Extract the HTML from the URL

advertisements

I'm using Boilerpipe to extract text from url, using this code:

URL url = new URL("http://www.example.com/some-location/index.html");
String text = ArticleExtractor.INSTANCE.getText(url);

the String text contains just the text of the html page, but I need to extract to whole html code from it.

Is there anyone who used this library and knows how to extract the HTML code?

You can check the demo page for more info on the library.


For something as simple as this you don't really need an external library:

 URL url = new URL("http://www.google.com");
 InputStream is = (InputStream) url.getContent();
 BufferedReader br = new BufferedReader(new InputStreamReader(is));
 String line = null;
 StringBuffer sb = new StringBuffer();
 while((line = br.readLine()) != null){
   sb.append(line);
 }
 String htmlContent = sb.toString();