Wednesday, January 19, 2011

How to get pure content from HTML page in Java via Regex

Introduction
I've written a web crawler while I was developing a search engine a few weeks ago. It extracts the contents and saves them onto the database. The HTML tags aren't so important to most of the search engines. So, I removed them successfully. To do the same, follow below steps:
1- Remove the script tags and inclusive content:
// htmlContent is full content of page with HTML codes.

String content;
Pattern pattern;

pattern = Pattern.compile(".*?", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
content = pattern.matcher(htmlContent).replaceAll("");
Note: In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

2- Remove the style tags and inclusive content:
String content;
Pattern pattern;

pattern = Pattern.compile(".*?", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
content = pattern.matcher(content).replaceAll("");

3- Remove all HTML tags without inclusive content.
pattern = Pattern.compile("<[^>]*>");
content = pattern.matcher(content).replaceAll("");
4- Replace new lines, tabs and multiple spaces with a single space.
content = content.replaceAll("\n+", " ");
content = content.replaceAll("\t+", " ");
content = content.replaceAll("(  )+", "");

And you have a pure content now :)

Links
Regular expression
How to Write an HTML Parser in Java
Regular-Expressions.info