Wednesday, December 12, 2012

Extract Text from HTML Code in ColdFusion

Our Goal:
How we could extract text from HTML code using ColdFusion?

Description:
We will use some regular expression to achieve this.

 Use regular expression "<.*?>" to extract HTML text. This one will work fine until there is no JavaScript(JS) or CSS code is present inside the HTML. When any JS/CSS code is present in HTML then it can't omit those.Then use a second regular expression to remove JS and CSS code from HTML.

To replace JS and CSS code we have to use "<(script|style).*?</\1>".

So, if we will combine the two regular expression then we can get actual text from the HTML code which may contain some CSS and JS code.

The final regular expression will be "<(script|style).*?</\1>|<.*?>".

Example:
Our HTML code is:


So, the final ColdFusion code to extract text from above HTML would be follows:


After, all these steps we will get following text as the out put.

NOTE:
In the final regular expression "<(script|style).*?</\1>|<.*?>", we have used expression to remove any CSS/JS first then remove the HTML. As if we will change the order to "<.*?>|<(script|style).*?</\1>" then the CSS/JS code will be there in the final output. As the CSS/JS code will match with the first part and it will treat as normal HTML code.

No comments:

Post a Comment

Followers