Our Goal:
How we could extract text from HTML code using ColdFusion?
Description:
We will use some regular expression to achieve this.
Use regular expression "<.*?>" to extract HTML text. This one will work fine until there is no JavaScript(JS) or CSS code is present inside the HTML. When any JS/CSS code is present in HTML then it can't omit those.Then use a second regular expression to remove JS and CSS code from HTML.
To replace JS and CSS code we have to use "<(script|style).*?</\1>".
So, if we will combine the two regular expression then we can get actual text from the HTML code which may contain some CSS and JS code.
The final regular expression will be "<(script|style).*?</\1>|<.*?>".
Example:
Our HTML code is:
So, the final ColdFusion code to extract text from above HTML would be follows:
After, all these steps we will get following text as the out put.
NOTE:
In the final regular expression "<(script|style).*?</\1>|<.*?>", we have used expression to remove any CSS/JS first then remove the HTML. As if we will change the order to "<.*?>|<(script|style).*?</\1>" then the CSS/JS code will be there in the final output. As the CSS/JS code will match with the first part and it will treat as normal HTML code.
No comments:
Post a Comment