Wednesday, December 12, 2012

Extract Text from HTML Code in ColdFusion

Our Goal:
How we could extract text from HTML code using ColdFusion?

Description:
We will use some regular expression to achieve this.

 Use regular expression "<.*?>" to extract HTML text. This one will work fine until there is no JavaScript(JS) or CSS code is present inside the HTML. When any JS/CSS code is present in HTML then it can't omit those.Then use a second regular expression to remove JS and CSS code from HTML.

To replace JS and CSS code we have to use "<(script|style).*?</\1>".

So, if we will combine the two regular expression then we can get actual text from the HTML code which may contain some CSS and JS code.

The final regular expression will be "<(script|style).*?</\1>|<.*?>".

Example:
Our HTML code is:

<div id="fb-root">&nbsp;</div>
<script>
(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id))
return; js = d.createElement(s);
js.id = id;
js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=154111381338316";
fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));
</script>
<p>
CF10 was originally referred to by the codename Zeus, after first being confirmed as coming by Adobe at Adobe MAX 2010, and during much of its prerelease period. It was also commonly referred to as "ColdFusion next" and "ColdFusion X" in blogs, on Twitter, etc., before Adobe finally confirmed it would be "ColdFusion 10". For much of 2010, ColdFusion Product Manager Adam Lehman toured the US setting up countless meetings with customers, developers, and user groups to formulate a master blueprint for the next feature set. In September 2010, he presented the plans to Adobe where they were given full support and approval by upper management.[4]
The first public beta of ColdFusion 10 was released via Adobe Labs on 17 February 2012.
</p>
</div>
view raw gistfile1.html hosted with ❤ by GitHub

So, the final ColdFusion code to extract text from above HTML would be follows:

<cfsavecontent variable="request.htmlString">
<div id="fb-root">&nbsp;</div>
<script>
(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id))
return; js = d.createElement(s);
js.id = id;
js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=1234567890";
fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));
</script>
<p>
CF10 was originally referred to by the codename Zeus, after first being confirmed as coming by Adobe at Adobe MAX 2010, and during much of its prerelease period. It was also commonly referred to as "ColdFusion next" and "ColdFusion X" in blogs, on Twitter, etc., before Adobe finally confirmed it would be "ColdFusion 10". For much of 2010, ColdFusion Product Manager Adam Lehman toured the US setting up countless meetings with customers, developers, and user groups to formulate a master blueprint for the next feature set. In September 2010, he presented the plans to Adobe where they were given full support and approval by upper management.[4]
The first public beta of ColdFusion 10 was released via Adobe Labs on 17 February 2012.
</p>
</div>
</cfsavecontent>
<div style="width:500px">
<cfoutput>#reReplaceNoCase(request.htmlString, "<(script|style).*?</\1>|<.*?>","","ALL")#</cfoutput>
</div>
view raw gistfile1.cfm hosted with ❤ by GitHub

After, all these steps we will get following text as the out put.

NOTE:
In the final regular expression "<(script|style).*?</\1>|<.*?>", we have used expression to remove any CSS/JS first then remove the HTML. As if we will change the order to "<.*?>|<(script|style).*?</\1>" then the CSS/JS code will be there in the final output. As the CSS/JS code will match with the first part and it will treat as normal HTML code.

No comments:

Post a Comment

Followers