Friday, October 4, 2013

Bring HTML Pages Into Relational World Using Web Scraping Teiid Translator

You probably heard many times that we live in the era of Semantic Web. Unfortunately not all HTML pages you see were made using RDF. We have to parse them using web browsers, HTTP clients and a variety of custom tools. Many HTML pages are old, unstructured, were built using outdated standards or poorly designed instruments.
Would be really nice to retrieve all data kept in HTML with minimal effort, and be able to access it in relational way. I had a sleepless night last week, and that's what I came up with.

In short - this is a poor attempt to wrap a great Jsoup java HTML parser in Teiid translator logic. A single example is better than a hundred words. This SQL statement:

SELECT text, attributes
FROM (call scrapedata.scrap('http://www.bing.com/search?q=jboss+teiid','a[href]')) as S
WHERE upper(text) like '%TEIID%'

Returns this:
Teiid - JBoss Community - Community driven open source … , href="http://www.jboss.org/teiid" h="ID=SERP,5095.1"    
Teiid - Downloads - JBoss Community , href="https://www.jboss.org/teiid/downloads" h="ID=SERP,5108.1"
Teiid Download , href="/search?q=Teiid+Download&FORM=QSRE4" h="ID=SERP,5240.1"
Teiid Designer , href="/search?q=Teiid+Designer&FORM=QSRE5" h="ID=SERP,5241.1"
Teiid Forum , href="/search?q=Teiid+Forum&FORM=QSRE6" h="ID=SERP,5242.1"
Teiid - Tools - JBoss Community , href="https://www.jboss.org/teiid/tools" h="ID=SERP,5121.1"
Teiid Installation , Community - JBoss, href="https://community.jboss.org/wiki/TeiidInstallation" h="ID=SERP,5134.1"
Teiid - JBoss Issue Tracker , href="https://issues.jboss.org/browse/TEIID" h="ID=SERP,5147.1" 
Teiid 7.0 Installation Guide , href="https://community.jboss.org/wiki/Teiid70InstallationGuide" h="ID=SERP,5160.1", 
TEIID on tomcat - Community - JBoss, href="https://community.jboss.org/thread/205308?start=0&tstart=0" h="ID=SERP,5172.1"

Feel free to get a clone a translator-scrape repository from Github, check the sources, play with ScrapeTest.java - it is a unit test build with Embedded Teiid, should give you an idea of how to use this translator.