Would be really nice to retrieve all data kept in HTML with minimal effort, and be able to access it in relational way. I had a sleepless night last week, and that's what I came up with.
In short - this is a poor attempt to wrap a great Jsoup java HTML parser in Teiid translator logic. A single example is better than a hundred words. This SQL statement:
SELECT text, attributes FROM (call scrapedata.scrap('http://www.bing.com/search?q=jboss+teiid','a[href]')) as S WHERE upper(text) like '%TEIID%'Returns this:
Teiid - JBoss Community - Community driven open source … , href="http://www.jboss.org/teiid" h="ID=SERP,5095.1" Teiid - Downloads - JBoss Community , href="https://www.jboss.org/teiid/downloads" h="ID=SERP,5108.1" Teiid Download , href="/search?q=Teiid+Download&FORM=QSRE4" h="ID=SERP,5240.1" Teiid Designer , href="/search?q=Teiid+Designer&FORM=QSRE5" h="ID=SERP,5241.1" Teiid Forum , href="/search?q=Teiid+Forum&FORM=QSRE6" h="ID=SERP,5242.1" Teiid - Tools - JBoss Community , href="https://www.jboss.org/teiid/tools" h="ID=SERP,5121.1" Teiid Installation , Community - JBoss, href="https://community.jboss.org/wiki/TeiidInstallation" h="ID=SERP,5134.1" Teiid - JBoss Issue Tracker , href="https://issues.jboss.org/browse/TEIID" h="ID=SERP,5147.1" Teiid 7.0 Installation Guide , href="https://community.jboss.org/wiki/Teiid70InstallationGuide" h="ID=SERP,5160.1", TEIID on tomcat - Community - JBoss, href="https://community.jboss.org/thread/205308?start=0&tstart=0" h="ID=SERP,5172.1"
Feel free to get a clone a translator-scrape repository from Github, check the sources, play with ScrapeTest.java - it is a unit test build with Embedded Teiid, should give you an idea of how to use this translator.