Clicky

Screen Scraping

Another (legit) use for scraping

>> Don't have access directly to the database or a web service that will provide you the information you need (or you won't have access soon enough)

Or... using a web service would be more complicated and bigger-budget than just grabbing HTML that is *almost* what you need. (IF the HTML can be expected to remain stable).

That was the case on http://www.vpr.net/community/school_closings/, which I developed several months back. VPR subscribes to the service and there are APIs to get full XML of the school closing data from the provider, but they also offer an HTML table version that was almost exactly what we needed. Rather than parsing their XML and re-building a table that would be almost identical to what they already had, I simply used PHP's fopen() to grab the HTML and use a tiny bit of regex to remove a bit of info that we didn't need. We then just re-styled the table with CSS to match our site.

Too bad there aren't any school closings to demonstrate. Come back next winter! In any case, it proved far simpler to just slightly modify the HTML that already existed than it would have been to deal with XML parsing to come up with something that was almost identical.

Screen scrapping as quality control

Hi David, nice blog, congrats! We've used Internet Macros, from Iopus, to automate quality checks and some minor "scrapping".. all legit... It has a variety of interfaces, VBA is simple enough so you can obtain excel extractions or summaries for quick analysis. Cheers,
Alex

Automating HTML testing

If checking HTML is something you have to do regularly, a tool like selenium (http://www.openqa.org/selenium/) can make automated test runs possible.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.

Need Help?

Most Buzz