Screen Scraping

This may not seem very Web 2.0 (O'Reilly wrote web services is 2.0 but screen scraping is 1.0), but I think there are a variety of reasons that screen scraping is still helpful, including:

  • Need to be closer to what the user sees
  • Don't have access directly to the database or a web service that will provide you the information you need (or you won't have access soon enough)

For example:

  • Testing whether your web pages are looking the way you expect. Sometimes testing this from the back end just isn't going to cut it, and you need to analyze the HTML to see if the page looks reasonable.
  • Writing a report that doesn't already exist on top of some reporting tool (for instance, on top of a defect-tracking system that you don't have access to the code for).
  • Creating archived versions of sites. Sometimes using HTTRACK, for example, isn't enough on its own (for example, when you need to pull in full-sized videos from the source system as oppossed to the streamed version on the web). Also, you can use Perl to wrap around HTTRACK so that you have a standard way of passing options to HTTRACK.
  • Seeing which of a large set of your sites are indexed in Google.
  • Testing your RSS feeds to determine if they have the right number of content items, etc (I guess this would be more "RSS scraping" than screen scraping).
  • Importing from a static site to a CMS (less and less commonly needed nowadays).

Often, if there's a direct DB connection or an RSS feed or some other XML interface that you can use, then it probably makes sense to use that. Even in that case, the archiving and web page testing cases would probably benefit from screen scraping. 

Another (legit) use for scraping

>> Don't have access directly to the database or a web service that will provide you the information you need (or you won't have access soon enough)

Or... using a web service would be more complicated and bigger-budget than just grabbing HTML that is *almost* what you need. (IF the HTML can be expected to remain stable).

That was the case on http://www.vpr.net/community/school_closings/, which I developed several months back. VPR subscribes to the service and there are APIs to get full XML of the school closing data from the provider, but they also offer an HTML table version that was almost exactly what we needed. Rather than parsing their XML and re-building a table that would be almost identical to what they already had, I simply used PHP's fopen() to grab the HTML and use a tiny bit of regex to remove a bit of info that we didn't need. We then just re-styled the table with CSS to match our site.

Too bad there aren't any school closings to demonstrate. Come back next winter! In any case, it proved far simpler to just slightly modify the HTML that already existed than it would have been to deal with XML parsing to come up with something that was almost identical.

Screen scrapping as quality control

Hi David, nice blog, congrats! We've used Internet Macros, from Iopus, to automate quality checks and some minor "scrapping".. all legit... It has a variety of interfaces, VBA is simple enough so you can obtain excel extractions or summaries for quick analysis. Cheers,
Alex

Automating HTML testing

If checking HTML is something you have to do regularly, a tool like selenium (http://www.openqa.org/selenium/) can make automated test runs possible.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.