Expedia Data Scraping: My Edmonton’s Many Data Sources

As I mentioned in my last post, there are over 30 different data sources used in My Edmonton, and a lot of different methods and screen-scraping were required. Again, I thought it would be interesting to developers out there to find out how just how this was done, as screen-scraping (although legally grey) is a great way to produce your own API.

First, some background on screen-scraping. Many different apps out there do this, as it's often the only way to get data for an application you need. Some of my favorite apps that are based on screen-scraping are:

    Mint.com: You know, the startup who Intuit (my former employer) acquired last year for untold millions?
    Pageonce.com : Another financial website/iPhone app that I personally love for managing my finances. If only CIBC would stop blocking them...
    AppSales : an iPhone app only available for iPhone developers on jailbroken phones, that screen-scrapes Apple's sales reports website for data and presents it in a nice, easy-to-use format. As one twitter poster wrote this morning (aptly timed) : "Anyone who thinks iTunes sales reports work well should be forced to visit a website where they download their email one message at a time." This little app makes iTunes sales reports actually consumable. This has recently been supplanted by Apple's own ITC mobile app, but AppSales is still far better and was 2 years ahead of Apple's. Why Apple doesn't provide an API for this service is beyond me.
    Virtually every "cheap flight" website like Expedia uses screen-scraping behind-the-scenes to get flight schedule and price information.

So, sometimes you need to do a little screen-scraping to get by.

As I mentioned, My Edmonton uses over 30 data sources. These include:

    23 datasets from the city's Open Data Initiative. These include amenity locations, neighborhood and ward information, garbage collection schedules, transit data, and event and construction calendars.
    7 sets were scraped off the city's own website, including the assessment data.
    Geocoding API's such as Google, Yahoo, and geocoder.ca
    Canada Post (for postal code lookups)

All of these datasets combined provide the immensely useful set of features that My Edmonton provides. For example, because My Edmonton knows the location of a certain home (thanks to the City's assessment data and Google Maps' Geocoding API), we can then look at various information:

    We can discover which Garbage collection zone the user is in without them having to manually look at it on a map.
    We can look at construction zones near their home.
    We can tell them which ward and neighborhood they are in, and provide details about how to call their city councilor.
    We can tell them their nearest bus stops, and the bus route schedules
    We can tell them the locations of various things like schools, playgrounds, etc. near them

To my knowledge, this is a novel use of the city's data and very useful.

Screen - scraping

This article was about screen-scraping, right?

We "only" had to screen-scrape about 8 different data sets - the city's website, and Canada Post. Three different methods were used for these:

    when possible, the city's website was scraped with a recent Google acquisition of Needlebase. Needlebase is on of the coolest tools I've used in a while. You can point it to a web page, and give it "examples" of what to look for and how to parse data. After about 3 examples, it usually gets everything right. I used this to scrape City councilor data, as well as amenities not provided in the City's Open Data Catalogue, like Playgrounds, Tennis courts, etc. I would have loved to gathered facility schedule information, like Kinsmen's pool hours, but that information is all in disparate formats and not scraper-friendly; some are in PDF, some websites. Given enough time, I could gather this, but it wasn't worth it for this competition.
    Canada post was scraped using a custom Ruby script using a package called Mechanize. This site is NOT scrape-friendly, as they have a kind of interview where you have to fill out multiple forms to get the data you want, and it looks like their site is Microsoft-based and all their form names were very long and convoluted.
    The city's assessment website was scraped with a custom Ruby script using nokogiri to parse the HTML. Originally, my teammate Terry wrote it in C#, but it wasn't performant enough and didn't interface with a database, so I translated it to Ruby. The script makes heavy use of caching to avoid any duplicate calls to the website. As I mentioned, we've sent over 3.5 million requests to the city's servers in 2 months, giving us now 200,000 assessments.

I prefer not to post code, as there is a lot of it, but if anybody is interested in seeing code, feel free to leave a comment.

Source: http://blog.myyeg.com/2010/09/my-edmontons-many-data-sources/

Expedia Data Scraping

Friday, 24 May 2013

My Edmonton’s Many Data Sources

No comments:

Post a Comment