If you’ve ever dug through endless web search results, you have scraped, albeit manually. Web data scraping is tech slang for getting your hands on useful but unstructured information.
Computers can do a lot of this work for us, but those of us who are used to old-fashioned research—ahem, journalists—don’t always realize how easy this can be, or how much time we can save.
Scraping is not the same as web indexing, which is how bots and web crawlers like Google match up search results to your query by looking at metadata. "Data scraping," by contrast, tries to replicate the human process of manually looking for information that is not as accessible—think research papers, where data is usually just plain text.
Scraping was born out of developers' need to interact with data from websites without an open API, or data interface. Since there was no easy way for them to extract information their apps needed, they wrote scraping software. (A great example of commercialization of a scraper can be found in the famous case of Craigslist versus Padmapper.)
Craigslist, to this day, chooses not to offer an API for developers, making it nearly impossible for outside apps to do cool things with their data. This left developers such as Padmapper with a duty to innovate software that took the good stuff from plain text and structured it for purpose. Padmapper used a software which scraped data from Craigslist and mapped out rentals based on zip code, number of rooms, and price. Pretty useful and way more user friendly than the Craiglist interface.
Craiglist sued Padmapper on grounds of infringement, claiming the copyrights on user data. But Craiglist only had the license to publicly share their user’s data, nowhere near a copyright, and the case was lost. This is one of many situations which illustrate why developers have built scraping solutions for a lack of APIs.
It’s not always so legally entangled, a lot of times scraping can be used for simple tasks like taking data on a table inside a PDF or searching for a new pair of cheap kicks. It’s especially a skill worth learning if you regularly use information that comes from charts, or you’re an investigative journalist, or are constantly looking at sources with periodic updates.
Last year, a close friend of mine set out to explain the impact online streaming had on traditional television viewership. He wanted to see if a cable network attracted more viewers or lost ad revenue when it made a show available online for free. He compiled a massive amount of data by using the Wayback Machine and looking at when these channels first started putting shows online (ABC was the pioneer in 2006 with shows like Lost and Desperate Housewives). My friend also used an endless amount of numbers and table with statistics on viewers from Nielsen reports and sites like Wikipedia.
The only problem was the websites he used didn’t allow him to copy and paste a table into STATS (the software he used for analytics). He would end up with 15 cells on Excel for a table with 300 cells online. He tried screenshots and text conversion but it was a mess. By the end of it, he decided it was easier to manually type in thousands of numbers for his database. At one point, he considered outsourcing it, but he realized it’d be a lot of money. The point is: Even basic research like his can be discouraged by little formatting issues. However, most of this process—getting the data into the right format so he could manipulate it— could have easily been automated. For free.
The first way to go about getting data from a table online into nice rows and columns in an Excel or CSV document is with Google Spreadsheets. Many people don’t know this, but Spreadsheets is actually pre-packaged with a command for importing data from HTML. If you open up a blank document, and you pick a cell and type in "=ImportHTML(url, query, index)" and you provide those inputs, Spreadsheets imports the information and formats those rows and columns exactly like the online equivalent.
Here’s an example I did for AMC’s The Killing (the best show on television right now!) that just got renewed for a fourth season by Netflix. I simply created a new Spreadsheets file on my Google Drive and I typed in:
"=ImportHtml("http://en.wikipedia.org/wiki/The_Killing_(U.S._TV_series)", "table", 4),"
And before I knew it, I had a nice table ready begging me to extrapolate:
If the table is interactive Google Spreadsheets sometimes gives a PARSE ERROR, ConvExtra is a workaround. This website simply takes a URL and with a simple search-like click, extracts up to a thousand pages of it for free. It gives different fields for each section and allows you to download it in different formats.
Whatever your use may be, understanding a scraper is easiest when it’s broken down into three components. The first is a simple queue of pages that contain the data you need somewhere in them, the second is storage for the data, and the third is a downloader and a parser that structures the information.
For any Google Doc enthusiasts like myself, ImportHTML is a pretty nifty way to reinforce your love for the enterprise but there are also sites like ScraperWiki that aid you in writing code to scrape multiple sites, which is often a little more time consuming and complicated but way more powerful. Here is a package specifically geared for journalists.
[Image: Flickr user Jill Ferry]