A defense of web-scraping as a vital tool for journalists

Another day, another overlong justification of programming. On the NICAR mailing list, someone asked how people are coping with the shutdown of Needlebase, a web-scraping tool maintained by Google. This led to the inevitable debate of whether a out-of-luck reporter should just learn enough code to web-scrape herself or to continue the search for more scraping, push-button tools.

Ed. B, who’s no technical slouch himself, wrote the list saying that scraping is “rarely necessary.”

The underlying tools in ScraperWiki, at least the Python and Ruby ones, are about as easy as you can get. You can build something on ScraperWiki in Ruby, Python or PHP, then run it on your own machine. Recursive “wget” followed by local parsing is another option, at least for some kinds of authentication.

Personally I’m not a huge fan of scraping for a number of reasons:

1. It’s time-consuming and error-prone, neither of which is a characteristic compatible with journalism. If you want to be fast and correct, scraping is a bad way to start. 😉

2. It’s very often a violation of someone’s terms of service.

3. It’s easy to collect mountains of irrelevant bits when the real story can be uncovered much more effectively by human-to-human interactions.

I think scraping is *rarely* necessary. Sure, not every site provides an API or downloadable spreadsheets, but we are nowhere near as effective at using the sites *with* APIs and spreadsheets as we could be.

My problem with this argument is that it makes scraping seem like a straighforward exercise of data-collection. This is not at all the case. The most obvious rebuttal is technical; there are a near infinite combination of design patterns and configurations – the vast majority of them resulting in slop when it comes to government sites – that no commercial program or third-party script can anticipate the variations. Learning to code gives you the power to adapt efficiently, rather than trying to wrangle someone else’s program into something barely useful for the situation.

But my main concern about being code-oblivious is that you cannot know, as Don Rumsfeld said, what you don’t know. This is why I compare scraping to conducting interviews without speaking the same language. It’s quite possible, with an interpreter, to get the single piece of data that you need:

  • Reporter: “Did you see the man who killed the child?”
  • Interpreter: “¿Has visto al hombre que mató al niño”
  • Subject: “Si”
  • Interpreter: “Yes”

But if an interview is meant to explore – that is, you don’t really know what the subject might tell you and where he will lead you – then you are at a huge disadvantage if you use an interpreter, at least in my brief experience with non-English speakers. You not only miss whatever nuance (or, in some cases, actual meaning) that the translator misses, you miss the ability to tell at which point in a long statement that the subject’s eyes looked down, as if trying to hide something. You lose the ability for give-and-take, because each of you has to wait for the interpreter to finish a translation (and this repetition, of course, effectively halves the length of time that you have for an interview).

Learning a new language is a real investment but it’s hard to imagine being effective without it in a foreign country. And to belabor the metaphor, we’re all entering a new country with the advent of our digitized society. Here, too, it is worth being acquainted with the new language.

My response to Ed on the NICAR list:

I feel these complaints about scraping either miss the point of scraping or make unreasonable expectations. Scraping (whether from web, pdf, or other non-normalized formats) is just information gathering and has all the same limitations as all other forms of reporting and the same benefits from being skilled at it.

All of your complaints could be made about the method of interviewing, something which is not inherently required to write a worthwhile journalistic story: it’s *extremely* time-consuming and error-prone. It violates people’s sense of comfort and privacy and – at times – official disclosure policies. And it frequently nets you mountains of cumbersome handwritten notes that you’ll never use.

The underlying theme in most data stories, in fact, is that information coming from officials’ interviews are mostly bullshit, which is why we turn to data to find the fuller picture. And single-source data is frequently bullshit, which is why we include interviewing, observation, and other data sources for the fuller picture.

There’s rarely an API to access data on a website in the same way that human interview subjects rarely issue (useful) pre-written questions for you to ask. Information found on the web is as diverse in formats and content as it is elsewhere and no prebuilt software solution will be able to deal with all of it at an acceptable standard.

Whether it’s worth it to learn to code custom scrapers* is definitely still a debate. It certainly is more justifiable than it was even 5 years ago, given the much-lower barrier to entry and the much higher amount of digital information. I agree with Aron Pilhofer that a day of Excel training for every journalist would bring massive benefits to the industry as a whole. But that’s because journalism has been in overdue need of spreadsheet-skills, not because spreadsheets are in themselves, alone, useful sources of information.

(* of course, learning enough programming to make custom scrapers will as a bonus let you do every other powerful data-related task made possible through the mechanism and abstract principles of code)

I think it’s ironic that programming advocates are sometimes suspected of putting too much faith into scraping/the Internet/computers writ large. It’s usually the people with minimal insight about scraping who are dangerously naïve about the flaws inherent to information gathered online.

At some point, it’s worth evaluating whether the time and energy (and money) you spend in clumsily navigating a website and figuring out how to wrangle pre-built solutions could be better spent becoming more technical. Just like you don’t get a practical appreciation of “if your mother says she loves you, check it out” until you’ve actually done some reporting.

…just to be clear, I’m not saying that it’s either learn-to-code or stay-off-the-computer. The solutions so far mentioned in the thread are good and in some cases, may be all is needed. But you could say the same for a Mexico bureau chief who hires a full-time translator instead of trying to learn conversational Spanish. There’s plenty of yet unrealized benefits in moving past the third-party solutions.

I'm a programmer journalist, currently teaching computational journalism at Stanford University. I'm trying to do my new blogging at blog.danwin.com.