Coding for Journalists: A four-part series

Photo by Nico Cavallotto on Flickr

Update: Read the Bastards Book of Ruby

Just wanted to point web-scraping-interested visitors to a much better resource than this page: The Bastards Book of Ruby, which I wrote and have put online for your free perusal. It contains everything in this scraping guide (kept up for posterity) but with much better code and examples. For instance, I wrote five whole chapters on web-scraping:

http://ruby.bastardsbook.com/chapters/web-scraping

http://ruby.bastardsbook.com/chapters/web-inspecting-html

http://ruby.bastardsbook.com/chapters/web-inspecting-traffic

http://ruby.bastardsbook.com/chapters/html-parsing

http://ruby.bastardsbook.com/chapters/web-crawling

And other chapters with specific web-scraping projects.

You can read my old scraping guide for entertainment purposes, but I wouldn’t use it for actual education.
—-

So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A “little while” turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows how to help Pfizer in its mission to more transparent, compelled me to just publish them in incomplete form. There’s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.

As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I’m very welcome to editing changes.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

Tutorial 1: Go from knowing nothing to scraping Web pages. In an hour. Hopefully

~~loop~~

Tutorial 2: Scraping a County Jail Website to Find Out Who’s in Jail – This uses all the concepts from the first tutorial and applies them to something that a cops reporter might actually want to try out.

Tutorial 3: Who’s Been in Jail Before: Cross-checking the jail logs with the court system with Ruby’s Mechanize – This lesson introduces you to another Ruby library that allows you to automate the filling-out of forms so that you can access online databases, in this case, California criminal case histories to see if current inmates are repeat-alleged-offenders.

Tutorial 4: Improving Pfizer’s Dollars-to-Doctors Pay List – Last week, Pfizer released a list of nearly 5,000 doctors and medical institutions that it made $35 million in consulting and expense payments. Fun. Unfortunately, the list, as it initially existed online, is just about useless to anyone wanting to examine trends. This tutorial provides a script to make the list more interesting to journalists.

danwin.com

Words, photos, and code by Dan Nguyen. The 'g' is mostly silent.

Coding for Journalists: A four-part series

Update: Read the Bastards Book of Ruby

6 thoughts on “Coding for Journalists: A four-part series”

Leave a Reply Cancel reply