Tag Archives: tutorial

dataist blog: An inspiring case for journalists learning to code

About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven’t looked back at it because I’m sure I’ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from “What is HTML” to actual Ruby scraping code all in a gigantic, badly formatted post.

The series of articles have gotten a fair number of hits but I don’t know how many people were able to stumble through it. Though last week I noticed this recent trackback from dataist, a new “blog about data exploration” by Finnish journo Jens Finnäs. He writes that he has “almost no prior programming experience” but, after going through my tutorials and checking out Scraperwiki, was able to produce this cool network graph of the Ratata blog network after about “two days of trial and error”:

Mapping of Ratata blogging network by Jens Finnäs of dataist.wordpress.com

Mapping of Ratata blogging network by Jens Finnäs of dataist.wordpress.com

I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas’s example. Becoming good at coding is not a trivial task. But even the first steps of it can teach a non-coder some profound lessons about data important enough on their own. And if you’re a curious-type with a question you want to answer, you’ll soon figure out a way to put something together, as in Finnas’s case.

ProPublica’s Dollars for Docs project originated in part from this Pfizer-scraping lesson I added on to my programming tutorial: I needed a timely example of public data that wasn’t as useful as it should be.

My colleagues Charles Ornstein and Tracy Weber may not be programmers (yet), but they are experienced enough with data to know its worth as an investigative resource, and turned an exercise in transparency into a focused and effective investigation. It’s not trivial to find a story in data. Besides being able to do Access queries themselves, C&T knew both the limitations of the data (for example, it’s difficult to make comparisons between the companies because of different reporting periods) and its possibilities, such as the cross-checking of names en masse from the payment lists with state and federal doctor databases.

Their investigation into the poor regulation of California nurses – a collaboration with the LA Times that was a Pulitzer finalist in the Public Service category – was similarly data-oriented. They (and the LA Times’ Maloy Moore and Doug Smith) had been diligently building a database of thousands of nurses – including their disciplinary records and the time it took for the nursing board to act – which made my part in building a site to graphically represent the data extremely simple.

The point of all this is: don’t put off your personal data-training because you think it requires a computer science degree, or that you have to become great at it in order for it to be useful. Even if after a week of learning, you can barely put together a programming script to alphabetize your tweets, you’ll likely gain enough insight to how data is made structured and useful, which will aid in just about every other aspect of your reporting repertoire.

In fact, just knowing to avoid taking notes like this:

Colonel Mustard used the revolver in the library? (not library)
Miss Scarlet used the Candlestick in the dining room? (not Scarlet)
“Mrs. Peacock, in the dining room, with the revolver? “
“Colonel Mustard, rope, conservatory?”
Mustard? Dining room? Rope (nope)?
“Was it Mrs. Peacock with the candlestick, inside the dining room?”

And instead, recording them like this:

Who/What? Role? Ruled out?
Mustard Suspect N
Scarlet Suspect Y
Peacock Suspect N
Revolver Weapon Y
Candlestick Weapon Y
Rope Weapon Y
Conservatory Place Y
Dining Room Place N
Library Place Y

…will make you a significantly more effective reporter, as well as position you to have your reporting and research become much more ready for thorough analysis and online projects.

There’s a motherlode of programming resources available through single Google search. My high school journalism teacher told us that if you want to do journalism, don’t major in it, just do it. I think the same can be said for programming. I’m glad I chose a computer field as an undergraduate so that I’m familiar with the theory. But if you have a career in reporting or research, you have real-world data-needs that most undergrads don’t. I’ve found that having those goals and needing to accomplish them has pushed my coding expertise far quicker than did any coursework.

If you aren’t set on learning to program, but want to get a better grasp of data, I recommend learning:

  • Regular expressions – a set of character patterns, easily printable on a cheat-sheet for memorization, that you use in a text-editor’s Find and Replace dialog to turn a chunk of text into something you can put into a spreadsheet, as well as clean up the data entries themselves. Regular-expressions.info is the most complete resource I’ve found. A cheat-sheet can be found here. Wikipedia has a list of some simple use cases.
  • Google Refine – A spreadsheet-like program that makes easy the task of cleaning and normalizing messy data. Ever go through campaign contribution records and wish you could easily group together and count as one, all the variations of “Jon J. Doe”, “Jonathan J. Doe”, “Jon Johnson Doe”, “JON J DOE”, etc.? Refine will do that. Refine developer David Huynh has an excellent screencast demonstrating Refine’s power. I wrote a guide as part of the Dollars for Docs tutorials. Even if you know Excel like a pro – which I do not – Refine may make your data-life much more enjoyable.

If you want to learn coding from the ground up, here’s a short list of places to start:

Coding for Journalists 101 : A four-part series

nico.cavallotto

Photo by Nico Cavallotto on Flickr

Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here.

I’m only keeping this old walkthrough up as a historical reference. I’m sure the code is so ugly that I’m not going to even try re-reading it.

So check it out: The Bastards Book of Ruby

-Dan

Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. There is a guide for Pfizer which supersedes the one I originally posted here.

So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A “little while” turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows how to help Pfizer in its mission to more transparent, compelled me to just publish them in incomplete form. There’s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.

As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I’m very welcome to editing changes.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List

Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state.

Update (4/28): Replaced the code and result files. Still haven’t written out a thorough explainer of what’s going on here.

Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. I’m going to rework this script and post and update soon.

So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. These doctors are the same ones who, from time to time, recommend the use of Pfizer products.

From the NYT:

Pfizer, the world’s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. Pfizer also paid $15.3 million to 250 academic medical centers and other research groups for clinical trials in the same period.

A spokeswoman for Pfizer, Kristen E. Neese, said most of the disclosures were required by an integrity agreement that the company signed in August to settle a federal investigation into the illegal promotion of drugs for off-label uses.

So, not an entirely altruistic release of information. But it’s out there nonetheless. You can view their list here. Jump to my results here

Not bad at first glance. However, on further examination, it’s clear that the list is nearly useless unless you intend to click through all 480 pages manually, or, if you have a doctor in mind and you only care about that one doctor’s relationship. As a journalist, you probably have other questions. Such as:

  • Which doctor received the most?
  • What was the largest kind of expenditure?
  • Were there any unusually large single-item payments?

None of these questions are answerable unless you have the list in a spreadsheet. As I mentioned in earlier lessons…there are cases when the information is freely available, but the provider hasn’t made it easy to analyze. Technically, they are fulfilling their requirement to be “transparent.”

I’ll give them the benefit of the doubt that they truly want this list to be as accessible and visible as possible…I tried emailing them to ask for the list as a single spreadsheet, but the email function was broken. So, let’s just write some code to save them some work and to get our answers a little quicker.
Continue reading

Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact dan@danwin.com if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

In particular, with lesson 3, I skipped basically any explanation to the code. I hope to get around to it later.

Going to Court

In the last lesson, we learned how to write a script that would record who was in jail at a given hour. This could yield some interesting stories for a crime reporter, including spates of arrests for notable crimes and inmates who are held with $1,000,000 bail for relatively minor crimes. However, an even more interesting angle would be to check the inmates’ prior records, to get a glimpse of the recidivism rate, for example.

Sacramento Superior Court allows users to search by not just names, but by the unique ID number given to inmates by Sacramento-area jurisdictions. This makes it pretty easy to link current inmates to court records.


However, the techniques we used in past lessons to automate the data collection won’t work here. As you can see in the above picture, you have to fill out a form. That’s not something any of the code we’ve written previously will do. Luckily, that’s where Ruby’s mechanize comes in.

Continue reading

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.

UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby.

I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think:

http://ruby.bastardsbook.com

Someone asked in this online chat for journalists: I want to program/code, but where does a non-programmer journalist begin?

My colleague Jeff Larson gave what I believe is the most practical and professionally-useful answer: web-scraping (jump to my summary of web-scraping here, or read this more authorative source).

This is my attempt to walk someone through the most basic computer science theory so that he/she can begin collecting data in an automated way off of web pages, which I think is one of the most useful (and time-saving) tools available to today’s journalist. And thanks to the countless hours of work by generous coders, the tools are already there to make this within the grasp of a beginning programmer.

You just have to know where the tools are and how to pick them up.

Click here for this page’s table of contents. Or jump to the the theory lesson. Or to the programming exercise. Or, if you already know what a function and variable is, and have Ruby installed, go straight to two of my walkthroughs of building a real-world journalistic-minded web scraper: Scraping a jail site, and scraping Pfizer’s doctor payment list.

Or, read on for some more exposition:

Continue reading