Next week, my short course on data journalism at New York University (SCPS) begins. It’s only five weeks long, but I wanted the material and readings I used for the class to be accessible for anyone. You can check it out at smalldatajournalism.com, a site I built using Jekyll and which will serve as the home for future musings and data projects.
The panel was titled, “Better Know a Developer” and my part of it was to discuss how non-programming journalists can work best with programmers.
You can see the slides here. The advice boils down to: Don’t believe in magic. Think about how you would do it yourself. And use a spreadsheet.
Update: I guess I’m not being completely hyperbolic; Mr. Pope’s “Republia Times” is nominated for “Most Significant Impact” and “Best Gameplay” awards at this year’s Games for Change Festival…not bad for a game he made in 48 hours as practice.
Ever wondered what it’s like to edit a newspaper and influence what the public thinks and cares about? The small, but financially stable Republia Times has an opening for editor-in-chief. The job duties are simply “increase [the public’s] loyalty by editing the Republia Times carefully. Pick only stories that highlight the good things about Republia and its government.”
“The Republia Times” was created by developer Lucas Pope and is as sharp as satire of newspapering as I’ve ever seen in the gaming world. Its crude mechanics and appearance may be off-putting, but as a whole, “The Republia Times” is astonishing considering that Pope wrote it to practice for a 48-hour game development competition. Not only that, but it was his first Flash game, which, if you’ve never tried learning the Flash development environment, is astonishing in itself.
I don’t think Pope has been a newspaper editor before, either, but he manages to capture the cynicism behind modern and classic yellow journalism: political articles bore the readership, weather and sports attract it. The twist here is that the Republia Times is the mouthpiece of the state, and so you have to balance the interesting tabloid material (“C&J Tie the Knot!”) with boilerplate to make the government look good (“Latest poll shows broad satisfaction with government leaders”). There’s a little mini-Tetris challenge in fitting the stories in (you choose how much real-estate each article gets) before the clock runs out, and an additional plot twist halfway through the game.
The game is probably too cynical for most journalists, at least the ones who don’t fancy themselves government spokespeople, but even the most idealistic of editors will get a kick out how Pope manages to distill the profession into something so simplistic. That Pope manages to make it entertaining and thought-provoking despite the limits he was working with a notable achievement. I can’t think of any news-related game that has been better executed, though, admittedly, the field is small. The Knight Foundation News Challenge has given hundreds of thousands of dollars in grants to journalism-themed games. If I were them, I’d give Pope six-figures to make something, even though it may be more subversive than the journalism industry would prefer.
I’ve actually buried the lede here. I only came across the Republia Times, which Pope created last year, because I read about his upcoming game, “Papers, Please!“, which puts you in the shoes of a border inspector in a Cold War-era nation. It’s only in playable beta (free for Mac and PC), but I wouldn’t be surprised if it’s my favorite game of the year. The trailer speaks for itself:
Pope says the game will hopefully be out this summer. If you’re on Steam, give Pope an upvote on Greenlight.
I used to work with Susan White at ProPublica but even I was completely surprised yesterday when InsideClimate News, the non-profit news website she now leads, won the Pulitzer Prize for National Reporting for an in-depth investigation of a 2010 pipeline spill in Michigan.
Don’t remember that spill? Maybe that’s why InsideClimate titled its story, “the biggest oil spill you’ve never heard of.”
You might also describe InsideClimate News as “the online news startup you’ve never heard of” – I wouldn’t know anything about it if it hadn’t been where Susan moved to. The surprise isn’t that she led yet another Pulitzer Prize project (she edited two such projects already at the San Diego Union Tribune and ProPublica) – it’s that InsideClimate News just seemed too small, too novel of a news organization to earn the Pulitzer committee’s notice.
At just 5 years old and with only 7 full-time reporters, InsideClimate News is likely the smallest news organization ever to win in the National Reporting category (see table below), and perhaps the smallest news organization ever to win any Pulitzer since the Point Reyes Light in 1979.
Here’s another size measurement: According to the AP, InsideClimate had about 200,000 page views last month. The winner of last year’s National Reporting Pulitzer, the Huffington Post, is also an online-only news site. But it reportedly racks up a a billion page views a month: i.e., 5,000 times the page views at InsideClimate.
Numbers may seem like a superficial metric, but there’s a reason why big papers dominate every Pulitzer category (except for maybe Public Service) – big investigations require big resources. InsideClimate’s investigation occupied 3 of their reporters for 7 months, a major commitment for a news organization still struggling to draw a daily readership. Even more impressive: InsideClimate is based in Brooklyn, but they invested time and money (i.e. a travel budget) for a story several states away.
“That’s quite a sacrifice to make when you’re trying to get eyeballs on your website,” said McGowan, who started her reporting with a trip out to Marshall, Mich., in November 2011. “We made the commitment to this story because we thought this story mattered.”
“Pulling me off, their most seasoned reporter, was an act of faith to some degree because I could’ve been pounding out five, six, seven stories a week”
I didn’t read InsideClimate’s project when it came out and the comment/social-media sections on the early stories didn’t show huge pickup initially. The presentation is what’d you’d expect from a small no-frills operation: nearly all the photos come from government sources and the graphics are relatively straightforward and non-interactive. But thankfully, the stories were judged by the quality and impact of their investigation, rather than fanciness of presentation.
The future of journalism as a profession, never mind investigative news, is still uncertain. But InsideClimate’s Pulitzer is a great validation of how passionate startups can still make a huge impact in the proud tradition of watchdog journalism. Congrats to InsideClimate and its lead reporters, Lisa Song, Elizabeth McGowan and David Hasemyer.
An aggregated list of National Reporting Pulitzers
The list below is scraped from the Pulitzer’s official list, and I used OpenRefine to cluster the names together. Interestingly, the last three National Reporting Pulitzers have been won by online-only organizations: InsideClimate News, Huffington Post, and ProPublica. In 2009, the St. Petersburg Times won a National Reporting Pulitzer for its PolitiFact project. PolitiFact had a print component but it can be reasonably seen as the first Pulitzer-winning website.
Fifteen years ago, there was debate over whether the Pulitzer committee should have a separate prize for online-only submissions. The committee has wisely decided to judge journalism by its quality and not what format it comes in, and the success of news websites in this prestigious category is a good sign of how forward-thinking the Pulitzers have become.
|Name||National Reporting Pulitzers|
|New York Times||17|
|Wall Street Journal||14|
|Des Moines Register and Tribune||7|
|Los Angeles Times||7|
|United Press International||3|
|St. Petersburg Times||3|
|Dallas Times Herald||2|
|Dayton Daily News||2|
|Christian Science Monitor||2|
|Chicago Daily News||1|
|Gannett News Service||1|
|Kansas City Star||1|
|New York Daily News||1|
|New York Herald Tribune||1|
|Newhouse News Service||1|
|Providence Journal and Evening Bulletin||1|
|Scripps-Howard Newspaper Alliance||1|
|Atlanta Journal and Constitution||1|
|Dallas Morning News||1|
|Kansas City Times||1|
|Miami (FL) News||1|
|Washington Daily News||1|
But “non-notable” only in that they their name wasn’t immediately connected to any famous event or accomplishment that most readers remember or had ever heard about. Because even with just a half-day’s worth of interviews to learn about a late, complete stranger, you could find out at least one notable accomplishment from his/her surviving relatives, as well as details of personal drama universal to us all, and distill his/her life into a profile as interesting and inspiring as the celebrity obits that shared space in the next-day’s section.
I hadn’t read many of non-celeb obits since moving to NYC. But while waiting for take-out, I checked the Times on my phone and came across this obit about a young well-off-salesman-turned-social-worker:
After she had unpacked, and her toothbrush was on the sink, the woman realized something was missing. She turned to John Sullivan, the tall, smiling social worker who had discovered her on a bench in the Broadway median. The woman was a nurse who had lost her grip and had been living in a tent on the Upper West Side, until Mr. Sullivan coaxed her off the street. She was delighted to be in an apartment of her own.
â€œJust one thing,â€ she told him. â€œI really need a tent for here.â€
Mr. Sullivan left. He came back with a tent, which she pitched in the living room. Some time and medication later, she put it away.
In Mr. Sullivanâ€™s line of work, there was no instruction manual.
Mr. Sullivan grew up in Sleepy Hollow, N.Y., a star high school quarterback and pitcher who took his golden personality and looks into sales. He made a fine living that provided him, as he once said, â€œlots of travel and a closet full of Brooks Brothers clothes.â€ He also drank too much. Then he stopped.
One morning, on his way to a run around the reservoir in Central Park, he passed homeless people in the street. The next day, he applied to Fordham University to begin graduate school in social work. In 1995, he got a job with Pathways to Housing, an agency that finds homes and help for people with mental illness and addiction living on the street. He prowled East Harlem before it was gentrified, meeting people living under railroad tracks and in abandoned buildings.
Read the rest of the obit here: In Helping Others, Finding What Was Never Truly Lost, by Jim Dwyer
Just came back from an inspiring week at the National Institute for Computer-Assisted Reporting in Raleigh, NC. Of all the journalism conferences I’ve been to, this one had the most to learn from and the most attendees excited to learn. There was real discussion about news apps being its own form of story-telling and art and not just uploading a bunch of numbers as HTML.
I led a couple of sessions. One boiled down to basically, use Firebug, which you can pretty much glean from a tutorial I wrote for ProPublica on how I grabbed the data from drugmaker Cephalon’s Flash site. I wrote another Ruby tutorial, starting from “Hello World” to building a Foursquare/Google Maps mashup…that was almost doable in an hour-session had I been better prepared with presentation materials.
One reason to try learning how to code now is that the number of teaching resources has never been more abundant. The NICAR resources collected on Chrys’s blog is more proof of this.
About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven’t looked back at it because I’m sure I’ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from “What is HTML” to actual Ruby scraping code all in a gigantic, badly formatted post.
The series of articles have gotten a fair number of hits but I don’t know how many people were able to stumble through it. Though last week I noticed this recent trackback from dataist, a new “blog about data exploration” by Finnish journo Jens FinnÃ¤s. He writes that he has “almost no prior programming experience” but, after going through my tutorials and checking out Scraperwiki, was able to produce this cool network graph of the Ratata blog network after about “two days of trial and error”:
I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas’s example. Becoming good at coding is not a trivial task. But even the first steps of it can teach a non-coder some profound lessons about data important enough on their own. And if you’re a curious-type with a question you want to answer, you’ll soon figure out a way to put something together, as in Finnas’s case.
ProPublica’s Dollars for Docs project originated in part from this Pfizer-scraping lesson I added on to my programming tutorial: I needed a timely example of public data that wasn’t as useful as it should be.
My colleagues Charles Ornstein and Tracy Weber may not be programmers (yet), but they are experienced enough with data to know its worth as an investigative resource, and turned an exercise in transparency into a focused and effective investigation. It’s not trivial to find a story in data. Besides being able to do Access queries themselves, C&T knew both the limitations of the data (for example, it’s difficult to make comparisons between the companies because of different reporting periods) and its possibilities, such as the cross-checking of names en masse from the payment lists with state and federal doctor databases.
Their investigation into the poor regulation of California nurses – a collaboration with the LA Times that was a Pulitzer finalist in the Public Service category – was similarly data-oriented. They (and the LA Times’ Maloy Moore and Doug Smith) had been diligently building a database of thousands of nurses – including their disciplinary records and the time it took for the nursing board to act – which made my part in building a site to graphically represent the data extremely simple.
The point of all this is: don’t put off your personal data-training because you think it requires a computer science degree, or that you have to become great at it in order for it to be useful. Even if after a week of learning, you can barely put together a programming script to alphabetize your tweets, you’ll likely gain enough insight to how data is made structured and useful, which will aid in just about every other aspect of your reporting repertoire.
In fact, just knowing to avoid taking notes like this:
Colonel Mustard used the revolver in the library? (not library)
Miss Scarlet used the Candlestick in the dining room? (not Scarlet)
“Mrs. Peacock, in the dining room, with the
“Colonel Mustard, rope,
Mustard? Dining room? Rope (nope)?
“Was it Mrs. Peacock with the
candlestick, inside the dining room?”
And instead, recording them like this:
…will make you a significantly more effective reporter, as well as position you to have your reporting and research become much more ready for thorough analysis and online projects.
There’s a motherlode of programming resources available through single Google search. My high school journalism teacher told us that if you want to do journalism, don’t major in it, just do it. I think the same can be said for programming. I’m glad I chose a computer field as an undergraduate so that I’m familiar with the theory. But if you have a career in reporting or research, you have real-world data-needs that most undergrads don’t. I’ve found that having those goals and needing to accomplish them has pushed my coding expertise far quicker than did any coursework.
If you aren’t set on learning to program, but want to get a better grasp of data, I recommend learning:
- Regular expressions – a set of character patterns, easily printable on a cheat-sheet for memorization, that you use in a text-editor’s Find and Replace dialog to turn a chunk of text into something you can put into a spreadsheet, as well as clean up the data entries themselves. Regular-expressions.info is the most complete resource I’ve found. A cheat-sheet can be found here. Wikipedia has a list of some simple use cases.
- Google Refine – A spreadsheet-like program that makes easy the task of cleaning and normalizing messy data. Ever go through campaign contribution records and wish you could easily group together and count as one, all the variations of “Jon J. Doe”, “Jonathan J. Doe”, “Jon Johnson Doe”, “JON J DOE”, etc.? Refine will do that. Refine developer David Huynh has an excellent screencast demonstrating Refine’s power. I wrote a guide as part of the Dollars for Docs tutorials. Even if you know Excel like a pro – which I do not – Refine may make your data-life much more enjoyable.
If you want to learn coding from the ground up, here’s a short list of places to start:
- The Pragmatic Programmer’s Guide to Programming Ruby – this covers an older version of Ruby, but is still a great comprehensive, browser-friendly book.
- Learn to Program (also in Ruby) by Chris Pine – Written in 2004, this is still an elegant beginner’s guide
- Invent Your Own Computer Games With Python – You may not be interested in writing game software, but the same programming techniques apply in that field as they do anywhere else. This guide covers all the fundamentals and gives you great project examples.
- ScraperWiki has a massive collection of web-scraping scripts for your perusal, and is where the dataist’s FinnÃ¤s learned from example. ScraperWiki has a set of python tutorials, too.
- Here’s a giant list of free programming books.
- Visit the learnprogramming subforum in Reddit to find a small, but active community of beginners who aren’t afraid to start the most basic of discussions with the forum’s programming experts. StackOverflow is the single best site for specific questions or problems; often, you can Google your exact problem and a relevant StackOverflow discussion will be at the top.
- And you can always refer back to my four-part programming tutorial from last year, which aims to cover HTML to writing Ruby to scrape websites. I also wrote a series of tutorials (with complete code) on how I collected data for Dollars for Docs, including how to scrape from websites, Flash applications, PDFs, and even image files (the solution is specific to one kind of format, so I will gladly welcome anyone else to generalize it).
Legendary reporter Wayne Barrett filed his last column for the Village Voice this week. It reads like it’s from someone who has muckraked for nearly 40 years and has had a lot of time to think about his job:
When I was asked in recent years to blog frequently, I wouldn’t do it unless I had something new to tell a reader, not just a clever regurgitation of someone else’s reporting.
My credo has always been that the only reason readers come back to you again and again over decades is because of what you unearth for them, and that the joy of our profession is discovery, not dissertation.
There is also no other job where you get paid to tell the truth. Other professionals do sometimes tell the truth, but it’s ancillary to what they do, not the purpose of their job. I was asked years ago to address the elementary school that my son attended and tell them what a reporter did and I went to the auditorium in a trenchcoat with the collar up and a notebook in a my pocket, baring it to announce that “we are detectives for the people.”
…It never mattered to me what the party or ideology was of the subject of an investigative piece; the reporting was as nonpartisan as the wrongdoing itself. I never looked past the wrist of any hand in the public till. It was the grabbing that bothered me, and there was no Democratic or Republican way to pick up the loot.
It’s been a huge last few days for ProPublica. My colleagues Jesse Eisinger and Jake Bernstein unveiled the result of 7+ months of reporting, a much anticipated collaboration with “This American Life” on how the hedge fund Magnetar Capital helped prolong the housing bubble by betting against risky investments that it advocated for. Also, our story on private jet owners hiding in public airspace, uncovered by Michael Grabell (after our lawyers’ successful litigation), was one of our most viewed, thanks to it getting top play by USA Today and Yahoo.
Those both alone would’ve made it one of ProPublica’s most prominent weeks, but then Sheri Fink won the Pulitzer for Investigative Reporting for her massive investigation, published in the NYT magazine, on how a hospital’s doctors, post-Katrina, reportedly put patients to death under the guise of mercy and grace under chaos. Sheri’s win is extremely gratifying, because her subject had a lot of things going against it: Katrina was a four-year-old painful, chaotic memory that most Americans wanted to forget. And for N.O. residents, it seemed that the overwhelming sentiment was for the doctors and other authorities who did what they could. Anna Pou, the doctor at the center of Sheri’s story, had been exonerated (and the prosecutor who went after her was removed). And after Sheri’s story, no new charges have been made against her.
The story itself is a long-read. In addition to the factors above going against it, it also doesn’t deliver an immediate payoff for the ADD-afflicted reader. It’s only until the end that you can appreciate the light that Sheri shed on a universally important, yet opaque topic: who deserves life in a time of crisis? I think Sheri’s story, and subsequent follow-ups related to swine flu preparations, raised the alarm that not even our medical professionals are on the same page, and moved the ball in such a way that her findings would shock even the most cynical skeptics of the medical profession.
Also, congrats to my colleagues Charles Ornstein and Tracy Weber for being finalists in the Public Service category for their exposure of California’s broken nursing board. For them to even be considered for that prize, considering they won it recently before in the same area (lax oversight of medical care) is a testament to how thorough their work was again, and how much impact their stories had (Gov. Schwarzenegger immediately sacked or forced out a majority of the board afterwards).
I think our office felt confident our work was as good as any Pulitzer contender and it wouldn’t be a shock to win, even though we would be the first online-only organization (and possibly the youngest, at two years old) to do it. The drama was less about whether if we would win but which one of our reporters would win. For example, T. Christian Miller and his work on defense contractors was, in my mind, as deserving as any. Like Sheri, he shed light, in an exhaustive, dogged fashion, on a subject that most people would rather not care about: the treatment of civilians who are injured in warzones while working as contractors. With the bad rep of Blackwater, it’s proof of T’s herculean reporting and writing efforts that he got lawmakers to make some real moves into an easily overlooked (for political reasons) but essential area of our national security (in terms of prizes though, T already brought home the Selden Ring).
And of course, all those stories above would’ve had a harder hill to climb without the collaboration of all our great editors and research staff. And in my own department, Krista Kjellman and Jeff Larson put in just as much dedication and deliberation to further illuminate the stories in their online presentation (and in the process, often provided research and work important to the stories themselves).
Congrats to the other Pulitzer winners. I haven’t had time to look through all their work. I did put WaPo’s Gene Weingarten’s winning feature on the hellish punishment of parents who left their children to die in overheated cars on my iPad’s Instapaper. I got about a fourth-way through before I had to put it away so I wouldn’t be crying in the subway car.
Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here.
I’m only keeping this old walkthrough up as a historical reference. I’m sure the code is so ugly that I’m not going to even try re-reading it.
So check it out: The Bastards Book of Ruby
Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. There is a guide for Pfizer which supersedes the one I originally posted here.
So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A “little while” turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows how to help Pfizer in its mission to more transparent, compelled me to just publish them in incomplete form. There’s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.
As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I’m very welcome to editing changes.
DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.
Tutorial 1: Go from knowing nothing to scraping Web pages. In an hour. Hopefully – A massive, sprawling tutorial that attempts to take you from learning what HTML is, to the definition of an “if
Tutorial 2: Scraping a County Jail Website to Find Out Who’s in Jail – This uses all the concepts from the first tutorial and applies them to something that a cops reporter might actually want to try out.
Tutorial 3: Who’s Been in Jail Before: Cross-checking the jail logs with the court system with Ruby’s Mechanize – This lesson introduces you to another Ruby library that allows you to automate the filling-out of forms so that you can access online databases, in this case, California criminal case histories to see if current inmates are repeat-alleged-offenders.
Tutorial 4: Improving Pfizer’s Dollars-to-Doctors Pay List – Last week, Pfizer released a list of nearly 5,000 doctors and medical institutions that it made $35 million in consulting and expense payments. Fun. Unfortunately, the list, as it initially existed online, is just about useless to anyone wanting to examine trends. This tutorial provides a script to make the list more interesting to journalists.