dataist blog: An inspiring case for journalists learning to code

About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven’t looked back at it because I’m sure I’ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from “What is HTML” to actual Ruby scraping code all in a gigantic, badly formatted post.

The series of articles have gotten a fair number of hits but I don’t know how many people were able to stumble through it. Though last week I noticed this recent trackback from dataist, a new “blog about data exploration” by Finnish journo Jens Finnäs. He writes that he has “almost no prior programming experience” but, after going through my tutorials and checking out Scraperwiki, was able to produce this cool network graph of the Ratata blog network after about “two days of trial and error”:

Mapping of Ratata blogging network by Jens Finnäs of dataist.wordpress.com

I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas’s example. Becoming good at coding is not a trivial task. But even the first steps of it can teach a non-coder some profound lessons about data important enough on their own. And if you’re a curious-type with a question you want to answer, you’ll soon figure out a way to put something together, as in Finnas’s case.

ProPublica’s Dollars for Docs project originated in part from this Pfizer-scraping lesson I added on to my programming tutorial: I needed a timely example of public data that wasn’t as useful as it should be.

My colleagues Charles Ornstein and Tracy Weber may not be programmers (yet), but they are experienced enough with data to know its worth as an investigative resource, and turned an exercise in transparency into a focused and effective investigation. It’s not trivial to find a story in data. Besides being able to do Access queries themselves, C&T knew both the limitations of the data (for example, it’s difficult to make comparisons between the companies because of different reporting periods) and its possibilities, such as the cross-checking of names en masse from the payment lists with state and federal doctor databases.

Their investigation into the poor regulation of California nurses – a collaboration with the LA Times that was a Pulitzer finalist in the Public Service category – was similarly data-oriented. They (and the LA Times’ Maloy Moore and Doug Smith) had been diligently building a database of thousands of nurses – including their disciplinary records and the time it took for the nursing board to act – which made my part in building a site to graphically represent the data extremely simple.

The point of all this is: don’t put off your personal data-training because you think it requires a computer science degree, or that you have to become great at it in order for it to be useful. Even if after a week of learning, you can barely put together a programming script to alphabetize your tweets, you’ll likely gain enough insight to how data is made structured and useful, which will aid in just about every other aspect of your reporting repertoire.

In fact, just knowing to avoid taking notes like this:

Colonel Mustard used the revolver in the library? (not library)
Miss Scarlet used the Candlestick in the dining room? (not Scarlet)
“Mrs. Peacock, in the dining room, with the revolver? “
“Colonel Mustard, rope, conservatory?”
Mustard? Dining room? Rope (nope)?
“Was it Mrs. Peacock with the candlestick, inside the dining room?”

And instead, recording them like this:

Who/What?Role?Ruled out?
Dining RoomPlaceN

…will make you a significantly more effective reporter, as well as position you to have your reporting and research become much more ready for thorough analysis and online projects.

There’s a motherlode of programming resources available through single Google search. My high school journalism teacher told us that if you want to do journalism, don’t major in it, just do it. I think the same can be said for programming. I’m glad I chose a computer field as an undergraduate so that I’m familiar with the theory. But if you have a career in reporting or research, you have real-world data-needs that most undergrads don’t. I’ve found that having those goals and needing to accomplish them has pushed my coding expertise far quicker than did any coursework.

If you aren’t set on learning to program, but want to get a better grasp of data, I recommend learning:

  • Regular expressions – a set of character patterns, easily printable on a cheat-sheet for memorization, that you use in a text-editor’s Find and Replace dialog to turn a chunk of text into something you can put into a spreadsheet, as well as clean up the data entries themselves. Regular-expressions.info is the most complete resource I’ve found. A cheat-sheet can be found here. Wikipedia has a list of some simple use cases.
  • Google Refine – A spreadsheet-like program that makes easy the task of cleaning and normalizing messy data. Ever go through campaign contribution records and wish you could easily group together and count as one, all the variations of “Jon J. Doe”, “Jonathan J. Doe”, “Jon Johnson Doe”, “JON J DOE”, etc.? Refine will do that. Refine developer David Huynh has an excellent screencast demonstrating Refine’s power. I wrote a guide as part of the Dollars for Docs tutorials. Even if you know Excel like a pro – which I do not – Refine may make your data-life much more enjoyable.

If you want to learn coding from the ground up, here’s a short list of places to start:

ProPublica Investigates Dialysis: For-Profit Providers Flourish as Care Quality Flounders

A great in-depth look by my ProPublica colleague Robin Fields into the dialysis industry. It was a field I was barely familiar with, as only about 400,000 Americans are on dialysis, but the entitlement as grown from $135 million to $20 billion annually, with mixed and depressing results.

I took this photo of a woman whose mother nearly bled to death after being improperly hooked up to a dialysis machine.

Cathleen Sharkey holds a frame of photographs of her mother, Barbara Scott, whose bloodline became disconnected during a dialysis treatment at Dutchess Dialysis Center. Scott never fully recovered and died shortly after of heart failure. (Dan Nguyen/ProPublica)

The Big Pharma-Dollars-for-Doctors Database, at ProPublica

Haven’t had much time to blog, or eat, or sleep in the past few months because of this project, but the first part just rolled out today (at about 2am, actually): at ProPublica, my colleagues and I collected the past two years of reports (albeit just from 7 companies) disclosing what they pay doctors to speak on their behalf. I still have a few posts and articles to write about what undertaking and background, but it’s the first time that someone has compiled all these reports and made them available to the public, something that will be mandated by law in 2013.

Our first investigation related to the data looked at how some of the companies’ top earners, who are ostensibly supposed to be experts in their field, had either shady or slim expertise. I did most of the datawork, including collecting the data and managing it, polling the various state websites to look up physician disciplinary records, and designing and coding (with the help of my genius coder co-worker Jeff Larson) the website. Whew!

Check it out.

“Letting Go” – The New Yorker’s Atul Gawande, on giving up life to live

Trinity Church Cemetery

Dr. Atul Gawande’s latest New Yorker piece is described as another examination on on what’s behind the cost of health care, but it serves more as a lesson on how to both cope with the finality of death and to appreciate life.

It took me several times to get through it, and luckily I wore my sunglasses in the subway so I wouldn’t look like some snob getting teary-eyed over his iPad.

The opening (and ultimately, closing anecdote) is about an ill-fated patient of Dr. Gawande’s, a Sara Thomas Monopoli, who discovers she has incurable cancer 39 weeks into her first pregnancy. Dr. Gawande describes Monopoli’s long struggle to stay alive, with taking a series of experimental drugs with harsh side effects; at one point, she hides the fact that she’s lost feeling in her hands and had double vision for two months, for fear her treatment would be stopped.

Gawande’s sprawling piece ends up being kind of a travelogue of his journey of accepting death for his patients. He believes, as do most people, that hospice care is meant to hasten death, even though at least one survey of terminal cancer patients found that those who elected for intensive care had similar survived no longer than those who entered hospice care.

As Gawande puts it:

Curiously, hospice care seemed to extend survival for some patients; those with pancreatic cancer gained an average of three weeks, those with lung cancer gained six weeks, and those with congestive heart failure gained three months. The lesson seems almost Zen: you live longer only when you stop trying to live longer.

Gawande relates this to the current health care crisis by pointing out a 2004 Aetna study in which policyholders expected to die within a year could choose hospice services and have all the other treatments. The hospice care was so appealing, apparently, that these patients spent far less time in hospitals and ICUs, even though they didn’t have to give up any options. Costs fell by nearly 25%

The benefits of accepting fate are not just monetary. Gawande writes that by many objective metrics, patients who seriously discussed end-of-life care ended up suffering less:

Two-thirds of the terminal-cancer patients in the Coping with Cancer study reported having had no discussion with their doctors about their goals for end-of-life care, despite being, on average, just four months from death. But the third who did were far less likely to undergo cardiopulmonary resuscitation or be put on a ventilator or end up in an intensive-care unit. Two-thirds enrolled in hospice. These patients suffered less, were physically more capable, and were better able, for a longer period, to interact with others. Moreover, six months after the patients died their family members were much less likely to experience persistent major depression.

In other words, people who had substantive discussions with their doctor about their end-of-life preferences were far more likely to die at peace and in control of their situation, and to spare their family anguish.

I can’t think of many other journalists who I respect more than Dr. Gawande. Besides having incredible eloquence as a writer, he’s a respected professional in the field he covers. His book “Complications,” nearly made me quit journalism to try med school – it was that fascinating of a look into how terrifying, yet intellectually challenging, it would be to be an ER surgeon – until I realized it would be a long, uphill slog for someone who never took a college level biology class.

Atul Gawande

Gawande has had at least two other notable pieces. One, related to the subject of “Complications,” was how a checklist consisting of steps as simple as reminding doctors to wash their hands was saving a staggering number of patients from post-surgical infections. And the second, about how one town in Texas managed to have the highest, by far, health care costs per capita. The article reportedly caught President Obama’s eye during the health care overhaul.

Both are instructive essays on the complexity of health care. “Letting Go” is less so, perhaps because there are no cost-benefit studies that would convince either death-panel-fearing-Tea-partiers or an insurance-company-demonizers that the health care system would ever be right to compel a patient to give up treatment.

But as a collection of tragic anecdotes, “Letting Go” really shook me and at least made me remember to appreciate what’s good in life. Maybe that’s Gawande’s ulterior strategy all along, to convince the reader to place enjoying life over prolonging it, and by doing so, maybe, get both.

Another New Yorker piece (h/t/ longform.org), written in 2001 by Gary Greenberg, also examines the moving line between life and death, and in particular, how where the line is drawn has been influenced by the demand for organs. With the concept of “brain death”, organs can be retrieved in a more viable state, as opposed to waiting until the heart stops beating. But doctors and ethicists (I assume today, as well as in 2001) are still arguing about the different kinds of brain death, and even those who accept it, they still have to train themselves to think of a warm, breathing body as “dead”:

“It took us years to get the public to understand what brain death was,” [Howard M. Nathan, who heads an organ-procurement group] said. “We had to train people in how to talk about it. Not that they’re brain dead, but they’re dead: ‘What you see is the machine artificially keeping the body alive . . .’ “ He stopped and pointed to my notebook. “No, don’t even use that. Say ‘keeping the organs functioning.’ “

And if you’ve got even more time to spend reading life-or-death longform pieces, I’ll pitch this 13,000-word Pulitzer-winning piece by my ProPublica colleague, Sheri Fink. The subject is how doctors at a New Orleans hospital may have been to quick to euthanize a group of patients while desperately waiting for rescue after Katrina.

The overarching theme, as it is in the New Yorker articles mentioned here, is what makes a life worth living/saving, and can doctors make that decision when patients can’t?


“Is she dying?” one of the sisters asked me. I didn’t know how to answer the question. I wasn’t even sure what the word “dying” meant anymore. In the past few decades, medical science has rendered obsolete centuries of experience, tradition, and language about our mortality, and created a new difficulty for mankind: how to die.


He wanted to show that the higher-brain rationale, which holds that living without consciousness is not really living—and which the President’s commission rejected because it raised questions about quality of life which science can never settle—was the sub-rosa justification for deciding to call a brain-dead person dead. He wanted to make it clear that these doctors were not making a straightforward medical judgment but, rather, a moral judgment that people like Matthew were so devastated that they had lost their claim on existence.


According to Memorial workers on the second floor, about a dozen patients who were designated as “3’s” [a triage category for patients thought to be close to death] remained in the lobby by the A.T.M. Other Memorial patients were being evacuated with help from volunteers and medical staff, including Bryant King. Around noon, King told me, he saw Anna Pou holding a handful of syringes and telling a patient near the A.T.M., “I’m going to give you something to make you feel better.” King remembered an earlier conversation with a colleague who, after speaking with Mulderick and Pou, asked him what he thought of hastening patients’ deaths. That was not a doctor’s job, he replied. Patients were hot and uncomfortable, and a few might be terminally ill, but he didn’t think they were in the kind of pain that calls for sedation, let alone mercy killing. When he saw Pou with the syringes, he assumed she was doing just that and said to anyone within earshot: “I’m getting out of here. This is crazy!” King grabbed his bag and stormed downstairs to get on a boat.

Been away for the redesign

Convincing the editors to use this as the lede image for the redesign was my most visible contribution

Been stuck at the office for the past couple of weeks, but it’s been worth it. Helped with ProPublica’s redesign, which we did in conjunction with Mule Design Studio. Our journalism has always been top notch, but the site didn’t quite look the part. Now it does. As my boss writes, “Our goal is simple: For the design of our site to match the sophistication of our reporting.”

Pultizer Prize at ProPublica

It’s been a huge last few days for ProPublica. My colleagues Jesse Eisinger and Jake Bernstein unveiled the result of 7+ months of reporting, a much anticipated collaboration with “This American Life” on how the hedge fund Magnetar Capital helped prolong the housing bubble by betting against risky investments that it advocated for. Also, our story on private jet owners hiding in public airspace, uncovered by Michael Grabell (after our lawyers’ successful litigation), was one of our most viewed, thanks to it getting top play by USA Today and Yahoo.

Those both alone would’ve made it one of ProPublica’s most prominent weeks, but then Sheri Fink won the Pulitzer for Investigative Reporting for her massive investigation, published in the NYT magazine, on how a hospital’s doctors, post-Katrina, reportedly put patients to death under the guise of mercy and grace under chaos. Sheri’s win is extremely gratifying, because her subject had a lot of things going against it: Katrina was a four-year-old painful, chaotic memory that most Americans wanted to forget. And for N.O. residents, it seemed that the overwhelming sentiment was for the doctors and other authorities who did what they could. Anna Pou, the doctor at the center of Sheri’s story, had been exonerated (and the prosecutor who went after her was removed). And after Sheri’s story, no new charges have been made against her.

The story itself is a long-read. In addition to the factors above going against it, it also doesn’t deliver an immediate payoff for the ADD-afflicted reader. It’s only until the end that you can appreciate the light that Sheri shed on a universally important, yet opaque topic: who deserves life in a time of crisis? I think Sheri’s story, and subsequent follow-ups related to swine flu preparations, raised the alarm that not even our medical professionals are on the same page, and moved the ball in such a way that her findings would shock even the most cynical skeptics of the medical profession.

Also, congrats to my colleagues Charles Ornstein and Tracy Weber for being finalists in the Public Service category for their exposure of California’s broken nursing board. For them to even be considered for that prize, considering they won it recently before in the same area (lax oversight of medical care) is a testament to how thorough their work was again, and how much impact their stories had (Gov. Schwarzenegger immediately sacked or forced out a majority of the board afterwards).

I think our office felt confident our work was as good as any Pulitzer contender and it wouldn’t be a shock to win, even though we would be the first online-only organization (and possibly the youngest, at two years old) to do it. The drama was less about whether if we would win but which one of our reporters would win. For example, T. Christian Miller and his work on defense contractors was, in my mind, as deserving as any. Like Sheri, he shed light, in an exhaustive, dogged fashion, on a subject that most people would rather not care about: the treatment of civilians who are injured in warzones while working as contractors. With the bad rep of Blackwater, it’s proof of T’s herculean reporting and writing efforts that he got lawmakers to make some real moves into an easily overlooked (for political reasons) but essential area of our national security (in terms of prizes though, T already brought home the Selden Ring).

And of course, all those stories above would’ve had a harder hill to climb without the collaboration of all our great editors and research staff. And in my own department, Krista Kjellman and Jeff Larson put in just as much dedication and deliberation to further illuminate the stories in their online presentation (and in the process, often provided research and work important to the stories themselves).

Congrats to the other Pulitzer winners. I haven’t had time to look through all their work. I did put WaPo’s Gene Weingarten’s winning feature on the hellish punishment of parents who left their children to die in overheated cars on my iPad’s Instapaper. I got about a fourth-way through before I had to put it away so I wouldn’t be crying in the subway car.

NYT: How Unemployment Taxes are Collected; Also, Watch Your Texting Habit

A pretty interesting piece in Jay Goltz’s “You’re the Boss” blog on how unemployment tax is paid for (in Illinois, an employer can pay up to $1.48 per dollar that a former employee collects in unemployment benefits). Goltz argues that this creates a disincentive for employers to hire, knowing that a prospective employee who turns out to be a failure will cost the company in time lost and extra unemployment tax.

Speaking of which, there’s this amusing nugget of a negligent employee who almost cost Goltz’s company that incremental tax, despite “working” for 21 days:

I have recently learned that you can be charged with a claim even if you’ve employed someone for less than 30 days. We fired someone after three weeks because she was text-messaging her friends all day. After we told her twice that she had to work during the day and stop texting, she put her phone away. We then noticed she was leaving her desk drawer open and looking into it a lot. She was now texting out of the drawer.

Now is a good time to refer to my colleagues Jeff Larson and Olga Pierce’s fantastic work in documenting the crisis in states’ unemployment insurance funds. Jeff devised a pretty smart way to scrape the information, and he and Olga came up with a formula to accurately predict whether states’ funds were in the red (see their nerdy formula page here).

Incidentally, Illinois, Goltz’s state of business (he owns five small Chicago businesses), is in the shitter for its unemployment funds, so to speak, according to ProPublica’s Unemployment Insurance Tracker.

ProPublica tracks the bailout, a year or so later

Today, my ProPublica colleague Paul Kiel and I put out some graphical revisions to PP’s bank bailout tracking site, including our master list of companies to get taxpayer bailout money:

Graphic: The Status of the Bailout

Graphic: The Status of the Bailout

Bailout List Page

Bailout List Page

Nothing fancy, mostly made the numbers easier to find and compare. The site itself has been far-from-fancy at its inception, since it was my first project after taking a crash course on Ruby on Rails. Back when the bailout was first announced in Q4 2008, the Treasury declined to name the banks it was doling taxpayer money to, for fear that non-listed banks would take a hit in reputation. Paul was one of the first few people to comb through banks’ press releases and enter them into a spreadsheet. His list of the first 26 – put into a simple html table – was a pretty big hit.

As the list grew into the dozens and hundreds, it became more cumbersome to maintain the static list, which was nothing more than the bank’s name, date of announcement, and amount of bailout. Plus, it was no longer just one bailout per company; Citigroup and Bank of America were beneficiaries of billions of dollars through a couple other programs.

So, I proposed a bailout site that would allow Paul to record the data at a more discrete level…up to that point, for example, most online lists showed that AIG had several dozen billion dollars committed to it, but not the various programs, reasons, and dates on which those allocations were made. A little anal maybe, but it gave the site the flexibility to adapt when the bailout grew to include all varieties of disbursements, including to auto parts manufacturers and mortgage servicers, as well as the money flow coming in the opposite direction, in the form of refunds and dividends.

I saw the site as more of a place for Paul to base his bailout coverage on (he’s been doing an excellent job covering the progress of the mortgage modification program), as I assumed that in the near future, Treasury would have its own, easy-to-use site of the data. Unfortunately, that is not quite the case, nearly a year and a half later. Besides some questionable UI decisions (such as having the front-facing page consist of a Flash map), the data is not put forth in an easily accessible method. It could be that I need to take an Excel refresher course here, but trying to sort the columns in these Excel spreadsheets just to find the biggest bailout amount, for example, throws an error.

Only in the past couple of months did Treasury finally release dividends in non-pdf form, and even then, it’s still a pain to work with (there’s no way, for example, to link the bank names in the dividends sheet to the master spreadsheet of bailouts). I would’ve thought that’d be the set of bailout data Treasury would be most eager to send out, because it’s the taxpayers’ return on investment. But, as it turns out, there is a half-empty perspective from this data (such as banks not having enough reserves to pay dividends in a timely fashion), one that would’ve been immediately obvious if the data were in a more sortable form.

ProPublica’s bailout tracking site doesn’t have much data other than the official Treasury bailout numbers; there’s all kinds of other unofficial numbers, such as how much each bank is giving out in bonuses, that people are more interested in. American University has gathered all kinds of financial health indicators for each bailout bank, too. There’s definitely much more data that PP, and other bailout trackers need to collect to provide a bigger picture of the bailout situation. But for now, I guess it’s a small victory to be one of the top starting points to find out just exactly where hundreds of billions of our taxes went to. And the why, too; Paul’s done a great job writing translations of the Treasury’s official-speak on each program.

ProPublica’s Eye on the Bailout

Track the Hydraulic Frack: ProPublica mini-site on oil/gas wells per state, and the few staff that regulate them

My colleague Jeff Larson made this very cool site that shows how many more gas/oil wells there are per state since 2003, and the relatively small change in staff to inspect them. All done with jquery’s flot.

frack track for Texas

Related story by Abrahm Lustgarten. Abrahm has pretty much been the journalist at the forefront of covering the important, yet under-the-radar issue of whether the drive for natural gas will threaten our water supplies. Essentially, the technique for drilling – hydraulic fracturing – involves injecting millions of gallons of chemically tainted water to crack open the ground to allow the gas to escape. Yet the process is exempted from the Clean Water Act. And there are currently no realistic ways to treat the billions of gallons of wastewater this drilling is expected to produce.

hydrofracking graphic

Click to see larger graphic

ProPublica’s complete coverage here.