Tag Archives: journalism

Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List

Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state.

Update (4/28): Replaced the code and result files. Still haven’t written out a thorough explainer of what’s going on here.

Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. I’m going to rework this script and post and update soon.

So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. These doctors are the same ones who, from time to time, recommend the use of Pfizer products.

From the NYT:

Pfizer, the world’s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. Pfizer also paid $15.3 million to 250 academic medical centers and other research groups for clinical trials in the same period.

A spokeswoman for Pfizer, Kristen E. Neese, said most of the disclosures were required by an integrity agreement that the company signed in August to settle a federal investigation into the illegal promotion of drugs for off-label uses.

So, not an entirely altruistic release of information. But it’s out there nonetheless. You can view their list here. Jump to my results here

Not bad at first glance. However, on further examination, it’s clear that the list is nearly useless unless you intend to click through all 480 pages manually, or, if you have a doctor in mind and you only care about that one doctor’s relationship. As a journalist, you probably have other questions. Such as:

  • Which doctor received the most?
  • What was the largest kind of expenditure?
  • Were there any unusually large single-item payments?

None of these questions are answerable unless you have the list in a spreadsheet. As I mentioned in earlier lessons…there are cases when the information is freely available, but the provider hasn’t made it easy to analyze. Technically, they are fulfilling their requirement to be “transparent.”

I’ll give them the benefit of the doubt that they truly want this list to be as accessible and visible as possible…I tried emailing them to ask for the list as a single spreadsheet, but the email function was broken. So, let’s just write some code to save them some work and to get our answers a little quicker.
Continue reading

Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact dan@danwin.com if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

In particular, with lesson 3, I skipped basically any explanation to the code. I hope to get around to it later.

Going to Court

In the last lesson, we learned how to write a script that would record who was in jail at a given hour. This could yield some interesting stories for a crime reporter, including spates of arrests for notable crimes and inmates who are held with $1,000,000 bail for relatively minor crimes. However, an even more interesting angle would be to check the inmates’ prior records, to get a glimpse of the recidivism rate, for example.

Sacramento Superior Court allows users to search by not just names, but by the unique ID number given to inmates by Sacramento-area jurisdictions. This makes it pretty easy to link current inmates to court records.

However, the techniques we used in past lessons to automate the data collection won’t work here. As you can see in the above picture, you have to fill out a form. That’s not something any of the code we’ve written previously will do. Luckily, that’s where Ruby’s mechanize comes in.

Continue reading

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.

UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby.

I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think:


Someone asked in this online chat for journalists: I want to program/code, but where does a non-programmer journalist begin?

My colleague Jeff Larson gave what I believe is the most practical and professionally-useful answer: web-scraping (jump to my summary of web-scraping here, or read this more authorative source).

This is my attempt to walk someone through the most basic computer science theory so that he/she can begin collecting data in an automated way off of web pages, which I think is one of the most useful (and time-saving) tools available to today’s journalist. And thanks to the countless hours of work by generous coders, the tools are already there to make this within the grasp of a beginning programmer.

You just have to know where the tools are and how to pick them up.

Click here for this page’s table of contents. Or jump to the the theory lesson. Or to the programming exercise. Or, if you already know what a function and variable is, and have Ruby installed, go straight to two of my walkthroughs of building a real-world journalistic-minded web scraper: Scraping a jail site, and scraping Pfizer’s doctor payment list.

Or, read on for some more exposition:

Continue reading

Day of the Tiger: How Newspapers, Networks, and News Aggregators Played Tiger Woods on Friday

On Friday, golfer Tiger Woods held a TV appearance to talk about life after marital problems. At around 2:30 p.m., I screen capped some of the websites for some of the largest news organizations and aggregators. Today, I looked at the screen-caps, cropped them to the top 1600 pixels, and marked in green the areas of the pages devoted to Woods coverage (or related coverage, such as “Slideshow: Top 10 Adultery Confessions).

Continue reading

200 Jobs rated for 2010, by CareerCast.com. Actuary #1, Software Engineer #2, Philosopher #11, Newspaper Reporter #184

CareerCast.com released a list of 200 jobs ranked by such factors as stress level, pay, work environment, and hiring outlook. Read their methodology here. The WSJ made it into a sortable multipage list but I took the liberty of making a single-page version with bar graphs showing the starting, mid, and top salaries.

At first glance…seems like it’s great to be a geek, with the top 6 jobs steeped in the mathematics and science (exception being historian…which is a geekiness of its own sort).

But going down the list…say, all the way to position #11, and your BS meter should be going off. Apparently, philosopher is the 11th best job, with very low physical demands and stress, a “very good” hiring outlook, and a median income of $60,000.

Really? A comment on this physicsforums thread sums up my a priori assumption: “I have no factual information but I guess your career choices would be either getting a faculty position at some university or flipping burgers.

Continue reading

Bad Nurses, and Our Tragic Inability to Track Them

Get rich in the temp nursing business

Get rich in the temp nursing business

On Sunday, my ProPublica colleagues Tracy Weber and Charles Ornstein, in conjunction with the Los Angeles Times, put out a story examining the lack of standards in the temp nursing agency, a dangerous situation considering California’s desperate shortage of nursing staff.

Emboldened by a chronic nursing shortage and scant regulation, the firms vie for their share of a free-wheeling, $4-billion industry. Some have become havens for nurses who hopscotch from place to place to avoid the consequences of their misconduct. (see related story: A ‘Crazy’ Way for an Industry to Operate)

A joint investigation with the Los Angeles Times found dozens of instances in which staffing agencies skimped on background checks or ignored warnings from hospitals about sub-par nurses on their payrolls. Some hired nurses sight unseen, without even conducting an interview.

The gist of the problem: California lacks virtually any kind of tracking of errant temp nurses. This nurse, for example, was accused of stealing drugs from at least six hospitals, suffered a drug-induced seizure on the job, and had his Minnesota nursing license suspended before California got around to filing an accusation against him. Two years later, after a few more reported incidents of drug theft, the California registered nursing board finally revoked his license when he didn’t make his hearing on time.

Charlie and Tracy have been covering this story even before they joined ProPublica; LATimers Maloy Moore and Doug Smith contributed a massive amount of the essential research and data-analysis. This temp nurses chapter is just another consequence of what appears to be awful records-keeping and sloth by the various oversight bodies.

My own contribution to the coverage was small, the most notable aspect of which was this Ruby on Rails site I built to catalogue the sanctioned nurses, a relatively minor task compared to actually collecting and parsing the data (i.e. reading through all the PDF files for the buried information). . It was pretty simple, allowing users at a glance to see the numbers of disciplined nurses by various categories, including year and type of discipline. I was a little skeptical of doing it at first, just because the CA nursing board does have a searchable and functional database of its own.

Theoretically (well, if it weren’t the case that the records themselves are often incomplete, so that criminal nurses come up with a clean sheet), any member of the public could look up their own nurses’ records and avoid the bad ones. But the meat of the Charlie’s and Tracy’s is the numbers: 1,254 days on average to discipline a nurse (compared to 173 for Texas). 1,706 days before one nurse, who was kicked out of a drug-recovery program and considered a threat to public safety, had even an accusation filed against her. Our site makes it evident that hard numbers, not just heartbreaking anecdotes,  argue against California’s regulatory status quo.

A screenshot from our sanctioned nurses database

A screenshot from our sanctioned nurses database

The reporters on this story put in months of time manually tabulating the data to come up with the thrust of their stories. Sadly, all of these numbers and statistical conclusions were probably right under the nursing board’s nose. The regulators apparently track dates and types of accusations and disciplines for each nurse. A few simple database queries would’ve quickly uncovered the glaring delays and bottlenecks in the system (e.g. (SELECT AVG(TO_DAYS(`date_discipline`)-TO_DAYS(`date_initial_complaint`)) as average_delay from `disciplinary_actions`).

A day after Charlie and Tracy’s initial story in July 2009, Gov. Schwarzenegger sacked a majority of the registered nursing board and new regulations include making public the restrictions on a nurse’s license. Read ProPublica’s complete coverage on California’s flawed oversight of health-care workers here.