Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here.
I’m only keeping this old walkthrough up as a historical reference. I’m sure the code is so ugly that I’m not going to even try re-reading it.
So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A “little while” turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows how to help Pfizer in its mission to more transparent, compelled me to just publish them in incomplete form. There’s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.
As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I’m very welcome to editing changes.
DISCLAIMER:The code, data files, and results are meant for reference and example only. You use it at your own risk.
Tutorial 1: Go from knowing nothing to scraping Web pages. In an hour. Hopefully – A massive, sprawling tutorial that attempts to take you from learning what HTML is, to the definition of an “if loop statement”, and finally, to using a Ruby library to scrape some information from Wikipedia. It may be too confusing for total neophytes and laughably basic for self-taught programmers. But at least you can kind of see, from beginning to end, one roadmap on going from nothing to something in the programming world.
Update (4/28): Replaced the code and result files. Still haven’t written out a thorough explainer of what’s going on here.
Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. I’m going to rework this script and post and update soon.
So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. These doctors are the same ones who, from time to time, recommend the use of Pfizer products.
Pfizer, the world’s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. Pfizer also paid $15.3 million to 250 academic medical centers and other research groups for clinical trials in the same period.
A spokeswoman for Pfizer, Kristen E. Neese, said most of the disclosures were required by an integrity agreement that the company signed in August to settle a federal investigation into the illegal promotion of drugs for off-label uses.
Not bad at first glance. However, on further examination, it’s clear that the list is nearly useless unless you intend to click through all 480 pages manually, or, if you have a doctor in mind and you only care about that one doctor’s relationship. As a journalist, you probably have other questions. Such as:
Which doctor received the most?
What was the largest kind of expenditure?
Were there any unusually large single-item payments?
None of these questions are answerable unless you have the list in a spreadsheet. As I mentioned in earlier lessons…there are cases when the information is freely available, but the provider hasn’t made it easy to analyze. Technically, they are fulfilling their requirement to be “transparent.”
I’ll give them the benefit of the doubt that they truly want this list to be as accessible and visible as possible…I tried emailing them to ask for the list as a single spreadsheet, but the email function was broken. So, let’s just write some code to save them some work and to get our answers a little quicker. Continue reading →