Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state.
Update (4/28): Replaced the code and result files. Still haven’t written out a thorough explainer of what’s going on here.
Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. I’m going to rework this script and post and update soon.
So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. These doctors are the same ones who, from time to time, recommend the use of Pfizer products.
Pfizer, the world’s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. Pfizer also paid $15.3 million to 250 academic medical centers and other research groups for clinical trials in the same period.
A spokeswoman for Pfizer, Kristen E. Neese, said most of the disclosures were required by an integrity agreement that the company signed in August to settle a federal investigation into the illegal promotion of drugs for off-label uses.
So, not an entirely altruistic release of information. But it’s out there nonetheless. You can view their list here. Jump to my results here
Not bad at first glance. However, on further examination, it’s clear that the list is nearly useless unless you intend to click through all 480 pages manually, or, if you have a doctor in mind and you only care about that one doctor’s relationship. As a journalist, you probably have other questions. Such as:
- Which doctor received the most?
- What was the largest kind of expenditure?
- Were there any unusually large single-item payments?
None of these questions are answerable unless you have the list in a spreadsheet. As I mentioned in earlier lessons…there are cases when the information is freely available, but the provider hasn’t made it easy to analyze. Technically, they are fulfilling their requirement to be “transparent.”
I’ll give them the benefit of the doubt that they truly want this list to be as accessible and visible as possible…I tried emailing them to ask for the list as a single spreadsheet, but the email function was broken. So, let’s just write some code to save them some work and to get our answers a little quicker.
DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.
The Code
The following code uses the same nokogiri strategies in the past three lessons. But here are the specific considerations that we have to make for Pfizer’s list:
- The base url is: http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=All&iPageNo=1 The most interesting parameter, iPageNo, is bolded. If you replace ‘1’ with any number, you’ll see you can progress through the list. There appears to be 486 pages.
- So each page has a table of data with id #hcpPayments. The rows of data aren’t very normalized. For example, each “Entity Paid” can have many services/activity listed, with each of those items having another name attached to it. Then there are “cash” and “non-cash” values, which may or may not be numeric (“—” apparently means 0) There’s no easy css selector to grab each entity…but it seems that we can safely assume that if the first table column has a name (and the second and third contain city and state) that this is a new entity.
- Download pages 1 to 486 of the list (each page has 10 entries)
- Run a method that gathers all the doctor names from the pages we just downloaded on to our hard drive)
- From that list of doctors, query the Pfizer site and gather the individual payments to every doctor.
- pfizer_doctors.txt – Every doctor name listed. We will use this in the next step to query each doctor individual on Pfizer’s site
- pfizer_entities.txt – A list of every payment made to Entities
- pfizer_entity_totals.txt – A list of the total payments made to Entities
These are the steps we’ll take:
At the top, I’ve written a few convenience methods to deal with strings. Also included are: get_doc_query is a function we call to extract the doctor name from the links on the site.
puts_error is a quick function to log any errors we might have
# Some general functions to deal with strings class String alias_method :old_strip, :strip def strip self.old_strip.gsub(/^[\302\240|\s]*|[\302\240|\s]*$/, '').gsub(/[\r\n]/, " ") end def strip_for_num self.strip.gsub(/[^0-9]/, '') end def blank? respond_to?(:empty?) ? empty? : !self end end END_PAGE=486 BASE_URL='' DOC_QUERY_URL='http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?hcpdisplayName=' def get_doc_query(str) str.match(/hcpdisplayName\=(.+)/)[1] end def puts_error(str) err = "#{Time.now}: #{str}" puts err File.open("pfizer_error_log.txt", 'a+'){|f| f.puts(err)} end
I found it easiest to download all the pages onto the hard drive first, using something like CURL, and then run the following code on it.
process_local_pages is a method that will iterate through every page (you can set BASE_URL to either your hard drive if you’ve downloaded all the pages yourself, or to the Pfizer page), run process_row, and store all the doctor names and payees into separate files, as well as hold all the total amounts
The three resulting files that you get are:
def process_row(row, i, current_entity, arrays) tds = row.css('td').collect{|r| r.text.strip} if !tds[3].blank? if !tds[1].blank? # new entity puts tds[0] current_entity = {:name=>tds[0],:city=>tds[1], :state=>tds[2], :page=>i, :services=>[]} arrays[:entities].push(current_entity) if arrays[:entities] current_class = row['class'] end if tds[3].match(/Total/) arrays[:totals].push([current_entity[:name], tds[4].strip_for_num, tds[5].strip_for_num].join("\t")) if arrays[:totals] else # new service services_td = row.css('td')[3] service_name = services_td.css("ul li a")[0].text.strip puts "#{current_entity[:name]}\t#{service_name}" current_entity[:services].push([service_name, tds[4].strip_for_num, tds[5].strip_for_num]) arrays[:doctors].push(services_td.css("ul li ul li a").map{|a| get_doc_query(a['href']) }.uniq) if arrays[:doctors] end elsif tds.reject{|t| t.blank?}.length == 0 #blank row else puts_error "Page #{i}: Encountered a row and didn't know what to do with it: #{tds.join("\t")}" end return current_entity end def process_local_pages doctors_arr = [] entities_arr = [] totals_arr =[] for i in 1..END_PAGE begin page = Nokogiri::HTML(open("#{BASE_URL}#{i}.html")) count1, count2 = page.css('#pagination td.alignRight').last.text.match(/([0-9]{1,}) - ([0-9]{1,})/)[1..2].map{|c| c.to_i} count = count2-count1+1 puts_error("Page #{i} WARNING: Pagination count is bad") if count < 0 puts("Page #{i}: #{count1} to #{count2}") rows = page.css('#hcpPayments tbody tr') current_entity=nil rows.each do |row| current_entity= process_row(row, i, current_entity, {:doctors=>doctors_arr, :entities=>entities_arr, :totals=>totals_arr}) end rescue Exception=>e puts_error "Oops, had a problem getting the #{i}-page: #{[e.to_str, e.backtrace.map{|b| "\n\t#{b}"}].join("\n")}" else end end File.open("pfizer_doctors.txt", 'w'){|f| doctors_arr.uniq.each do |d| f.puts(d) end } File.open("pfizer_entities.txt", 'w'){|f| entities_arr.each do |e| e[:services].each do |s| f.puts("#{e[:name]}\t#{e[:page]}\t#{e[:city]}\t#{e[:state]}\t#{s[0]}\t#{s[1]}\t#{s[2]}") end end } File.open("pfizer_entity_totals.txt", 'w'){|f| totals_arr.uniq.each do |d| f.puts(d) end } end
process_doctor is what we run after we’ve compiled the list of doctor names that show up on the Pfizer list. Each doctor has his/her own page with detailed spending. The data rows are roughly in the same format as the main list, so we reuse process_row again
.
def process_doctor(r, time='') begin url = "#{DOC_QUERY_URL}#{r}" page = Nokogiri::HTML(open("#{url}")) rescue puts_error "Oops, had a problem getting the #{r}-entry: #{[e.to_str, e.backtrace.map{|b| "\n\t#{b}"}].join("\n")}" end rows = page.css('#hcpPayments tbody tr') entities_arr = [] current_entity=nil rows.each do |row| current_entity= process_row(row, '', current_entity, {:entities=>entities_arr}) end name = r.split('+') puts_error("Should've been a last name at #{r}") if !name[0].match(/,$/) name = "#{name[0].gsub(/,$/, '')}\t#{name[1..-1].join(' ')}" vals=[] entities_arr.each do |e| e[:services].each do |s| vals.push("#{name}\t#{e[:name]}\t#{e[:page]}\t#{e[:city]}\t#{e[:state]}\t#{s[0]}\t#{s[1]}\t#{s[2]}\t#{url}\t#{time}") end end vals.each{|val| File.open("pfizer_doctor_details.txt", "a"){ |f| f.puts val }} puts vals return vals end
process_doctor_pages is just a function that calls process_doctor for each name in the pfizer_doctors.txt we previously gathered
The final result is pfizer_doctor_details.txt, which contains a line for every payment to every doctor.
def process_doctor_pages time = Time.now File.open("pfizer_doctors.txt", 'r'){|f| f.readlines.each do |r| vals = process_doctor(r, time) end } end
The Results
After Googling the top-Pfizer-paid-doctor on the list (Gerald Michael Sacks for ~$150K), I came across the Pharma Conduct blog, which had already posted partial aggregations of the list, including the top 5 doctors, complete with profiles and pics.
As Pharma Conduct has already been on the ball, I’ll defer to its analysis. It has some good background here on how lame pharma companies have been in past releases of data. Overall, Pharma Conduct is less-than impressed with Pfizer:
Despite reporting more information than some its peers, Pfizer’s interface is still very limited. For one, to use the search filtering, you must know a physician’s first name and last name, as well as the state where the payment was made. Also, the data cannot be sorted by payment amount, which is a big limitation. Pfizer should be given credit for releasing the information and being so thorough. However, by releasing it in a format that is not really amenable to data analysis and is more suited to simply looking up results one physician at a time, I echo John Mack’s sentiment, namely, that this data is translucent, but not transparent.
Pingback: Coding for Journalists 101 : A four-part series | Danwin: Dan Nguyen, in short
Pingback: Credigy Receivables and Steve Stewart | SAP Business One – How To – A/R Down Payments
Hi Dan,
What an excellent write-id. I was pondering the idea of whether or not my readers would be interested in tutorial for how I aggregated the data so quickly. Thus, by posting such an excellent overview of web-scraping, you have saved me a tremendous amount of time. I’m a PHP/MySQL user myself, but anyone with a modicum of programming skill should be able to follow your example. Posting the raw data, which I had hoped to do later this week (I’m traveling at the moment), was a nice thing to do. I didn’t want to post the full set until I’d had a chance to perform some serious QC on the data. Looking over your numbers, we seem to have found most of the same things, as well as some of the same problems.
For example, regarding Duke, I think I understand where the discrepancy lies. I alluded to this briefly in a previous post (see Who were the top 5 recipients of money from Pfizer during the period 2009 Q3-Q4?), but I have a more complete post in the works. Thanks again for writing such an excellent tutorial, for making the data available, and for posting on my blog. In one of my next posts, I’ll be sure to spotlight this post.
-Eric
Pingback: Pfizer Data Redux | Danwin: Dan Nguyen, in short
Dan
Where is ‘process_row’? I can’t see it defined anywhere? Thanks
Joe
Pingback: Dan Nguyen’s coding for journalists 101
really really awesome that you did this.. I think if there were more GUI oriented tools built into the browser itself via plugins and panels this practice of large data-set anaylsis could really catch on.. especially with these cool examples.
On a side note these ceo’s siting on multiple boards is fascinating..
Pingback: dataist blog: An inspiring case for journalists learning to code | Dan Nguyen pronounced fast is danwin
Pingback: Code, Don’t Tell: Programming as an Essential Journalism Skill | Dan Nguyen pronounced fast is danwin