Category Archives: works

actual works, projects

The Most Viewed Portraits: Marina AbramoviÄ‡: The Artist Is Present, the MOMA

Thought it’d be fun to see the 200 most viewed portraits on the MOMA’s “Marina AbramoviÄ‡: The Artist Is Present” Flickr set. So I wrote a scraper to collect each portrait’s stats, including page views. A number of celebrities participated in the marathon performance art exhibit, including Sharon Stone, Rufus Wainright, and Bjork. Other top people include the guy who showed up a dozen times. And children and pretty females.

24504	22726	21622	16961	15808
14002	12018	12001	11950	11482
11406	11190	9854	9194	9041
9015	8935	8500	8446	8272
7726	7669	6640	6569	6325
5835	5718	5654	5611	5596
5445	5303	5297	5266	5201
5059	5011	4965	4931	4921
4905	4867	4850	4784	4751
4734	4705	4660	4649	4617
4596	4566	4505	4491	4457
4443	4443	4400	4399	4326
4295	4285	4275	4179	4164
4088	4079	4063	4017	4016
3921	3884	3876	3875	3854
3822	3822	3805	3743	3738
3734	3603	3582	3555	3536
3529	3506	3484	3473	3463
3456	3422	3388	3341	3340
3334	3333	3306	3285	3265
3263	3239	3208	3192	3154
3149	3145	3124	3123	3112
3110	3087	3078	3045	3045
3045	3031	3024	3005	3000
2998	2947	2932	2930	2930
2922	2920	2915	2904	2900
2897	2892	2892	2881	2879
2869	2867	2850	2850	2823
2791	2785	2784	2768	2736
2736	2733	2722	2721	2720
2715	2705	2699	2694	2691
2687	2687	2680	2677	2672
2664	2655	2645	2643	2640
2620	2605	2604	2600	2599
2597	2596	2594	2593	2591
2588	2586	2581	2570	2567
2563	2562	2558	2552	2550
2546	2544	2541	2518	2516
2505	2504	2498	2496	2491
2486	2481	2478	2478	2477

Marina AbramoviÄ‡ Melts Before Your Eyes

Wrote a quick scrape of the Museum of Modern Art’s gallery of Marina AbramoviÄ‡’s “The Artist is Present”. This is Abramovic’s portrait for the last 68 days (I guess the upload isn’t complete yet).
(Update, all 72 days are up, I’ll get around to updating this. Number 72 is a doozy)

01	01	02	03	04
05	06	07	08	09
10	10	11	12	13
14	16	18	19	21
21	22	23	24	25
26	27	28	29	30
31	32	33	33	34
35	36	37	38	39
40	41	42	43	45
46	47	48	49	50
51	52	53	54	55
56	57	58	59	60
61	62	63	64	65
66	67	68

Pfizer Data Redux

Updated the code and results to my guide on how to scraper Pfizer’s list of payments to doctors. It now contains a more normalized file that has a line for every doctor and payment. The aggregate totals changed marginally.

Coding for Journalists 101 : A four-part series

Photo by Nico Cavallotto on Flickr

Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here.

I’m only keeping this old walkthrough up as a historical reference. I’m sure the code is so ugly that I’m not going to even try re-reading it.

So check it out: The Bastards Book of Ruby

-Dan

—

Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. There is a guide for Pfizer which supersedes the one I originally posted here.

So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A “little while” turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows how to help Pfizer in its mission to more transparent, compelled me to just publish them in incomplete form. There’s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.

As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I’m very welcome to editing changes.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

Tutorial 1: Go from knowing nothing to scraping Web pages. In an hour. Hopefully

~~loop~~

Tutorial 2: Scraping a County Jail Website to Find Out Who’s in Jail – This uses all the concepts from the first tutorial and applies them to something that a cops reporter might actually want to try out.

Tutorial 3: Who’s Been in Jail Before: Cross-checking the jail logs with the court system with Ruby’s Mechanize – This lesson introduces you to another Ruby library that allows you to automate the filling-out of forms so that you can access online databases, in this case, California criminal case histories to see if current inmates are repeat-alleged-offenders.

Tutorial 4: Improving Pfizer’s Dollars-to-Doctors Pay List – Last week, Pfizer released a list of nearly 5,000 doctors and medical institutions that it made $35 million in consulting and expense payments. Fun. Unfortunately, the list, as it initially existed online, is just about useless to anyone wanting to examine trends. This tutorial provides a script to make the list more interesting to journalists.

Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List

Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state.

Update (4/28): Replaced the code and result files. Still haven’t written out a thorough explainer of what’s going on here.

Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. I’m going to rework this script and post and update soon.

So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. These doctors are the same ones who, from time to time, recommend the use of Pfizer products.

From the NYT:

Pfizer, the worldâ€™s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. Pfizer also paid $15.3 million to 250 academic medical centers and other research groups for clinical trials in the same period.

A spokeswoman for Pfizer, Kristen E. Neese, said most of the disclosures were required by an integrity agreement that the company signed in August to settle a federal investigation into the illegal promotion of drugs for off-label uses.

So, not an entirely altruistic release of information. But it’s out there nonetheless. You can view their list here. Jump to my results here

Not bad at first glance. However, on further examination, it’s clear that the list is nearly useless unless you intend to click through all 480 pages manually, or, if you have a doctor in mind and you only care about that one doctor’s relationship. As a journalist, you probably have other questions. Such as:

Which doctor received the most?
What was the largest kind of expenditure?
Were there any unusually large single-item payments?

None of these questions are answerable unless you have the list in a spreadsheet. As I mentioned in earlier lessons…there are cases when the information is freely available, but the provider hasn’t made it easy to analyze. Technically, they are fulfilling their requirement to be “transparent.”

I’ll give them the benefit of the doubt that they truly want this list to be as accessible and visible as possible…I tried emailing them to ask for the list as a single spreadsheet, but the email function was broken. So, let’s just write some code to save them some work and to get our answers a little quicker.
Continue reading →

Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact dan@danwin.com if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

In particular, with lesson 3, I skipped basically any explanation to the code. I hope to get around to it later.

Going to Court

In the last lesson, we learned how to write a script that would record who was in jail at a given hour. This could yield some interesting stories for a crime reporter, including spates of arrests for notable crimes and inmates who are held with $1,000,000 bail for relatively minor crimes. However, an even more interesting angle would be to check the inmates’ prior records, to get a glimpse of the recidivism rate, for example.

Sacramento Superior Court allows users to search by not just names, but by the unique ID number given to inmates by Sacramento-area jurisdictions. This makes it pretty easy to link current inmates to court records.

However, the techniques we used in past lessons to automate the data collection won’t work here. As you can see in the above picture, you have to fill out a form. That’s not something any of the code we’ve written previously will do. Luckily, that’s where Ruby’s mechanize comes in.

Continue reading →

Coding for Journalists 102: Who’s in Jail Now: Collecting info from a county jail site

This is part 2 of a 4-part series in introductory coding for journalists. Go here for the first lesson. This lesson and code will still be verbose, but will have a lot less hand-holding than the previous one.

Continue reading →

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.

UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby.

I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think:

http://ruby.bastardsbook.com

Someone asked in this online chat for journalists: I want to program/code, but where does a non-programmer journalist begin?

My colleague Jeff Larson gave what I believe is the most practical and professionally-useful answer: web-scraping (jump to my summary of web-scraping here, or read this more authorative source).

This is my attempt to walk someone through the most basic computer science theory so that he/she can begin collecting data in an automated way off of web pages, which I think is one of the most useful (and time-saving) tools available to today’s journalist. And thanks to the countless hours of work by generous coders, the tools are already there to make this within the grasp of a beginning programmer.

You just have to know where the tools are and how to pick them up.

Click here for this page’s table of contents. Or jump to the the theory lesson. Or to the programming exercise. Or, if you already know what a function and variable is, and have Ruby installed, go straight to two of my walkthroughs of building a real-world journalistic-minded web scraper: Scraping a jail site, and scraping Pfizer’s doctor payment list.

Or, read on for some more exposition:

Continue reading →

Day of the Tiger: How Newspapers, Networks, and News Aggregators Played Tiger Woods on Friday

On Friday, golfer Tiger Woods held a TV appearance to talk about life after marital problems. At around 2:30 p.m., I screen capped some of the websites for some of the largest news organizations and aggregators. Today, I looked at the screen-caps, cropped them to the top 1600 pixels, and marked in green the areas of the pages devoted to Woods coverage (or related coverage, such as “Slideshow: Top 10 Adultery Confessions).

Continue reading →

ProPublica tracks the bailout, a year or so later

Today, my ProPublica colleague Paul Kiel and I put out some graphical revisions to PP’s bank bailout tracking site, including our master list of companies to get taxpayer bailout money:

Graphic: The Status of the Bailout

Bailout List Page

Nothing fancy, mostly made the numbers easier to find and compare. The site itself has been far-from-fancy at its inception, since it was my first project after taking a crash course on Ruby on Rails. Back when the bailout was first announced in Q4 2008, the Treasury declined to name the banks it was doling taxpayer money to, for fear that non-listed banks would take a hit in reputation. Paul was one of the first few people to comb through banks’ press releases and enter them into a spreadsheet. His list of the first 26 – put into a simple html table – was a pretty big hit.

As the list grew into the dozens and hundreds, it became more cumbersome to maintain the static list, which was nothing more than the bank’s name, date of announcement, and amount of bailout. Plus, it was no longer just one bailout per company; Citigroup and Bank of America were beneficiaries of billions of dollars through a couple other programs.

So, I proposed a bailout site that would allow Paul to record the data at a more discrete level…up to that point, for example, most online lists showed that AIG had several dozen billion dollars committed to it, but not the various programs, reasons, and dates on which those allocations were made. A little anal maybe, but it gave the site the flexibility to adapt when the bailout grew to include all varieties of disbursements, including to auto parts manufacturers and mortgage servicers, as well as the money flow coming in the opposite direction, in the form of refunds and dividends.

I saw the site as more of a place for Paul to base his bailout coverage on (he’s been doing an excellent job covering the progress of the mortgage modification program), as I assumed that in the near future, Treasury would have its own, easy-to-use site of the data. Unfortunately, that is not quite the case, nearly a year and a half later. Besides some questionable UI decisions (such as having the front-facing page consist of a Flash map), the data is not put forth in an easily accessible method. It could be that I need to take an Excel refresher course here, but trying to sort the columns in these Excel spreadsheets just to find the biggest bailout amount, for example, throws an error.

Only in the past couple of months did Treasury finally release dividends in non-pdf form, and even then, it’s still a pain to work with (there’s no way, for example, to link the bank names in the dividends sheet to the master spreadsheet of bailouts). I would’ve thought that’d be the set of bailout data Treasury would be most eager to send out, because it’s the taxpayers’ return on investment. But, as it turns out, there is a half-empty perspective from this data (such as banks not having enough reserves to pay dividends in a timely fashion), one that would’ve been immediately obvious if the data were in a more sortable form.

ProPublica’s bailout tracking site doesn’t have much data other than the official Treasury bailout numbers; there’s all kinds of other unofficial numbers, such as how much each bank is giving out in bonuses, that people are more interested in. American University has gathered all kinds of financial health indicators for each bailout bank, too. There’s definitely much more data that PP, and other bailout trackers need to collect to provide a bigger picture of the bailout situation. But for now, I guess it’s a small victory to be one of the top starting points to find out just exactly where hundreds of billions of our taxes went to. And the why, too; Paul’s done a great job writing translations of the Treasury’s official-speak on each program.

ProPublica’s Eye on the Bailout

01	01	02	03	04
05	06	07	08	09
10	10	11	12	13
14	16	18	19	21
21	22	23	24	25
26	27	28	29	30
31	32	33	33	34
35	36	37	38	39
40	41	42	43	45
46	47	48	49	50
51	52	53	54	55
56	57	58	59	60
61	62	63	64	65
66	67	68

01	01	02	03	04
05	06	07	08	09
10	10	11	12	13
14	16	18	19	21
21	22	23	24	25
26	27	28	29	30
31	32	33	33	34
35	36	37	38	39
40	41	42	43	45
46	47	48	49	50
51	52	53	54	55
56	57	58	59	60
61	62	63	64	65
66	67	68

01	01	02	03	04
05	06	07	08	09
10	10	11	12	13
14	16	18	19	21
21	22	23	24	25
26	27	28	29	30
31	32	33	33	34
35	36	37	38	39
40	41	42	43	45
46	47	48	49	50
51	52	53	54	55
56	57	58	59	60
61	62	63	64	65
66	67	68