Category Archives: works

actual works, projects

The Most Viewed Portraits: Marina Abramović: The Artist Is Present, the MOMA

Thought it’d be fun to see the 200 most viewed portraits on the MOMA’s “Marina Abramović: The Artist Is Present” Flickr set. So I wrote a scraper to collect each portrait’s stats, including page views. A number of celebrities participated in the marathon performance art exhibit, including Sharon Stone, Rufus Wainright, and Bjork. Other top people include the guy who showed up a dozen times. And children and pretty females.

Photos by Marco Anelli. © 2010 Marina Abramović


24504

22726

21622

16961

15808

14002

12018

12001

11950

11482

11406

11190

9854

9194

9041

9015

8935

8500

8446

8272

7726

7669

6640

6569

6325

5835

5718

5654

5611

5596

5445

5303

5297

5266

5201

5059

5011

4965

4931

4921

4905

4867

4850

4784

4751

4734

4705

4660

4649

4617

4596

4566

4505

4491

4457

4443

4443

4400

4399

4326

4295

4285

4275

4179

4164

4088

4079

4063

4017

4016

3921

3884

3876

3875

3854

3822

3822

3805

3743

3738

3734

3603

3582

3555

3536

3529

3506

3484

3473

3463

3456

3422

3388

3341

3340

3334

3333

3306

3285

3265

3263

3239

3208

3192

3154

3149

3145

3124

3123

3112

3110

3087

3078

3045

3045

3045

3031

3024

3005

3000

2998

2947

2932

2930

2930

2922

2920

2915

2904

2900

2897

2892

2892

2881

2879

2869

2867

2850

2850

2823

2791

2785

2784

2768

2736

2736

2733

2722

2721

2720

2715

2705

2699

2694

2691

2687

2687

2680

2677

2672

2664

2655

2645

2643

2640

2620

2605

2604

2600

2599

2597

2596

2594

2593

2591

2588

2586

2581

2570

2567

2563

2562

2558

2552

2550

2546

2544

2541

2518

2516

2505

2504

2498

2496

2491

2486

2481

2478

2478

2477

Marina Abramović Melts Before Your Eyes

Wrote a quick scrape of the Museum of Modern Art’s gallery of Marina Abramović’s “The Artist is Present”. This is Abramovic’s portrait for the last 68 days (I guess the upload isn’t complete yet).
(Update, all 72 days are up, I’ll get around to updating this. Number 72 is a doozy)

Photos by Marco Anelli. © 2010 Marina Abramović

Coding for Journalists 101 : A four-part series

nico.cavallotto

Photo by Nico Cavallotto on Flickr

Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here.

I’m only keeping this old walkthrough up as a historical reference. I’m sure the code is so ugly that I’m not going to even try re-reading it.

So check it out: The Bastards Book of Ruby

-Dan

Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. There is a guide for Pfizer which supersedes the one I originally posted here.

So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A “little while” turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows how to help Pfizer in its mission to more transparent, compelled me to just publish them in incomplete form. There’s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.

As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I’m very welcome to editing changes.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List

Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state.

Update (4/28): Replaced the code and result files. Still haven’t written out a thorough explainer of what’s going on here.

Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. I’m going to rework this script and post and update soon.

So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. These doctors are the same ones who, from time to time, recommend the use of Pfizer products.

From the NYT:

Pfizer, the world’s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. Pfizer also paid $15.3 million to 250 academic medical centers and other research groups for clinical trials in the same period.

A spokeswoman for Pfizer, Kristen E. Neese, said most of the disclosures were required by an integrity agreement that the company signed in August to settle a federal investigation into the illegal promotion of drugs for off-label uses.

So, not an entirely altruistic release of information. But it’s out there nonetheless. You can view their list here. Jump to my results here

Not bad at first glance. However, on further examination, it’s clear that the list is nearly useless unless you intend to click through all 480 pages manually, or, if you have a doctor in mind and you only care about that one doctor’s relationship. As a journalist, you probably have other questions. Such as:

  • Which doctor received the most?
  • What was the largest kind of expenditure?
  • Were there any unusually large single-item payments?

None of these questions are answerable unless you have the list in a spreadsheet. As I mentioned in earlier lessons…there are cases when the information is freely available, but the provider hasn’t made it easy to analyze. Technically, they are fulfilling their requirement to be “transparent.”

I’ll give them the benefit of the doubt that they truly want this list to be as accessible and visible as possible…I tried emailing them to ask for the list as a single spreadsheet, but the email function was broken. So, let’s just write some code to save them some work and to get our answers a little quicker.
Continue reading

Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact dan@danwin.com if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

In particular, with lesson 3, I skipped basically any explanation to the code. I hope to get around to it later.

Going to Court

In the last lesson, we learned how to write a script that would record who was in jail at a given hour. This could yield some interesting stories for a crime reporter, including spates of arrests for notable crimes and inmates who are held with $1,000,000 bail for relatively minor crimes. However, an even more interesting angle would be to check the inmates’ prior records, to get a glimpse of the recidivism rate, for example.

Sacramento Superior Court allows users to search by not just names, but by the unique ID number given to inmates by Sacramento-area jurisdictions. This makes it pretty easy to link current inmates to court records.


However, the techniques we used in past lessons to automate the data collection won’t work here. As you can see in the above picture, you have to fill out a form. That’s not something any of the code we’ve written previously will do. Luckily, that’s where Ruby’s mechanize comes in.

Continue reading

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.

UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby.

I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think:

http://ruby.bastardsbook.com

Someone asked in this online chat for journalists: I want to program/code, but where does a non-programmer journalist begin?

My colleague Jeff Larson gave what I believe is the most practical and professionally-useful answer: web-scraping (jump to my summary of web-scraping here, or read this more authorative source).

This is my attempt to walk someone through the most basic computer science theory so that he/she can begin collecting data in an automated way off of web pages, which I think is one of the most useful (and time-saving) tools available to today’s journalist. And thanks to the countless hours of work by generous coders, the tools are already there to make this within the grasp of a beginning programmer.

You just have to know where the tools are and how to pick them up.

Click here for this page’s table of contents. Or jump to the the theory lesson. Or to the programming exercise. Or, if you already know what a function and variable is, and have Ruby installed, go straight to two of my walkthroughs of building a real-world journalistic-minded web scraper: Scraping a jail site, and scraping Pfizer’s doctor payment list.

Or, read on for some more exposition:

Continue reading

Day of the Tiger: How Newspapers, Networks, and News Aggregators Played Tiger Woods on Friday

On Friday, golfer Tiger Woods held a TV appearance to talk about life after marital problems. At around 2:30 p.m., I screen capped some of the websites for some of the largest news organizations and aggregators. Today, I looked at the screen-caps, cropped them to the top 1600 pixels, and marked in green the areas of the pages devoted to Woods coverage (or related coverage, such as “Slideshow: Top 10 Adultery Confessions).

Continue reading

ProPublica tracks the bailout, a year or so later

Today, my ProPublica colleague Paul Kiel and I put out some graphical revisions to PP’s bank bailout tracking site, including our master list of companies to get taxpayer bailout money:

Graphic: The Status of the Bailout

Graphic: The Status of the Bailout

Bailout List Page

Bailout List Page

Nothing fancy, mostly made the numbers easier to find and compare. The site itself has been far-from-fancy at its inception, since it was my first project after taking a crash course on Ruby on Rails. Back when the bailout was first announced in Q4 2008, the Treasury declined to name the banks it was doling taxpayer money to, for fear that non-listed banks would take a hit in reputation. Paul was one of the first few people to comb through banks’ press releases and enter them into a spreadsheet. His list of the first 26 – put into a simple html table – was a pretty big hit.

As the list grew into the dozens and hundreds, it became more cumbersome to maintain the static list, which was nothing more than the bank’s name, date of announcement, and amount of bailout. Plus, it was no longer just one bailout per company; Citigroup and Bank of America were beneficiaries of billions of dollars through a couple other programs.

So, I proposed a bailout site that would allow Paul to record the data at a more discrete level…up to that point, for example, most online lists showed that AIG had several dozen billion dollars committed to it, but not the various programs, reasons, and dates on which those allocations were made. A little anal maybe, but it gave the site the flexibility to adapt when the bailout grew to include all varieties of disbursements, including to auto parts manufacturers and mortgage servicers, as well as the money flow coming in the opposite direction, in the form of refunds and dividends.

I saw the site as more of a place for Paul to base his bailout coverage on (he’s been doing an excellent job covering the progress of the mortgage modification program), as I assumed that in the near future, Treasury would have its own, easy-to-use site of the data. Unfortunately, that is not quite the case, nearly a year and a half later. Besides some questionable UI decisions (such as having the front-facing page consist of a Flash map), the data is not put forth in an easily accessible method. It could be that I need to take an Excel refresher course here, but trying to sort the columns in these Excel spreadsheets just to find the biggest bailout amount, for example, throws an error.

Only in the past couple of months did Treasury finally release dividends in non-pdf form, and even then, it’s still a pain to work with (there’s no way, for example, to link the bank names in the dividends sheet to the master spreadsheet of bailouts). I would’ve thought that’d be the set of bailout data Treasury would be most eager to send out, because it’s the taxpayers’ return on investment. But, as it turns out, there is a half-empty perspective from this data (such as banks not having enough reserves to pay dividends in a timely fashion), one that would’ve been immediately obvious if the data were in a more sortable form.

ProPublica’s bailout tracking site doesn’t have much data other than the official Treasury bailout numbers; there’s all kinds of other unofficial numbers, such as how much each bank is giving out in bonuses, that people are more interested in. American University has gathered all kinds of financial health indicators for each bailout bank, too. There’s definitely much more data that PP, and other bailout trackers need to collect to provide a bigger picture of the bailout situation. But for now, I guess it’s a small victory to be one of the top starting points to find out just exactly where hundreds of billions of our taxes went to. And the why, too; Paul’s done a great job writing translations of the Treasury’s official-speak on each program.

ProPublica’s Eye on the Bailout