Category Archives: works

actual works, projects

Go to aGogh for museums, arts, culture listings

So I’ve finally finished my update to my listing……though it’s not quite finished. But it’s good enough for now for people to get some use out of it.

Same idea as before: an easy to read list of cultural venues in the city. But I’ve added profile pages for all the venues and a sampling of exhibition listings. After viewing more than 200 homepages, I’m even more convinced that it’s a huge pain to just serendipitously find what’s going on and when (other than at the most popular, obvious attractions) because of how different each place’s web presence is.

This site is an attempt to make it all a little more uniform, whether you want to see the latest exhibits in the city or what’s free today. Let me know what you think.

NICAR 2011 wrapup

Just came back from an inspiring week at the National Institute for Computer-Assisted Reporting in Raleigh, NC. Of all the journalism conferences I’ve been to, this one had the most to learn from and the most attendees excited to learn. There was real discussion about news apps being its own form of story-telling and art and not just uploading a bunch of numbers as HTML.

Chrys Wu has a compilation of the tipsheets and the highly technical tutorials. It’s a great trove for anyone – journalists or not – wanting to learn how to collect and process data and build powerful news applications. Some of my favorites, for their step-by-step nature: Jacob Fenton’s R tutorial, David Huynh’s detailed guide on his Google Refine, Andy Boyle’s on setting up Varnish, and Timothy Barmann’s walkthrough of Javascript mapping. My colleague Jeff Larson shows off his own Javascript skills with this MVC framework.

I led a couple of sessions. One boiled down to basically, use Firebug, which you can pretty much glean from a tutorial I wrote for ProPublica on how I grabbed the data from drugmaker Cephalon’s Flash site. I wrote another Ruby tutorial, starting from “Hello World” to building a Foursquare/Google Maps mashup…that was almost doable in an hour-session had I been better prepared with presentation materials.

One reason to try learning how to code now is that the number of teaching resources has never been more abundant. The NICAR resources collected on Chrys’s blog is more proof of this.

The free list of free New York museums:

Last Wednesday, in my haste to get it over with before I forgot about it after a weekend at NICAR, I threw up a hand-compiled chart of New York museums and other cultural attractions, focused primarily on when they were open and free. This was in response to a NY reddit user who asked just the right question to hit my “hey-maybe-*I*-can-do-something” buttons:

Does something like this exist? A chart? It seems like every museum has a day or two that it isn’t open and then one day that it’s open late (ideal for me) but they’re all different. Today, for example, I thought “I’d like to go to a museum but it’s going to be 5 soon and I have no idea if any are open late.” If somebody has an idea how this could be most logically put together, I wouldn’t mind doing it. I just can’t even imagine what form this would take other than some dry list or spreadsheet.

Well, I’m not much of a designer but I like making stuff that uses simple color bars and graphics to represent data, ever since my boss made me attend a Edward Tufte lecture. I also am a big fan of the special nights that museums have; a friend took me to the MOMA on one of the Target Free Fridays and I became a member afterward; I can’t count the times I’ve been since or the number of friends I’ve brought in, at the $5 member discount rate. Considering my tendency to sit around at home, I may have never gone without that first free night.

I got interview requests from writers at the Village Voice and the WSJ the day the map went up, so hopefully this chart gets out to the people who need one more reminder to check out all that’s great in this city.

The site’s a pretty lame technical feat; I looked at list of museums from Wikipedia and Yelp and then hit up each website to fill out a spreadsheet, which I converted to a webpage that’s way too big of a file for being mostly simple HTML. I guess I could’ve run a scraper on each site, but I wanted to acquaint myself with each place so I could get inspired to check out some new places. The info-gathering was by far the most painful and time-consuming aspect of this (my humble explanation for why it would take 7 days to make a sloppy HTML page with a Google map on top). It reminded me of the many restaurants that make you click through bouncy Flash graphics just to find their business hours. In defense of the museums though, their site-design M.O. is probably to wow people enough with images so that they won’t mind digging through to find the pertinent visitor and admission info. Still, it’s kind of annoying for those of us who just want to get down to some art-seeing business.

Now that I’ve got the basic info down, along with a lot of the museums’ social media links, the next step will be to…well, make this a real site from a framework rather than a Ruby script that reads from a Google spreadsheet. Then, to make a newsfeed of exhibits and events and put everything in a standard hcard format. I’ll probably tackify the site up with photos I’ve taken, too. As someone who needs Google to find what direction I’m walking in, I’m always kind of reluctant to do what the Great Indexers, including Wikipedia contributors, have already done. But then again, those broad informational frameworks don’t always show you enough specific details up front (such as the existence of free hours) to encourage you to go beyond the first search results. And since working on the Dollars for Docs project, I’ve learned there’s always a way to make already-easily available information much more useful.

Check out here.

dataist blog: An inspiring case for journalists learning to code

About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven’t looked back at it because I’m sure I’ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from “What is HTML” to actual Ruby scraping code all in a gigantic, badly formatted post.

The series of articles have gotten a fair number of hits but I don’t know how many people were able to stumble through it. Though last week I noticed this recent trackback from dataist, a new “blog about data exploration” by Finnish journo Jens Finnäs. He writes that he has “almost no prior programming experience” but, after going through my tutorials and checking out Scraperwiki, was able to produce this cool network graph of the Ratata blog network after about “two days of trial and error”:

Mapping of Ratata blogging network by Jens Finnäs of

Mapping of Ratata blogging network by Jens Finnäs of

I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas’s example. Becoming good at coding is not a trivial task. But even the first steps of it can teach a non-coder some profound lessons about data important enough on their own. And if you’re a curious-type with a question you want to answer, you’ll soon figure out a way to put something together, as in Finnas’s case.

ProPublica’s Dollars for Docs project originated in part from this Pfizer-scraping lesson I added on to my programming tutorial: I needed a timely example of public data that wasn’t as useful as it should be.

My colleagues Charles Ornstein and Tracy Weber may not be programmers (yet), but they are experienced enough with data to know its worth as an investigative resource, and turned an exercise in transparency into a focused and effective investigation. It’s not trivial to find a story in data. Besides being able to do Access queries themselves, C&T knew both the limitations of the data (for example, it’s difficult to make comparisons between the companies because of different reporting periods) and its possibilities, such as the cross-checking of names en masse from the payment lists with state and federal doctor databases.

Their investigation into the poor regulation of California nurses – a collaboration with the LA Times that was a Pulitzer finalist in the Public Service category – was similarly data-oriented. They (and the LA Times’ Maloy Moore and Doug Smith) had been diligently building a database of thousands of nurses – including their disciplinary records and the time it took for the nursing board to act – which made my part in building a site to graphically represent the data extremely simple.

The point of all this is: don’t put off your personal data-training because you think it requires a computer science degree, or that you have to become great at it in order for it to be useful. Even if after a week of learning, you can barely put together a programming script to alphabetize your tweets, you’ll likely gain enough insight to how data is made structured and useful, which will aid in just about every other aspect of your reporting repertoire.

In fact, just knowing to avoid taking notes like this:

Colonel Mustard used the revolver in the library? (not library)
Miss Scarlet used the Candlestick in the dining room? (not Scarlet)
“Mrs. Peacock, in the dining room, with the revolver? “
“Colonel Mustard, rope, conservatory?”
Mustard? Dining room? Rope (nope)?
“Was it Mrs. Peacock with the candlestick, inside the dining room?”

And instead, recording them like this:

Who/What? Role? Ruled out?
Mustard Suspect N
Scarlet Suspect Y
Peacock Suspect N
Revolver Weapon Y
Candlestick Weapon Y
Rope Weapon Y
Conservatory Place Y
Dining Room Place N
Library Place Y

…will make you a significantly more effective reporter, as well as position you to have your reporting and research become much more ready for thorough analysis and online projects.

There’s a motherlode of programming resources available through single Google search. My high school journalism teacher told us that if you want to do journalism, don’t major in it, just do it. I think the same can be said for programming. I’m glad I chose a computer field as an undergraduate so that I’m familiar with the theory. But if you have a career in reporting or research, you have real-world data-needs that most undergrads don’t. I’ve found that having those goals and needing to accomplish them has pushed my coding expertise far quicker than did any coursework.

If you aren’t set on learning to program, but want to get a better grasp of data, I recommend learning:

  • Regular expressions – a set of character patterns, easily printable on a cheat-sheet for memorization, that you use in a text-editor’s Find and Replace dialog to turn a chunk of text into something you can put into a spreadsheet, as well as clean up the data entries themselves. is the most complete resource I’ve found. A cheat-sheet can be found here. Wikipedia has a list of some simple use cases.
  • Google Refine – A spreadsheet-like program that makes easy the task of cleaning and normalizing messy data. Ever go through campaign contribution records and wish you could easily group together and count as one, all the variations of “Jon J. Doe”, “Jonathan J. Doe”, “Jon Johnson Doe”, “JON J DOE”, etc.? Refine will do that. Refine developer David Huynh has an excellent screencast demonstrating Refine’s power. I wrote a guide as part of the Dollars for Docs tutorials. Even if you know Excel like a pro – which I do not – Refine may make your data-life much more enjoyable.

If you want to learn coding from the ground up, here’s a short list of places to start:

Google Refine, a.k.a. Gridworks 2.0 released; ProPublica’s “Dollars for Docs” featured.

Good news for data-nerds everywhere. The 2.0 version of Google’s fantastic data-cleaning tool, Google Refine (formerly Gridworks), has been released. And they were nice enough to feature ProPublica’s Dollars for Docs as an example of a use-case. I talked briefly to about how I used Refine to put together the pharma top earners list.

It’s possible I could’ve done it using SQL queries and Ruby libraries. But I definitely would’ve missed a lot of matches, and probably overdosed on over-the-counter pharma-painkillers.

The Big Pharma-Dollars-for-Doctors Database, at ProPublica

Haven’t had much time to blog, or eat, or sleep in the past few months because of this project, but the first part just rolled out today (at about 2am, actually): at ProPublica, my colleagues and I collected the past two years of reports (albeit just from 7 companies) disclosing what they pay doctors to speak on their behalf. I still have a few posts and articles to write about what undertaking and background, but it’s the first time that someone has compiled all these reports and made them available to the public, something that will be mandated by law in 2013.

Our first investigation related to the data looked at how some of the companies’ top earners, who are ostensibly supposed to be experts in their field, had either shady or slim expertise. I did most of the datawork, including collecting the data and managing it, polling the various state websites to look up physician disciplinary records, and designing and coding (with the help of my genius coder co-worker Jeff Larson) the website. Whew!

Check it out.

Eye Heart New York

Bubble Dealer

A Bubble Dealer, on Spring St. and Broadway, shortly before the market crash of 2008

UPDATE: It is now a Tumblr:

It’s such a nice day out, I think I’ll make yet another blog about New York…

I’ve been looking for an area to test out HTML5, some other WP themes, and to shuffle all my New York-centric BS. I don’t know if I’ll ever complete this site about New York but at least I can get it indexed now.

Trying out Inuit Types, a half-magazine, half-blog format. Seems nice, though the image handling isn’t as flexible as I’d like. Or intuitive, at least.

Marina Abramović’s Top 50 Time Hogs; (Women sit around a lot)

OK, now it’s time to arrange the participants in the MOMA’s “Marina Abramović: The Artist Is Present” Flickr set by number of minutes each person stared at Abramovic. The Paco dude who went about a dozen times is the only person, apparently, to have stayed the whole day. It’s interesting to read the comments on the portraits of the long-suffering sitters; some people are understandably pissed to have been behind them.

Surprisingly, women made up the vast majority of the top 50 sitters; 37 by my quick visual count. Just a statistical fluke? Does the MOMA have a higher base of female visitors? Did females identify more with the female artist?

(One of Marina’s photos is mistakenly labeled, which is why my script placed her in this list…too lazy to fix right now)

See my list of the top 200 most popular portraits from Marina’s exhibit.

Photos by Marco Anelli. © 2010 Marina Abramović