Category Archives: works

actual works, projects

I’ve always been interested in exploring the various online Congressional information sources and the recent SOPA debate seemed like a good time to put some effort in it…also, I’ve always wanted to try out the excellent isotope Javascript library.

I had been passively paying attention to the debate and was surprised at how hard it was to find a list of supporters and opponents, given how much it’s dominated my (admittedly small bubblish) internet communities.

When I set out to compile the list, though, I could see why…the official government sites don’t make it easy to find or interpret the information. So SOPAopera is my game attempt at putting some basic information about it…the feedback I’ve gotten so far indicates that even constituents who have been reading a lot about SOPA/PROTECT-IP are surprised at the level and diversity of support the laws have among Congressmembers.

Crossing Bleecker and Lafayette through a snowstorm

Back when I wrote my “Coding for Journalists 101″ guide about a year and a half ago, I barely realized how useful code could be as a journalistic tool. Since then, after the Dollars for Docs project at ProPublica and various other programming adventures, I’ve become a slightly better coder and even more adamant that programming is basically a necessity for anyone who cares about understanding and communicating about the world in a quantitative, meaningful way.

The world of data has exploded in the past few years without a corresponding increase in the people or tools to efficiently make sense of it. And so I’ve had a hankering to create a more cohesive, useful programming guide aimed at not just journalists, but for anyone in any field.

It’s called the Bastards Book of Ruby. It’s not really just about Ruby and “bastards” was a working title that I came up with but never got around to changing. But it seems to work for now.

As I was writing the introduction (“Programming is for Anyone“), I came across this Steve Jobs interview with Fresh Air. He says pretty much exactly what I’m thinking, but he said it 15 years ago — surprising given that the Web was in its infancy and Jobs’s fame was largely out of making computers brain-dead simple for people. He wasn’t much of a programmer, but he really was a genius at understanding the bigger picture of what he himself only dabbled in:

“In my perspective … science and computer science is a liberal art, it’s something everyone should know how to use, at least, and harness in their life. It’s not something that should be relegated to 5 percent of the population over in the corner. It’s something that everybody should be exposed to and everyone should have mastery of to some extent, and that’s how we viewed computation and these computation devices.”

Bastards Book of Ruby. It’s just a rough draft but already numbers at 75,000 words. See the table of contents.

UPDATE 1:30PM: New NOAA numbers project REDUCED probabilities, table updated:

According to raw data from the National Hurricane Center, the probability that NYC will suffer sustained high winds has increased significantly

I had yesterday's numbers saved on my web cache from yesterday. Here they are compared with this morning's numbers (reports 26 and 28 respectively):

CityKTSAT 0200-1400SAT 1400-SUN 0200SUN 0200-1400SUN 1400-MON 0200MON 0200 - TUE 0200TUE 0200-WED 0200WED 0600 - THU 0600
NYC341( 1)23(24)44(68)1(69)X(69)X(69)
NYC50X( X)2( 2)27(29)X(29)X(29)X(29)
NYC64X( X)X( X)5( 5)X( 5)X( 5)X( 5)
New proj:
NYC34135(36)47(83)X(83)X(83)X(83)X(83)
NYC50X3( 3)41(44)X(44)X(44)X(44)X(44)
NYC64XX( X)10(10)X(10)X(10)X(10)X(10)
NEWER proj (#29):
NYC341059(69)5(74)X(74)X(74)X(74)X(74)
NYC50X30(30)3(33)X(33)X(33)X(33)X(33)
NYC64X5( 5)1( 6)X( 6)X( 6)X( 6)X( 6)


The KT values are sustained winds (1 minute or longer) measurements. They translate to:

3439mph
5058mph
6474mph

The number in the parentheses is the projected cumulative chance that NYC experiences those wind speeds. The number outside the parentheses are the chance that those wind speeds will occur in the given time period.

How bad are those wind speeds for New York? Nate Silver of the New York Times has a great article and chart showing the projected damage. Summary: It's not good, at all:

Nate Silver Hurricane Irene damage chart

Nate Silver Hurricane Irene damage chart

The NYTimes is keeping a good up-to-date blog of the latest Irene news.

Here's the current NOAA raw data for all the cities (next time around, I'll just make a web app to translate this mess):


000
FONT14 KNHC 271449
PWSAT4

HURRICANE IRENE WIND SPEED PROBABILITIES NUMBER  29                 
NWS NATIONAL HURRICANE CENTER MIAMI FL       AL092011               
1500 UTC SAT AUG 27 2011                                            

AT 1500Z THE CENTER OF HURRICANE IRENE WAS LOCATED NEAR LATITUDE    
35.2 NORTH...LONGITUDE 76.4 WEST WITH MAXIMUM SUSTAINED WINDS NEAR  
75 KTS...85 MPH...140 KM/H.                                         

Z INDICATES COORDINATED UNIVERSAL TIME (GREENWICH)                  
   ATLANTIC STANDARD TIME (AST)...SUBTRACT 4 HOURS FROM Z TIME      
   EASTERN  DAYLIGHT TIME (EDT)...SUBTRACT 4 HOURS FROM Z TIME      
   CENTRAL  DAYLIGHT TIME (CDT)...SUBTRACT 5 HOURS FROM Z TIME      


I.  MAXIMUM WIND SPEED (INTENSITY) PROBABILITY TABLE                

CHANCES THAT THE MAXIMUM SUSTAINED (1-MINUTE AVERAGE) WIND SPEED OF 
THE TROPICAL CYCLONE WILL BE WITHIN ANY OF THE FOLLOWING CATEGORIES 
AT EACH OFFICIAL FORECAST TIME DURING THE NEXT 5 DAYS.              
PROBABILITIES ARE GIVEN IN PERCENT.  X INDICATES PROBABILITIES LESS 
THAN 1 PERCENT.                                                     


      - - - MAXIMUM WIND SPEED (INTENSITY) PROBABILITIES - - -      

VALID TIME   00Z SUN 12Z SUN 00Z MON 12Z MON 12Z TUE 12Z WED 12Z THU
FORECAST HOUR   12      24      36      48      72      96     120  
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
DISSIPATED       X       4       4      10      25      30      31
TROP DEPRESSION  3      19       7      26      31      29      28
TROPICAL STORM  41      56      65      53      41      38      38
HURRICANE       56      21      24      12       3       4       3
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
HUR CAT 1       52      18      21      10       3       3       3
HUR CAT 2        4       2       3       2       X       X       X
HUR CAT 3        1       1       X       X       X       X       X
HUR CAT 4        X       X       X       X       X       X       X
HUR CAT 5        X       X       X       X       X       X       X
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
FCST MAX WIND   70KT    65KT    60KT    45KT    40KT    35KT    35KT


II. WIND SPEED PROBABILITY TABLE FOR SPECIFIC LOCATIONS             

CHANCES OF SUSTAINED (1-MINUTE AVERAGE) WIND SPEEDS OF AT LEAST     
   ...34 KT (39 MPH... 63 KPH)...                                   
   ...50 KT (58 MPH... 93 KPH)...                                   
   ...64 KT (74 MPH...119 KPH)...                                   
FOR LOCATIONS AND TIME PERIODS DURING THE NEXT 5 DAYS               

PROBABILITIES FOR LOCATIONS ARE GIVEN AS IP(CP) WHERE               
    IP  IS THE PROBABILITY OF THE EVENT BEGINNING DURING            
        AN INDIVIDUAL TIME PERIOD (INDIVIDUAL PROBABILITY)          
   (CP) IS THE PROBABILITY OF THE EVENT OCCURRING BETWEEN           
        12Z SAT AND THE FORECAST HOUR (CUMULATIVE PROBABILITY)      

PROBABILITIES ARE GIVEN IN PERCENT                                  
X INDICATES PROBABILITIES LESS THAN 1 PERCENT                       
PROBABILITIES FOR 34 KT AND 50 KT ARE SHOWN AT A GIVEN LOCATION WHEN
THE 5-DAY CUMULATIVE PROBABILITY IS AT LEAST 3 PERCENT.             
PROBABILITIES FOR 64 KT ARE SHOWN WHEN THE 5-DAY CUMULATIVE         
PROBABILITY IS AT LEAST 1 PERCENT.                                  


  - - - - WIND SPEED PROBABILITIES FOR SELECTED  LOCATIONS - - - -  

               FROM    FROM    FROM    FROM    FROM    FROM    FROM 
  TIME       12Z SAT 00Z SUN 12Z SUN 00Z MON 12Z MON 12Z TUE 12Z WED
PERIODS         TO      TO      TO      TO      TO      TO      TO  
             00Z SUN 12Z SUN 00Z MON 12Z MON 12Z TUE 12Z WED 12Z THU

FORECAST HOUR    (12)   (24)    (36)    (48)    (72)    (96)   (120)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
LOCATION       KT                                                   

BURGEO NFLD    34  X   X( X)   X( X)   X( X)   6( 6)   X( 6)   X( 6)

PTX BASQUES    34  X   X( X)   X( X)   2( 2)   8(10)   X(10)   X(10)

EDDY POINT NS  34  X   X( X)   X( X)   4( 4)   1( 5)   X( 5)   X( 5)

SYDNEY NS      34  X   X( X)   X( X)   2( 2)   3( 5)   X( 5)   X( 5)

HALIFAX NS     34  X   X( X)   1( 1)   8( 9)   X( 9)   X( 9)   X( 9)

YARMOUTH NS    34  X   X( X)  16(16)   6(22)   X(22)   X(22)   X(22)

MONCTON NB     34  X   X( X)   3( 3)  20(23)   1(24)   X(24)   X(24)

ST JOHN NB     34  X   X( X)  12(12)  18(30)   X(30)   X(30)   X(30)
ST JOHN NB     50  X   X( X)   X( X)   3( 3)   X( 3)   X( 3)   X( 3)

EASTPORT ME    34  X   X( X)  22(22)  16(38)   X(38)   X(38)   X(38)
EASTPORT ME    50  X   X( X)   1( 1)   4( 5)   X( 5)   X( 5)   X( 5)

BAR HARBOR ME  34  X   X( X)  41(41)  12(53)   X(53)   X(53)   X(53)
BAR HARBOR ME  50  X   X( X)   6( 6)   6(12)   X(12)   X(12)   X(12)
BAR HARBOR ME  64  X   X( X)   1( 1)   1( 2)   X( 2)   X( 2)   X( 2)

AUGUSTA ME     34  X   1( 1)  62(63)   7(70)   X(70)   X(70)   X(70)
AUGUSTA ME     50  X   X( X)  18(18)   6(24)   X(24)   X(24)   X(24)
AUGUSTA ME     64  X   X( X)   3( 3)   1( 4)   X( 4)   X( 4)   X( 4)

PORTLAND ME    34  X   5( 5)  67(72)   2(74)   X(74)   X(74)   X(74)
PORTLAND ME    50  X   X( X)  26(26)   2(28)   X(28)   X(28)   X(28)
PORTLAND ME    64  X   X( X)   5( 5)   X( 5)   X( 5)   X( 5)   X( 5)

CONCORD NH     34  X   9( 9)  68(77)   1(78)   X(78)   X(78)   X(78)
CONCORD NH     50  X   X( X)  37(37)   X(37)   X(37)   X(37)   X(37)
CONCORD NH     64  X   X( X)   7( 7)   X( 7)   X( 7)   X( 7)   X( 7)

BOSTON MA      34  X  18(18)  54(72)   X(72)   X(72)   X(72)   X(72)
BOSTON MA      50  X   X( X)  29(29)   X(29)   X(29)   X(29)   X(29)
BOSTON MA      64  X   X( X)   5( 5)   X( 5)   X( 5)   X( 5)   X( 5)

HYANNIS MA     34  X  19(19)  34(53)   X(53)   X(53)   X(53)   X(53)
HYANNIS MA     50  X   X( X)  12(12)   X(12)   X(12)   X(12)   X(12)
HYANNIS MA     64  X   X( X)   1( 1)   X( 1)   X( 1)   X( 1)   X( 1)

NANTUCKET MA   34  X  20(20)  26(46)   X(46)   X(46)   X(46)   X(46)
NANTUCKET MA   50  X   1( 1)   6( 7)   X( 7)   X( 7)   X( 7)   X( 7)
NANTUCKET MA   64  X   X( X)   1( 1)   X( 1)   X( 1)   X( 1)   X( 1)

PROVIDENCE RI  34  X  30(30)  39(69)   1(70)   X(70)   X(70)   X(70)
PROVIDENCE RI  50  X   2( 2)  28(30)   X(30)   X(30)   X(30)   X(30)
PROVIDENCE RI  64  X   X( X)   6( 6)   X( 6)   X( 6)   X( 6)   X( 6)

HARTFORD CT    34  2  39(41)  34(75)   X(75)   X(75)   X(75)   X(75)
HARTFORD CT    50  X   6( 6)  29(35)   X(35)   X(35)   X(35)   X(35)
HARTFORD CT    64  X   X( X)   6( 6)   X( 6)   X( 6)   X( 6)   X( 6)

MONTAUK POINT  34  4  42(46)  23(69)   X(69)   X(69)   X(69)   X(69)
MONTAUK POINT  50  X  11(11)  23(34)   X(34)   X(34)   X(34)   X(34)
MONTAUK POINT  64  X   1( 1)   6( 7)   X( 7)   X( 7)   X( 7)   X( 7)

NEW YORK CITY  34 10  59(69)   5(74)   X(74)   X(74)   X(74)   X(74)
NEW YORK CITY  50  X  30(30)   3(33)   X(33)   X(33)   X(33)   X(33)
NEW YORK CITY  64  X   5( 5)   1( 6)   X( 6)   X( 6)   X( 6)   X( 6)

NEWARK NJ      34  9  53(62)   5(67)   X(67)   X(67)   X(67)   X(67)
NEWARK NJ      50  X  21(21)   2(23)   X(23)   X(23)   X(23)   X(23)
NEWARK NJ      64  X   3( 3)   1( 4)   X( 4)   X( 4)   X( 4)   X( 4)

TRENTON NJ     34 15  45(60)   2(62)   X(62)   X(62)   X(62)   X(62)
TRENTON NJ     50  X  16(16)   X(16)   X(16)   X(16)   X(16)   X(16)
TRENTON NJ     64  X   2( 2)   X( 2)   X( 2)   X( 2)   X( 2)   X( 2)

ATLANTIC CITY  34 44  38(82)   X(82)   X(82)   X(82)   X(82)   X(82)
ATLANTIC CITY  50  1  42(43)   X(43)   X(43)   X(43)   X(43)   X(43)
ATLANTIC CITY  64  X   7( 7)   X( 7)   X( 7)   X( 7)   X( 7)   X( 7)

BALTIMORE MD   34 26   9(35)   X(35)   X(35)   X(35)   X(35)   X(35)

DOVER DE       34 54  20(74)   1(75)   X(75)   X(75)   X(75)   X(75)
DOVER DE       50  2  20(22)   X(22)   X(22)   X(22)   X(22)   X(22)
DOVER DE       64  X   2( 2)   X( 2)   X( 2)   X( 2)   X( 2)   X( 2)

ANNAPOLIS MD   34 35  10(45)   1(46)   X(46)   X(46)   X(46)   X(46)

WASHINGTON DC  34 26   7(33)   X(33)   X(33)   X(33)   X(33)   X(33)

OCEAN CITY MD  34 83   9(92)   X(92)   X(92)   X(92)   X(92)   X(92)
OCEAN CITY MD  50 43  26(69)   X(69)   X(69)   X(69)   X(69)   X(69)
OCEAN CITY MD  64  5   9(14)   X(14)   X(14)   X(14)   X(14)   X(14)

RICHMOND VA    34 57   1(58)   X(58)   X(58)   X(58)   X(58)   X(58)

NORFOLK NAS    34 99   X(99)   X(99)   X(99)   X(99)   X(99)   X(99)
NORFOLK NAS    50 71   X(71)   X(71)   X(71)   X(71)   X(71)   X(71)
NORFOLK NAS    64  6   X( 6)   X( 6)   X( 6)   X( 6)   X( 6)   X( 6)

NORFOLK VA     34 99   X(99)   X(99)   X(99)   X(99)   X(99)   X(99)
NORFOLK VA     50 84   X(84)   X(84)   X(84)   X(84)   X(84)   X(84)
NORFOLK VA     64 10   X(10)   X(10)   X(10)   X(10)   X(10)   X(10)

GREENSBORO NC  34  4   X( 4)   X( 4)   X( 4)   X( 4)   X( 4)   X( 4)

RALEIGH NC     34 12   1(13)   X(13)   X(13)   X(13)   X(13)   X(13)

CAPE HATTERAS  34 99   X(99)   X(99)   X(99)   X(99)   X(99)   X(99)
CAPE HATTERAS  50 99   X(99)   X(99)   X(99)   X(99)   X(99)   X(99)
CAPE HATTERAS  64 99   X(99)   X(99)   X(99)   X(99)   X(99)   X(99)

CHARLOTTE NC   34  3   X( 3)   X( 3)   X( 3)   X( 3)   X( 3)   X( 3)

MOREHEAD CITY  34 99   X(99)   X(99)   X(99)   X(99)   X(99)   X(99)
MOREHEAD CITY  50 99   X(99)   X(99)   X(99)   X(99)   X(99)   X(99)
MOREHEAD CITY  64 14   X(14)   X(14)   X(14)   X(14)   X(14)   X(14)

WILMINGTON NC  34 99   X(99)   X(99)   X(99)   X(99)   X(99)   X(99)
WILMINGTON NC  50 99   X(99)   X(99)   X(99)   X(99)   X(99)   X(99)

MYRTLE BEACH   34  3   X( 3)   X( 3)   X( 3)   X( 3)   X( 3)   X( 3)

$$                                                                  
FORECASTER BROWN                                                    


	

After trying too hard to rewrite my really old Flash gallery as a jQuery plugin, I thought “to hell with it” and decided to join the one-pager trend: http://photos.danwin.com. I have to say, this was one of the more pleasant site-designing jobs I’ve done in awhile. I’m going to try to limit my sites to one-page or fewer from here on out.

photos.danwin.com

photos.danwin.com

I started with a HTML5 template from initializr.com and then tacked on the 1140 CSS grid sheet, a fluid framework.

As far as Javascript goes, besides jQuery, I’m using Ben Alman’s throttle-debounce plugin, Leandro Vieira’s lightbox plugin, and Ariel Flesler’s scrollTo plugin for the simple interaction bits.

It’s pretty rudimentary in terms of code sophistication…I haven’t yet decided how to lazy-load the images while still providing a full page for non-JS users. I think I’ll end up tacking on backbone.js and figuring out a JSON structure to load in the “galleries”. So, for now, deal with loading some 100+ images all at once from S3…

To me, it’s an improvement over the typical slideshow galleries in which only one image at a time is shown. Maybe it’s because I don’t have enough Big Picture show-stoppers to justify displaying every photo as full-screen. But I think there’s some artistic room in manually arranging the images as a collage and purposefully deciding the size of each image in relation to the others.

The best part is that with the 1140 grid system, not only was designing for variable-width desktop browsers (and placing the images) a breeze…the site works very well on the iPad and passably well on the iPhone…and I barely even left Google Chrome on my Mac during the whole development process.

Now I just have to get some better photos. And maybe think the typography a little more…Meanwhile, check it out:

Obama 1, Osama 0

This (totally not-double-checked) analysis is a riff off of the excellent New York Times visualization (The Death of a Terrorist: A Turning Point?) of how people reacted to Osama bin Laden’s death. In the days following the news, the Times asked online readers to not only write their thoughts on bin Laden’s killing, but put a mark on a scatterplot graph that best described their reaction.

The Times used the data to show the continuum of reactions from everyone who participated. I wanted to see how reactions differed across geographical location and gender.

The Times collected about 13,000 reactions before closing it down. Besides the nature and content of reaction, users had the choice of leaving their names and geographical areas.

I used Google Refine to quickly sort out the geographic locations (which varied from zip codes, to city/state, to neighborhoods, such as “Upper East Side”). Gender was not a checkbox in the NYT’s form, so I used Refine to sort based on first names. More details in the methodology section.

Conclusion

The conclusion my totally-unscientific analysis came to: Among all NYT website users, there was general moral approval and optimism for killing bin Laden. This did not vary significantly among U.S. citizens, whether they were from the cities attacked on Sept. 11 or elsewhere.

However, non-U.S. NYT-website-users were less supportive of the action. This gap of moral approval also exists between male and female NYT-website-users and at roughly the same magnitude (about 10 points).

There wasn’t much variation in terms of how significant NYT-website-users believed OBL’s death would be. All demographic groups averaged about 60 (out of 100) in terms of how significant they rated OBL’s death in the war on terror.

In case you’re wondering: the 260 non-U.S.-female respondents averaged a 43 in positivity, which is a whole step below the average female response. U.S. females (2,270 of them), averaged a 52, compared to the 6,059 U.S. males who averaged a 65.

Data

I’ll just get right to the results tables.

The original graph was arranged so that its x-axis represented how positive users felt about OBL’s death and the y-axis represented how significant of an impact they thought it would have on the war on terror.

So, someone who thought that OBL’s demise was very good news and would have a strong impact on the war would be in the top right quadrant. Those who thought it was a bad deed, and would amount to nothing, would be in the bottom left. In the scatterplot, darker points correspond to more users with the same type of reaction.

I have two sections of tables. The first section consists of the basic numbers: The count of users, the average positivity rating (from 0 to 100) and the average significance rating.

The second section consists of visualizations. The first is a scatterplot similar to the NYT’s original graphic, with less granularity. The second and third plot positivity and significance ratings, respectively, on the x-axis, with the y-axis showing the relative popularity of each rating.

The most interesting graph is the female respondents’: it was the only one in which the most-positive rating did not garner the most respondents. It appears that the most popular choice was on-the-fence.

 

GroupNumberAverage PositivityAverage Significance
All1386460.2361.04
Males706764.0162.07
Females258051.8160.08
U.S.1153761.2861.45
Outside U.S.182053.8059.06
U.S. non-NYC/DC919161.2861.28
NYC197861.1562.18
Washington DC36862.0761.74

Graphs

A quick note: I was not as adept as the NYT at making my scatterplot more discrete and readable. The darkness of each pixel is relative to the highest respondent count in that particular group. So, the female scatterplot looks to be denser than the others, when what probably happened was that the responses were more evenly spread out.

GroupScatterplotDistribution of PositivityDistribution of Significance
All
Males
Females
U.S.
Outside U.S.
U.S. non-NYC/DC
NYC
Washington DC

Caveats

In my summary of conclusions section, I was careful to say “NYT-website-users.” The NYT reactions graph is not a random sampling of the population, or of even the NYT’s audience. It is a feature accessible only to web-users, which – if the Internet is still stereotypically male-dominated – might account for the high male-to-female ratio.

The reactions feature was a passive one, in that onus was on the readers to actually interact with the graphic and fill out a form. So this would seem to filter out most of the apathetic – or busy – crowd. Moreover, the NYT team removed any comments that were off-topic, trolling, or strongly inappropriate…so anyone who is driven to cuss when the topic is bin Laden has probably been filtered out.

I also think the nature of the graphic, having users pick out a point out of 10,000 (or so), might naturally have them gravitate towards the axes and midpoints. For example. someone might verbalize their reaction as “Meh, neither happy nor sad” and pick the exact midpoint, when they’re really more of a 4 or 6. Or, someone who is really happy that bin Laden is dead automatically goes for the farthest right spot because anything less than the highest positivity scale would indicate some kind of partial sympathy for bin Laden. Each scatterplot graph reflects this, with the darker spots collecting around the extremes.

And if you want to be part of the “NYT’s a bunch of liberal-brie-eaters” crowd, then it’s possible that the entire respondent base is slanted leftwards politically. I thought it would be interesting to see if results varied by red and blue states, but I think that a red-state fan of the NYT is probably not much different than a blue-state fan. And, it would’ve have taken way more time to sort out by state.

So with that said, this survey is not at all an accurate reflection of the general population, compared to a general poll. Still, it’s interesting to see that even within this select sample group, there is a large disparity between males and females, and U.S. and non-U.S. But again, we can’t really make any sweeping generalizations, such as: “Women are less positive about killing” or that “Foreigners are against American unilateral raids.” without prefacing it with “Women who use the New York Times’ website and who are opinionated enough to participate in their interactive graphic are…”

Methodology

I used Google Refine to quickly cluster around geographic locations and first names. To decide whether a user was in the U.S. or not, I used regular expressions to quickly find all the location entries with postal or AP-style state abbreviations. To filter for NYC users, I used regular expressions that looked for “NY” and rejected any that specifically stated a non-NYC city, such as Poughkeepsie. And I also just did a search for all well-known NYC neighborhoods. Finding DC was mostly just looking for “DC”

Gender was a little bit trickier. I found the easiest way was to Google for a list of the most common male and female names and do a large regular expression to filter for them. I rejected names that could belong to either gender, such as “Pat” or “Kim”. And for names that I wasn’t sure of, I just didn’t include them in the sample, so this means foreign and rare names weren’t part of the mix.

For both geography and names, I ended up rejecting most values that didn’t have a count of at least 2 or 3. So the upshot is, people with common names, like “John”, are more represented than those with relatively uncommon names, like “Leopold.”

I used RMagick to generate the scatterplots and Google Image Charts API for the bar graphs.

I’ve said it before and I’ll say it again, for geeky data analysis, Google Refine is a godsend.

A sidenote: The Jessica Dovey quote, misattributed to Martin Luther King Jr., “I will mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy,” made an appearance 42 times in the NYT response matrix.

So I’ve finally finished my update to my iheartnymuseums.com listing…http://agogh.com…though it’s not quite finished. But it’s good enough for now for people to get some use out of it.

Same idea as before: an easy to read list of cultural venues in the city. But I’ve added profile pages for all the venues and a sampling of exhibition listings. After viewing more than 200 homepages, I’m even more convinced that it’s a huge pain to just serendipitously find what’s going on and when (other than at the most popular, obvious attractions) because of how different each place’s web presence is.

This site is an attempt to make it all a little more uniform, whether you want to see the latest exhibits in the city or what’s free today. Let me know what you think.

Just came back from an inspiring week at the National Institute for Computer-Assisted Reporting in Raleigh, NC. Of all the journalism conferences I’ve been to, this one had the most to learn from and the most attendees excited to learn. There was real discussion about news apps being its own form of story-telling and art and not just uploading a bunch of numbers as HTML.

Chrys Wu has a compilation of the tipsheets and the highly technical tutorials. It’s a great trove for anyone – journalists or not – wanting to learn how to collect and process data and build powerful news applications. Some of my favorites, for their step-by-step nature: Jacob Fenton’s R tutorial, David Huynh’s detailed guide on his Google Refine, Andy Boyle’s on setting up Varnish, and Timothy Barmann’s walkthrough of Javascript mapping. My colleague Jeff Larson shows off his own Javascript skills with this MVC framework.

I led a couple of sessions. One boiled down to basically, use Firebug, which you can pretty much glean from a tutorial I wrote for ProPublica on how I grabbed the data from drugmaker Cephalon’s Flash site. I wrote another Ruby tutorial, starting from “Hello World” to building a Foursquare/Google Maps mashup…that was almost doable in an hour-session had I been better prepared with presentation materials.

One reason to try learning how to code now is that the number of teaching resources has never been more abundant. The NICAR resources collected on Chrys’s blog is more proof of this.

IHeartNYMuseums.com

IHeartNYMuseums.com

Last Wednesday, in my haste to get it over with before I forgot about it after a weekend at NICAR, I threw up a hand-compiled chart of New York museums and other cultural attractions, focused primarily on when they were open and free. This was in response to a NY reddit user who asked just the right question to hit my “hey-maybe-*I*-can-do-something” buttons:

Does something like this exist? A chart? It seems like every museum has a day or two that it isn’t open and then one day that it’s open late (ideal for me) but they’re all different. Today, for example, I thought “I’d like to go to a museum but it’s going to be 5 soon and I have no idea if any are open late.” If somebody has an idea how this could be most logically put together, I wouldn’t mind doing it. I just can’t even imagine what form this would take other than some dry list or spreadsheet.

Well, I’m not much of a designer but I like making stuff that uses simple color bars and graphics to represent data, ever since my boss made me attend a Edward Tufte lecture. I also am a big fan of the special nights that museums have; a friend took me to the MOMA on one of the Target Free Fridays and I became a member afterward; I can’t count the times I’ve been since or the number of friends I’ve brought in, at the $5 member discount rate. Considering my tendency to sit around at home, I may have never gone without that first free night.

I got interview requests from writers at the Village Voice and the WSJ the day the map went up, so hopefully this chart gets out to the people who need one more reminder to check out all that’s great in this city.

The site’s a pretty lame technical feat; I looked at list of museums from Wikipedia and Yelp and then hit up each website to fill out a spreadsheet, which I converted to a webpage that’s way too big of a file for being mostly simple HTML. I guess I could’ve run a scraper on each site, but I wanted to acquaint myself with each place so I could get inspired to check out some new places. The info-gathering was by far the most painful and time-consuming aspect of this (my humble explanation for why it would take 7 days to make a sloppy HTML page with a Google map on top). It reminded me of the many restaurants that make you click through bouncy Flash graphics just to find their business hours. In defense of the museums though, their site-design M.O. is probably to wow people enough with images so that they won’t mind digging through to find the pertinent visitor and admission info. Still, it’s kind of annoying for those of us who just want to get down to some art-seeing business.

Now that I’ve got the basic info down, along with a lot of the museums’ social media links, the next step will be to…well, make this a real site from a framework rather than a Ruby script that reads from a Google spreadsheet. Then, to make a newsfeed of exhibits and events and put everything in a standard hcard format. I’ll probably tackify the site up with photos I’ve taken, too. As someone who needs Google to find what direction I’m walking in, I’m always kind of reluctant to do what the Great Indexers, including Wikipedia contributors, have already done. But then again, those broad informational frameworks don’t always show you enough specific details up front (such as the existence of free hours) to encourage you to go beyond the first search results. And since working on the Dollars for Docs project, I’ve learned there’s always a way to make already-easily available information much more useful.

Check out IHeartNYMuseums.com here.

About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven’t looked back at it because I’m sure I’ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from “What is HTML” to actual Ruby scraping code all in a gigantic, badly formatted post.

The series of articles have gotten a fair number of hits but I don’t know how many people were able to stumble through it. Though last week I noticed this recent trackback from dataist, a new “blog about data exploration” by Finnish journo Jens Finnäs. He writes that he has “almost no prior programming experience” but, after going through my tutorials and checking out Scraperwiki, was able to produce this cool network graph of the Ratata blog network after about “two days of trial and error”:

Mapping of Ratata blogging network by Jens Finnäs of dataist.wordpress.com

Mapping of Ratata blogging network by Jens Finnäs of dataist.wordpress.com

I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas’s example. Becoming good at coding is not a trivial task. But even the first steps of it can teach a non-coder some profound lessons about data important enough on their own. And if you’re a curious-type with a question you want to answer, you’ll soon figure out a way to put something together, as in Finnas’s case.

ProPublica’s Dollars for Docs project originated in part from this Pfizer-scraping lesson I added on to my programming tutorial: I needed a timely example of public data that wasn’t as useful as it should be.

My colleagues Charles Ornstein and Tracy Weber may not be programmers (yet), but they are experienced enough with data to know its worth as an investigative resource, and turned an exercise in transparency into a focused and effective investigation. It’s not trivial to find a story in data. Besides being able to do Access queries themselves, C&T knew both the limitations of the data (for example, it’s difficult to make comparisons between the companies because of different reporting periods) and its possibilities, such as the cross-checking of names en masse from the payment lists with state and federal doctor databases.

Their investigation into the poor regulation of California nurses – a collaboration with the LA Times that was a Pulitzer finalist in the Public Service category – was similarly data-oriented. They (and the LA Times’ Maloy Moore and Doug Smith) had been diligently building a database of thousands of nurses – including their disciplinary records and the time it took for the nursing board to act – which made my part in building a site to graphically represent the data extremely simple.

The point of all this is: don’t put off your personal data-training because you think it requires a computer science degree, or that you have to become great at it in order for it to be useful. Even if after a week of learning, you can barely put together a programming script to alphabetize your tweets, you’ll likely gain enough insight to how data is made structured and useful, which will aid in just about every other aspect of your reporting repertoire.

In fact, just knowing to avoid taking notes like this:

Colonel Mustard used the revolver in the library? (not library)
Miss Scarlet used the Candlestick in the dining room? (not Scarlet)
“Mrs. Peacock, in the dining room, with the revolver? “
“Colonel Mustard, rope, conservatory?”
Mustard? Dining room? Rope (nope)?
“Was it Mrs. Peacock with the candlestick, inside the dining room?”

And instead, recording them like this:

Who/What?Role?Ruled out?
MustardSuspectN
ScarletSuspectY
PeacockSuspectN
RevolverWeaponY
CandlestickWeaponY
RopeWeaponY
ConservatoryPlaceY
Dining RoomPlaceN
LibraryPlaceY

…will make you a significantly more effective reporter, as well as position you to have your reporting and research become much more ready for thorough analysis and online projects.

There’s a motherlode of programming resources available through single Google search. My high school journalism teacher told us that if you want to do journalism, don’t major in it, just do it. I think the same can be said for programming. I’m glad I chose a computer field as an undergraduate so that I’m familiar with the theory. But if you have a career in reporting or research, you have real-world data-needs that most undergrads don’t. I’ve found that having those goals and needing to accomplish them has pushed my coding expertise far quicker than did any coursework.

If you aren’t set on learning to program, but want to get a better grasp of data, I recommend learning:

  • Regular expressions – a set of character patterns, easily printable on a cheat-sheet for memorization, that you use in a text-editor’s Find and Replace dialog to turn a chunk of text into something you can put into a spreadsheet, as well as clean up the data entries themselves. Regular-expressions.info is the most complete resource I’ve found. A cheat-sheet can be found here. Wikipedia has a list of some simple use cases.
  • Google Refine – A spreadsheet-like program that makes easy the task of cleaning and normalizing messy data. Ever go through campaign contribution records and wish you could easily group together and count as one, all the variations of “Jon J. Doe”, “Jonathan J. Doe”, “Jon Johnson Doe”, “JON J DOE”, etc.? Refine will do that. Refine developer David Huynh has an excellent screencast demonstrating Refine’s power. I wrote a guide as part of the Dollars for Docs tutorials. Even if you know Excel like a pro – which I do not – Refine may make your data-life much more enjoyable.

If you want to learn coding from the ground up, here’s a short list of places to start:

Good news for data-nerds everywhere. The 2.0 version of Google’s fantastic data-cleaning tool, Google Refine (formerly Gridworks), has been released. And they were nice enough to feature ProPublica’s Dollars for Docs as an example of a use-case. I talked briefly to BusinessJournalism.org about how I used Refine to put together the pharma top earners list.

It’s possible I could’ve done it using SQL queries and Ruby libraries. But I definitely would’ve missed a lot of matches, and probably overdosed on over-the-counter pharma-painkillers.