Tag Archives: ruby

Ruby MiniTest Cheat Sheet, Unit and Spec reference

Ruby’s standard testing suite, MiniTest, is in dire need of a quick-and-handy reference for its syntax. I’ve put one together comparing the unit syntax (assert/refute) and the spec syntax (must/wont). You can see it below or download the Google Spreadsheet I made and roll your own sheet (HTML, XLS)

Test syntax

Unit	Spec	Arguments	Examples
`assert_empty refute_empty`	`must_be_empty wont_be_empty`	`obj, msg=nil`	`assert_empty []` `refute_empty [1,2,3]` `[].must_be_empty` `[1,2,3].wont_be_empty`
`assert_equal refute_equal`	`must_equal wont_equal`	`exp, act, msg=nil`	`assert_equal 2, 2` `refute_equal 2,1` `2.must_equal 2` `2.wont_equal 1`
`assert_in_delta refute_in_delta`	`must_be_within_delta wont_be_within_delta`	`exp, act, dlt=0.001, msg=nil`	`assert_in_delta 2012, 2010, 2` `refute_in_delta 2012, 3012, 2` `2012.must_be_within_delta 2010, 2` `2012.wont_be_within_delta 3012, 2`
	`must_be_close_to wont_be_close_to`	`act, dlt=0.001, msg=nil`	`2012.must_be_close_to 2010, 2` `2012.wont_be_close_to 3012, 2`
`assert_in_epsilon refute_in_epsilon`	`must_be_within_epsilon wont_be_within_epsilon`	`a, b, eps=0.001, msg=nil`	`assert_in_epsilon 1.0, 1.02, 0.05` `refute_in_epsilon 1.0, 1.06, 0.05` `1.0.must_be_within_epsilon 1.02, 0.05` `1.0.wont_be_within_epsilon 1.06, 0.05`
`assert_includes refute_includes`	`must_include wont_include`	`collection, obj, msg=nil`	`assert_includes [1, 2], 2` `refute_includes [1, 2], 3` `[1, 2].must_include 2` `[1, 2].wont_include 3`
`assert_instance_of refute_instance_of`	`must_be_instance_of wont_be_instance_of`	`klass, obj, msg=nil`	`assert_instance_of String, "bar"` `refute_instance_of String, 100` `"bar".must_be_instance_of String` `100.wont_be_instance_of String`
`assert_kind_of refute_kind_of`	`must_be_kind_of wont_be_kind_of`	`klass, obj, msg=nil`	`assert_kind_of Numeric, 100` `refute_kind_of Numeric, "bar"` `100.must_be_kind_of Numeric` `"bar".wont_be_kind_of Numeric`
`assert_match refute_match`	`must_match wont_match`	`exp, act, msg=nil`	`assert_match /\d/, "42"` `refute_match /\d/, "foo"` `"42".must_match /\d/` `"foo".wont_match /\d/`
`assert_nil refute_nil`	`must_be_nil wont_be_nil`	`obj, msg=nil`	`assert_nil [].first` `refute_nil [1].first` `[].first.must_be_nil` `[1].first.wont_be_nil`
`assert_operator refute_operator`	`must_be wont_be`	`o1, op, o2, msg=nil`	`assert_operator 1, :<, 2` `refute_operator 1, :>, 2` `1.must_be :<, 2` `1.wont_be :>, 2`
`assert_output`	`must_output`	`stdout = nil, stderr = nil`	`assert_output("hi\n"){ puts "hi" }` `Proc.new{puts "hi"}.must_output "hi\n"`
`assert_raises`	`must_raise`	`*exp`	`assert_raises(NoMethodError){ nil! }` `Proc.new{nil!}.must_raise NoMethodError`
`assert_respond_to refute_respond_to`	`must_respond_to wont_respond_to`	`obj, meth, msg=nil`	`assert_respond_to "foo",:empty?` `refute_respond_to 100, :empty?` `"foo".must_respond_to :empty?` `100.wont_respond_to :empty?`
`assert_same refute_same`	`must_be_same_as wont_be_same_as`	`exp, act, msg=nil`	`assert_same :foo, :foo` `refute_same ['foo'], ['foo']` `:foo.must_be_same_as :foo` `['foo'].wont_be_same_as ['foo']`
`assert_silent`	`must_be_silent`	`-`	`assert_silent{ 1 + 1 }` `Proc.new{ 1 + 1}.must_be_silent`
`assert_throws`	`must_throw`	`sym, msg=nil`	`assert_throws(:up){ throw :up}` `Proc.new{throw :up}.must_throw :up`

Test Setup

Unit	Spec
`setup()`	`before(type = nil, &block)`
`teardown()`	`after(type = nil, &block)`

Mocks

via MiniTest::Mock

expect(name, retval, args=[]) – Expect that method name is called, optionally with args or a blk, and returns retval.

Example:

@mock.expect(:meaning_of_life, 42)
@mock.meaning_of_life # => 42

@mock.expect(:do_something_with, true, [some_obj, true])
@mock.do_something_with(some_obj, true) # => true

@mock.expect(:do_something_else, true) do |a1, a2|
   a1 == "buggs" && a2 == :bunny
end

verify – Verify that all methods were called as expected. Raises MockExpectationError if the mock object was not called as expected.

Other syntax

def flunk(msg=nil)
def pass(msg=nil)
def skip(msg=nil, bt=caller)
def it (desc="anonymous", &block)
i_suck_and_my_tests_are_order_dependent!() – Call this at the top of your tests when you absolutely positively need to have ordered tests. In doing so, youâ€™re admitting that you suck and your tests are weak. (TestCase public class method)
parallelize_me!() – Call this at the top of your tests when you want to run your tests in parallel. In doing so, youâ€™re admitting that you rule and your tests are awesome.
make_my_diffs_pretty!() – Make diffs for this TestCase use pretty_inspect so that diff in assert_equal can be more details. NOTE: this is much slower than the regular inspect but much more usable for complex objects.

Tutorials

Reference

Analyzing the U.S. Senate Smiles: A Ruby tutorial with the Face.com and NYT Congress APIs

U.S. Senate Smiles, ranked by Face.com face-detection algorithm

The smiles of your U.S. Senate from most smiley-est to least, according to Face.com's algorithm

Who’s got the biggest smile among our U.S. senators? Let’s find out and exercise our Ruby coding and civic skills. This article consists of a quick coding strategy overview (from the full code is at my Github). Or jump here to see the results, as sorted by Face’s algorithm.

About this tutorial

This is a Ruby coding lesson to demonstrate the basic features of Face.com’s face-detection API for a superficial use case. We’ll mash with the New York Times Congress API and data from the Sunlight Foundation.

The code comprehension is at a relatively simple level and is intended for learning programmers who are comfortable with RubyGems, hashes, loops and variables.

If you’re a non-programmer: The use case may be a bit silly here but I hope you can view it from an abstract-big-picture level and see the use of programming to: 1) Make quick work of menial work and 2) create and analyze datapoints where none existed before.

On to the lesson!

—

The problem with portraits

For the SOPA Opera app I built a few weeks ago, I wanted to use the Congressional mugshots to illustrate
the front page. The Sunlight Foundation provides a convenient zip file download of every sitting Congressmember’s face. The problem is that the portraits were a bit inconsistent in composition (and quality). For example, here’s a usable, classic head-and-shoulders portrait of Senator Rand Paul:

Sen. Rand Paul

But some of the portraits don’t have quite that face-to-photo ratio; Here’s Sen. Jeanne Shaheen’s portrait:

Sen. Jeanne Shaheen

It’s not a terrible Congressional portrait. It’s just out of proportion compared to Sen. Paul’s. What we need is a closeup crop of Sen. Shaheen’s face:

Sen. Jeanne Shaheen's face cropped

How do we do that for a given set of dozens (even hundreds) of portraits that doesn’t involve manually opening each image and cropping the heads in a non-carpal-tunnel-syndrome-inducing manner?

Easy face detection with Face.com’s Developer API

Face-detection is done using an algorithm that scans an image and looks for shapes proportional to the average human face and containing such inner shapes as eyes, a nose and mouth in the expected places. It’s not as if the algorithm has to have an idea of what an eye looks like exactly; two light-ish shapes about halfway down what looks like a head might be good enough.

You could write your own image-analyzer to do this, but we just want to crop faces right now. Luckily, Face.com provides a generous API that when you send it an image, it will send you back a JSON file in this format:

{
    "photos": [{
        "url": "http:\/\/face.com\/images\/ph\/12f6926d3e909b88294ceade2b668bf5.jpg",
        "pid": "F@e9a7cd9f2a52954b84ab24beace23046_1243fff1a01078f7c339ce8c1eecba44",
        "width": 200,
        "height": 250,
        "tags": [{
            "tid": "TEMP_F@e9a7cd9f2a52954b84ab24beace23046_1243fff1a01078f7c339ce8c1eecba44_46.00_52.40_0_0",
            "recognizable": true,
            "threshold": null,
            "uids": [],
            "gid": null,
            "label": "",
            "confirmed": false,
            "manual": false,
            "tagger_id": null,
            "width": 43,
            "height": 34.4,
            "center": {
                "x": 46,
                "y": 52.4
            },
            "eye_left": {
                "x": 35.66,
                "y": 44.91
            },
            "eye_right": {
                "x": 58.65,
                "y": 43.77
            },
            "mouth_left": {
                "x": 37.76,
                "y": 61.83
            },
            "mouth_center": {
                "x": 49.35,
                "y": 62.79
            },
            "mouth_right": {
                "x": 57.69,
                "y": 59.75
            },
            "nose": {
                "x": 51.58,
                "y": 56.15
            },
            "ear_left": null,
            "ear_right": null,
            "chin": null,
            "yaw": 22.37,
            "roll": -3.55,
            "pitch": -8.23,
            "attributes": {
                "glasses": {
                    "value": "false",
                    "confidence": 16
                },
                "smiling": {
                    "value": "true",
                    "confidence": 92
                },
                "face": {
                    "value": "true",
                    "confidence": 79
                },
                "gender": {
                    "value": "male",
                    "confidence": 50
                },
                "mood": {
                    "value": "happy",
                    "confidence": 75
                },
                "lips": {
                    "value": "parted",
                    "confidence": 39
                }
            }
        }]
    }],
    "status": "success",
    "usage": {
        "used": 42,
        "remaining": 4958,
        "limit": 5000,
        "reset_time_text": "Tue, 24 Jan 2012 05:23:21 +0000",
        "reset_time": 1327382601
    }
}

The JSON includes an array of photos (if you sent more than one to be analyzed) and then an array of tags – one tag for each detected face. The important part for cropping purposes are the attributes dealing with height, width, and center:

		"width": 43,
      "height": 34.4,
      "center": {
          "x": 46,
          "y": 52.4
      },

These numbers represent percentage values from 0-100. So the width of the face is 43% of the image’s total width. If the image is 200 pixels wide, then the face spans 86 pixels.

Using your favorite HTTP-calling library (I like the RestClient gem), you can simply ping the Face.com API’s detect feature to get these coordinates for any image you please.

Image manipulation with RMagick

So how do we do the actual cropping? By using the RMagick (a Ruby wrapper for the ImageMagick graphics library) gem, which lets us do crops with commands as simple as these:

img = Magick::Image.read("somefile.jpg")[0]

# crop a 100x100 image starting from the top left corner
img = img.crop(0,0,100,100)

The RMagick documentation page is a great place to start. I’ve also written an image-manipulation chapter for The Bastards Book of Ruby.

The Process

The code for all of this is stored at my Github account.

I’ve divided this into two parts/scripts. You could combine it into one script but to make things easier to comprehend (and to lessen the amount of best-practices error-handling code for me to write), I divide it into a “fetch” and “process” stage.

In the fetch.rb stage, we essentially download all the remote files we need to do our work:

Download a zip file of images from Sunlight Labs and unzip it at the command line
Use NYT’s Congress API to get latest list of Senators
Use Face.com API to download face-coordinates as JSON files

In the process.rb stage, we use RMagick to crop the photos based from the metadata we downloaded from the NYT and Face.com. As a bonus, I’ve thrown in a script to programmatically create a crude webpage that ranks the Congressmembers’ faces by smile, glasses-wearingness, and androgenicity. How do I do this? The Face.com API handily provides these numbers in its response:

	"attributes": {
            "glasses": {
                "value": "false",
                "confidence": 16
            },
            "smiling": {
                "value": "true",
                "confidence": 92
            },
            "face": {
                "value": "true",
                "confidence": 79
            },
            "gender": {
                "value": "male",
                "confidence": 50
            },
            "mood": {
                "value": "happy",
                "confidence": 75
            },
            "lips": {
                "value": "parted",
                "confidence": 39
            }
        }

I’m not going to reprint the code from my Github account, you can see the scripts yourself there:

https://github.com/dannguyen/Congressmiles

First things first: sign up for API keys at the NYT and Face.com

I also use the following gems:

The Results

Here’s what you should see after you run the process.rb script (all judgments made by Face.com’s algorithm…I don’t think everyone will agree with about the quality of the smiles):

10 Biggest Smiles

Sen. Wicker (R-MS) [100]

Sen. Reid (D-NV) [100]

Sen. Shaheen (D-NH) [99]

Sen. Hagan (D-NC) [99]

Sen. Snowe (R-ME) [98]

Sen. Kyl (R-AZ) [98]

Sen. Klobuchar (D-MN) [98]

Sen. Crapo (R-ID) [98]

Sen. Johanns (R-NE) [98]

Sen. Hutchison (R-TX) [98]

10 Most Ambiguous Smiles

Sen. Inouye (D-HI) [40]

Sen. Kohl (D-WI) [43]

Sen. McCain (R-AZ) [47]

Sen. Durbin (D-IL) [49]

Sen. Roberts (R-KS) [50]

Sen. Whitehouse (D-RI) [52]

Sen. Hoeven (R-ND) [54]

Sen. Alexander (R-TN) [54]

Sen. Shelby (R-AL) [62]

Sen. Johnson (D-SD) [63]

The Non-Smilers

Sen. Bingaman (D-NM) [79]

Sen. Coons (D-DE) [77]

Sen. Burr (R-NC) [72]

Sen. Hatch (R-UT) [72]

Sen. Reed (D-RI) [71]

Sen. Paul (R-KY) [71]

Sen. Lieberman (I-CT) [59]

Sen. Bennet (D-CO) [55]

Sen. Udall (D-NM) [51]

Sen. Levin (D-MI) [50]

Sen. Boozman (R-AR) [48]

Sen. Isakson (R-GA) [41]

Sen. Franken (D-MN) [37]

10 Most Bespectacled Senators

Sen. Franken (D-MN) [99]

Sen. Sanders (I-VT) [98]

Sen. McConnell (R-KY) [98]

Sen. Grassley (R-IA) [96]

Sen. Coburn (R-OK) [93]

Sen. Mikulski (D-MD) [93]

Sen. Roberts (R-KS) [93]

Sen. Inouye (D-HI) [91]

Sen. Akaka (D-HI) [88]

Sen. Conrad (D-ND) [86]

10 Most Masculine-Featured Senators

Sen. Bingaman (D-NM) [94]

Sen. Boozman (R-AR) [92]

Sen. Bennet (D-CO) [92]

Sen. McConnell (R-KY) [91]

Sen. Nelson (D-FL) [91]

Sen. Rockefeller IV (D-WV) [90]

Sen. Carper (D-DE) [90]

Sen. Casey (D-PA) [90]

Sen. Blunt (R-MO) [89]

Sen. Toomey (R-PA) [88]

10 Most Feminine-Featured Senators

Sen. McCaskill (D-MO) [95]

Sen. Boxer (D-CA) [93]

Sen. Shaheen (D-NH) [93]

Sen. Gillibrand (D-NY) [92]

Sen. Hutchison (R-TX) [91]

Sen. Collins (R-ME) [90]

Sen. Stabenow (D-MI) [86]

Sen. Hagan (D-NC) [81]

Sen. Ayotte (R-NH) [79]

Sen. Klobuchar (D-MN) [79]

—

For the partisan data-geeks, here’s some faux analysis with averages:

Party	Smiles	Non-smiles	Avg. Smile Confidence
D	44	7	85
R	42	5	86
I	1	1	85

There you have it, the Republicans are the smiley-est party of them all.

Further discussion

This is an exercise to show off the very cool Face.com API and to demonstrate the value of a little programming knowledge. Writing the script doesn’t take too long, though I spent more time than I liked on idiotic bugs of my own making. But this was way preferable than cropping photos by hand. And once I had the gist of things, I not only had a set of cropped files, I had the ability to whip up any kind of visualization I needed with just a minute’s more work.

And it wasn’t just face-detection that I was using, but face-detection in combination with deep data-sources like the Times’s Congress API and the Sunlight Foundation. For the SOPA Opera app, it didn’t take long at all to populate the site with legislator data and faces. (I didn’t get around to using this face-detection technique to clean up the images, but hey, I get lazy too…)

Please don’t judge the value of programming by my silly example here – having an easy-to-use service like Face.com API (mind the usage terms, of course) gives you a lot of great possibilities if you’re creative. Off the top of my head, I can think of a few:

As a photographer, I’ve accumulated thousands of photos but have been quite lazy in tagging them. I could conceivably use Face.com’s API to quickly find photos without faces for stock photo purposes. Or maybe a client needs to see male/female portraits. The Face.com API gives me an ad-hoc way to retrieve those without menial browsing.
Data on government hearing webcasts are hard to come by. I’m sure there’s a programmatic way to split up a video into thousands of frames. Want to know at which points Sen. Harry Reid shows up? Train Face.com’s API to recognize his face and set it loose on those still frames to find when he speaks.
Speaking of breaking up video…use the Face API to detect the eyes of someone being interviewed and use RMagick to detect when the eyes are closed (the pixels in those positions are different in color than the second before) to do that college-level psych experiment of correlating blinks-per-minute to truthiness.

Thanks for reading. This was a quick post and I’ll probably go back to clean it up. At some point, I’ll probably add this to the Bastards Book.

The Bastards Book: A Programming Tutorial for journalists, researchers, analysts, and anyone else who cares about data

Crossing Bleecker and Lafayette through a snowstorm

Back when I wrote my “Coding for Journalists 101″ guide about a year and a half ago, I barely realized how useful code could be as a journalistic tool. Since then, after the Dollars for Docs project at ProPublica and various other programming adventures, I’ve become a slightly better coder and even more adamant that programming is basically a necessity for anyone who cares about understanding and communicating about the world in a quantitative, meaningful way.

The world of data has exploded in the past few years without a corresponding increase in the people or tools to efficiently make sense of it. And so I’ve had a hankering to create a more cohesive, useful programming guide aimed at not just journalists, but for anyone in any field.

It’s called the Bastards Book of Ruby. It’s not really just about Ruby and “bastards” was a working title that I came up with but never got around to changing. But it seems to work for now.

As I was writing the introduction (“Programming is for Anyone“), I came across this Steve Jobs interview with Fresh Air. He says pretty much exactly what I’m thinking, but he said it 15 years ago — surprising given that the Web was in its infancy and Jobs’s fame was largely out of making computers brain-dead simple for people. He wasn’t much of a programmer, but he really was a genius at understanding the bigger picture of what he himself only dabbled in:

“In my perspective … science and computer science is a liberal art, it’s something everyone should know how to use, at least, and harness in their life. It’s not something that should be relegated to 5 percent of the population over in the corner. It’s something that everybody should be exposed to and everyone should have mastery of to some extent, and that’s how we viewed computation and these computation devices.”

Bastards Book of Ruby. It’s just a rough draft but already numbers at 75,000 words. See the table of contents.

dataist blog: An inspiring case for journalists learning to code

About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven’t looked back at it because I’m sure I’ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from “What is HTML” to actual Ruby scraping code all in a gigantic, badly formatted post.

The series of articles have gotten a fair number of hits but I don’t know how many people were able to stumble through it. Though last week I noticed this recent trackback from dataist, a new “blog about data exploration” by Finnish journo Jens FinnÃ¤s. He writes that he has “almost no prior programming experience” but, after going through my tutorials and checking out Scraperwiki, was able to produce this cool network graph of the Ratata blog network after about “two days of trial and error”:

Mapping of Ratata blogging network by Jens FinnÃ¤s of dataist.wordpress.com

I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas’s example. Becoming good at coding is not a trivial task. But even the first steps of it can teach a non-coder some profound lessons about data important enough on their own. And if you’re a curious-type with a question you want to answer, you’ll soon figure out a way to put something together, as in Finnas’s case.

ProPublica’s Dollars for Docs project originated in part from this Pfizer-scraping lesson I added on to my programming tutorial: I needed a timely example of public data that wasn’t as useful as it should be.

My colleagues Charles Ornstein and Tracy Weber may not be programmers (yet), but they are experienced enough with data to know its worth as an investigative resource, and turned an exercise in transparency into a focused and effective investigation. It’s not trivial to find a story in data. Besides being able to do Access queries themselves, C&T knew both the limitations of the data (for example, it’s difficult to make comparisons between the companies because of different reporting periods) and its possibilities, such as the cross-checking of names en masse from the payment lists with state and federal doctor databases.

Their investigation into the poor regulation of California nurses – a collaboration with the LA Times that was a Pulitzer finalist in the Public Service category – was similarly data-oriented. They (and the LA Times’ Maloy Moore and Doug Smith) had been diligently building a database of thousands of nurses – including their disciplinary records and the time it took for the nursing board to act – which made my part in building a site to graphically represent the data extremely simple.

The point of all this is: don’t put off your personal data-training because you think it requires a computer science degree, or that you have to become great at it in order for it to be useful. Even if after a week of learning, you can barely put together a programming script to alphabetize your tweets, you’ll likely gain enough insight to how data is made structured and useful, which will aid in just about every other aspect of your reporting repertoire.

In fact, just knowing to avoid taking notes like this:

Colonel Mustard used the revolver in the library? (not library)
Miss Scarlet used the Candlestick in the dining room? (not Scarlet)
“Mrs. Peacock, in the dining room, with the ~~revolver~~? “
“Colonel Mustard, rope, ~~conservatory~~?”
Mustard? Dining room? Rope (nope)?
“Was it Mrs. Peacock with the ~~candlestick~~, inside the dining room?”

And instead, recording them like this:

Who/What?	Role?	Ruled out?
Mustard	Suspect	N
Scarlet	Suspect	Y
Peacock	Suspect	N
Revolver	Weapon	Y
Candlestick	Weapon	Y
Rope	Weapon	Y
Conservatory	Place	Y
Dining Room	Place	N
Library	Place	Y

…will make you a significantly more effective reporter, as well as position you to have your reporting and research become much more ready for thorough analysis and online projects.

There’s a motherlode of programming resources available through single Google search. My high school journalism teacher told us that if you want to do journalism, don’t major in it, just do it. I think the same can be said for programming. I’m glad I chose a computer field as an undergraduate so that I’m familiar with the theory. But if you have a career in reporting or research, you have real-world data-needs that most undergrads don’t. I’ve found that having those goals and needing to accomplish them has pushed my coding expertise far quicker than did any coursework.

If you aren’t set on learning to program, but want to get a better grasp of data, I recommend learning:

Regular expressions – a set of character patterns, easily printable on a cheat-sheet for memorization, that you use in a text-editor’s Find and Replace dialog to turn a chunk of text into something you can put into a spreadsheet, as well as clean up the data entries themselves. Regular-expressions.info is the most complete resource I’ve found. A cheat-sheet can be found here. Wikipedia has a list of some simple use cases.
Google Refine – A spreadsheet-like program that makes easy the task of cleaning and normalizing messy data. Ever go through campaign contribution records and wish you could easily group together and count as one, all the variations of “Jon J. Doe”, “Jonathan J. Doe”, “Jon Johnson Doe”, “JON J DOE”, etc.? Refine will do that. Refine developer David Huynh has an excellent screencast demonstrating Refine’s power. I wrote a guide as part of the Dollars for Docs tutorials. Even if you know Excel like a pro – which I do not – Refine may make your data-life much more enjoyable.

If you want to learn coding from the ground up, here’s a short list of places to start:

Lifehacker’s “Full Beginner’s Guide” – a four day guide that covers the very basics to how to write a simple guessing game. It’s in Javascript, but as you’ll hear plenty of times from veterans, it really doesn’t matter what language you start out with.
The Pragmatic Programmer’s Guide to Programming Ruby – this covers an older version of Ruby, but is still a great comprehensive, browser-friendly book.
Learn to Program (also in Ruby) by Chris Pine – Written in 2004, this is still an elegant beginner’s guide
Invent Your Own Computer Games With Python – You may not be interested in writing game software, but the same programming techniques apply in that field as they do anywhere else. This guide covers all the fundamentals and gives you great project examples.
ScraperWiki has a massive collection of web-scraping scripts for your perusal, and is where the dataist’s FinnÃ¤s learned from example. ScraperWiki has a set of python tutorials, too.
Here’s a giant list of free programming books.
Visit the learnprogramming subforum in Reddit to find a small, but active community of beginners who aren’t afraid to start the most basic of discussions with the forum’s programming experts. StackOverflow is the single best site for specific questions or problems; often, you can Google your exact problem and a relevant StackOverflow discussion will be at the top.
And you can always refer back to my four-part programming tutorial from last year, which aims to cover HTML to writing Ruby to scrape websites. I also wrote a series of tutorials (with complete code) on how I collected data for Dollars for Docs, including how to scrape from websites, Flash applications, PDFs, and even image files (the solution is specific to one kind of format, so I will gladly welcome anyone else to generalize it).

Coding for Journalists 101 : A four-part series

Photo by Nico Cavallotto on Flickr

Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here.

I’m only keeping this old walkthrough up as a historical reference. I’m sure the code is so ugly that I’m not going to even try re-reading it.

So check it out: The Bastards Book of Ruby

-Dan

—

Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. There is a guide for Pfizer which supersedes the one I originally posted here.

So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A “little while” turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows how to help Pfizer in its mission to more transparent, compelled me to just publish them in incomplete form. There’s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.

As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I’m very welcome to editing changes.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

Tutorial 1: Go from knowing nothing to scraping Web pages. In an hour. Hopefully

~~loop~~

Tutorial 2: Scraping a County Jail Website to Find Out Who’s in Jail – This uses all the concepts from the first tutorial and applies them to something that a cops reporter might actually want to try out.

Tutorial 3: Who’s Been in Jail Before: Cross-checking the jail logs with the court system with Ruby’s Mechanize – This lesson introduces you to another Ruby library that allows you to automate the filling-out of forms so that you can access online databases, in this case, California criminal case histories to see if current inmates are repeat-alleged-offenders.

Tutorial 4: Improving Pfizer’s Dollars-to-Doctors Pay List – Last week, Pfizer released a list of nearly 5,000 doctors and medical institutions that it made $35 million in consulting and expense payments. Fun. Unfortunately, the list, as it initially existed online, is just about useless to anyone wanting to examine trends. This tutorial provides a script to make the list more interesting to journalists.

Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact dan@danwin.com if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

In particular, with lesson 3, I skipped basically any explanation to the code. I hope to get around to it later.

Going to Court

In the last lesson, we learned how to write a script that would record who was in jail at a given hour. This could yield some interesting stories for a crime reporter, including spates of arrests for notable crimes and inmates who are held with $1,000,000 bail for relatively minor crimes. However, an even more interesting angle would be to check the inmates’ prior records, to get a glimpse of the recidivism rate, for example.

Sacramento Superior Court allows users to search by not just names, but by the unique ID number given to inmates by Sacramento-area jurisdictions. This makes it pretty easy to link current inmates to court records.

However, the techniques we used in past lessons to automate the data collection won’t work here. As you can see in the above picture, you have to fill out a form. That’s not something any of the code we’ve written previously will do. Luckily, that’s where Ruby’s mechanize comes in.

Continue reading →

Coding for Journalists 102: Who’s in Jail Now: Collecting info from a county jail site

This is part 2 of a 4-part series in introductory coding for journalists. Go here for the first lesson. This lesson and code will still be verbose, but will have a lot less hand-holding than the previous one.

Continue reading →

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.

UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby.

I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think:

http://ruby.bastardsbook.com

Someone asked in this online chat for journalists: I want to program/code, but where does a non-programmer journalist begin?

My colleague Jeff Larson gave what I believe is the most practical and professionally-useful answer: web-scraping (jump to my summary of web-scraping here, or read this more authorative source).

This is my attempt to walk someone through the most basic computer science theory so that he/she can begin collecting data in an automated way off of web pages, which I think is one of the most useful (and time-saving) tools available to today’s journalist. And thanks to the countless hours of work by generous coders, the tools are already there to make this within the grasp of a beginning programmer.

You just have to know where the tools are and how to pick them up.

Click here for this page’s table of contents. Or jump to the the theory lesson. Or to the programming exercise. Or, if you already know what a function and variable is, and have Ruby installed, go straight to two of my walkthroughs of building a real-world journalistic-minded web scraper: Scraping a jail site, and scraping Pfizer’s doctor payment list.

Or, read on for some more exposition:

Continue reading →