Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

In particular, with lesson 3, I skipped basically any explanation to the code. I hope to get around to it later.

Going to Court

In the last lesson, we learned how to write a script that would record who was in jail at a given hour. This could yield some interesting stories for a crime reporter, including spates of arrests for notable crimes and inmates who are held with $1,000,000 bail for relatively minor crimes. However, an even more interesting angle would be to check the inmates’ prior records, to get a glimpse of the recidivism rate, for example.

Sacramento Superior Court allows users to search by not just names, but by the unique ID number given to inmates by Sacramento-area jurisdictions. This makes it pretty easy to link current inmates to court records.

However, the techniques we used in past lessons to automate the data collection won’t work here. As you can see in the above picture, you have to fill out a form. That’s not something any of the code we’ve written previously will do. Luckily, that’s where Ruby’s mechanize comes in.

Ruby Mechanize

Go the the mechanize library homepage to learn how to install it as a Ruby gem. It requires that nokogiri is installed, which you should’ve done if you’ve made it this far into my tutorials.

There are some basic examples on the project page, but you’re going to have to read some of the technical documentation to learn some of mechanize’s commands.

Here’s a code example we’ll be using:

result_page_form = search_form.submit

search_form refers to a mechanize Form object. In that HTML form is a textfield with a name of ‘txtXref’. The array notation we used above is setting that textfield to the value ‘00112233’.

Then, using mechanize’s Form object’s submit method, we submit the form just as if we had clicked the “Submit” button on a webpage.

That’s the basic theory.

The Code

Note: The following code works, if you have an inmates.txt file from the last lesson (use this one if you don’t; keep in mind that the last names and birthdates have been changed/redacted). However, it’s very rudimentary, with no error-checking at all. Still, it’ll give you a couple tab-delimited files that will list an inmate’s past charges and past sentences served, with XREF being the key that links those files to inmates.txt.

Remember that you’re accessing a live site here. This script pauses for 2 seconds after each access…there should be no reason to be more frequent about it.

This tutorial will be updated in the future.

require 'rubygems'
require 'mechanize'
xrefs ="inmates.txt", 'r').readlines().map{|x| x.split("\t")[7].match(/[0-9]+/).to_s}.uniq

# open datafile

a = { |agent|
  agent.user_agent_alias = 'Mac Safari'

search_page = a.get(search_url) 
search_form = search_page.form_with(:name=>'frmCriminalSearch')

#show the fieldnames {|f|}
#=> ["__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "txtLastName", "txtFirstName", "txtDOB", "txtXref", "txtCaseNumber", "lstCaseType"]{|m|}
# => ["btnFindByName", "btnFindByNumber"]

xrefs.each do |xref|
  puts "\nFinding info for xref: #{xref}"
  result_page_form = search_form.submit.forms.first
  case_buttons = result_page_form.buttons[1..-2]

  puts "There are #{case_buttons.length} cases to check:"
  case_buttons.each do |cb|
    file_page = result_page_form.click_button(cb)
    file_page = file_page.parser
    charges_arr = []
    sentences_arr =[]
    charge_rows = file_page.css('#dgDispositionCharges tr')
    if charge_rows.length > 0
    puts "Charges: "
      charge_rows[1..-1].each do |cr|
        ctd = cr.css('td').map{|td| td.text}
        charges_arr << {:plea=>ctd[1], :charge=>ctd[2], :date=>ctd[4], :severity=>ctd[5]}
        puts "\t - #{charges_arr.last.collect().join("\t")}"
    sentence_rows = file_page.css('#dgSentenceSummary tr')
    if sentence_rows.length > 0
      puts "Sentences: "
      sentence_rows[1..-1].each do |sr|
        sentences_arr << sr.css('td').map{|td| td.text}.join("\t")
        puts "\t - #{sentences_arr.last}"
    "court_charges.txt",'a+'){ |f|

      charges_arr.each do |c|
    }"sentences.txt", 'a+'){ |f| 
      sentences_arr.each do |c|
  end #done checking a case entry
  puts "Done with #{xref}, sleeping"
  sleep 1



I'm a programmer journalist, currently teaching computational journalism at Stanford University. I'm trying to do my new blogging at