Coding for Journalists 102: Who’s in Jail Now: Collecting info from a county jail site

This is part 2 of a 4-part series in introductory coding for journalists. Go here for the first lesson. This lesson and code will still be verbose, but will have a lot less hand-holding than the previous one.

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact dan@danwin.com if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

A note about privacy: This tutorial uses files that I archived from a real-world jail website. Though booking records are public record, I make no claims about the legal proceedings involving the inmates who happened to be in jail when I took my snapshot. For all I know, they could have all been wrongfully arrested and therefore don’t deserve to have their name attached in online perpetuity to erroneous charges (even if the site only purports to record who was arrested and when, and not any legal conclusions). For that reason, I’ve redacted the last names of the inmates and randomized their birthdates.

The Cops Reporter and the Log

If you’re a daily cops reporter, calling the police station to ask for the list of last night’s arrests is probably part of your job. Because many papers have some kind of cops blotter where arrested suspects are listed…and online and in print, this is usually one of a paper’s top features. The St. Petersburg Times has a modern version of the feature, complete with mugshots and stats summaries.

Arrest logs have sometimes been criticized for being little more than voyeurism (here’s a discussion over the St. Pete’s mugshot site). But knowing who your law officers are arresting, and why, is essential to a nice, free society (and for a fair and efficient police force). And the more data you have as a reporter, the better you’ll be able to cover your beat.

Most pro-active police departments will announce when they’ve made high-profile arrests. But relying on the police to tell you what the most noteworthy arrests are kind of begs the question, and doesn’t tell the whole picture of arrest activity. Most states consider arrest logs to be public information (not that that stops some jurisdictions from hiding them). But a paper list or a PDF is hard to analyze. Luckily, some police departments are putting their work on the Web They might be willing to send you a spreadsheet of arrest activity, but what if you wanted up-to-the-hour information, so that you could be aware of:

  1. Suspected crimes that fall between egregious and infamous (non-fatal assaults, robberies, car jackings, etc.)
  2. An abnormally large number of arrests at a given time
  3. Unusual types of suspected crimes at a given time

This is where the web-scraping you learned in my last tutorial gets useful. You’re going to have an automated way of collecting the latest arrests news, in an ordered fashion (so that you could, for example, find the inmate with the largest bail at a given time), and you’ll save yourself and your friendly police PIO tedious paper shuffling and typing.

I’m going to base my lesson on this sheriff department’s jail system. I’ve mirrored a snapshot of their site here (zip file here), so I recommend you run your scripts on my mirror (root directory: http://danwin.com/static/jail-list/)before doing a real-world test.

The jail web site has these characteristics:

  • At this page is a list of every person booked in the last 24 hours
  • The list typically has 100 to 200 inmates at a time
  • Most entries in that list contain a link to an inmate’s page containing data including name, DOB, bail, charges, booking time.
  • Each inmate has a unique identifying number called X-REF
  • Not all entries have a link; inmates who have been released have only their names listed

The site is pretty useful and user-friendly. However, it’s hard to quickly glean any useful information from the main list. You have to click through each individual entry to find out why someone was jailed. The purpose of the following lesson is to automate that process so you can efficiently get the big picture of a jail’s activity.

Program flow will go something like this:

  1. Create two text files: one to store the list of inmates (inmates.txt), one to store the list of charges (charges.txt)
  2. Open the inmate listing page
  3. Collect each list entry
  4. If list entry is not a link (i.e. inmate has been released)
    1. Fetch first name, middle name, last name, intake time and release date
  5. Else If list entry that is a link, open it
    1. Fetch first name, middle name, last name, xref, intake time, and DOB of an inmate
    2. Fetch and parse list of charges
    3. Fetch the bail amount
  6. In an each loop, for each inmate entry we collected above:
    1. Output inmate information, in tab-delimited format, into inmates.txt, including the XREF.
    2. Output the charges associated with the inmate into charges.txt. Each charge will take up one line, and the XREF of the inmate will also be included as to provided a key to the associated inmate

File I/O

We didn’t cover opening and writing to an external text file in the last lesson. So here’s how it goes briefly: Using Ruby’s IO class, we’re going to create two files, inmates.txt and charges.txt, and write to them what we find on the jail’s website. We’ll be using the variables inmates_file and charges_file to refer to the external files.

To open the the files and set the variables, use the IO class’s new method, which takes in two parameters: a string designating the file name, and a string
designating the mode…which in this case, will be “a”: write-only (read about the various modes here).

inmates_file = File.new('inmates.txt', 'a')
charges_file = File.new('charges.txt', 'a')

If these files don’t already exist, they will now. If they did, the ‘a’ mode will append new content to the end of the file.

To write something to the file, use the puts method, which writes whatever string you supply to it as one line in the file (we’ve used this method without the IO class, in which case it outputs to the screen):

charges_file.puts("Adding a new line of text to the charges file.")

While we’re setting up, let’s create an array of hashes, with each hash object holding an inmate and his/her information. We don’t have to do this…we could just output to the file each inmate record as we get to it, but this will allow us some flexibility later. All we have to do is initialize the array:

inmates_array = []

Open the inmate listing page

Now let’s fetch the inmates listing. We’ll be using Nokogiri in the same fashion we did in the last lesson, beginning by requiring the nokogiri and open-uri libraries, then using the Open-URI’s open method to fetch the page, and then Nokogiri’s HTML class to wrap up the page in a parsable format.

require 'rubygems'
require 'nokogiri'
require 'open-uri'
		
base_url='http://danwin.com/static/jail-list/' # all links on the list will be relative to this address		
inmate_listing = Nokogiri::HTML(open("#{base_url}current_listing.cfm.html"))
A reminder. The construct #{something_here}, when put inside a double-quoted string, will treat something_here as an actual value of the variable something_here, not just the string. This is called string interpolation. The two following expressions, the latter using interpolation, are equivalent, though the latter will not throw an error if string2 happens to not be a String.

a_combined_string = “Hello ” + string2
a_combined_string = “Hello #{string2}”

Read more about Ruby’s string interpolation here.

Let’s visit the page with a browser and examine the structure. The list is an HTML table, with each row containing several columns, the first column being the inmate’s full name and, if the inmate hasn’t been released, a link to his/her booking page.

If you inspect the HTML closely, you’ll see that this page is composed of several tables. What we want is the table contained inside the <td> element with a class of “content.gsub(/\302\240/, ‘ ‘).”

So we’ll collect all the table rows, using Nokogiri’s xpath method, and iterate through them using an each loop. We’re going to use a variation of an each loop called each_index, which provides the numerical index of the current iteration we’re on.

	inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1]

The XPath syntax here is looking for a td element with class=’content’, then the table inside of that. There’s more than one, but the first one on the page has the data. From that, we gather all the rows (tr) within that. We call the collect method to convert the result into an array since Nokogiri’s xpath method returns a NodeSet, which won’t have the each_index method. each_index loops through an array, just like each, but it provides the index of the current iteration.

	inmate_rows.each_index do |i|
		inmate_row = inmate_rows[i]
		inmates_array[i] = {}
		inmate = inmates_array[i]

		# each row has a set of columns with the inmate info
		list_columns = inmate_row.xpath('./td')

Because we know we’re on the ith row, we can also initialize the ith index in inmates_array as a hash to store the ith inmate’s information. Remember that each element in the inmates_array is going to be a hash of information.

Lets use the variable named inmate as a shorthand way to refer to this position in the inmates_array .Each time we iterate through the loop, inmate will refer to the next spot in the inmates_array.

This is easier to type out 10 times than inmates_array[i]

Before we get to visiting the individual inmate pages, let’s just collect the name and other information readily available here

Each name consists of a String in this format: last_name, first_name middle_name

So let’s use the String split method. First to split the string by comma; this will give us an array with the first element being what’s on the left side of the comma. Splitting the second element of that array, with a space, will give us another array, consisting of a first name and middle name.

		
		
		
		# remember that you need to call Nokogiri's content method to get the text, as a String, between a tag	
		the_inmate_name =  list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',')
		
		inmate['last_name'] = the_inmate_name[0]					# the name before the comma
		inmate['first_name'] = the_inmate_name[1].split(' ')[0]		# the name after the comma, but before the next space
		inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1]
		
		

I’m going to be using this method call after each use of content: gsub(/\302\240/, ‘ ‘).strip

Not all entries have a middle name. So we use the if the_inmate_name.length > 2 conditional statement to tell Ruby to skip this line if the_inmate_name

		
		# Moving on to the next table cell, which will be the 1 spot in list_columns
		inmate['sex'] = list_columns[1].content
		
		
		# next cell, DOB
		inmate['dob'] = list_columns[2].content
			
		# next cell, booking time
		inmate['intake_time'] = list_columns[3].content
		
	
		
		
		# let's go back to the first column to see if it contained a link
		if list_columns[0].xpath('./a').length == 0  # if there was no link, there would be 0 links returned
			
			# No link to visit, so this must have been a released inmate. Let's grab his/her release date 
			# which comes in the pattern "Released mm/dd/yyyy"...so we'll split the string and capture the second term

			inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1]
			
		else
		
			# visit link
			# we'll get to this subroutine in the next section
			
			
		end
	end
I make a method call named gsub to cleanse the strings of data. This particular website uses &nbsp; (non-breaking-space) to form a space-character, and Nokogiri treats these differently than normal space characters, so strip doesn’t work as intended. So this method call is called frequently:
.gsub(/\302\240/, ‘ ‘)

Read more about this from Vita Ara

OK, that should’ve given you a refresher on arrays, hashes, XPath, and string manipulation. Now we’ll handle the case of when the first list_column array item does contain a link. It will involve fetching the page from that link and then more XPathing to pick out the wanted data.

At this time, go to the inmate list page and click on one of the inmate pages in the browser.

There’s a lot more information here; what will be most relevant to us right now is the X-Reference Number, charges, and bail. This next section of code will fit into the else branch of our previous section of code.

	# visit link (remember that the xpath method returns an array, so we have to explicitly refer to
	# the 0th index to get the link)
	inmate_link = list_columns[0].xpath('./a')[0]["href"] 
	
	# remember that we set base_url to contain the site's base address. we append 
	# inmate_link to it to get the absolute address to the inmate page
	inmate_page = Nokogiri::HTML(open("#{base_url}#{inmate_link}"))
	
	# everything is inside a <td> with a class="content" attribute, so let's set a variable
	# to hold the table rows inside
	
	content_table_rows = inmate_page.xpath("//td[@class='content']/table/tbody/tr")
	
	# the xref number appears to be in the third row and in the third cell
	# again, we're still using the inmate variable to hold the data associated with an inmate
	
	inmate["xref"] = content_table_rows[2].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
	# the strip method removes characters that are just space, such as tabs and carriage returns
	
	inmate['booking_number'] = content_table_rows[3].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
	inmate['arresting_agency'] = content_table_rows[13].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
	inmate['total_bail'] = content_table_rows[16].xpath("./td")[2].content.gsub(/(\302\240)|\s|\n|\r/, ' ').strip
Total bail gets an extra gsub condition because there are a few cases where carriage returns are in the table cell, which causes issues when we later try to import the result into a tab-delimited file/spreadsheet.

OK, so we collected the basic info about each inmate. Now, we want to collect the charges leveled against them. This is a little bit trickier. If you inspect the table-cell containing the charges, you’ll see that the charge listing itself is a table. The first row of the table lists the case number and type of arrest (warrant, or fresh pickup). Below that is a list of charges, with each charge taking up two rows, like so:

1st Row 1st Cell: Charge code (i.e. PC 459) 2nd Cell: Charge severity (i.e. Felony)
2nd Row: Charge description (i.e. “Burglary”)

For most of the inmate listings, this is immediately followed by another row listing the bail amount.

However, there are a few inmates who are held on more than one charge. And there are some who are being held from multiple charges stemming from multiple warrants, such as this person here, who appears to have racked up a number of public nuisance accusations, including evading ticket fare and prohibited public drinking. In his case, the charge listing is one row after another, and each row could either mention the case, the agency that issued the warrant, the charge, or the bail amount per warrant.

My point here is that you won’t be able to predict that the third row, for instance, always contains the charge code and severity. But using Inspect Element, we see that the table cells containing the code, severity, and description have class attributes “cellTopLeft”, “cellTopMiddle” and “cellBottom”, respectively. The bail amount per case is in the cell with class “cellBail”…but we’re not interested in bail per case, so we’ll ignore it.

We’re going to loop through rows inside this table, and if that row contains a td cell of class “cellTopLeft”, we know that each this row will contain the code and severity of a charge. We’re going to assume that the row immediately following it has a cell with class “cellBottom,” which contains the description.

Processing this sub-table of charges will require its own loop. And since each inmate could have more than one charge, we need to store “charges” inside our inmate hash…charges will point to an array. And each item in the charges array will itself be a hash, with keys of “code”, “severity”, and “description.”

Confusing? Well, here’s a quick diagram of what we have so far, in terms of variables:

inmates		=> an array of Hashes...
				inmate = inmates[index] (each inmate is a Hash)
			=> inmate['first_name'] => inmate's first name
			=> inmate['last_name']  => inmate's last name
			=> inmate['xref'] 		=> inmate's xref
			... all the other attributes
			=> inmate['charges']  =>  an array of hashes
						charge = inmate['charges'][charge_index] (each charge is a Hash)
						charge['code']			=> charge's code
						charge['severity']			=> charge's severity
						charge['description']	=> charge's description

The loop to fill out that charge array is as follows:

	# first, grab the entire table of charges that exists in the 16th row of the main content table
	table_of_charges = content_table_rows[15].xpath("./td")[2]
	
	# and give this inmate an array of charges
	inmate['charges'] = []
	
	# Now, collect all rows that have a td with class "cellTopLeft"
	charge_1st_rows =  table_of_charges.xpath(".//tr[td[@class='cellTopLeft']]")
	
	# Now, collect all rows that have a td with class "cellBottom"
	charge_2nd_rows = table_of_charges.xpath(".//tr[td[@class='cellBottom']]")
	
	# OK, you should do some basic error checking here. We expect the arrays of charge_1st_rows and charge_2nd_rows to have
	# equal length, since each charge has a code, severity and description, right?
	
	# If not, that means our assumption was wrong, and you should do something...like exit the script and re-examine your
	# datasource and assumptions about it. But I'll skip that for now
	
	charge_1st_rows.collect.each_index do |charge_row_index|
	
		# we found a row with a charge, so let's create a new hash that will hold the charge's attributes
		hash_of_inmate_charge = {}
		
		charge_1st_row = charge_1st_rows[charge_row_index]
		hash_of_inmate_charge['code'] = charge_1st_row.xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip
		hash_of_inmate_charge['severity'] = charge_1st_row.xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip
			
		# we assume that the row, with the same index in the charge_2nd_rows array will be the description of the charge
		# listed in charge_1st_rows
			
		hash_of_inmate_charge['description'] = charge_2nd_rows[charge_row_index].xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip
		
		
		# push this hash on to the array of inmate charges:
		inmate['charges'] << hash_of_inmate_charge	
		
	end

Well, we've collected all the relevant inmate information, and if our assumptions were right, each of the inmate's charges. We've reached the end of the loop that examines each row in the main inmate listing. Our script will go onto the next inmate and collect his/her info. And so on until it has reached the end of the list. Here's all the code so far:

		require 'rubygems'
		require 'nokogiri'
		require 'open-uri'
		inmates_array = []
		base_url='' 		
		inmate_listing = Nokogiri::HTML(open("#{base_url}current_listing.cfm.html"))

		inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1]
		inmate_rows.each_index do |i|
			inmate_row = inmate_rows[i]		
			inmates_array[i] = {}
			inmate = inmates_array[i]


			list_columns = inmate_row.xpath('./td')		
			the_inmate_name =  list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',')
			inmate['last_name'] = the_inmate_name[0]					# the name before the comma

			inmate['first_name'] = the_inmate_name[1].split(' ')[0]		# the name after the comma, but before the next space
			inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1] if  the_inmate_name.length > 2	



			inmate['sex'] = list_columns[1].content		
			inmate['dob'] = list_columns[2].content

			inmate['intake_time'] = list_columns[3].content


			if list_columns[0].xpath('./a').length == 0 
				inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1]
			else

				inmate_link = list_columns[0].xpath('./a')[0]["href"] 
				inmate_page = Nokogiri::HTML(open("#{base_url}#{inmate_link}"))
				content_table_rows = inmate_page.xpath("//td[@class='content']/table/tr")

		    if content_table_rows.length > 0


			  	inmate["xref"] = content_table_rows[2].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
		  		inmate['booking_number'] = content_table_rows[3].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
		  		inmate['arresting_agency'] = content_table_rows[13].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
		  		inmate['total_bail'] = content_table_rows[16].xpath("./td")[2].content.gsub(/\302\240/, ' ').gsub(/\s|\n|\r/, ' ').strip

		  		table_of_charges = content_table_rows[15].xpath("./td")[2]
		  		inmate['charges'] = []

		  		charge_1st_rows =  table_of_charges.xpath(".//tr[td[@class='cellTopLeft']]")
		  		charge_2nd_rows = table_of_charges.xpath(".//tr[td[@class='cellBottom']]")

		  		charge_1st_rows.collect{|x| x}.each_index do |charge_row_index|

		  			hash_of_inmate_charge = {}

		  			charge_1st_row = charge_1st_rows[charge_row_index]
		  			hash_of_inmate_charge['code'] = charge_1st_row.xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip
		  			hash_of_inmate_charge['severity'] = charge_1st_row.xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip
		  			hash_of_inmate_charge['description'] = charge_2nd_rows[charge_row_index].xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip

		  			# push this hash on to the array of inmate charges:
		  			inmate['charges'] << hash_of_inmate_charge	

		  		end
				end # end if content_table_rows

			end
		end
	

Storing your Data into a File

At this point in your script, all your carefully collected data is in memory. When the script finishes execution, it disappears. That defeats the purpose of any way of tracking data. So let's store it in a persistent way...my choice would be in some kind of database, like MySQL or SQLite. But for our purposes, we can quickly learn the methods to store this information in a tab-delimited file that can be opened as an Excel spreadsheet.

We will be using Ruby's File class:


			##write to file
			File.open("inmate.txt", 'w'){ |f| 

				f.write("first_name\tmiddle_name\tlast_name\tsex\tdob\tintaketime\trelease_date\txref\tbooking_number\tarresting_agency\ttotal_bail\n")

				inmates_array.each do |inmate|

			f.write("#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}\t#{inmate['sex']}\t#{inmate['dob']}\t#{inmate['intake_time']}\t#{inmate['release_date']}\t#{inmate['xref']}\t#{inmate['booking_number']}\t#{inmate['arresting_agency']}\t#{inmate['total_bail']}\n")

				end
			}

A quick explanation. The File class has the open method, to which we pass in two arguments: the name of the file we want to write to, and the mode. In this case, we're using 'w', which stands for "write" mode. The curly-braces sets off the code that gets executed while this File is open, with the variable f referring to the actual file.

File also has an instance method called write, which takes in a String as an argument to write to the open file.

Backslash-t will write a tab, and backslash-n will write a newline character.

The next block of code is similar to the first...but it refers to a "charges.txt" file. Remember that each inmate could have more than one charge to his/her name. The following file lists every charge, but also lists the xref key to tie back into inmates.txt. For convenience sake, we're also going to print out the inmate name and the inmate's total bail on each line.


			File.open("charges.txt",'w'){ |f|
			  f.write("name\txref\ttotal_bail\tcode\tseverity\tdescription\n")

			  inmates_array.each do |inmate|	  
				  if inmate['charges']
				    inmate['charges'].each do |charge|
			  	    f.write("#{inmate['first_name']} #{inmate['last_name']}\t#{inmate['xref']}\t#{inmate['total_bail']}\t#{charge['code']}\t#{charge['severity']}\t#{charge['description']}\n")
			      end
			    end
				end

			}
		

Printing out the inmate's name and total bail, although redundant, allows us to quickly skim the list to see if there were any unusual crimes connected to unusual amounts of bail (note that the jail site does not breakdown bail amounts per charge).

Putting it all together for the real world

The above code, put all together, will execute cleanly and compile some nice text files for you, especially if you've saved the package of HTML files onto your hard drive. But in the real world, you'll be targeting an internet server, which may not like you hitting it at a rate of five times per second. Or, may intermittently fail.

To deal with this, I've added a call to Ruby's sleep method, which pauses script execution for a given number of seconds. I've also thrown in some error-handling. Here's the basic structure:

		# some code
		begin
			# risky code here
			# The Ruby interpreter will watch the code that gets executed within the begin branch...if something goes wrong, it's going to execute code in the following rescue branch
	
		rescue
			# the begin-branch messed up, time to run some other code
			puts "An error happened!"
		else
			# this code gets executed if the begin-branch worked fine
		ensure
			# this code in the ensure branch (which is optional) runs no matter what.
			puts "We're done with our error handling"
		end
	
		

Read more about error-handling here.

And finally, I'm going to make a few alterations to the script to make it so that it'll run repeatedly for every half hour (essentially, by sleeping a half hour after going through the list). This is the crudest way to schedule a script, but it'll work for now. It will also use another instance method of File: readlines.

Each half hour, it's likely that the list of inmates will be the same. So a crude way to reduce the number of repeat listings is to check the inmates.txt file (using the match method) to see if a given inmate's xref number is in there. This gets slower as inmates.txt grows. Like I said, it's crude. I prefer using a database, which is a topic outside the scope of this tutorial.

So I've taken the code above and split it into five parts:

  1. the process_inmate_row method - This method takes in a single row from the list of inmates and reads the basic information, including name, sex, and date of birth. It takes in as its second argument the entire text of inmates.txt and sees if inmate.txt already contains the name. If so, it will return a hash of inmate data. If not, it will return nil

    Note: As said previously, constantly searching the entire inmates.txt file is incredibly inefficient. And, what happens if two John Smiths are arrested in the same time period? The name-check will fail to differentiate inmates of similar names (an even better match method would involve using the date of birth). But I leave it as an exercise for you to develop a more efficient method, which could involve a database. Or storing the name columns of inmates.txt into an array.

    But the reason why we're doing the name-check is to save us the time of entering an inmate's page. And, of course, to not fill the inmates.txt file with duplicate entries.

  2. the process_inmate_page_link method - The code that fetches an inmate's individual page and then processes the extra data, including the total bail amount and charges, is done here. It returns a hash of the inmate data.
  3. the write_to_file method - This code invokes the File.open methods and, for each inmate and charge, writes a tab-delimited line to the inmates.txt and charges.txt files
  4. the check_the_site method - This is the master method. It retrieves the list of inmates from the jail site and then on each row of inmate data, calls all the previously defined methods. IT also has some basic error handling. If something happens, like your internet connection drops in the middle of a page retrieval, it skips the current inmate and moves on. This is better than just crashing.
  5. The main execution loop - All the code previously written out as methods will do nothing unless you actually invoke the methods. So we initialize a variable, called hours, to zero and while that is less than 24, we run the check_the_site method. After check_the_site finishes, hours is incremented and the script sleeps for an hour (3600 seconds).
Here's the final code, which will be reading from the mirrored archive list I've provided here. So obviously, running the main collection loop more than once is pointless as my list is static...but at least it's practice. You can download a zipped archive of the files here.

require 'rubygems'
require 'nokogiri'
require 'open-uri'


def process_inmate_row(inmate_row, inmate_text)
  
	list_columns = inmate_row.xpath('./td')		
	inmate = {}
	the_inmate_name =  list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',')
	inmate['last_name'] = the_inmate_name[0]					# the name before the comma  
	inmate['first_name'] = the_inmate_name[1].split(' ')[0]		# the name after the comma, but before the next space
	inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1] if  the_inmate_name.length > 2	
  
  # at this point, we can determine if the inmate is already in our textfile
  
  name_to_match="#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}"  
   # remember that in the text file, we tab-delimited the name, so we have to match that pattern
	
  
   if inmate_text.match(name_to_match)
     puts "NOT adding inmate #{name_to_match} to inmates txt, as it already exists"
     inmate = nil
     # the method that invoked process_inmate_row will only add the inmate if it is not nil
     # we DON'T want this inmate added, so that's why we're setting it to nil
		
   else  
     
     	puts "Adding inmate #{name_to_match} to inmates txt"
 		  inmate['sex'] = list_columns[1].content		
   	  inmate['dob'] = list_columns[2].content
   	  inmate['intake_time'] = list_columns[3].content
	    puts "Basic info of inmate: #{inmate['first_name']} #{inmate['last_name']}: #{inmate['dob']}"
       
  end
  
  
  return inmate
  
end


def process_inmate_page_link(inmate_link)
  	inmate_page = Nokogiri::HTML(open(inmate_link))
		content_table_rows = inmate_page.xpath("//td[@class='content']/table/tr")

    more_inmate_stuff = {}
    
    if content_table_rows.length > 0
      
	  	more_inmate_stuff["xref"] = content_table_rows[2].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
  		
  		more_inmate_stuff['booking_number'] = content_table_rows[3].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
  		more_inmate_stuff['arresting_agency'] = content_table_rows[13].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
  		more_inmate_stuff['total_bail'] = content_table_rows[16].xpath("./td")[2].content.gsub(/\302\240/, ' ').gsub(/\s|\n|\r/, ' ').strip

  		puts "Found more inmate info, total-bail: #{more_inmate_stuff['total_bail']} arresting-agency: #{more_inmate_stuff['arresting_agency']}"


  		table_of_charges = content_table_rows[15].xpath("./td")[2]
  		more_inmate_stuff['charges'] = []

  		charge_1st_rows =  table_of_charges.xpath(".//tr[td[@class='cellTopLeft']]")
  		
  		puts "Number of charges: #{charge_1st_rows.length}"
  		charge_2nd_rows = table_of_charges.xpath(".//tr[td[@class='cellBottom']]")

  		charge_1st_rows.collect{|x| x}.each_index do |charge_row_index|

  			hash_of_inmate_charge = {}

  			charge_1st_row = charge_1st_rows[charge_row_index]
  			hash_of_inmate_charge['code'] = charge_1st_row.xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip
  			hash_of_inmate_charge['severity'] = charge_1st_row.xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip
  			hash_of_inmate_charge['description'] = charge_2nd_rows[charge_row_index].xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip

  			# push this hash on to the array of inmate charges:
  			more_inmate_stuff['charges'] << hash_of_inmate_charge	
    
        puts hash_of_inmate_charge.collect.join(" | ")
  		end
  		
  	else
  	  "Could not find more inmate info"	
		end # end if content_table_rows
  
    return more_inmate_stuff
end

def write_to_file(inmate)

    ##write to file
    puts "Writing to inmates.txt"
    
    # note that we use the 'a' mode here, which will append new input onto the end of an existing file (or create a new one if it doesn't exist), instead of overwriting it
    # Obviously, we don't want to keep overwriting inmates.txt if we intend it to be a persistent record of the inmate log
    
    File.open("inmates.txt", 'a+'){ |f| 
	f.write("first_name\tmiddle_name\tlast_name\tsex\tdob\tintaketime\trelease_date\txref\tbooking_number\tarresting_agency\ttotal_bail\n") unless File.size(f) >= 0 
      # we don't want to repeatedly print the column headers    
      f.write("#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}\t#{inmate['sex']}\t#{inmate['dob']}\t#{inmate['intake_time']}\t#{inmate['release_date']}\t#{inmate['xref']}\t#{inmate['booking_number']}\t#{inmate['arresting_agency']}\t#{inmate['total_bail']}\t#{Time.now}\n")
   
    }
   
    puts "Writing to charges.txt"

    File.open("charges.txt",'a+'){ |f|
      f.write("name\txref\ttotal_bail\tcode\tseverity\tdescription\n") unless File.size(f) >= 0 
      # we don't want to repeatedly print the column headers
      
  	  if inmate['charges']
  	    inmate['charges'].each do |charge|
  	      puts "Writing charge: #{charge['description']}"
    	    f.write("#{inmate['first_name']} #{inmate['last_name']}\t#{inmate['xref']}\t#{inmate['total_bail']}\t#{charge['code']}\t#{charge['severity']}\t#{charge['description']}\t#{Time.now}\n")
        end
      end
  
    }
end
  
  

def check_the_site(base_url, index_url)
  # read the contents of inmates.txt into a variable so that we can check to see if an inmate already exists
   inmate_text = File.exists?("inmates.txt") ? File.open("inmates.txt", 'r').readlines().join() : ''
   inmates_added_count = 0 # just a piece of info we want to keep track of. We'll increment this number on each successful add
   
    
    begin	
      inmate_listing = Nokogiri::HTML(open("#{base_url}#{index_url}"))
    rescue Exception=>e
      puts "Oops, had a problem getting the inmates list at #{Time.now}"
      return nil #get out of here.
    end
      
    inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1]
    puts "There are #{inmate_rows.length} rows to process"
    inmate_rows.each_index do |i|
  
      puts "\nProcessing inmate row: #{i}"
      inmate_row = inmate_rows[i]
      
      begin
        # The following code is potentially risky; we're making calls to process_inmate_row and process_inmate_page_link, two methods that could potentially throw an error if the data is improperly formatted or if the website refuses to send data
        
        # I've set up some rudimentary error handling to notify you of an error, but to keep chugging along to the next row
        
        inmate = process_inmate_row(inmate_rows[i], inmate_text)
        
        # process_inmate_row will return a hash of inmate data 
        # BUT, it will reutrn nil if it turns out this inmate already exists
        # so here's another if branch to check for that
        
        if inmate.nil?
          # do nothing
        else  
          # inmate was not blank, so let's continue
          list_columns = inmate_row.xpath('./td')		
        	if list_columns[0].xpath('./a').length == 0 
        		inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1]
        		puts "inmate was released on #{inmate['release_date']}"
        	else
        	  inmate_link = list_columns[0].xpath('./a')[0]["href"] 
        	  inmate_link = "#{base_url}#{inmate_link}"
        	  puts "Fetching: #{inmate_link}"
            more_inmate_attributes = process_inmate_page_link(inmate_link)
            inmate.merge!(more_inmate_attributes)
          end
    
          
        end # end of the if inmate.blank? branch
      rescue Exception=>e
        puts "Oops, had a problem getting data from inmate row #{i}, Error: #{e}"
      rescue Timeout::Error => e 
        puts "Had a timeout error: #{e}"
        sleep(10)
      else
         # got all the info for the inmage, so lets add him/her to the file
        
        unless inmate.nil?
          write_to_file(inmate) unless inmate.nil? 
  	      # an inline conditional: remember that inmate was set to nil if it already existed in the text file
  	      # we don't want to add it to the main array in such a case, hence the 'unless'
  	      inmates_added_count+=1
  	        puts "We successfully queried the site, so let's sleep a second"
          	sleep 1
  	    end

        
        
      end
    
    end
	
	  # reached the end, let's print a summary:
	  puts "#{Time.now}: Out of #{inmate_rows.length}, we added #{inmates_added_count} inmates"
	
end

   


hours = 0
BASE_URL='http://danwin.com/static/jail-list/'

while(hours < 24)
  puts "Checking the site (#{hours} out of 24 times):"
  puts "***********************"
  check_the_site(BASE_URL, 'current_listing.cfm.html')
  #run the code that hits the site and processes the links...this method also returns an array of all the inmates
  
  
  
  hours += 1 # increment the counter, or this will run forever...
  puts "sleeping till next iteration"
  
  sleep_count = 0
  while(sleep_count < 1800)
    sleep(1) #sleep for an hour
    sleep_count +=1
    puts "Will check again in #{(1800-sleep_count)/60} minutes" if sleep_count%60==0
  end
  
  
end

    

4/4/2010: This lesson remains unfinished, but the above code should execute. From it, you should have text files that, at a glance, will tell you some of the more interesting circumstances that this set of inmates were arrested under. There's various kinds of analysis you could do on a long term basis. But trying to figure out why some inmates have bail set at $1,000,000 isn't easy; you need to know their prior criminal record too...which is what we hope to do in the third tutorial in this series.

I'm a programmer journalist, currently teaching computational journalism at Stanford University. I'm trying to do my new blogging at blog.danwin.com.