This is part 2 of a 4-part series in introductory coding for journalists. Go here for the first lesson. This lesson and code will still be verbose, but will have a lot less hand-holding than the previous one.
DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.
A note about privacy: This tutorial uses files that I archived from a real-world jail website. Though booking records are public record, I make no claims about the legal proceedings involving the inmates who happened to be in jail when I took my snapshot. For all I know, they could have all been wrongfully arrested and therefore don’t deserve to have their name attached in online perpetuity to erroneous charges (even if the site only purports to record who was arrested and when, and not any legal conclusions). For that reason, I’ve redacted the last names of the inmates and randomized their birthdates.
The Cops Reporter and the Log
If you’re a daily cops reporter, calling the police station to ask for the list of last night’s arrests is probably part of your job. Because many papers have some kind of cops blotter where arrested suspects are listed…and online and in print, this is usually one of a paper’s top features. The St. Petersburg Times has a modern version of the feature, complete with mugshots and stats summaries.
Arrest logs have sometimes been criticized for being little more than voyeurism (here’s a discussion over the St. Pete’s mugshot site). But knowing who your law officers are arresting, and why, is essential to a nice, free society (and for a fair and efficient police force). And the more data you have as a reporter, the better you’ll be able to cover your beat.
Most pro-active police departments will announce when they’ve made high-profile arrests. But relying on the police to tell you what the most noteworthy arrests are kind of begs the question, and doesn’t tell the whole picture of arrest activity. Most states consider arrest logs to be public information (not that that stops some jurisdictions from hiding them). But a paper list or a PDF is hard to analyze. Luckily, some police departments are putting their work on the Web They might be willing to send you a spreadsheet of arrest activity, but what if you wanted up-to-the-hour information, so that you could be aware of:
- Suspected crimes that fall between egregious and infamous (non-fatal assaults, robberies, car jackings, etc.)
- An abnormally large number of arrests at a given time
- Unusual types of suspected crimes at a given time
This is where the web-scraping you learned in my last tutorial gets useful. You’re going to have an automated way of collecting the latest arrests news, in an ordered fashion (so that you could, for example, find the inmate with the largest bail at a given time), and you’ll save yourself and your friendly police PIO tedious paper shuffling and typing.
I’m going to base my lesson on this sheriff department’s jail system. I’ve mirrored a snapshot of their site here (zip file here), so I recommend you run your scripts on my mirror (root directory: https://danwin.com/static/jail-list/)before doing a real-world test.
The jail web site has these characteristics:
- At this page is a list of every person booked in the last 24 hours
- The list typically has 100 to 200 inmates at a time
- Most entries in that list contain a link to an inmate’s page containing data including name, DOB, bail, charges, booking time.
- Each inmate has a unique identifying number called X-REF
- Not all entries have a link; inmates who have been released have only their names listed
The site is pretty useful and user-friendly. However, it’s hard to quickly glean any useful information from the main list. You have to click through each individual entry to find out why someone was jailed. The purpose of the following lesson is to automate that process so you can efficiently get the big picture of a jail’s activity.
Program flow will go something like this:
- Create two text files: one to store the list of inmates (inmates.txt), one to store the list of charges (charges.txt)
- Open the inmate listing page
- Collect each list entry
- If list entry is not a link (i.e. inmate has been released)
- Else If list entry that is a link, open it
- Fetch first name, middle name, last name, xref, intake time, and DOB of an inmate
- Fetch and parse list of charges
- Fetch the bail amount
- In an each loop, for each inmate entry we collected above:
- Output inmate information, in tab-delimited format, into inmates.txt, including the XREF.
- Output the charges associated with the inmate into charges.txt. Each charge will take up one line, and the XREF of the inmate will also be included as to provided a key to the associated inmate
File I/O
We didn’t cover opening and writing to an external text file in the last lesson. So here’s how it goes briefly: Using Ruby’s IO class, we’re going to create two files, inmates.txt and charges.txt, and write to them what we find on the jail’s website. We’ll be using the variables inmates_file and charges_file to refer to the external files.
To open the the files and set the variables, use the IO class’s new method, which takes in two parameters: a string designating the file name, and a string
designating the mode…which in this case, will be “a”: write-only (read about the various modes here).
inmates_file = File.new('inmates.txt', 'a') charges_file = File.new('charges.txt', 'a')
If these files don’t already exist, they will now. If they did, the ‘a’ mode will append new content to the end of the file.
To write something to the file, use the puts method, which writes whatever string you supply to it as one line in the file (we’ve used this method without the IO class, in which case it outputs to the screen):
charges_file.puts("Adding a new line of text to the charges file.")
While we’re setting up, let’s create an array of hashes, with each hash object holding an inmate and his/her information. We don’t have to do this…we could just output to the file each inmate record as we get to it, but this will allow us some flexibility later. All we have to do is initialize the array:
inmates_array = []
Open the inmate listing page
Now let’s fetch the inmates listing. We’ll be using Nokogiri in the same fashion we did in the last lesson, beginning by requiring the nokogiri and open-uri libraries, then using the Open-URI’s open method to fetch the page, and then Nokogiri’s HTML class to wrap up the page in a parsable format.
require 'rubygems' require 'nokogiri' require 'open-uri' base_url='https://danwin.com/static/jail-list/' # all links on the list will be relative to this address inmate_listing = Nokogiri::HTML(open("#{base_url}current_listing.cfm.html"))
a_combined_string = “Hello ” + string2
a_combined_string = “Hello #{string2}”
Read more about Ruby’s string interpolation here.
Let’s visit the page with a browser and examine the structure. The list is an HTML table, with each row containing several columns, the first column being the inmate’s full name and, if the inmate hasn’t been released, a link to his/her booking page.
If you inspect the HTML closely, you’ll see that this page is composed of several tables. What we want is the table contained inside the <td> element with a class of “content.gsub(/\302\240/, ‘ ‘).”
So we’ll collect all the table rows, using Nokogiri’s xpath method, and iterate through them using an each loop. We’re going to use a variation of an each loop called each_index, which provides the numerical index of the current iteration we’re on.
inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1]
The XPath syntax here is looking for a td element with class=’content’, then the table inside of that. There’s more than one, but the first one on the page has the data. From that, we gather all the rows (tr) within that. We call the collect method to convert the result into an array since Nokogiri’s xpath method returns a NodeSet, which won’t have the each_index method. each_index loops through an array, just like each, but it provides the index of the current iteration.
inmate_rows.each_index do |i| inmate_row = inmate_rows[i] inmates_array[i] = {} inmate = inmates_array[i] # each row has a set of columns with the inmate info list_columns = inmate_row.xpath('./td')
Because we know we’re on the ith row, we can also initialize the ith index in inmates_array as a hash to store the ith inmate’s information. Remember that each element in the inmates_array is going to be a hash of information.
Lets use the variable named inmate as a shorthand way to refer to this position in the inmates_array .Each time we iterate through the loop, inmate will refer to the next spot in the inmates_array.
This is easier to type out 10 times than inmates_array[i]
Before we get to visiting the individual inmate pages, let’s just collect the name and other information readily available here
Each name consists of a String in this format: last_name, first_name middle_name
So let’s use the String split method. First to split the string by comma; this will give us an array with the first element being what’s on the left side of the comma. Splitting the second element of that array, with a space, will give us another array, consisting of a first name and middle name.
# remember that you need to call Nokogiri's content method to get the text, as a String, between a tag the_inmate_name = list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',') inmate['last_name'] = the_inmate_name[0] # the name before the comma inmate['first_name'] = the_inmate_name[1].split(' ')[0] # the name after the comma, but before the next space inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1]
I’m going to be using this method call after each use of content: gsub(/\302\240/, ‘ ‘).strip
Not all entries have a middle name. So we use the if the_inmate_name.length > 2 conditional statement to tell Ruby to skip this line if the_inmate_name
# Moving on to the next table cell, which will be the 1 spot in list_columns inmate['sex'] = list_columns[1].content # next cell, DOB inmate['dob'] = list_columns[2].content # next cell, booking time inmate['intake_time'] = list_columns[3].content # let's go back to the first column to see if it contained a link if list_columns[0].xpath('./a').length == 0 # if there was no link, there would be 0 links returned # No link to visit, so this must have been a released inmate. Let's grab his/her release date # which comes in the pattern "Released mm/dd/yyyy"...so we'll split the string and capture the second term inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1] else # visit link # we'll get to this subroutine in the next section end end
.gsub(/\302\240/, ‘ ‘)
Read more about this from Vita Ara
OK, that should’ve given you a refresher on arrays, hashes, XPath, and string manipulation. Now we’ll handle the case of when the first list_column array item does contain a link. It will involve fetching the page from that link and then more XPathing to pick out the wanted data.
At this time, go to the inmate list page and click on one of the inmate pages in the browser.
There’s a lot more information here; what will be most relevant to us right now is the X-Reference Number, charges, and bail. This next section of code will fit into the else branch of our previous section of code.
# visit link (remember that the xpath method returns an array, so we have to explicitly refer to # the 0th index to get the link) inmate_link = list_columns[0].xpath('./a')[0]["href"] # remember that we set base_url to contain the site's base address. we append # inmate_link to it to get the absolute address to the inmate page inmate_page = Nokogiri::HTML(open("#{base_url}#{inmate_link}")) # everything is inside a <td> with a class="content" attribute, so let's set a variable # to hold the table rows inside content_table_rows = inmate_page.xpath("//td[@class='content']/table/tbody/tr") # the xref number appears to be in the third row and in the third cell # again, we're still using the inmate variable to hold the data associated with an inmate inmate["xref"] = content_table_rows[2].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip # the strip method removes characters that are just space, such as tabs and carriage returns inmate['booking_number'] = content_table_rows[3].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip inmate['arresting_agency'] = content_table_rows[13].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip inmate['total_bail'] = content_table_rows[16].xpath("./td")[2].content.gsub(/(\302\240)|\s|\n|\r/, ' ').strip
OK, so we collected the basic info about each inmate. Now, we want to collect the charges leveled against them. This is a little bit trickier. If you inspect the table-cell containing the charges, you’ll see that the charge listing itself is a table. The first row of the table lists the case number and type of arrest (warrant, or fresh pickup). Below that is a list of charges, with each charge taking up two rows, like so:
1st Row | 1st Cell: Charge code (i.e. PC 459) | 2nd Cell: Charge severity (i.e. Felony) |
2nd Row: | Charge description (i.e. “Burglary”) |
For most of the inmate listings, this is immediately followed by another row listing the bail amount.
However, there are a few inmates who are held on more than one charge. And there are some who are being held from multiple charges stemming from multiple warrants, such as this person here, who appears to have racked up a number of public nuisance accusations, including evading ticket fare and prohibited public drinking. In his case, the charge listing is one row after another, and each row could either mention the case, the agency that issued the warrant, the charge, or the bail amount per warrant.
My point here is that you won’t be able to predict that the third row, for instance, always contains the charge code and severity. But using Inspect Element, we see that the table cells containing the code, severity, and description have class attributes “cellTopLeft”, “cellTopMiddle” and “cellBottom”, respectively. The bail amount per case is in the cell with class “cellBail”…but we’re not interested in bail per case, so we’ll ignore it.
We’re going to loop through rows inside this table, and if that row contains a td cell of class “cellTopLeft”, we know that each this row will contain the code and severity of a charge. We’re going to assume that the row immediately following it has a cell with class “cellBottom,” which contains the description.
Processing this sub-table of charges will require its own loop. And since each inmate could have more than one charge, we need to store “charges” inside our inmate hash…charges will point to an array. And each item in the charges array will itself be a hash, with keys of “code”, “severity”, and “description.”
Confusing? Well, here’s a quick diagram of what we have so far, in terms of variables:
inmates => an array of Hashes... inmate = inmates[index] (each inmate is a Hash) => inmate['first_name'] => inmate's first name => inmate['last_name'] => inmate's last name => inmate['xref'] => inmate's xref ... all the other attributes => inmate['charges'] => an array of hashes charge = inmate['charges'][charge_index] (each charge is a Hash) charge['code'] => charge's code charge['severity'] => charge's severity charge['description'] => charge's description
The loop to fill out that charge array is as follows:
# first, grab the entire table of charges that exists in the 16th row of the main content table table_of_charges = content_table_rows[15].xpath("./td")[2] # and give this inmate an array of charges inmate['charges'] = [] # Now, collect all rows that have a td with class "cellTopLeft" charge_1st_rows = table_of_charges.xpath(".//tr[td[@class='cellTopLeft']]") # Now, collect all rows that have a td with class "cellBottom" charge_2nd_rows = table_of_charges.xpath(".//tr[td[@class='cellBottom']]") # OK, you should do some basic error checking here. We expect the arrays of charge_1st_rows and charge_2nd_rows to have # equal length, since each charge has a code, severity and description, right? # If not, that means our assumption was wrong, and you should do something...like exit the script and re-examine your # datasource and assumptions about it. But I'll skip that for now charge_1st_rows.collect.each_index do |charge_row_index| # we found a row with a charge, so let's create a new hash that will hold the charge's attributes hash_of_inmate_charge = {} charge_1st_row = charge_1st_rows[charge_row_index] hash_of_inmate_charge['code'] = charge_1st_row.xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip hash_of_inmate_charge['severity'] = charge_1st_row.xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip # we assume that the row, with the same index in the charge_2nd_rows array will be the description of the charge # listed in charge_1st_rows hash_of_inmate_charge['description'] = charge_2nd_rows[charge_row_index].xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip # push this hash on to the array of inmate charges: inmate['charges'] << hash_of_inmate_charge end
Well, we've collected all the relevant inmate information, and if our assumptions were right, each of the inmate's charges. We've reached the end of the loop that examines each row in the main inmate listing. Our script will go onto the next inmate and collect his/her info. And so on until it has reached the end of the list. Here's all the code so far:
require 'rubygems' require 'nokogiri' require 'open-uri' inmates_array = [] base_url='' inmate_listing = Nokogiri::HTML(open("#{base_url}current_listing.cfm.html")) inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1] inmate_rows.each_index do |i| inmate_row = inmate_rows[i] inmates_array[i] = {} inmate = inmates_array[i] list_columns = inmate_row.xpath('./td') the_inmate_name = list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',') inmate['last_name'] = the_inmate_name[0] # the name before the comma inmate['first_name'] = the_inmate_name[1].split(' ')[0] # the name after the comma, but before the next space inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1] if the_inmate_name.length > 2 inmate['sex'] = list_columns[1].content inmate['dob'] = list_columns[2].content inmate['intake_time'] = list_columns[3].content if list_columns[0].xpath('./a').length == 0 inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1] else inmate_link = list_columns[0].xpath('./a')[0]["href"] inmate_page = Nokogiri::HTML(open("#{base_url}#{inmate_link}")) content_table_rows = inmate_page.xpath("//td[@class='content']/table/tr") if content_table_rows.length > 0 inmate["xref"] = content_table_rows[2].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip inmate['booking_number'] = content_table_rows[3].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip inmate['arresting_agency'] = content_table_rows[13].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip inmate['total_bail'] = content_table_rows[16].xpath("./td")[2].content.gsub(/\302\240/, ' ').gsub(/\s|\n|\r/, ' ').strip table_of_charges = content_table_rows[15].xpath("./td")[2] inmate['charges'] = [] charge_1st_rows = table_of_charges.xpath(".//tr[td[@class='cellTopLeft']]") charge_2nd_rows = table_of_charges.xpath(".//tr[td[@class='cellBottom']]") charge_1st_rows.collect{|x| x}.each_index do |charge_row_index| hash_of_inmate_charge = {} charge_1st_row = charge_1st_rows[charge_row_index] hash_of_inmate_charge['code'] = charge_1st_row.xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip hash_of_inmate_charge['severity'] = charge_1st_row.xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip hash_of_inmate_charge['description'] = charge_2nd_rows[charge_row_index].xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip # push this hash on to the array of inmate charges: inmate['charges'] << hash_of_inmate_charge end end # end if content_table_rows end end
Storing your Data into a File
At this point in your script, all your carefully collected data is in memory. When the script finishes execution, it disappears. That defeats the purpose of any way of tracking data. So let's store it in a persistent way...my choice would be in some kind of database, like MySQL or SQLite. But for our purposes, we can quickly learn the methods to store this information in a tab-delimited file that can be opened as an Excel spreadsheet.
We will be using Ruby's File class:
##write to file File.open("inmate.txt", 'w'){ |f| f.write("first_name\tmiddle_name\tlast_name\tsex\tdob\tintaketime\trelease_date\txref\tbooking_number\tarresting_agency\ttotal_bail\n") inmates_array.each do |inmate| f.write("#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}\t#{inmate['sex']}\t#{inmate['dob']}\t#{inmate['intake_time']}\t#{inmate['release_date']}\t#{inmate['xref']}\t#{inmate['booking_number']}\t#{inmate['arresting_agency']}\t#{inmate['total_bail']}\n") end }
A quick explanation. The File class has the open method, to which we pass in two arguments: the name of the file we want to write to, and the mode. In this case, we're using 'w', which stands for "write" mode. The curly-braces sets off the code that gets executed while this File is open, with the variable f referring to the actual file.
File also has an instance method called write, which takes in a String as an argument to write to the open file.
Backslash-t will write a tab, and backslash-n will write a newline character.
The next block of code is similar to the first...but it refers to a "charges.txt" file. Remember that each inmate could have more than one charge to his/her name. The following file lists every charge, but also lists the xref key to tie back into inmates.txt. For convenience sake, we're also going to print out the inmate name and the inmate's total bail on each line.
File.open("charges.txt",'w'){ |f| f.write("name\txref\ttotal_bail\tcode\tseverity\tdescription\n") inmates_array.each do |inmate| if inmate['charges'] inmate['charges'].each do |charge| f.write("#{inmate['first_name']} #{inmate['last_name']}\t#{inmate['xref']}\t#{inmate['total_bail']}\t#{charge['code']}\t#{charge['severity']}\t#{charge['description']}\n") end end end }
Printing out the inmate's name and total bail, although redundant, allows us to quickly skim the list to see if there were any unusual crimes connected to unusual amounts of bail (note that the jail site does not breakdown bail amounts per charge).
Putting it all together for the real world
The above code, put all together, will execute cleanly and compile some nice text files for you, especially if you've saved the package of HTML files onto your hard drive. But in the real world, you'll be targeting an internet server, which may not like you hitting it at a rate of five times per second. Or, may intermittently fail.
To deal with this, I've added a call to Ruby's sleep method, which pauses script execution for a given number of seconds. I've also thrown in some error-handling. Here's the basic structure:
# some code begin # risky code here # The Ruby interpreter will watch the code that gets executed within the begin branch...if something goes wrong, it's going to execute code in the following rescue branch rescue # the begin-branch messed up, time to run some other code puts "An error happened!" else # this code gets executed if the begin-branch worked fine ensure # this code in the ensure branch (which is optional) runs no matter what. puts "We're done with our error handling" end
Read more about error-handling here.
And finally, I'm going to make a few alterations to the script to make it so that it'll run repeatedly for every half hour (essentially, by sleeping a half hour after going through the list). This is the crudest way to schedule a script, but it'll work for now. It will also use another instance method of File: readlines.
Each half hour, it's likely that the list of inmates will be the same. So a crude way to reduce the number of repeat listings is to check the inmates.txt file (using the match method) to see if a given inmate's xref number is in there. This gets slower as inmates.txt grows. Like I said, it's crude. I prefer using a database, which is a topic outside the scope of this tutorial.
So I've taken the code above and split it into five parts:
- the process_inmate_row method - This method takes in a single row from the list of inmates and reads the basic information, including name, sex, and date of birth. It takes in as its second argument the entire text of inmates.txt and sees if inmate.txt already contains the name. If so, it will return a hash of inmate data. If not, it will return nil
Note: As said previously, constantly searching the entire inmates.txt file is incredibly inefficient. And, what happens if two John Smiths are arrested in the same time period? The name-check will fail to differentiate inmates of similar names (an even better match method would involve using the date of birth). But I leave it as an exercise for you to develop a more efficient method, which could involve a database. Or storing the name columns of inmates.txt into an array.
But the reason why we're doing the name-check is to save us the time of entering an inmate's page. And, of course, to not fill the inmates.txt file with duplicate entries.
- the process_inmate_page_link method - The code that fetches an inmate's individual page and then processes the extra data, including the total bail amount and charges, is done here. It returns a hash of the inmate data.
- the write_to_file method - This code invokes the File.open methods and, for each inmate and charge, writes a tab-delimited line to the inmates.txt and charges.txt files
- the check_the_site method - This is the master method. It retrieves the list of inmates from the jail site and then on each row of inmate data, calls all the previously defined methods. IT also has some basic error handling. If something happens, like your internet connection drops in the middle of a page retrieval, it skips the current inmate and moves on. This is better than just crashing.
- The main execution loop - All the code previously written out as methods will do nothing unless you actually invoke the methods. So we initialize a variable, called hours, to zero and while that is less than 24, we run the check_the_site method. After check_the_site finishes, hours is incremented and the script sleeps for an hour (3600 seconds).
require 'rubygems' require 'nokogiri' require 'open-uri' def process_inmate_row(inmate_row, inmate_text) list_columns = inmate_row.xpath('./td') inmate = {} the_inmate_name = list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',') inmate['last_name'] = the_inmate_name[0] # the name before the comma inmate['first_name'] = the_inmate_name[1].split(' ')[0] # the name after the comma, but before the next space inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1] if the_inmate_name.length > 2 # at this point, we can determine if the inmate is already in our textfile name_to_match="#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}" # remember that in the text file, we tab-delimited the name, so we have to match that pattern if inmate_text.match(name_to_match) puts "NOT adding inmate #{name_to_match} to inmates txt, as it already exists" inmate = nil # the method that invoked process_inmate_row will only add the inmate if it is not nil # we DON'T want this inmate added, so that's why we're setting it to nil else puts "Adding inmate #{name_to_match} to inmates txt" inmate['sex'] = list_columns[1].content inmate['dob'] = list_columns[2].content inmate['intake_time'] = list_columns[3].content puts "Basic info of inmate: #{inmate['first_name']} #{inmate['last_name']}: #{inmate['dob']}" end return inmate end def process_inmate_page_link(inmate_link) inmate_page = Nokogiri::HTML(open(inmate_link)) content_table_rows = inmate_page.xpath("//td[@class='content']/table/tr") more_inmate_stuff = {} if content_table_rows.length > 0 more_inmate_stuff["xref"] = content_table_rows[2].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip more_inmate_stuff['booking_number'] = content_table_rows[3].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip more_inmate_stuff['arresting_agency'] = content_table_rows[13].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip more_inmate_stuff['total_bail'] = content_table_rows[16].xpath("./td")[2].content.gsub(/\302\240/, ' ').gsub(/\s|\n|\r/, ' ').strip puts "Found more inmate info, total-bail: #{more_inmate_stuff['total_bail']} arresting-agency: #{more_inmate_stuff['arresting_agency']}" table_of_charges = content_table_rows[15].xpath("./td")[2] more_inmate_stuff['charges'] = [] charge_1st_rows = table_of_charges.xpath(".//tr[td[@class='cellTopLeft']]") puts "Number of charges: #{charge_1st_rows.length}" charge_2nd_rows = table_of_charges.xpath(".//tr[td[@class='cellBottom']]") charge_1st_rows.collect{|x| x}.each_index do |charge_row_index| hash_of_inmate_charge = {} charge_1st_row = charge_1st_rows[charge_row_index] hash_of_inmate_charge['code'] = charge_1st_row.xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip hash_of_inmate_charge['severity'] = charge_1st_row.xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip hash_of_inmate_charge['description'] = charge_2nd_rows[charge_row_index].xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip # push this hash on to the array of inmate charges: more_inmate_stuff['charges'] << hash_of_inmate_charge puts hash_of_inmate_charge.collect.join(" | ") end else "Could not find more inmate info" end # end if content_table_rows return more_inmate_stuff end def write_to_file(inmate) ##write to file puts "Writing to inmates.txt" # note that we use the 'a' mode here, which will append new input onto the end of an existing file (or create a new one if it doesn't exist), instead of overwriting it # Obviously, we don't want to keep overwriting inmates.txt if we intend it to be a persistent record of the inmate log File.open("inmates.txt", 'a+'){ |f| f.write("first_name\tmiddle_name\tlast_name\tsex\tdob\tintaketime\trelease_date\txref\tbooking_number\tarresting_agency\ttotal_bail\n") unless File.size(f) >= 0 # we don't want to repeatedly print the column headers f.write("#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}\t#{inmate['sex']}\t#{inmate['dob']}\t#{inmate['intake_time']}\t#{inmate['release_date']}\t#{inmate['xref']}\t#{inmate['booking_number']}\t#{inmate['arresting_agency']}\t#{inmate['total_bail']}\t#{Time.now}\n") } puts "Writing to charges.txt" File.open("charges.txt",'a+'){ |f| f.write("name\txref\ttotal_bail\tcode\tseverity\tdescription\n") unless File.size(f) >= 0 # we don't want to repeatedly print the column headers if inmate['charges'] inmate['charges'].each do |charge| puts "Writing charge: #{charge['description']}" f.write("#{inmate['first_name']} #{inmate['last_name']}\t#{inmate['xref']}\t#{inmate['total_bail']}\t#{charge['code']}\t#{charge['severity']}\t#{charge['description']}\t#{Time.now}\n") end end } end def check_the_site(base_url, index_url) # read the contents of inmates.txt into a variable so that we can check to see if an inmate already exists inmate_text = File.exists?("inmates.txt") ? File.open("inmates.txt", 'r').readlines().join() : '' inmates_added_count = 0 # just a piece of info we want to keep track of. We'll increment this number on each successful add begin inmate_listing = Nokogiri::HTML(open("#{base_url}#{index_url}")) rescue Exception=>e puts "Oops, had a problem getting the inmates list at #{Time.now}" return nil #get out of here. end inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1] puts "There are #{inmate_rows.length} rows to process" inmate_rows.each_index do |i| puts "\nProcessing inmate row: #{i}" inmate_row = inmate_rows[i] begin # The following code is potentially risky; we're making calls to process_inmate_row and process_inmate_page_link, two methods that could potentially throw an error if the data is improperly formatted or if the website refuses to send data # I've set up some rudimentary error handling to notify you of an error, but to keep chugging along to the next row inmate = process_inmate_row(inmate_rows[i], inmate_text) # process_inmate_row will return a hash of inmate data # BUT, it will reutrn nil if it turns out this inmate already exists # so here's another if branch to check for that if inmate.nil? # do nothing else # inmate was not blank, so let's continue list_columns = inmate_row.xpath('./td') if list_columns[0].xpath('./a').length == 0 inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1] puts "inmate was released on #{inmate['release_date']}" else inmate_link = list_columns[0].xpath('./a')[0]["href"] inmate_link = "#{base_url}#{inmate_link}" puts "Fetching: #{inmate_link}" more_inmate_attributes = process_inmate_page_link(inmate_link) inmate.merge!(more_inmate_attributes) end end # end of the if inmate.blank? branch rescue Exception=>e puts "Oops, had a problem getting data from inmate row #{i}, Error: #{e}" rescue Timeout::Error => e puts "Had a timeout error: #{e}" sleep(10) else # got all the info for the inmage, so lets add him/her to the file unless inmate.nil? write_to_file(inmate) unless inmate.nil? # an inline conditional: remember that inmate was set to nil if it already existed in the text file # we don't want to add it to the main array in such a case, hence the 'unless' inmates_added_count+=1 puts "We successfully queried the site, so let's sleep a second" sleep 1 end end end # reached the end, let's print a summary: puts "#{Time.now}: Out of #{inmate_rows.length}, we added #{inmates_added_count} inmates" end hours = 0 BASE_URL='https://danwin.com/static/jail-list/' while(hours < 24) puts "Checking the site (#{hours} out of 24 times):" puts "***********************" check_the_site(BASE_URL, 'current_listing.cfm.html') #run the code that hits the site and processes the links...this method also returns an array of all the inmates hours += 1 # increment the counter, or this will run forever... puts "sleeping till next iteration" sleep_count = 0 while(sleep_count < 1800) sleep(1) #sleep for an hour sleep_count +=1 puts "Will check again in #{(1800-sleep_count)/60} minutes" if sleep_count%60==0 end end
4/4/2010: This lesson remains unfinished, but the above code should execute. From it, you should have text files that, at a glance, will tell you some of the more interesting circumstances that this set of inmates were arrested under. There's various kinds of analysis you could do on a long term basis. But trying to figure out why some inmates have bail set at $1,000,000 isn't easy; you need to know their prior criminal record too...which is what we hope to do in the third tutorial in this series.
Pingback: Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully. | Danwin: Dan Nguyen, in short
Pingback: Coding for Journalists 101 : A four-part series | Danwin: Dan Nguyen, in short