danwin.com » tutorial

dataist blog: An inspiring case for journalists learning to code

Dan Nguyen — Wed, 16 Feb 2011 13:00:32 +0000

About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven’t looked back at it because I’m sure I’ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from “What is HTML” to actual Ruby scraping code all in a gigantic, badly formatted post.

The series of articles have gotten a fair number of hits but I don’t know how many people were able to stumble through it. Though last week I noticed this recent trackback from dataist, a new “blog about data exploration” by Finnish journo Jens FinnÃ¤s. He writes that he has “almost no prior programming experience” but, after going through my tutorials and checking out Scraperwiki, was able to produce this cool network graph of the Ratata blog network after about “two days of trial and error”:

Mapping of Ratata blogging network by Jens FinnÃ¤s of dataist.wordpress.com

I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas’s example. Becoming good at coding is not a trivial task. But even the first steps of it can teach a non-coder some profound lessons about data important enough on their own. And if you’re a curious-type with a question you want to answer, you’ll soon figure out a way to put something together, as in Finnas’s case.

ProPublica’s Dollars for Docs project originated in part from this Pfizer-scraping lesson I added on to my programming tutorial: I needed a timely example of public data that wasn’t as useful as it should be.

My colleagues Charles Ornstein and Tracy Weber may not be programmers (yet), but they are experienced enough with data to know its worth as an investigative resource, and turned an exercise in transparency into a focused and effective investigation. It’s not trivial to find a story in data. Besides being able to do Access queries themselves, C&T knew both the limitations of the data (for example, it’s difficult to make comparisons between the companies because of different reporting periods) and its possibilities, such as the cross-checking of names en masse from the payment lists with state and federal doctor databases.

Their investigation into the poor regulation of California nurses – a collaboration with the LA Times that was a Pulitzer finalist in the Public Service category – was similarly data-oriented. They (and the LA Times’ Maloy Moore and Doug Smith) had been diligently building a database of thousands of nurses – including their disciplinary records and the time it took for the nursing board to act – which made my part in building a site to graphically represent the data extremely simple.

The point of all this is: don’t put off your personal data-training because you think it requires a computer science degree, or that you have to become great at it in order for it to be useful. Even if after a week of learning, you can barely put together a programming script to alphabetize your tweets, you’ll likely gain enough insight to how data is made structured and useful, which will aid in just about every other aspect of your reporting repertoire.

In fact, just knowing to avoid taking notes like this:

Colonel Mustard used the revolver in the library? (not library)
Miss Scarlet used the Candlestick in the dining room? (not Scarlet)
“Mrs. Peacock, in the dining room, with the ~~revolver~~? “
“Colonel Mustard, rope, ~~conservatory~~?”
Mustard? Dining room? Rope (nope)?
“Was it Mrs. Peacock with the ~~candlestick~~, inside the dining room?”

And instead, recording them like this:

Who/What?	Role?	Ruled out?
Mustard	Suspect	N
Scarlet	Suspect	Y
Peacock	Suspect	N
Revolver	Weapon	Y
Candlestick	Weapon	Y
Rope	Weapon	Y
Conservatory	Place	Y
Dining Room	Place	N
Library	Place	Y

…will make you a significantly more effective reporter, as well as position you to have your reporting and research become much more ready for thorough analysis and online projects.

There’s a motherlode of programming resources available through single Google search. My high school journalism teacher told us that if you want to do journalism, don’t major in it, just do it. I think the same can be said for programming. I’m glad I chose a computer field as an undergraduate so that I’m familiar with the theory. But if you have a career in reporting or research, you have real-world data-needs that most undergrads don’t. I’ve found that having those goals and needing to accomplish them has pushed my coding expertise far quicker than did any coursework.

If you aren’t set on learning to program, but want to get a better grasp of data, I recommend learning:

Regular expressions – a set of character patterns, easily printable on a cheat-sheet for memorization, that you use in a text-editor’s Find and Replace dialog to turn a chunk of text into something you can put into a spreadsheet, as well as clean up the data entries themselves. Regular-expressions.info is the most complete resource I’ve found. A cheat-sheet can be found here. Wikipedia has a list of some simple use cases.
Google Refine – A spreadsheet-like program that makes easy the task of cleaning and normalizing messy data. Ever go through campaign contribution records and wish you could easily group together and count as one, all the variations of “Jon J. Doe”, “Jonathan J. Doe”, “Jon Johnson Doe”, “JON J DOE”, etc.? Refine will do that. Refine developer David Huynh has an excellent screencast demonstrating Refine’s power. I wrote a guide as part of the Dollars for Docs tutorials. Even if you know Excel like a pro – which I do not – Refine may make your data-life much more enjoyable.

If you want to learn coding from the ground up, here’s a short list of places to start:

Lifehacker’s “Full Beginner’s Guide” – a four day guide that covers the very basics to how to write a simple guessing game. It’s in Javascript, but as you’ll hear plenty of times from veterans, it really doesn’t matter what language you start out with.
The Pragmatic Programmer’s Guide to Programming Ruby – this covers an older version of Ruby, but is still a great comprehensive, browser-friendly book.
Learn to Program (also in Ruby) by Chris Pine – Written in 2004, this is still an elegant beginner’s guide
Invent Your Own Computer Games With Python – You may not be interested in writing game software, but the same programming techniques apply in that field as they do anywhere else. This guide covers all the fundamentals and gives you great project examples.
ScraperWiki has a massive collection of web-scraping scripts for your perusal, and is where the dataist’s FinnÃ¤s learned from example. ScraperWiki has a set of python tutorials, too.
Here’s a giant list of free programming books.
Visit the learnprogramming subforum in Reddit to find a small, but active community of beginners who aren’t afraid to start the most basic of discussions with the forum’s programming experts. StackOverflow is the single best site for specific questions or problems; often, you can Google your exact problem and a relevant StackOverflow discussion will be at the top.
And you can always refer back to my four-part programming tutorial from last year, which aims to cover HTML to writing Ruby to scrape websites. I also wrote a series of tutorials (with complete code) on how I collected data for Dollars for Docs, including how to scrape from websites, Flash applications, PDFs, and even image files (the solution is specific to one kind of format, so I will gladly welcome anyone else to generalize it).

The post dataist blog: An inspiring case for journalists learning to code appeared first on danwin.com.

Pfizer Data Redux

Dan Nguyen — Wed, 28 Apr 2010 14:22:36 +0000

Updated the code and results to my guide on how to scraper Pfizer’s list of payments to doctors. It now contains a more normalized file that has a line for every doctor and payment. The aggregate totals changed marginally.

The post Pfizer Data Redux appeared first on danwin.com.

Coding for Journalists 101 : A four-part series

Dan Nguyen — Tue, 06 Apr 2010 13:51:40 +0000

Photo by Nico Cavallotto on Flickr

Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here.

I’m only keeping this old walkthrough up as a historical reference. I’m sure the code is so ugly that I’m not going to even try re-reading it.

So check it out: The Bastards Book of Ruby

-Dan

—

Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. There is a guide for Pfizer which supersedes the one I originally posted here.

So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A “little while” turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows how to help Pfizer in its mission to more transparent, compelled me to just publish them in incomplete form. There’s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.

As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I’m very welcome to editing changes.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

Tutorial 1: Go from knowing nothing to scraping Web pages. In an hour. Hopefully

~~loop~~

Tutorial 2: Scraping a County Jail Website to Find Out Who’s in Jail – This uses all the concepts from the first tutorial and applies them to something that a cops reporter might actually want to try out.

Tutorial 3: Who’s Been in Jail Before: Cross-checking the jail logs with the court system with Ruby’s Mechanize – This lesson introduces you to another Ruby library that allows you to automate the filling-out of forms so that you can access online databases, in this case, California criminal case histories to see if current inmates are repeat-alleged-offenders.

Tutorial 4: Improving Pfizer’s Dollars-to-Doctors Pay List – Last week, Pfizer released a list of nearly 5,000 doctors and medical institutions that it made $35 million in consulting and expense payments. Fun. Unfortunately, the list, as it initially existed online, is just about useless to anyone wanting to examine trends. This tutorial provides a script to make the list more interesting to journalists.

The post Coding for Journalists 101 : A four-part series appeared first on danwin.com.

Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List

Dan Nguyen — Tue, 06 Apr 2010 13:50:19 +0000

Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state.

Update (4/28): Replaced the code and result files. Still haven’t written out a thorough explainer of what’s going on here.

Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. I’m going to rework this script and post and update soon.

So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. These doctors are the same ones who, from time to time, recommend the use of Pfizer products.

From the NYT:

Pfizer, the worldâ€™s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. Pfizer also paid $15.3 million to 250 academic medical centers and other research groups for clinical trials in the same period.

A spokeswoman for Pfizer, Kristen E. Neese, said most of the disclosures were required by an integrity agreement that the company signed in August to settle a federal investigation into the illegal promotion of drugs for off-label uses.

So, not an entirely altruistic release of information. But it’s out there nonetheless. You can view their list here. Jump to my results here

Not bad at first glance. However, on further examination, it’s clear that the list is nearly useless unless you intend to click through all 480 pages manually, or, if you have a doctor in mind and you only care about that one doctor’s relationship. As a journalist, you probably have other questions. Such as:

Which doctor received the most?
What was the largest kind of expenditure?
Were there any unusually large single-item payments?

None of these questions are answerable unless you have the list in a spreadsheet. As I mentioned in earlier lessons…there are cases when the information is freely available, but the provider hasn’t made it easy to analyze. Technically, they are fulfilling their requirement to be “transparent.”

I’ll give them the benefit of the doubt that they truly want this list to be as accessible and visible as possible…I tried emailing them to ask for the list as a single spreadsheet, but the email function was broken. So, let’s just write some code to save them some work and to get our answers a little quicker.

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact dan@danwin.com if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

The Code

The following code uses the same nokogiri strategies in the past three lessons. But here are the specific considerations that we have to make for Pfizer’s list:

The base url is: http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=All&iPageNo=1 The most interesting parameter, iPageNo, is bolded. If you replace ‘1’ with any number, you’ll see you can progress through the list. There appears to be 486 pages.
So each page has a table of data with id #hcpPayments. The rows of data aren’t very normalized. For example, each “Entity Paid” can have many services/activity listed, with each of those items having another name attached to it. Then there are “cash” and “non-cash” values, which may or may not be numeric (“—” apparently means 0) There’s no easy css selector to grab each entity…but it seems that we can safely assume that if the first table column has a name (and the second and third contain city and state) that this is a new entity.

These are the steps we’ll take:

Download pages 1 to 486 of the list (each page has 10 entries)
Run a method that gathers all the doctor names from the pages we just downloaded on to our hard drive)
From that list of doctors, query the Pfizer site and gather the individual payments to every doctor.

At the top, I’ve written a few convenience methods to deal with strings. Also included are: get_doc_query is a function we call to extract the doctor name from the links on the site.

puts_error is a quick function to log any errors we might have

						# Some general functions to deal with strings
					class String

					  alias_method :old_strip, :strip

					  def strip
						  self.old_strip.gsub(/^[\302\240|\s]*|[\302\240|\s]*$/, '').gsub(/[\r\n]/, " ")
					  end

					  def strip_for_num
					    self.strip.gsub(/[^0-9]/, '')
					  end

					  def blank?
						respond_to?(:empty?) ? empty? : !self
					  end
					end
					
					
					END_PAGE=486
					BASE_URL=''
					DOC_QUERY_URL='http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?hcpdisplayName='


					def get_doc_query(str)
					  str.match(/hcpdisplayName\=(.+)/)[1]
					end

					def puts_error(str)
					  err = "#{Time.now}: #{str}"
					  puts err
					  File.open("pfizer_error_log.txt", 'a+'){|f| f.puts(err)}
					end

I found it easiest to download all the pages onto the hard drive first, using something like CURL, and then run the following code on it.

process_local_pages is a method that will iterate through every page (you can set BASE_URL to either your hard drive if you’ve downloaded all the pages yourself, or to the Pfizer page), run process_row, and store all the doctor names and payees into separate files, as well as hold all the total amounts

The three resulting files that you get are:

pfizer_doctors.txt – Every doctor name listed. We will use this in the next step to query each doctor individual on Pfizer’s site
pfizer_entities.txt – A list of every payment made to Entities
pfizer_entity_totals.txt – A list of the total payments made to Entities



						def process_row(row, i, current_entity, arrays)  

						  tds = row.css('td').collect{|r| r.text.strip}

						   if !tds[3].blank? 
						     if !tds[1].blank?
						     # new entity
						     puts tds[0]
							     current_entity = {:name=>tds[0],:city=>tds[1], :state=>tds[2], :page=>i, :services=>[]} 
							     arrays[:entities].push(current_entity) if arrays[:entities]
						  	   current_class = row['class']
							   end

						     if tds[3].match(/Total/)
						       arrays[:totals].push([current_entity[:name], tds[4].strip_for_num, tds[5].strip_for_num].join("\t")) if arrays[:totals]

						     else
						        # new service
						   	   services_td = row.css('td')[3]
						   	   service_name = services_td.css("ul li a")[0].text.strip 
						   	   puts "#{current_entity[:name]}\t#{service_name}" 
						   	   current_entity[:services].push([service_name, tds[4].strip_for_num, tds[5].strip_for_num]) 

						   	   arrays[:doctors].push(services_td.css("ul li ul li a").map{|a| get_doc_query(a['href']) }.uniq) if arrays[:doctors]
						     end
						   elsif tds.reject{|t| t.blank?}.length == 0
						     #blank row
						   else
						     puts_error "Page #{i}: Encountered a row and didn't know what to do with it: #{tds.join("\t")}"
						   end

						   return current_entity
						end





						def process_local_pages

						  doctors_arr = []
						  entities_arr = []
						  totals_arr =[]

						  for i in 1..END_PAGE
						    begin
						  	   page = Nokogiri::HTML(open("#{BASE_URL}#{i}.html"))

						    	 count1, count2 = page.css('#pagination td.alignRight').last.text.match(/([0-9]{1,}) - ([0-9]{1,})/)[1..2].map{|c| c.to_i}
						    	 count = count2-count1+1

						    	 puts_error("Page #{i} WARNING: Pagination count is bad") if count < 0
						    	 puts("Page #{i}: #{count1} to #{count2}")

						    	 rows = page.css('#hcpPayments tbody tr')

						    	 current_entity=nil

						    	 rows.each do |row|  	   
						    	   current_entity= process_row(row, i, current_entity, {:doctors=>doctors_arr, :entities=>entities_arr, :totals=>totals_arr})
						       end

						     rescue Exception=>e
						  	   puts_error "Oops, had a problem getting the #{i}-page: #{[e.to_str, e.backtrace.map{|b| "\n\t#{b}"}].join("\n")}"
						     else


						     end
						  end

						  File.open("pfizer_doctors.txt", 'w'){|f|
						    doctors_arr.uniq.each do |d|
						        f.puts(d)
						    end
						  }

						  File.open("pfizer_entities.txt", 'w'){|f|
						    entities_arr.each do |e|
						      e[:services].each do |s|
						        f.puts("#{e[:name]}\t#{e[:page]}\t#{e[:city]}\t#{e[:state]}\t#{s[0]}\t#{s[1]}\t#{s[2]}")
						      end  
						    end
						  }


						  File.open("pfizer_entity_totals.txt", 'w'){|f|
						    totals_arr.uniq.each do |d|
						        f.puts(d)
						    end
						  }
						end

process_doctor is what we run after we’ve compiled the list of doctor names that show up on the Pfizer list. Each doctor has his/her own page with detailed spending. The data rows are roughly in the same format as the main list, so we reuse process_row again


						def process_doctor(r, time='')
						  begin
						    url = "#{DOC_QUERY_URL}#{r}"
						    page = Nokogiri::HTML(open("#{url}"))
						  rescue
							   puts_error "Oops, had a problem getting the #{r}-entry: #{[e.to_str, e.backtrace.map{|b| "\n\t#{b}"}].join("\n")}"
						  end

						  rows = page.css('#hcpPayments tbody tr')
						  entities_arr = []
						  current_entity=nil

						   rows.each do |row|  	   
						     current_entity= process_row(row, '', current_entity, {:entities=>entities_arr})
						   end


						   name = r.split('+')
						   puts_error("Should've been a last name at #{r}") if !name[0].match(/,$/)
						   name = "#{name[0].gsub(/,$/, '')}\t#{name[1..-1].join(' ')}"

						   vals=[]
						   entities_arr.each do |e| 
						     e[:services].each do |s|
						       vals.push("#{name}\t#{e[:name]}\t#{e[:page]}\t#{e[:city]}\t#{e[:state]}\t#{s[0]}\t#{s[1]}\t#{s[2]}\t#{url}\t#{time}")
						    end
						   end

						  vals.each{|val| File.open("pfizer_doctor_details.txt", "a"){ |f| 
						    f.puts val
						  }}

						  puts vals
						  return vals
						end

process_doctor_pages is just a function that calls process_doctor for each name in the pfizer_doctors.txt we previously gathered

The final result is pfizer_doctor_details.txt, which contains a line for every payment to every doctor.

						def process_doctor_pages
						  time = Time.now

						  File.open("pfizer_doctors.txt", 'r'){|f|
						     f.readlines.each do |r|
						        vals = process_doctor(r, time)
						     end 
						  }
						end

The Results

After Googling the top-Pfizer-paid-doctor on the list (Gerald Michael Sacks for ~$150K), I came across the Pharma Conduct blog, which had already posted partial aggregations of the list, including the top 5 doctors, complete with profiles and pics.

As Pharma Conduct has already been on the ball, I’ll defer to its analysis. It has some good background here on how lame pharma companies have been in past releases of data. Overall, Pharma Conduct is less-than impressed with Pfizer:

Despite reporting more information than some its peers, Pfizer’s interface is still very limited. For one, to use the search filtering, you must know a physician’s first name and last name, as well as the state where the payment was made. Also, the data cannot be sorted by payment amount, which is a big limitation. Pfizer should be given credit for releasing the information and being so thorough. However, by releasing it in a format that is not really amenable to data analysis and is more suited to simply looking up results one physician at a time, I echo John Mack’s sentiment, namely, that this data is translucent, but not transparent.

The post Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List appeared first on danwin.com.

Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form

Dan Nguyen — Tue, 06 Apr 2010 13:40:53 +0000

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact dan@danwin.com if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

In particular, with lesson 3, I skipped basically any explanation to the code. I hope to get around to it later.

Going to Court

In the last lesson, we learned how to write a script that would record who was in jail at a given hour. This could yield some interesting stories for a crime reporter, including spates of arrests for notable crimes and inmates who are held with $1,000,000 bail for relatively minor crimes. However, an even more interesting angle would be to check the inmates’ prior records, to get a glimpse of the recidivism rate, for example.

Sacramento Superior Court allows users to search by not just names, but by the unique ID number given to inmates by Sacramento-area jurisdictions. This makes it pretty easy to link current inmates to court records.

However, the techniques we used in past lessons to automate the data collection won’t work here. As you can see in the above picture, you have to fill out a form. That’s not something any of the code we’ve written previously will do. Luckily, that’s where Ruby’s mechanize comes in.

Ruby Mechanize

Go the the mechanize library homepage to learn how to install it as a Ruby gem. It requires that nokogiri is installed, which you should’ve done if you’ve made it this far into my tutorials.

There are some basic examples on the project page, but you’re going to have to read some of the technical documentation to learn some of mechanize’s commands.

Here’s a code example we’ll be using:

search_form['txtXref']='00112233'
result_page_form = search_form.submit

search_form refers to a mechanize Form object. In that HTML form is a textfield with a name of ‘txtXref’. The array notation we used above is setting that textfield to the value ‘00112233’.

Then, using mechanize’s Form object’s submit method, we submit the form just as if we had clicked the “Submit” button on a webpage.

That’s the basic theory.

The Code

Note: The following code works, if you have an inmates.txt file from the last lesson (use this one if you don’t; keep in mind that the last names and birthdates have been changed/redacted). However, it’s very rudimentary, with no error-checking at all. Still, it’ll give you a couple tab-delimited files that will list an inmate’s past charges and past sentences served, with XREF being the key that links those files to inmates.txt.

Remember that you’re accessing a live site here. This script pauses for 2 seconds after each access…there should be no reason to be more frequent about it.

This tutorial will be updated in the future.

require 'rubygems'
require 'mechanize'
search_url='https://services.saccourt.com/indexsearchnew/CriminalSearchV2.aspx'
xrefs = File.open("inmates.txt", 'r').readlines().map{|x| x.split("\t")[7].match(/[0-9]+/).to_s}.uniq

# open datafile


a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

search_page = a.get(search_url) 
search_form = search_page.form_with(:name=>'frmCriminalSearch')

#show the fieldnames
search_form.fields.map {|f| f.name}
#=> ["__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "txtLastName", "txtFirstName", "txtDOB", "txtXref", "txtCaseNumber", "lstCaseType"]

search_form.buttons.map{|m| m.name}
# => ["btnFindByName", "btnFindByNumber"]


xrefs.each do |xref|
  puts "\nFinding info for xref: #{xref}"
  search_form['txtXref']=xref
  search_form.field_with(:name=>'lstCaseType').options[1].select
  result_page_form = search_form.submit.forms.first
  case_buttons = result_page_form.buttons[1..-2]

  puts "There are #{case_buttons.length} cases to check:"
  case_buttons.each do |cb|
    file_page = result_page_form.click_button(cb)
    file_page = file_page.parser
  
    charges_arr = []
    sentences_arr =[]
    charge_rows = file_page.css('#dgDispositionCharges tr')
  
    if charge_rows.length > 0
    puts "Charges: "
      charge_rows[1..-1].each do |cr|
        ctd = cr.css('td').map{|td| td.text}
        charges_arr << {:plea=>ctd[1], :charge=>ctd[2], :date=>ctd[4], :severity=>ctd[5]}
        puts "\t - #{charges_arr.last.collect().join("\t")}"
      end  
    end
  
    sentence_rows = file_page.css('#dgSentenceSummary tr')
  
    if sentence_rows.length > 0
      puts "Sentences: "
      sentence_rows[1..-1].each do |sr|
        sentences_arr << sr.css('td').map{|td| td.text}.join("\t")
        puts "\t - #{sentences_arr.last}"
      end
    end
    
    
    File.open("court_charges.txt",'a+'){ |f|

      charges_arr.each do |c|
        f.puts("#{xref}\t#{c[:plea]}\t#{c[:charge]}\t#{c[:date]}\t#{c[:severity]}")
      end
    }

    File.open("sentences.txt", 'a+'){ |f| 
      sentences_arr.each do |c|
        f.puts("#{xref}\t#{c}")
      end
    }
    
    
    
  
  end #done checking a case entry
  
  puts "Done with #{xref}, sleeping"
  sleep 1
  
  
end

The post Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form appeared first on danwin.com.

Coding for Journalists 102: Who’s in Jail Now: Collecting info from a county jail site

Dan Nguyen — Tue, 06 Apr 2010 13:30:51 +0000

This is part 2 of a 4-part series in introductory coding for journalists. Go here for the first lesson. This lesson and code will still be verbose, but will have a lot less hand-holding than the previous one.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

A note about privacy: This tutorial uses files that I archived from a real-world jail website. Though booking records are public record, I make no claims about the legal proceedings involving the inmates who happened to be in jail when I took my snapshot. For all I know, they could have all been wrongfully arrested and therefore don’t deserve to have their name attached in online perpetuity to erroneous charges (even if the site only purports to record who was arrested and when, and not any legal conclusions). For that reason, I’ve redacted the last names of the inmates and randomized their birthdates.

The Cops Reporter and the Log

If you’re a daily cops reporter, calling the police station to ask for the list of last night’s arrests is probably part of your job. Because many papers have some kind of cops blotter where arrested suspects are listed…and online and in print, this is usually one of a paper’s top features. The St. Petersburg Times has a modern version of the feature, complete with mugshots and stats summaries.

Arrest logs have sometimes been criticized for being little more than voyeurism (here’s a discussion over the St. Pete’s mugshot site). But knowing who your law officers are arresting, and why, is essential to a nice, free society (and for a fair and efficient police force). And the more data you have as a reporter, the better you’ll be able to cover your beat.

Most pro-active police departments will announce when they’ve made high-profile arrests. But relying on the police to tell you what the most noteworthy arrests are kind of begs the question, and doesn’t tell the whole picture of arrest activity. Most states consider arrest logs to be public information (not that that stops some jurisdictions from hiding them). But a paper list or a PDF is hard to analyze. Luckily, some police departments are putting their work on the Web They might be willing to send you a spreadsheet of arrest activity, but what if you wanted up-to-the-hour information, so that you could be aware of:

Suspected crimes that fall between egregious and infamous (non-fatal assaults, robberies, car jackings, etc.)
An abnormally large number of arrests at a given time
Unusual types of suspected crimes at a given time

This is where the web-scraping you learned in my last tutorial gets useful. You’re going to have an automated way of collecting the latest arrests news, in an ordered fashion (so that you could, for example, find the inmate with the largest bail at a given time), and you’ll save yourself and your friendly police PIO tedious paper shuffling and typing.

I’m going to base my lesson on this sheriff department’s jail system. I’ve mirrored a snapshot of their site here (zip file here), so I recommend you run your scripts on my mirror (root directory: https://danwin.com/static/jail-list/)before doing a real-world test.

The jail web site has these characteristics:

At this page is a list of every person booked in the last 24 hours
The list typically has 100 to 200 inmates at a time
Most entries in that list contain a link to an inmate’s page containing data including name, DOB, bail, charges, booking time.
Each inmate has a unique identifying number called X-REF
Not all entries have a link; inmates who have been released have only their names listed

The site is pretty useful and user-friendly. However, it’s hard to quickly glean any useful information from the main list. You have to click through each individual entry to find out why someone was jailed. The purpose of the following lesson is to automate that process so you can efficiently get the big picture of a jail’s activity.

Program flow will go something like this:

Create two text files: one to store the list of inmates (inmates.txt), one to store the list of charges (charges.txt)
Open the inmate listing page
Collect each list entry
If list entry is not a link (i.e. inmate has been released)

Fetch first name, middle name, last name, intake time and release date

Else If list entry that is a link, open it

Fetch first name, middle name, last name, xref, intake time, and DOB of an inmate
Fetch and parse list of charges
Fetch the bail amount

In an each loop, for each inmate entry we collected above:

Output inmate information, in tab-delimited format, into inmates.txt, including the XREF.
Output the charges associated with the inmate into charges.txt. Each charge will take up one line, and the XREF of the inmate will also be included as to provided a key to the associated inmate

File I/O

We didn’t cover opening and writing to an external text file in the last lesson. So here’s how it goes briefly: Using Ruby’s IO class, we’re going to create two files, inmates.txt and charges.txt, and write to them what we find on the jail’s website. We’ll be using the variables inmates_file and charges_file to refer to the external files.

To open the the files and set the variables, use the IO class’s new method, which takes in two parameters: a string designating the file name, and a string
designating the mode…which in this case, will be “a”: write-only (read about the various modes here).

inmates_file = File.new('inmates.txt', 'a')
charges_file = File.new('charges.txt', 'a')

If these files don’t already exist, they will now. If they did, the ‘a’ mode will append new content to the end of the file.

To write something to the file, use the puts method, which writes whatever string you supply to it as one line in the file (we’ve used this method without the IO class, in which case it outputs to the screen):

charges_file.puts("Adding a new line of text to the charges file.")

While we’re setting up, let’s create an array of hashes, with each hash object holding an inmate and his/her information. We don’t have to do this…we could just output to the file each inmate record as we get to it, but this will allow us some flexibility later. All we have to do is initialize the array:

inmates_array = []

Open the inmate listing page

Now let’s fetch the inmates listing. We’ll be using Nokogiri in the same fashion we did in the last lesson, beginning by requiring the nokogiri and open-uri libraries, then using the Open-URI’s open method to fetch the page, and then Nokogiri’s HTML class to wrap up the page in a parsable format.

require 'rubygems'
require 'nokogiri'
require 'open-uri'
		
base_url='https://danwin.com/static/jail-list/' # all links on the list will be relative to this address		
inmate_listing = Nokogiri::HTML(open("#{base_url}current_listing.cfm.html"))

A reminder. The construct #{something_here}, when put inside a double-quoted string, will treat something_here as an actual value of the variable something_here, not just the string. This is called string interpolation. The two following expressions, the latter using interpolation, are equivalent, though the latter will not throw an error if string2 happens to not be a String.

a_combined_string = “Hello ” + string2
a_combined_string = “Hello #{string2}”

Read more about Ruby’s string interpolation here.

Let’s visit the page with a browser and examine the structure. The list is an HTML table, with each row containing several columns, the first column being the inmate’s full name and, if the inmate hasn’t been released, a link to his/her booking page.

If you inspect the HTML closely, you’ll see that this page is composed of several tables. What we want is the table contained inside the element with a class of “content.gsub(/\302\240/, ‘ ‘).”

So we’ll collect all the table rows, using Nokogiri’s xpath method, and iterate through them using an each loop. We’re going to use a variation of an each loop called each_index, which provides the numerical index of the current iteration we’re on.

	inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1]

The XPath syntax here is looking for a td element with class=’content’, then the table inside of that. There’s more than one, but the first one on the page has the data. From that, we gather all the rows (tr) within that. We call the collect method to convert the result into an array since Nokogiri’s xpath method returns a NodeSet, which won’t have the each_index method. each_index loops through an array, just like each, but it provides the index of the current iteration.

	inmate_rows.each_index do |i|
		inmate_row = inmate_rows[i]
		inmates_array[i] = {}
		inmate = inmates_array[i]

		# each row has a set of columns with the inmate info
		list_columns = inmate_row.xpath('./td')

Because we know we’re on the ith row, we can also initialize the ith index in inmates_array as a hash to store the ith inmate’s information. Remember that each element in the inmates_array is going to be a hash of information.

Lets use the variable named inmate as a shorthand way to refer to this position in the inmates_array .Each time we iterate through the loop, inmate will refer to the next spot in the inmates_array.

This is easier to type out 10 times than inmates_array[i]

Before we get to visiting the individual inmate pages, let’s just collect the name and other information readily available here

Each name consists of a String in this format: last_name, first_name middle_name

So let’s use the String split method. First to split the string by comma; this will give us an array with the first element being what’s on the left side of the comma. Splitting the second element of that array, with a space, will give us another array, consisting of a first name and middle name.

		
		
		
		# remember that you need to call Nokogiri's content method to get the text, as a String, between a tag	
		the_inmate_name =  list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',')
		
		inmate['last_name'] = the_inmate_name[0]					# the name before the comma
		inmate['first_name'] = the_inmate_name[1].split(' ')[0]		# the name after the comma, but before the next space
		inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1]

I’m going to be using this method call after each use of content: gsub(/\302\240/, ‘ ‘).strip

Not all entries have a middle name. So we use the if the_inmate_name.length > 2 conditional statement to tell Ruby to skip this line if the_inmate_name

		
		# Moving on to the next table cell, which will be the 1 spot in list_columns
		inmate['sex'] = list_columns[1].content
		
		
		# next cell, DOB
		inmate['dob'] = list_columns[2].content
			
		# next cell, booking time
		inmate['intake_time'] = list_columns[3].content
		
	
		
		
		# let's go back to the first column to see if it contained a link
		if list_columns[0].xpath('./a').length == 0  # if there was no link, there would be 0 links returned
			
			# No link to visit, so this must have been a released inmate. Let's grab his/her release date 
			# which comes in the pattern "Released mm/dd/yyyy"...so we'll split the string and capture the second term

			inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1]
			
		else
		
			# visit link
			# we'll get to this subroutine in the next section
			
			
		end
	end

I make a method call named gsub to cleanse the strings of data. This particular website uses (non-breaking-space) to form a space-character, and Nokogiri treats these differently than normal space characters, so strip doesn’t work as intended. So this method call is called frequently:
.gsub(/\302\240/, ‘ ‘)

Storing your Data into a File

At this point in your script, all your carefully collected data is in memory. When the script finishes execution, it disappears. That defeats the purpose of any way of tracking data. So let's store it in a persistent way...my choice would be in some kind of database, like MySQL or SQLite. But for our purposes, we can quickly learn the methods to store this information in a tab-delimited file that can be opened as an Excel spreadsheet.

We will be using Ruby's File class:


			##write to file
			File.open("inmate.txt", 'w'){ |f| 

				f.write("first_name\tmiddle_name\tlast_name\tsex\tdob\tintaketime\trelease_date\txref\tbooking_number\tarresting_agency\ttotal_bail\n")

				inmates_array.each do |inmate|

			f.write("#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}\t#{inmate['sex']}\t#{inmate['dob']}\t#{inmate['intake_time']}\t#{inmate['release_date']}\t#{inmate['xref']}\t#{inmate['booking_number']}\t#{inmate['arresting_agency']}\t#{inmate['total_bail']}\n")

				end
			}

A quick explanation. The File class has the open method, to which we pass in two arguments: the name of the file we want to write to, and the mode. In this case, we're using 'w', which stands for "write" mode. The curly-braces sets off the code that gets executed while this File is open, with the variable f referring to the actual file.

File also has an instance method called write, which takes in a String as an argument to write to the open file.

Backslash-t will write a tab, and backslash-n will write a newline character.

The next block of code is similar to the first...but it refers to a "charges.txt" file. Remember that each inmate could have more than one charge to his/her name. The following file lists every charge, but also lists the xref key to tie back into inmates.txt. For convenience sake, we're also going to print out the inmate name and the inmate's total bail on each line.


			File.open("charges.txt",'w'){ |f|
			  f.write("name\txref\ttotal_bail\tcode\tseverity\tdescription\n")

			  inmates_array.each do |inmate|	  
				  if inmate['charges']
				    inmate['charges'].each do |charge|
			  	    f.write("#{inmate['first_name']} #{inmate['last_name']}\t#{inmate['xref']}\t#{inmate['total_bail']}\t#{charge['code']}\t#{charge['severity']}\t#{charge['description']}\n")
			      end
			    end
				end

			}

Printing out the inmate's name and total bail, although redundant, allows us to quickly skim the list to see if there were any unusual crimes connected to unusual amounts of bail (note that the jail site does not breakdown bail amounts per charge).

Putting it all together for the real world

The above code, put all together, will execute cleanly and compile some nice text files for you, especially if you've saved the package of HTML files onto your hard drive. But in the real world, you'll be targeting an internet server, which may not like you hitting it at a rate of five times per second. Or, may intermittently fail.

To deal with this, I've added a call to Ruby's sleep method, which pauses script execution for a given number of seconds. I've also thrown in some error-handling. Here's the basic structure:

		# some code
		begin
			# risky code here
			# The Ruby interpreter will watch the code that gets executed within the begin branch...if something goes wrong, it's going to execute code in the following rescue branch
	
		rescue
			# the begin-branch messed up, time to run some other code
			puts "An error happened!"
		else
			# this code gets executed if the begin-branch worked fine
		ensure
			# this code in the ensure branch (which is optional) runs no matter what.
			puts "We're done with our error handling"
		end

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.

Dan Nguyen — Tue, 06 Apr 2010 12:40:34 +0000

UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby.

I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think:

http://ruby.bastardsbook.com

Someone asked in this online chat for journalists: I want to program/code, but where does a non-programmer journalist begin?

My colleague Jeff Larson gave what I believe is the most practical and professionally-useful answer: web-scraping (jump to my summary of web-scraping here, or read this more authorative source).

This is my attempt to walk someone through the most basic computer science theory so that he/she can begin collecting data in an automated way off of web pages, which I think is one of the most useful (and time-saving) tools available to today’s journalist. And thanks to the countless hours of work by generous coders, the tools are already there to make this within the grasp of a beginning programmer.

You just have to know where the tools are and how to pick them up.

Click here for this page’s table of contents. Or jump to the the theory lesson. Or to the programming exercise. Or, if you already know what a function and variable is, and have Ruby installed, go straight to two of my walkthroughs of building a real-world journalistic-minded web scraper: Scraping a jail site, and scraping Pfizer’s doctor payment list.

Or, read on for some more exposition:

Who this post is for

His Girl Friday

You’re a journalist who knows almost nothing about computers beyond using them to connect to the Internets, email, and cheat on Facebook scrabble. This is not entirely trivial; if you’re able to do this without typing your password and SSN into a phishing site, you’re (sadly) a step ahead of most of the Internet populace. OK, it’ll also help if you’re familiar enough with your operating system (Windows or Mac…I’m assuming anyone using Linux won’t even need this tutorial) to know how to install programs.

Anyone who has taken a semester of computer science will scoff at how I’ve simplified even the basic fundamentals of programming…and they’d be right…but my goal is just to get you into the basics to write some useful code immediately. You’re going to have to make the effort yourself to learn the topics in-depth.

Thankfully, coding is something that provides immediate success and failure. You hit Ctrl-R, your script runs, and in five seconds or less, you’ll learn if you did right. The more you fumble, the more you learn. And getting around an error no longer requires owning a reference library.

The roadmap

This tutorial aims to walk you through the bare essentials of HTML, programming theory and tools so that you can do something very practical: build an automatic process to gather data from websites. I made this lesson into one giant page so you can see for yourself, in one glance, the number of words (about 9,000) it takes to touch upon what is essentially one semester in a first-level computer science course. Also, I have no ads to sell.

Here’s what will happen if you read this entire page:

Learn a little HTML
Install Firefox+Fire Bug
Install Ruby, a programming language
Learn some programming theory
Write a script
Execute the script

Jump to the table of contents or read some more blab.

What is web-scraping and how it’s important to journalists

Web-scraping (also called screen-scraping) is the automated process of collecting the *useful* data off of a webpage. This is made possible because of the design of HTML, which, when done right, puts this data in as predictable a format as an Excel spreadsheet…sans the convenient interface, keyboard shortcuts, and Clippy. So you have to write your own tool tailored to the structure of a webpage.

The importance of data collection should be obvious to a journalist. Used to be, if you wanted a set of data…such as the list of restaurant inspections so you could do a regression analysis of failed tests with respect to neighborhood income levels, you’d ask them for the data, sue them if they said no, and if you were on the right side of the law, they’d grudgingly hand you a chunk of ordered text that you could eventually put into a spreadsheet.

But now, it’s possible that a public-information officer will just point you to the public website and say, there it is. And it’s not always a case of them being ignorant/disdainful of laws that oblige them to give the dataset, in electronic form, that backs the website. From their viewpoint, the information is there for any idiot with an Internet connection to ask for, so what are you whining about?

At this point, you can either go through a weeks-long argument through emails and phone messages that ends with their legal counsel compelling the PI officer to hand over the data. Or, if keeping your story idea secret isn’t a priority, you could explain what your intent is, and why you need a whole dataset to see if a trend exists. Either way, you almost might have another week or so of waiting for the PIO to successfully wrangle their tech people (and legal staff, who need to vet the released data for any confidential info) to giving you the data in a nice comma-delimited format.

So, if their website already has the information you need (although, often, the web display omits record keys and such that are useful), why not write a script in 15 minutes to grab it? Also, even if data is released willingly, it’s not always at a convenient pace. If a website is updated faster than a PIO can send you email attachments, then scraping the website on a nightly basis will save both of you headaches.

And some types of information is just not FOIA-able. My former colleague Brian Boyer, now news-apps chief at the Tribune, created ProPublica’s ChangeTracker, built on a web-scraping service, to check when and how the White House changes its website. The request, “Hey, can you tell me all the times you’ve changed text on your website, what the text originally was, and what you changed it to” is not something a PIO could, even if he/she wanted to, could easily fulfill.

Web-scraping sometimes has bad connotations…because this is how various members of royalty find your email address in order to tell you that they are a distant family relation with $10,000,000US that they desperately want to give to you. So yes, you could use it for ugly purposes. My response is that if that’s your ultimate goal, you are way behind the game, and you will probably suffer a humiliating karmic fate, either in your online or real life.

On the other hand, there are innnumerable sets of public, useful data that no one has gotten around to mapping out and collecting, in a useful format. So let’s get to it.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

The task

Thomas Jefferson lived to be 83, according to Wikipedia

When you get through this tutorial, you will be able to answer the question: According to Wikipedia, what is the average age of U.S. Presidents whose last names have more than six characters? Not an important question, but it is on the same order of difficulty as, say, scraping a county jail’s booking list to find the inmates with the largest bail amount and charge list, and how many are repeat-offenders…which are the second and third lessons.

HTML

HTML is what makes web pages not just a stream of characters. Why did that “not” in the previous sentence appear bold?. Because I wrapped the word “not” in tags. The raw code is: not

The design and theory of HTML are topics that could consume the rest of your waking life. For now, it’s relevance to us is that with HTML, web pages have structure. And with structure, a web-scraper can reliably collect the useful bits of data as it would from columns of a spreadsheet.

W3Schools is the best place to get a primer on HTML.

Here is a h1 headline

Here is a h4 headline

OK, one more critically important thing about tags. They can have attributes.

Let’s say I wanted to make something not only be a headline (i.e. bold large text), but the color red. There are many ways to do this, but let me show you the most simple (if not totally standards-compliant) way to illustrate the simplest form of an attribute:

An attribute consists of: the name of the attribute, an equals sign, and then the value of that attribute enclosed in quotation marks. Like so: attribute=”this_is_the_attributes_value”

This is a headline

In that starting tag â€“

â€“ is where attributes goâ€“ after the tagname, h1, and before the closing right-angle-bracket. The name of the attribute, color is followed by an = sign. Then quotation marks (or single quotes; either way, they have to match, as they would when you write down someone’s quote, or someone quoting a quote) enclosing the value of the attribute. In this case, red.

HTML Errors

Couple of things to keep in mind. Tags come in pairs. When things look funny on a hand-coded webpage, usually it’s because the coder didn’t provide a closing tag to his starting tag. Here’s a properly tagged sentence:

This sentence is meant to be bold. This sentence is just in italics.

Results in: This sentence is meant to be bold. This sentence is just in italics.

In this sentence, I didn’t provided a closing bold tag, and so the bold part overlaps into the italics sentence, making a bold AND italicized sentence:

This sentence is meant to be bold. This sentence is just in italics.

Results in: This sentence is meant to be bold. This sentence is just in italics.

Also, close the tags in the order they come in…I don’t know how to concisely explain this point, but the following is not properly-structured HTML. The part in red denotes how the closing-bold-tag should NOT come after the opening italics tag:

This sentence is meant to be bold. This sentence is just in italics.

Sometimes browsers will compensate for coder-error and interpret this in a way that doesn’t look awful. But you just need to know that this violates a principle of HTML…and pages that you scrape that aren’t well structured may give strange results even if you’ve written a logically-designed scraper.

Hyperlinks

Hyperlinks are those (depending on a website’s style) underlined words that, upon clicking, send you to a whole different page. They are nothing more than special tags with an important attribute.

The tagged hyperlink makes the word “link” a clickable link that goes to Google. The href attribute describes where the link sends you:

This link has many answers

Results in:

This link has many answers.

Want to try some tags and hyperlinks yourself? Use W3Schools interactive editor.

Firefox and Firebug

As I wrote earlier, HTML structures the data you want. But you need to know how it’s structured, and so you need to know the designer’s blueprint. Not to get in a browser war, but just to make things easier on me, you can’t go wrong by first downloading Firefox, the free open-source browser by Mozilla.

Now go to any website, right click on an empty space, and click “View Source” in the submenu. You’ll likely see something like this:

That’s the raw HTML. You might eventually get to the point where HTML is what the Matrix is to Neo. But let’s make it as painless as possible. Firefox has many plugins, including one called Firebug, which makes it very easy to dissect code. Get it here.

Firebug, a plugin for Firefox

Double-click on one of the sample headlines in this tutorial to highlight it. Then right-click to open the submenu, then click “Inspect Element“. This should bring up a Firebug panel that lets you see the HTML that made that headline. This saves you from having to search through the entire source to find that headline, just to see the tags that wrap it.

Like I said, in order to successful web-scrape, you’re going to have to know how the elements â€“ the paragraphs, headlines, and links â€“ were structured. Firebug is a tool that helps pinpoint the elements you want to know about.

Programming Basics

A good way to annoy a programmer is to say something like, “Yeah, I have some programming experience: I’ve been writing HTML for two weeks now.” Writing HTML is not programming, any more than operating a stereo equalizer makes you a classically-trained guitarist. HTML is a way to describe and present content, but you’re not running any kind of computerized task.

So, I went through the basics of HTML so you’d be familiar with the content that you’d be collecting. Now we’ll learn the basics of how to program a script that will actually collect that content.

Installing Ruby

What is Ruby? It’s a programming language. And like a spoken language, once you’ve learned one, you’ve learned the fundamentals (i.e. the concepts of verbs, nouns, sentences, etc.) that allow you to try out all the other ones. Ruby is also the basis for Ruby on Rails, a very popular framework that many developers use to build data-driven websites. But right now, we’re collecting data from websites, not building them.

I’ve purposely been brief here. Installing Ruby and its libraries may be the most frustrating aspect of this lesson, and I have little more insight to it than, “I have a Mac w/ Leopard, and it came with it”

Installation instructions for Ruby are here…if you’re on a Mac OS X with Leopard or better, you should be good to go. Hopefully, the one-click installer for Windows should be easy enough to install (check the Enable RubyGems and SciTE boxes).

The One-Click Ruby Installer for Windows

More specifically, Ruby is an interpreted language…so I use the phrase “Ruby interpreter” to refer to the program that reads your script, makes sense of it, and executes it. Read more about this definition at Wikipedia.

The Ruby Interactive Prompt (IRB)

If you belong to the target-audience of this tutorial, you probably have been able to get your computer to perform tasks (such as, ‘Open my web browser’) with your mouse-clicking. Programming means you’re going to be typing out lines of code that executes tasks. Your web-scraping is essentially going to be a sequence of such commands, i.e. a script.

But why wait until you get a complete script when we can start executing commands right now? This is where Ruby’s Interactive Prompt (IRB) comes in. In its simplest form of operation, the IRB waits for you to type in a line of code, then for you to hit “Enter/Return”, and then it will run your command, provided it makes sense.

On Windows, go to your menu and type ctrl-R to bring up the Run… prompt. Type in ‘cmd’. Then type in ‘irb’. On the Mac, go to Applications=>Terminal. At the command line, type in ‘irb’.

Interactive Ruby prompt

Now that you’re here, type in the following:

1+6 #result: 7

Congrats. You just wrote a one-line script to figure out what one plus six is.

Note: In Ruby, the pound sign ‘#’ designates the code following it to be a comment; I will use this convention in the code boxes to mark what your result after a command should be.

Let’s also learn a common Ruby command: puts. It simply outputs what comes after it (actually, not quite that simple, but you’ll learn soon in the next section)…I’ll be using this in the script to output results.

puts "Hello World" #result: Hello World

Read more about the command-line interpreter.

Strings

Let’s say you want to be a little more narrative about the above 1+6 calculation. Try writing out those numbers and enclosing them in quotation marks. Like so:

"One"+"Six" # result: "OneSix"

Your answer won’t be “Seven”, but “OneSix”. Why? To human eyes and ears, 1+6 and “One”+”Six” might be the same. But in Ruby, and most other programming languages, the computer interprets the latter command to be joining two words, i.e. strings together.

Strings can be enclosed in either double-quotes or single-quotes. However, double-quotes in Ruby and other languages, allow for some important manipulation, called string interpolation. Good to know for later. Just make sure whatever you use, the first mark matches the second.

In the programming-world, “six” is fundamentally different than 6. “Six” is what Ruby considers a String. 6 is a Number.

So what happens when you try to add “Six”, the string, to 6, the number?

"Six"+6

TypeError: can't convert Fixnum into String from (irb):2:in `+' from (irb):2 from /usr/local/bin/irb:12:in `'

Congrats, it’s your first of many, many times of making the Ruby interpreter choke. In the case of numbers and strings, it only knows how to add like items together.

The takeaway from this is that, for our purposes, anything in quotation marks is a string. Even a number in quotation marks is no longer a number. You’ll get the same above error if you try:

"6"+6

The quotation marks make all the difference, just as they do in the journalism world. For example:

The governor is a scumbag who molests staffers on taxpayer-dime
by Dan Nguyen, Newswire, Inc.

Whistleblower: “The governor is a scumbag who molests staffers on taxpayer-dime”
by Dan Nguyen, Newswire, Inc.

Variables

OK, you now know that you shouldn’t add strings to numbers, and you’re perfectly content to add strings to create results like “eightzero”. What if you tire of typing quotation marks?

eight+zero # NameError: undefined local variable or method `eight' ...

What happened here? Well, without quotation marks, eight and zero are no longer considered strings. In their unquoted form, they are considered variables that hold some kind of value.

Think back to algebra when you were asked to solve “x+1=6″. You weren’t supposed to interpret that as:
the letter x added to the number 1 equals 6

The x is a stand-in for the value 5. x could’ve been a, b or y.

(Forgot what algebra was? Try this great primer, “The Joy of X” by the NYT’s Opinionator)

So, to make eight+zero understandable by the Ruby interpreter, you must assign those two terms values. So, try:

eight=8 zero=0 eight+zero # result: 8

Now, eight+zero is the same as 8+0.

Enter the following into the IRB:

zero=1 eight+zero # result: 9

You should get 9 as the result. The variable eight is still 8. But you assigned zero the value of 1. Therefore, you were asking the interpreter to execute 8+1.

Here’s what you should grok by now: unquoted words are considered to be variables, and they are empty unless you’ve assigned them a value. And the name of the variable is completely independent and unrelated to its actual value. Thus, nine=”nine” makes as much sense in Ruby as this_variable_has_a_value_that_is_not_nine_dang_it=”nine”

Obviously, since you can name your variables just about anything (stick to a series of lowercase letters and numbers with no spaces or hypens), name them something that is related to their actual value, so that your code is more readable.

At this point, we’ve run through a lot of programming concepts. But if you don’t understand how the above examples, and the following:

one = 1 one = 2 # assigning the variable named one to another value one + one # result: 4

…then pause for a moment. It’s not a trivial topic, but it is critical to understand it at least at this level. Go here for more discussion on variables.

By the way, arithmetic symbols, such as + and –, are called operators. A statement like 4+5 is an expression. I’ll avoid, or mangle, the terminology throughout the lesson.

Comparison operators

Let’s say you’ve written a bunch of code and forgot whether you set the variable eight to “eight” or 8. How to test that? Well…typing in eight and hitting ‘Enter’ is the easy way…but now’s a good time to learn the concept of a comparison.

We already know that =, the equals sign, is something that assigns a value: what’s on the right of the = is set as the value of the variable on the left side.

So what’s a double equals sign == mean?

Write this sequence of code:

eight="eight" eight==8 # result: false

The second line of code, translated into English, is you telling the interpreter:

The value of the variable named eight is the number 8

To which the computer responds: false

Here, Ruby is telling you that the string “eight”, to which the variable eight was assigned, is not equal to the number 8.

Which we, from vainly trying to add “eight”+8, know is how Ruby interprets things. Evaluating eight==”eight” will yield the value of true

Note: true and false are not variable names. They are reserved words that are values in themselves. So, this will result in an error: true = “A string I’d like to assign the value named true”. However, replacing that equals sign with a double equals sign, ==, will return a result of false.

Arrays

Think of an Array as something that contains a sequence of other variables and values. In Ruby, and most other languages, arrays are set off by square brackets, [ and ].

Here’s the easiest way to initialize an Array:

an_empty_array = [] array_with_numbers=[1,2,3,4]

Above, I’ve assigned two variables the values of two different arrays. The first, an_empty_array, is empty. The second, array_with_numbers, is filled with four numbers. You could’ve written out four lines of code, assigning four different variables respectively with the numbers 1 through 4. With an array, you essentially have one variable referring to 4 values.

How do you access the individual values? Use the name of the variable, and then the index. Consider the index as an address) of the element you want, set off by square brackets (in this fashion, the square brackets denotes the variable they follow is an array, while the value inside them is the index/address). Such as:

array_with_numbers[0]

In Ruby, the first element of an array has an index of 0. So the above line would give you the value of 1. array_with_numbers[3] would get you 4. The index 4 in array_with_numbers would get you an empty (nil) value.

Arrays can contain other variables too, like so:

an_empty_array = [] array_with_numbers=[1,2,3,4, an_empty_array]

array_with_numbers[4] would now yield [], an empty array, which is the value of the variable named an_empty_array

More about Arrays here.

Hashes

OK, I’m going to make another vast simplification of a programming object: Hashes can be considered Arrays in which the indexes are strings, not numbers. Hashes are denoted by curly brackets.

a_hash = {"one"=>1, "two"=>2, "three"=>3}

Note the convention of => which assigns a value to an index (the correct term, actually, is key) of the hash. So:

a_hash["two"] # result: 2

It’s not important right now to understand the full differences and capabilities of Arrays and Hashes, but you’ll be seeing this notation in the script we write.

Read more about Hashes here.

Conditional Branches

So far, we’ve been typing in single line commands. Your final script is going to be a long list of commands telling the computer to:

Go to Wikipedia’s listing of each U.S. President’s page (i.e. a list of links to each page)

Visit, via hyperlink, each page belonging to a president whose last name is longer than six letters

Grab the president’s age from each individual page, if that president is dead

Average those ages

Our criteria for inclusion means we have to come up with some way to not visit, say, John Adams’s Wikipedia page. And to not include a living president’s age. So inside our script, there’s going to be a section of code telling the computer to go into a webpage…but that code should only execute if the length of a President’s last name is greater than 6.

That’s where the if conditional branch comes in. Without getting too far past the basics, here’s the simplified code:.

president = "John Adams" last_name_length = 5 # I manually set this variable for now; in your actual script, you'll find this value programmatically if last_name_length > 6 # then go to his wikipedia page...and while we're in this branch of code, let's print something puts "Entering a page" else #OK, don't go there. But let's print out a statement puts "This name is too short" end # result: "This name is too short"

What the above section of code is essentially saying is that if the value of the variable last_name_length is greater than 6, then do what was in between if and else. Otherwise, completely skip what was there and go to what’s between the else and end

The else is optional…if you want, you could do nothing if the conditional statement (if last_name_length > 6) isn’t satisfied. The end is required; it tells Ruby that that’s the end of that optional branch of code that started with the if.

Up till now, our series of commands have been straight-forward: the interpreter executes one line after another. Introducing the if statement has introduced a fork in the road; if the condition in the if statement isn’t met, the interpreter skips past that if block.

The if statement is the simplest of such conditional branches. All you need to know for now is that there’s a way to tell the Ruby interpeter to execute a certain bit of code if a condition is met. Read more about it here.

Methods

I’m really going to be brief here. Think of methods as a set of commands that are useful enough to run more than once.

Out of bad habit, I’ll use the term function as a synonym for method. They’re the same concept, except method is a kind of function, the explanation of which requires me getting into object-oriented programming. Which I don’t want to right now.

Let’s say I need to take two numbers, multiply them together, subtract 5 from the product, and then add the result to itself. In code, that would be:

#initialize the variables: a = 10 b = 20 #now make each step its own line c = a * b c = c - 5 c = c+c # result: 390

Well, that could’ve been one line, without using the placeholder variable named c, like so:

(a*b)-5 + (a*b)-5

If I need to run this more than once, it’s a bit annoying to type out each time we want to run that series of commands, so let’s define a function called my_funny_equation

def my_funny_equation (first_argument, second_argument) answer = (first_argument*second_argument)-5 + (first_argument*second_argument)-5 return answer end

Inside the parentheses, following my_funny_equation are the arguments, the values that you want the method to work with.

The takeaway here is that I’ve encapsulated my series of commands into a block of code. The variable names, arbitrarily named first_argument, second_argument, and answer, are references that only exist within that block of code which defines the method my_funny_equation.

Now that this method is defined, I can do:

my_funny_equation(10, 12) 230 my_funny_equation( 4, 5) # result: 30 answer+10 # result: (Ruby will choke here)

Why does the third command choke? Again, answer exists only within the little world defined in the my_funny_equation method, between the def and end lines. It has no value outside of the method definition. This is called function scope, a topic outside of, well, the scope of this simplified tutorial. Read more about scope here.

OK, the above was just introducing you to the concept of a method/function. The kind of methods we’ll be dealing with in our script are called instance methods. These methods belong to something…an actual number, for example. 6 is an instance of a Number. “Six” is an instance of a String

Example:

The number 2.67 is considered by Ruby to be of the class Float…that is, a number with a floating decimal point.

More specifically, 2.67 is an instance of a Float. So is 4.777. And so is 8.999.

What if I wanted to go about rounding a Float number? Well, luckily, Ruby has built in instance methods that do this. The basic structure is the instance, followed by the method’s name…as follows:

instance.method_name

The method for rounding a Float is called “round”. So, to round 2.67, we do:

2.67.round >>2

This is a little confusing because of the two periods. Just be faithful that the Ruby interpreter knows the difference; it sees the first “dot” as a decimal point defining the number. The second “dot” tells it that we want to access the built-in Float method called round.

One more example, let’s work with arrays.

Let’s say we have:

an_array= [1,2,3,4,5,6]

I want to make an array that consists of the first three elements of *any* array. Luckily, Ruby arrays has a built in method called slice.

an_array.slice(0,3) >>[1,2,3]

So, slice is the name of an instance method of things that are Arrays. Inside the parentheses are two arguments, the first denotes the element to start out at (in this case, 0, since we want the first element), the second denotes how many elements to include in this sub-array (3).

What was the point of all of this? In our final code, you’ll be seeing calls to methods. Someone already wrote the method that, say, collects all the text of a webpage and stores it into a variable for you. But you need to know the name of that method and how to invoke it.

Writing Your Script

OK, now we get past the fundamentals and into things that will really solve your problems. It wasn’t important to have intimate knowledge of the previous concepts, but just to know they exist.

But how can you, knowing just the basics, do something as complicated as connect to a series of web pages, collect their content, pick the exact points of needed data, and arrange them in a useful structure? Because other programmers have abstracted all these functions in such a way that we could do this series of tasks in just a few lines.

I’m going to write out an extremely-verbose way of performing these tasks to make each step clear…but as you get better, you’ll find ways to minimize your typing.

Here’s the list of steps we’ll be doing, in somewhat plain English:

1) Grab the contents of the presidents list

2) From that list, grab each president’s name

3) Determine if the last name is longer than 6 characters

4) If so, fetch the link to the president’s page and open it

5) Grab the age from the president’s page

6) Add up the data you gathered

Before doing any of the above steps, we’re going to download a Ruby library that makes the above tasks trivially easy (that is, compared to starting from scratch)…

Nokogiri

I won’t get into what “gems” are in relation to the Ruby programming language; just think of them as pre-packaged functions and code that you can easily download and re-use for your own scripts.

Complete instructions can be found at the nokogiri homepage. You may run into a lot of errors…my advice is to copy part of that error and Google it with that and “nokogiri,” and hopefully you’ll get an answer.

Hopefully, it’s as simple as going into your command-line console (exit the interpreter if you’re in there) and typing:

>> gem install libxml-ruby >> gem install nokogiri

What is nokogiri? It’s a library of code that makes it easy to parse a webpage. Remember when you right-clicked on a webpage to view source, and how painful of a task it would be to collect, say, what the text of the third headline is…on 100 different pages? Nokogiri essentially allows you to do this with a couple lines of code. Check out the homepage here.

Step One: Fetch the Contents From the Presidents List

Let’s try Nokogiri out. Open your ruby interpreter and type in the following commands; these first lines invoke the method require, which will give your script access to the required libraries of code, including nokogiri:

require 'rubygems' require 'nokogiri' require 'open-uri'

This next line will fetch the contents of Wikipedia’s list of U.S. Presidents

list_of_presidents = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States'))

I’m going to quickly deconstruct this line:

Nokogiri::HTML specifies that we want a method that exists in the Nokogiri library, and more specifically, in its class named HTML.

open is the name of the method we want. Now you see why we had to specify the above…there are lots of libraries and contexts that have methods named open. We want Nokogiri’s.

‘http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States’ is the string that holds the address of the page we want. The open method needs this to…well…know what to open.

list_of_presidents is the variable to which open will spit its contents into.

OK, that one line, maybe the most complicated line we’ve written so far, just did a whole lot for you.

Using a method in the Nokogiri library called open (which takes in a web page address as an argument), it opened a connection with Wikipedia, performed the Internet protocols necessary to exchange information, copied the content of the target page, and wraps it all up in a Nokogiri data structure for later manipulation. We are pointing to this data structure with the variable list_of_presidents

Let’s try to grab the contents of the second h2 tag (i.e. the second, secondary headline)

list_of_presidents.xpath('//h2')[1].content =>Presidents

Running scripts from Text Editors or the Command Line

Running Ruby commands from the Interactive Ruby prompt is nice and all, for quick feedback. But from here on out, we’ll be writing a full-on script with a few dozen lines of code. So, it’ll be easier if you create a new text file with a file extension of .rb … something like, myfirstscript.rb to put your code in.

You should be using a text-editor for this…something better than Notepad, at least.

For Macs, there’s the free and excellent TextWrangler. If you’re willing to spend some money, TextMate is what I use and it’s worth the $55. A free 30-day trial can be \”downloaded here.

For Windows, the one-click Ruby installer includes the free SciTE4. Also, there’s the free Komodo Edit. For $35, there’s the “Textmate on Windows”, E-TextEditor (free trial here)

Some of these text editors have a shortcut-key that allows you to run the script. For example, SciTE uses F5. Note how the output is conveniently displayed to the side:

Writing a Ruby script in SciTE for Windows

There’s also the old-fashioned command line, from which you ran IRB from. Navigate to the directory that you saved your file in. Then type “ruby whatever_your_file_name_is.rb“:

Running a script from the Windows command line

OK, here’s another high-level programming construct we’ll superficially try to cover…

XPath

XPath is a syntax used to address parts of HTML documents. It allows you, for example, to find all text that’s between headline, italics, paragraph, or whatever tags you want. You could also do something as specific as “Find the third link in every paragraph.”

From Zvon.org, how to select all 'BBB' nodes using XPath

It’s another field of knowledge in which you could spend your life memorizing. For our purposes, you just need to know that it’s a way to pinpoint an element, or a set of elements, in an HTML document.

list_of_presidents.xpath('//h2')[1].content #result: "Presidents"

Let’s dissect the above nokogiri command. list_of_presidents was a variable holding a Nokogiri data structure…essentially, the entirety of the Wikipedia page in a format that the Nokogiri library can understand.

xpath, then, is an instance method of this data structure, that takes a string as an argument. That string contains XPath syntax.

The string, in the above example, is “//h2″. In XPath syntax (check out W3Schools for a primer), the double-slashes // tells the parser to look anywhere in the document. h2 is the specific tag â€“ a level-2 headline â€“ that we want. And [1] denotes that the result of the xpath method is an array, of which we want the value at the 1st index (technically, the second value of that array…remember that an array’s index starts at the 0th index). And content is an instance method of what was in that 1st index: a nokogiri data structure. content, in this case, pulls what was in those h2 tags: “Presidents“.

The above line could’ve been broken down into:

a = list_of_presidents a = a.xpath('//h2') a = a[1] a = a.content #result: "Presidents"

That was a very simple XPath query. Another one could be:

list_of_presidents.xpath('//p/a[4]')

Unlike arrays, XPath notation does not start at 0 So 1 will refer to the 1st element) hyperlink ( tag). The notation is contained within that string:

list_of_presidents.xpath(‘//p/a[4]’)[0]

…would refer to the first element of the array of fourth-hyperlinks that were inside p tags.

This will find the 4th hyperlink in each paragraph. If you try it out, you’ll get an array containing two elements…which makes sense, as there are only two paragraphs on this page (therefore, there can only be two fourth-in-a-paragraph hyperlinks)

Step 2: From a Table of Data, Fetch the President’s Name

At this time, it’s worth looking at how Wikipedia lists its presidents:

Wikipedia's List of Presidents of the United States

This is an HTML table. Each row appears to contain one president (there are sub-rows, which we’ll ignore, corresponding to each term). In the third column (the second column is the actual image file) are two important pieces of data for us: the president’s name and a link to that president’s Wikipedia page.

Remember that we wanted the age of each president. Unfortunately, that’s not listed on this table, so we’ll have to visit each page, where, presumably, an age is listed.

Visit w3Schools for a quick primer on HTML tables. But to be brief: tr designates a row and td designates a column. Let’s put our installation of Firefox’s Firebug to use. Let’s confirm that the info we want â€“ a president’s name â€“ is indeed in the third column.

Right click on the hyperlink of John Adams and select Inspect Element. The Firebug panel should pop-up like so, showing that the third element contains “John Adams”. More specifically, it contains the text “John Adams” in between tags, which we learned marks off a hyperlink. This will be important in the next step…

Using Firebug to find out the element containing "John Adams"

Adapting from our previous line of code using XPath, let’s try this:

those_columns = list_of_presidents.xpath("//tr/td[3]")

That XPath notation will find us every third (column) that is enclosed in a tag (row). That should spit out a large array of Nokogiri elements (as many as there are presidents).

We want the first of those, which is addressed in the 0th-index of that array…

those_columns[0] # result is: "George Washington[2][3][4][5]"

So we got a name…but what’s with the bracketed numbers? If you look at the Wikipedia list again, you’ll see that those numbers are links to footnotes. Useful, but not to us. So how to extract just the name? Remember that each president’s name is enclosed in a a (hyperlink) tag. And it’s the first hyperlink. So let’s make our previous XPath a little more complex:

george_washingtons_name = list_of_presidents.xpath("//tr/td[3]/a[1]")[0] =>"George Washington"

We’re now asking for the 1st (a[1], in XPath notation, is asking for the first a tag) hyperlink, in the third column (td), in each row (tr). The result is the string “George Washington”.

Step 3: Determine if the Last Name Is Longer Than 6 Characters

OK, now we have a name; how do we programmatically determine the length of the last name (remember, our goal is to search all presidents with last names with more than 6 letters)?

The split and length methods of String

First, let’s get the last name. It’s reasonable to assume that the last word in each string (“Bush” in “George W. Bush”) is the last name. Each word is set off by a space. So we are going to use a String instance method called split, which will take a string and divide it into separate pieces, using a character we specify. The result is an Array of strings.

So:

the_last_name = george_washingtons_name.split(' ')[-1] # Result: "Washington"

The above line can be described as thus: Take the string inside the variable george_washingtons_name

Split it at every instance of a space

Return the last element (the -1 index of an array returns the last element. -2 would return the second-to-last)

The result is: “Washington” from the string “George Washington” is assigned to the variable the_last_name

Now, this is when we finally use the conditional branch statement if

the_last_name.length > 6 # result: true

if the_last_name.length > 6 puts("Yep, greater than 6") end # result: Yep, greater than 6

length is an instance method of Strings. In the first bit of code, we basically asked: is the length of the_last_name greater than 6. The interpreter says, true

In the second bit of code, we defined a branch statement, saying to print “Yep, greater than 6″ if the condition in the if statement (the_last_name.length > 6) was true. It was.

Step 4: If So, Fetch the Link to the President’s Page and Open It

Here’s the code, in verbose form, that we’ve taken to get here…plus a few more lines that flesh out how we want the script to actually execute.

# open the required libraries require 'rubygems' require 'nokogiri' require 'open-uri' # Using nokogiri, fetch Wikipedia's list of presidents page list_of_presidents = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States')) # Using another nokogiri method, grab the third column from every row, and from those, grab the first hyperlink (which contains the prez's name) an_array_of_links = list_of_presidents.xpath("//tr/td[3]/a[1]")

So we dealt with George Washington’s name…but we want to deal with an array of presidential names. On each element, we want to execute the same operation (see if length of last name is greater than 6 letters, if so, fetch the link).

We’re going to use something called an each loop.

count = 0 an_array_of_links.each do |link_to_test| # This above statement can be read as: for each element in an_array_of_links, do # the following code (until the end line) # And as you go through each element, the variable use to reference the element will be named "link_to_test" last_name = link_to_test.content.split(' ')[-1] #remember that between the tags was the president's name, with the last word being the last name if last_name.length > 6 the_link_to_the_presidents_page = link_to_test["href"] # We'll get to this part in the next section... end end # OK, we're at the end of the each loop. Go back to the top

I’m not going to dissect this. It’s enough to know that each is a method of an Array, and the code inside each do and end is executed for each element of an Array.

OK, using the code above, we are looping through all the presidents’ names and page links. On each name, we’re testing the length of the last name. And if the last name is longer than 6 letters…we’re going to open the link and grab the president’s age.

So:

if last_name.length > 6 the_link_to_the_presidents_page = link_to_test["href"] # OK, the value of href is going to be something like "/wiki/George_Washington". That's an address relative to the Wikipedia site # so we need to prepend "http://en.wikipedia.org" to have a valid address... the_link_to_the_presidents_page = "http://en.wikipedia.org"+the_link_to_the_presidents_page # now let's fetch that page the_presidents_page = Nokogiri::HTML(open(the_link_to_the_presidents_page)) # ... OK, now what? end

Step 5: Grab the age from the president’s page

All right, so the_presidents_page now holds all the html inside one of the president’s page. We need to scope it out to find the XPath necessary to fetch the age of the president.

Let’s take a look at George Washington’s page. More specifically, look at the sidebar to the right, which contains his vital statistics:

George Washington's Wikipedia Sidebar

As you can see, the age is listed, next to the “Died” line.

Using Firebug to check out the structure tells us that the sidebar is a table, and the death date is in the cell that immediately follows the cell containing the text “Died”.

Firebug Inspection of George Washington Sidebar

OK, were going to have to use XPath to target those specific cells. Let’s test it out on George Washington’s page. I’m just going to provide you the XPath syntax; you’re welcome to read W3School’s tutorial to figure why it works:

george = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/George_Washington')) death_date = george.xpath("//th[contains(text(), 'Died')]/following-sibling::*")[0].content # => "December 14, 1799 (aged 67)Mount Vernon, Virginia,\nUnited States"

(Some references to the syntax above: contains, following-sibling

Well, death_date contains more than we wanted. How do we just get the 67 from the aged 67 part? There’s no html tag that sets 67 off (our job would have been so easy if it had been 67).

The last new topic you’ll learn in order to complete the task is regular expressions.

Regular Expressions, aka regexes

Again, like HTML and XPath, regular expressions aren’t “programming”, but it’s a universe of syntax that requires entire books to describe. Put simply, regular expressions allow you to grab strings of text that match a pattern.

From regular-expressions.info, how to match HTML tags

In this case, the pattern I want is: a number, either two-to-three digits long, that is after the word “aged “

I won’t go into the specifics here…I’ve found that you can learn regular expressions with a little reading and trial and error. In this case, the pattern I want, in regex terms, is /aged.+?([0-9]+)/ (note: although the text on the Wikipedia page reads something like “aged 67″, the space in between is a special HTML character, hence, the .+? used to capture it in the reg ex…don’t worry, that last sentence will make perfect sense when you someday understand reg exes.).

In descriptive English, this pattern is going to capture (what’s in the parentheses) any digits from 0-9 that follow the character sequence aged. The forward-slashes denote the beginning and end of the regex.

Again, a regular expression is a syntax, not an actual programming function. So we need to call Ruby’s instance method, match, which executes a text-search based on the syntax of regular expression that you passed into it. Like so:

death_date = george.xpath("//th[contains(text(), 'Died')]/following-sibling::*")[0].content age_at_death = death_date.match(/aged.+?([0-9])/)[1]

As you can guess, match returns an array of elements. I don’t want to explain the match method in full here, but the 0th element contains the entire match, which would be “aged 67″, and the 1st element returns what was in between the parentheses of my regular expression…the pattern for a multi-digit number, i.e. 67. Again, you just have to learn about reg exes for this to make more sense.

You don’t have to be a programmer to appreciate regular expressions. Ever do find and replace in a text editor? Let’s say you have a bunch of text with numbers sprinkled through…and those numbers were supposed to have $ signs in front of them. There’s no simple find-and-replace that can replace every group of numbers (9, 12.3, 0.55) with ($9, $12.3, $0.55); but in text-editors that support regexes, you could do such a replacement in one command. This is pretty invaluable if you’ve ever had to clean up “dirty” comma-delimited files.

Bookmark regular-expressions.info and save yourself a lot of time in learning about reg exes.

Step 6: Add up the data you gathered

So now we’ve gotten to our goal: retrieving a president’s age from his Wikipedia page. Now we just need to add it all up and take the average.

Here’s the remaining things we have to do, in narrative form:
Before we go into each president’s page, we need a variable to hold the sum of all the ages (total_age). And we’ll need a variable to keep track of how many president’s ages we’ve retrieved (prez_count). However, not every page is going to have an age…since not all former presidents have passed away. So, if the “age” datapoint exists, add it to the total_age variable. And increment prez_count. If not, then do nothing, and go onto the next president until we’ve gone through all the presidents.

Once we’ve finished looping through the pages of presidents, divide total_age by prez_count. And we’re done.

The complete script

The final code is as follows (I’ve added several puts statements to notify you where in the execution the script is…it should take less than 2 minutes):

require 'rubygems' require 'nokogiri' require 'open-uri' list_of_presidents = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States')) an_array_of_links = list_of_presidents.xpath("//tr/td[3]/a[1]") ## These two variables will be added to throughout the execution of the script ## At the end, they'll have the answers prez_count = 0 total_age = 0 an_array_of_links.each do |link_to_test| last_name = link_to_test.content.split(' ')[-1] if last_name.length > 6 the_link_to_the_presidents_page = link_to_test["href"] the_link_to_the_presidents_page = "http://en.wikipedia.org" + the_link_to_the_presidents_page prez_page = Nokogiri::HTML(open(the_link_to_the_presidents_page)) puts "Entering the page: #{the_link_to_the_presidents_page}" death_date = prez_page.xpath("//th[contains(text(), 'Died')]/following-sibling::*") if death_date && death_date[0] # Doing something like `if some_variable_name` is basically asking, "Does some_variable_name have any value?". # It will return false if some_variable_name has been set to false or if it had never been set to anything at all, not even 0 or an empty string (both of which would return true) # The double ampersand && functions as an "AND", requiring that two conditional tests be true before entering the if-statement's true branch age_at_death = death_date[0].content.match(/aged.+?([0-9]+)/)[1] if age_at_death # we only get here if there was a "Died" table cell AND a text pattern similar to: "aged XX" puts "Age of #{link_to_test.content} is: #{age_at_death}" total_age += age_at_death[1].to_i # technically, age_at_death[0] is a String. to_i will make it a Number so we can safely add it to total_age prez_count += 1 end #end of the if age_at_death end # end of the if death_date... else # we reach this branch of code if last_name was shorter than 6. Let's print a debug message to notify us: puts "#{last_name} is not longer than 6 letters" end #end of the if last_name.length > 6 end # OK, we're at the end of the each loop. Go back to the top # if we got here, we're out of the loop, and total_age and prez_count have the right values. So: the_final_value = total_age/prez_count.to_f # to_f converts an integer to a decimal number, so we'll get partial years for the average puts "#{prez_count} presidents were counted, their age totaling: #{total_age}." puts "The average of their ages is #{the_final_value}"

As of Feb. 2010, running that script produces this output:

Entering the page: http://en.wikipedia.org/wiki/George_Washington
Age of George Washington is: 67
Adams is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Thomas_Jefferson
Age of Thomas Jefferson is: 83
Entering the page: http://en.wikipedia.org/wiki/James_Madison
Age of James Madison is: 85
Monroe is not longer than 6 letters
Adams is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Andrew_Jackson
Age of Andrew Jackson is: 78
Buren is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/William_Henry_Harrison
Age of William Henry Harrison is: 68
Tyler is not longer than 6 letters
Polk is not longer than 6 letters
Taylor is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Millard_Fillmore
Age of Millard Fillmore is: 74
Pierce is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/James_Buchanan
Age of James Buchanan is: 77
Entering the page: http://en.wikipedia.org/wiki/Abraham_Lincoln
Age of Abraham Lincoln is: 56
Entering the page: http://en.wikipedia.org/wiki/Andrew_Johnson
Age of Andrew Johnson is: 66
Grant is not longer than 6 letters
Hayes is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/James_A._Garfield
Age of James A. Garfield is: 49
Arthur is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Grover_Cleveland
Age of Grover Cleveland is: 71
Entering the page: http://en.wikipedia.org/wiki/Benjamin_Harrison
Age of Benjamin Harrison is: 67
Entering the page: http://en.wikipedia.org/wiki/Grover_Cleveland
Age of Grover Cleveland is: 71
Entering the page: http://en.wikipedia.org/wiki/William_McKinley
Age of William McKinley is: 58
Entering the page: http://en.wikipedia.org/wiki/Theodore_Roosevelt
Age of Theodore Roosevelt is: 60
Taft is not longer than 6 letters
Wilson is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Warren_G._Harding
Age of Warren G. Harding is: 57
Entering the page: http://en.wikipedia.org/wiki/Calvin_Coolidge
Age of Calvin Coolidge is: 60
Hoover is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Franklin_D._Roosevelt
Age of Franklin D. Roosevelt is: 63
Truman is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Dwight_D._Eisenhower
Age of Dwight D. Eisenhower is: 78
Entering the page: http://en.wikipedia.org/wiki/John_F._Kennedy
Age of John F. Kennedy is: 46
Entering the page: http://en.wikipedia.org/wiki/Lyndon_B._Johnson
Age of Lyndon B. Johnson is: 64
Nixon is not longer than 6 letters
Ford is not longer than 6 letters
Carter is not longer than 6 letters
Reagan is not longer than 6 letters
Bush is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Bill_Clinton
Bush is not longer than 6 letters
Obama is not longer than 6 letters
21 presidents were counted, their age totaling: 1398.
The average of their ages is 66.5714285714286

The End?

Well, congratulations…you accomplished a trivial task, but you learned a set of methods that you can apply to much more important goals. If you’re a complete newbie to programming, hopefully this tutorial has given you a glimpse of what’s involved. And how, once you firm up your programming fundamentals, you can get real work done.

But I need to stress that this tutorial simplified things as much as possible…at the cost of best-practices programming. I chose Wikipedia as a target because it’s a reasonably well-structured, high-traffic site that has an ethos of making volumes of information available for the public good.

The script that we just wrote is a naive, little child, that gets what it wants as fast as it wants. In the real world, many sites that you attempt to scrape will not be so forgiving. Some sites will block you, or fail to connect, if you try to read a hundred pages at once. Some sites will have horrific HTML that will require much more complicated XPath and regular expression syntax. Sometimes, your internet connection might drop. All of this will cause the above script to halt to a ugly and premature death. Or even worse: collect bad data that you won’t know was erroneous.

All of these problems are solvable, but like any task, it takes experience that comes from trying and failing. Hopefully, this tutorial at least shows you how easy it is to try.

Other resources:

Tutorials on HTML, CSS, XPath and a bunch of other useful topics, from W3Schools

About.com’s guide to Ruby

Installing Ruby

The Little Book of Ruby – A free e-book from SaphireSteel Software

Ruby Standard Library Documentation

Nokogiri Tutorial

Video tutorial of Nokogiri Screen Scraping from Railscasts

Ruby Examples from Wikipedia

XPath tutorial from Zvon

Regular-Expressions.info, pretty much the best regular expression resource online

A printable cheat-sheet for regular expressions

See my four-part series on web-scraping for journalists here.

The post Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully. appeared first on danwin.com.

1st Row	1st Cell: Charge code (i.e. PC 459)	2nd Cell: Charge severity (i.e. Felony)
2nd Row:	Charge description (i.e. “Burglary”)

danwin.com » tutorial

dataist blog: An inspiring case for journalists learning to code

Pfizer Data Redux

Coding for Journalists 101 : A four-part series

Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List

The Code

The Results

Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form

Going to Court

Ruby Mechanize

The Code

Coding for Journalists 102: Who’s in Jail Now: Collecting info from a county jail site

The Cops Reporter and the Log

File I/O

Open the inmate listing page

Storing your Data into a File

Putting it all together for the real world

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.

Who this post is for

The roadmap

What is web-scraping and how it’s important to journalists

The task

Table of Contents

HTML

Tags

Here is a h1 headline

Here is a h4 headline

This is a headline

HTML Errors

Hyperlinks

Firefox and Firebug

Programming Basics

Installing Ruby

The Ruby Interactive Prompt (IRB)

Strings

Variables

Comparison operators

Arrays

Hashes

Conditional Branches

Methods

Writing Your Script

Nokogiri

Step One: Fetch the Contents From the Presidents List

Running scripts from Text Editors or the Command Line

XPath

Step 2: From a Table of Data, Fetch the President’s Name

Step 3: Determine if the Last Name Is Longer Than 6 Characters

Step 4: If So, Fetch the Link to the President’s Page and Open It

Step 5: Grab the age from the president’s page

Regular Expressions, aka regexes

Step 6: Add up the data you gathered

The complete script

The End?