Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.

UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby.

I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think:

http://ruby.bastardsbook.com

Someone asked in this online chat for journalists: I want to program/code, but where does a non-programmer journalist begin?

My colleague Jeff Larson gave what I believe is the most practical and professionally-useful answer: web-scraping (jump to my summary of web-scraping here, or read this more authorative source).

This is my attempt to walk someone through the most basic computer science theory so that he/she can begin collecting data in an automated way off of web pages, which I think is one of the most useful (and time-saving) tools available to today’s journalist. And thanks to the countless hours of work by generous coders, the tools are already there to make this within the grasp of a beginning programmer.

You just have to know where the tools are and how to pick them up.

Click here for this page’s table of contents. Or jump to the the theory lesson. Or to the programming exercise. Or, if you already know what a function and variable is, and have Ruby installed, go straight to two of my walkthroughs of building a real-world journalistic-minded web scraper: Scraping a jail site, and scraping Pfizer’s doctor payment list.

Or, read on for some more exposition:

Who this post is for

His Girl Friday

His Girl Friday

You’re a journalist who knows almost nothing about computers beyond using them to connect to the Internets, email, and cheat on Facebook scrabble. This is not entirely trivial; if you’re able to do this without typing your password and SSN into a phishing site, you’re (sadly) a step ahead of most of the Internet populace. OK, it’ll also help if you’re familiar enough with your operating system (Windows or Mac…I’m assuming anyone using Linux won’t even need this tutorial) to know how to install programs.

Anyone who has taken a semester of computer science will scoff at how I’ve simplified even the basic fundamentals of programming…and they’d be right…but my goal is just to get you into the basics to write some useful code immediately. You’re going to have to make the effort yourself to learn the topics in-depth.

Thankfully, coding is something that provides immediate success and failure. You hit Ctrl-R, your script runs, and in five seconds or less, you’ll learn if you did right. The more you fumble, the more you learn. And getting around an error no longer requires owning a reference library.

The roadmap

This tutorial aims to walk you through the bare essentials of HTML, programming theory and tools so that you can do something very practical: build an automatic process to gather data from websites. I made this lesson into one giant page so you can see for yourself, in one glance, the number of words (about 9,000) it takes to touch upon what is essentially one semester in a first-level computer science course. Also, I have no ads to sell.

Here’s what will happen if you read this entire page:

  1. Learn a little HTML
  2. Install Firefox+Fire Bug
  3. Install Ruby, a programming language
  4. Learn some programming theory
  5. Write a script
  6. Execute the script

Jump to the table of contents or read some more blab.

What is web-scraping and how it’s important to journalists

Web-scraping (also called screen-scraping) is the automated process of collecting the *useful* data off of a webpage. This is made possible because of the design of HTML, which, when done right, puts this data in as predictable a format as an Excel spreadsheet…sans the convenient interface, keyboard shortcuts, and Clippy. So you have to write your own tool tailored to the structure of a webpage.

The importance of data collection should be obvious to a journalist. Used to be, if you wanted a set of data…such as the list of restaurant inspections so you could do a regression analysis of failed tests with respect to neighborhood income levels, you’d ask them for the data, sue them if they said no, and if you were on the right side of the law, they’d grudgingly hand you a chunk of ordered text that you could eventually put into a spreadsheet.

But now, it’s possible that a public-information officer will just point you to the public website and say, there it is. And it’s not always a case of them being ignorant/disdainful of laws that oblige them to give the dataset, in electronic form, that backs the website. From their viewpoint, the information is there for any idiot with an Internet connection to ask for, so what are you whining about?

At this point, you can either go through a weeks-long argument through emails and phone messages that ends with their legal counsel compelling the PI officer to hand over the data. Or, if keeping your story idea secret isn’t a priority, you could explain what your intent is, and why you need a whole dataset to see if a trend exists. Either way, you almost might have another week or so of waiting for the PIO to successfully wrangle their tech people (and legal staff, who need to vet the released data for any confidential info) to giving you the data in a nice comma-delimited format.

So, if their website already has the information you need (although, often, the web display omits record keys and such that are useful), why not write a script in 15 minutes to grab it? Also, even if data is released willingly, it’s not always at a convenient pace. If a website is updated faster than a PIO can send you email attachments, then scraping the website on a nightly basis will save both of you headaches.

And some types of information is just not FOIA-able. My former colleague Brian Boyer, now news-apps chief at the Tribune, created ProPublica’s ChangeTracker, built on a web-scraping service, to check when and how the White House changes its website. The request, “Hey, can you tell me all the times you’ve changed text on your website, what the text originally was, and what you changed it to” is not something a PIO could, even if he/she wanted to, could easily fulfill.

Web-scraping sometimes has bad connotations…because this is how various members of royalty find your email address in order to tell you that they are a distant family relation with $10,000,000US that they desperately want to give to you. So yes, you could use it for ugly purposes. My response is that if that’s your ultimate goal, you are way behind the game, and you will probably suffer a humiliating karmic fate, either in your online or real life.

On the other hand, there are innnumerable sets of public, useful data that no one has gotten around to mapping out and collecting, in a useful format. So let’s get to it.

This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact dan@danwin.com if you have any questions, or leave a comment below.

DISCLAIMER: The code, data files, and results are meant for reference and example only. You use it at your own risk.

The task

Thomas Jefferson lived to be 83, according to Wikipedia

When you get through this tutorial, you will be able to answer the question: According to Wikipedia, what is the average age of U.S. Presidents whose last names have more than six characters? Not an important question, but it is on the same order of difficulty as, say, scraping a county jail’s booking list to find the inmates with the largest bail amount and charge list, and how many are repeat-offenders…which are the second and third lessons.

HTML

HTML is what makes web pages not just a stream of characters. Why did that “not” in the previous sentence appear bold?. Because I wrapped the word “not” in tags. The raw code is: <b>not</b>

The design and theory of HTML are topics that could consume the rest of your waking life. For now, it’s relevance to us is that with HTML, web pages have structure. And with structure, a web-scraper can reliably collect the useful bits of data as it would from columns of a spreadsheet.

W3Schools is the best place to get a primer on HTML.

Tags

Tags are themselves contained in angle brackets (< and >) and come in pairs. The end tag is denoted by a forward slash: /.

So, anything between these tags – <i> </i> – will appear in italics.

To make something a headline, use <h1> </h1> tags. You can replace that ‘1’ with numbers 2 through 6, with ‘1’ being the most prominent kind of headline.

Here is a h1 headline

Here is a h4 headline

OK, one more critically important thing about tags. They can have attributes.

Let’s say I wanted to make something not only be a headline (i.e. bold large text), but the color red. There are many ways to do this, but let me show you the most simple (if not totally standards-compliant) way to illustrate the simplest form of an attribute:

An attribute consists of: the name of the attribute, an equals sign, and then the value of that attribute enclosed in quotation marks. Like so: attribute=”this_is_the_attributes_value”

<h1 color="red">This is a headline</h1>

In that starting tag – <h1> – is where attributes go– after the tagname, h1, and before the closing right-angle-bracket. The name of the attribute, color is followed by an = sign. Then quotation marks (or single quotes; either way, they have to match, as they would when you write down someone’s quote, or someone quoting a quote) enclosing the value of the attribute. In this case, red.

HTML Errors

Couple of things to keep in mind. Tags come in pairs. When things look funny on a hand-coded webpage, usually it’s because the coder didn’t provide a closing tag to his starting tag. Here’s a properly tagged sentence:

<b>This sentence is meant to be bold.</b> <i>This sentence is just in italics.</i>

Results in: This sentence is meant to be bold. This sentence is just in italics.

In this sentence, I didn’t provided a closing bold tag, and so the bold part overlaps into the italics sentence, making a bold AND italicized sentence:

<b>This sentence is meant to be bold. <i>This sentence is just in italics.</i>

Results in: This sentence is meant to be bold. This sentence is just in italics.

Also, close the tags in the order they come in…I don’t know how to concisely explain this point, but the following is not properly-structured HTML. The part in red denotes how the closing-bold-tag should NOT come after the opening italics tag:

<b>This sentence is meant to be bold. <i></b>This sentence is just in italics.</i>

Sometimes browsers will compensate for coder-error and interpret this in a way that doesn’t look awful. But you just need to know that this violates a principle of HTML…and pages that you scrape that aren’t well structured may give strange results even if you’ve written a logically-designed scraper.

Hyperlinks

Hyperlinks are those (depending on a website’s style) underlined words that, upon clicking, send you to a whole different page. They are nothing more than special tags with an important attribute.

The tagged hyperlink makes the word “link” a clickable link that goes to Google. The href attribute describes where the link sends you:

This <a href="http://google.com">link</a> has many answers

Results in:

This link has many answers.

Want to try some tags and hyperlinks yourself? Use W3Schools interactive editor.

Firefox and Firebug

As I wrote earlier, HTML structures the data you want. But you need to know how it’s structured, and so you need to know the designer’s blueprint. Not to get in a browser war, but just to make things easier on me, you can’t go wrong by first downloading Firefox, the free open-source browser by Mozilla.

Now go to any website, right click on an empty space, and click “View Source” in the submenu. You’ll likely see something like this:

That’s the raw HTML. You might eventually get to the point where HTML is what the Matrix is to Neo. But let’s make it as painless as possible. Firefox has many plugins, including one called Firebug, which makes it very easy to dissect code. Get it here.

Firebug, a plugin for Firefox

Firebug, a plugin for Firefox

Double-click on one of the sample headlines in this tutorial to highlight it. Then right-click to open the submenu, then click “Inspect Element“. This should bring up a Firebug panel that lets you see the HTML that made that headline. This saves you from having to search through the entire source to find that headline, just to see the tags that wrap it.

Like I said, in order to successful web-scrape, you’re going to have to know how the elements – the paragraphs, headlines, and links – were structured. Firebug is a tool that helps pinpoint the elements you want to know about.

Programming Basics

A good way to annoy a programmer is to say something like, “Yeah, I have some programming experience: I’ve been writing HTML for two weeks now.” Writing HTML is not programming, any more than operating a stereo equalizer makes you a classically-trained guitarist. HTML is a way to describe and present content, but you’re not running any kind of computerized task.

So, I went through the basics of HTML so you’d be familiar with the content that you’d be collecting. Now we’ll learn the basics of how to program a script that will actually collect that content.

Installing Ruby

What is Ruby? It’s a programming language. And like a spoken language, once you’ve learned one, you’ve learned the fundamentals (i.e. the concepts of verbs, nouns, sentences, etc.) that allow you to try out all the other ones. Ruby is also the basis for Ruby on Rails, a very popular framework that many developers use to build data-driven websites. But right now, we’re collecting data from websites, not building them.

I’ve purposely been brief here. Installing Ruby and its libraries may be the most frustrating aspect of this lesson, and I have little more insight to it than, “I have a Mac w/ Leopard, and it came with it”

Installation instructions for Ruby are here…if you’re on a Mac OS X with Leopard or better, you should be good to go. Hopefully, the one-click installer for Windows should be easy enough to install (check the Enable RubyGems and SciTE boxes).

The One-Click Ruby Installer for Windows

More specifically, Ruby is an interpreted language…so I use the phrase “Ruby interpreter” to refer to the program that reads your script, makes sense of it, and executes it. Read more about this definition at Wikipedia.

The Ruby Interactive Prompt (IRB)

If you belong to the target-audience of this tutorial, you probably have been able to get your computer to perform tasks (such as, ‘Open my web browser’) with your mouse-clicking. Programming means you’re going to be typing out lines of code that executes tasks. Your web-scraping is essentially going to be a sequence of such commands, i.e. a script.

But why wait until you get a complete script when we can start executing commands right now? This is where Ruby’s Interactive Prompt (IRB) comes in. In its simplest form of operation, the IRB waits for you to type in a line of code, then for you to hit “Enter/Return”, and then it will run your command, provided it makes sense.

On Windows, go to your menu and type ctrl-R to bring up the Run… prompt. Type in ‘cmd’. Then type in ‘irb’. On the Mac, go to Applications=>Terminal. At the command line, type in ‘irb’.

Interactive Ruby prompt

Interactive Ruby prompt

Now that you’re here, type in the following:

1+6
	#result: 7

Congrats. You just wrote a one-line script to figure out what one plus six is.

Note: In Ruby, the pound sign ‘#’ designates the code following it to be a comment; I will use this convention in the code boxes to mark what your result after a command should be.

Let’s also learn a common Ruby command: puts. It simply outputs what comes after it (actually, not quite that simple, but you’ll learn soon in the next section)…I’ll be using this in the script to output results.

		puts "Hello World"
		#result: Hello World

Read more about the command-line interpreter.

Strings

Let’s say you want to be a little more narrative about the above 1+6 calculation. Try writing out those numbers and enclosing them in quotation marks. Like so:

	"One"+"Six"
	# result: "OneSix"

Your answer won’t be “Seven”, but “OneSix”. Why? To human eyes and ears, 1+6 and “One”+”Six” might be the same. But in Ruby, and most other programming languages, the computer interprets the latter command to be joining two words, i.e. strings together.

Strings can be enclosed in either double-quotes or single-quotes. However, double-quotes in Ruby and other languages, allow for some important manipulation, called string interpolation. Good to know for later. Just make sure whatever you use, the first mark matches the second.

In the programming-world, “six” is fundamentally different than 6. “Six” is what Ruby considers a String. 6 is a Number.

So what happens when you try to add “Six”, the string, to 6, the number?

"Six"+6
TypeError: can't convert Fixnum into String
	from (irb):2:in `+'
	from (irb):2
	from /usr/local/bin/irb:12:in `'

Congrats, it’s your first of many, many times of making the Ruby interpreter choke. In the case of numbers and strings, it only knows how to add like items together.

The takeaway from this is that, for our purposes, anything in quotation marks is a string. Even a number in quotation marks is no longer a number. You’ll get the same above error if you try:

"6"+6

The quotation marks make all the difference, just as they do in the journalism world. For example:

The governor is a scumbag who molests staffers on taxpayer-dime
by Dan Nguyen, Newswire, Inc.

Whistleblower: “The governor is a scumbag who molests staffers on taxpayer-dime”
by Dan Nguyen, Newswire, Inc.

Variables

OK, you now know that you shouldn’t add strings to numbers, and you’re perfectly content to add strings to create results like “eightzero”. What if you tire of typing quotation marks?

	eight+zero
	# NameError: undefined local variable or method `eight' ...

What happened here? Well, without quotation marks, eight and zero are no longer considered strings. In their unquoted form, they are considered variables that hold some kind of value.

Think back to algebra when you were asked to solve “x+1=6″. You weren’t supposed to interpret that as:
the letter x added to the number 1 equals 6

The x is a stand-in for the value 5. x could’ve been a, b or y.

(Forgot what algebra was? Try this great primer, “The Joy of X” by the NYT’s Opinionator)

So, to make eight+zero understandable by the Ruby interpreter, you must assign those two terms values. So, try:

eight=8
zero=0
eight+zero
# result: 8

Now, eight+zero is the same as 8+0.

Enter the following into the IRB:

zero=1
eight+zero
# result: 9

You should get 9 as the result. The variable eight is still 8. But you assigned zero the value of 1. Therefore, you were asking the interpreter to execute 8+1.

Here’s what you should grok by now: unquoted words are considered to be variables, and they are empty unless you’ve assigned them a value. And the name of the variable is completely independent and unrelated to its actual value. Thus, nine=”nine” makes as much sense in Ruby as this_variable_has_a_value_that_is_not_nine_dang_it=”nine”

Obviously, since you can name your variables just about anything (stick to a series of lowercase letters and numbers with no spaces or hypens), name them something that is related to their actual value, so that your code is more readable.

At this point, we’ve run through a lot of programming concepts. But if you don’t understand how the above examples, and the following:

one = 1
one = 2  # assigning the variable named one to another value
one + one
# result: 4

…then pause for a moment. It’s not a trivial topic, but it is critical to understand it at least at this level. Go here for more discussion on variables.

By the way, arithmetic symbols, such as + and , are called operators. A statement like 4+5 is an expression. I’ll avoid, or mangle, the terminology throughout the lesson.

Comparison operators

Let’s say you’ve written a bunch of code and forgot whether you set the variable eight to “eight” or 8. How to test that? Well…typing in eight and hitting ‘Enter’ is the easy way…but now’s a good time to learn the concept of a comparison.

We already know that =, the equals sign, is something that assigns a value: what’s on the right of the = is set as the value of the variable on the left side.

So what’s a double equals sign == mean?

Write this sequence of code:

eight="eight"
eight==8
# result: false

The second line of code, translated into English, is you telling the interpreter:

The value of the variable named eight is the number 8

To which the computer responds: false

Here, Ruby is telling you that the string “eight”, to which the variable eight was assigned, is not equal to the number 8.

Which we, from vainly trying to add “eight”+8, know is how Ruby interprets things. Evaluating eight==”eight” will yield the value of true

Note: true and false are not variable names. They are reserved words that are values in themselves. So, this will result in an error: true = “A string I’d like to assign the value named true”. However, replacing that equals sign with a double equals sign, ==, will return a result of false.

Arrays

Think of an Array as something that contains a sequence of other variables and values. In Ruby, and most other languages, arrays are set off by square brackets, [ and ].

Here’s the easiest way to initialize an Array:

an_empty_array = []
array_with_numbers=[1,2,3,4]

Above, I’ve assigned two variables the values of two different arrays. The first, an_empty_array, is empty. The second, array_with_numbers, is filled with four numbers. You could’ve written out four lines of code, assigning four different variables respectively with the numbers 1 through 4. With an array, you essentially have one variable referring to 4 values.

How do you access the individual values? Use the name of the variable, and then the index. Consider the index as an address) of the element you want, set off by square brackets (in this fashion, the square brackets denotes the variable they follow is an array, while the value inside them is the index/address). Such as:

array_with_numbers[0]

In Ruby, the first element of an array has an index of 0. So the above line would give you the value of 1. array_with_numbers[3] would get you 4. The index 4 in array_with_numbers would get you an empty (nil) value.

Arrays can contain other variables too, like so:

an_empty_array = []
array_with_numbers=[1,2,3,4, an_empty_array]

array_with_numbers[4] would now yield [], an empty array, which is the value of the variable named an_empty_array

More about Arrays here.

Hashes

OK, I’m going to make another vast simplification of a programming object: Hashes can be considered Arrays in which the indexes are strings, not numbers. Hashes are denoted by curly brackets.

a_hash = {"one"=>1, "two"=>2, "three"=>3}

Note the convention of => which assigns a value to an index (the correct term, actually, is key) of the hash. So:

a_hash["two"]
# result: 2

It’s not important right now to understand the full differences and capabilities of Arrays and Hashes, but you’ll be seeing this notation in the script we write.

Read more about Hashes here.

Conditional Branches

So far, we’ve been typing in single line commands. Your final script is going to be a long list of commands telling the computer to:

  1. Go to Wikipedia’s listing of each U.S. President’s page (i.e. a list of links to each page)
  2. Visit, via hyperlink, each page belonging to a president whose last name is longer than six letters
  3. Grab the president’s age from each individual page, if that president is dead
  4. Average those ages

Our criteria for inclusion means we have to come up with some way to not visit, say, John Adams’s Wikipedia page. And to not include a living president’s age. So inside our script, there’s going to be a section of code telling the computer to go into a webpage…but that code should only execute if the length of a President’s last name is greater than 6.

That’s where the if conditional branch comes in. Without getting too far past the basics, here’s the simplified code:.

president = "John Adams"
last_name_length = 5  # I manually set this variable for now; in your actual script, you'll find this value programmatically 

if last_name_length > 6
 # then go to his wikipedia page...and while we're in this branch of code, let's print something
 puts "Entering a page"
else
 #OK, don't go there. But let's print out a statement
 puts "This name is too short"
end

# result: "This name is too short"

What the above section of code is essentially saying is that if the value of the variable last_name_length is greater than 6, then do what was in between if and else. Otherwise, completely skip what was there and go to what’s between the else and end

The else is optional…if you want, you could do nothing if the conditional statement (if last_name_length > 6) isn’t satisfied. The end is required; it tells Ruby that that’s the end of that optional branch of code that started with the if.

Up till now, our series of commands have been straight-forward: the interpreter executes one line after another. Introducing the if statement has introduced a fork in the road; if the condition in the if statement isn’t met, the interpreter skips past that if block.

The if statement is the simplest of such conditional branches. All you need to know for now is that there’s a way to tell the Ruby interpeter to execute a certain bit of code if a condition is met. Read more about it here.

Methods

I’m really going to be brief here. Think of methods as a set of commands that are useful enough to run more than once.

Out of bad habit, I’ll use the term function as a synonym for method. They’re the same concept, except method is a kind of function, the explanation of which requires me getting into object-oriented programming. Which I don’t want to right now.

Let’s say I need to take two numbers, multiply them together, subtract 5 from the product, and then add the result to itself. In code, that would be:

#initialize the variables:
a = 10
b = 20

#now make each step its own line
c = a * b
c = c - 5
c = c+c
# result: 390

Well, that could’ve been one line, without using the placeholder variable named c, like so:

(a*b)-5 + (a*b)-5

If I need to run this more than once, it’s a bit annoying to type out each time we want to run that series of commands, so let’s define a function called my_funny_equation

def my_funny_equation (first_argument, second_argument)
  answer = (first_argument*second_argument)-5 + (first_argument*second_argument)-5
  return  answer
end

Inside the parentheses, following my_funny_equation are the arguments, the values that you want the method to work with.

The takeaway here is that I’ve encapsulated my series of commands into a block of code. The variable names, arbitrarily named first_argument, second_argument, and answer, are references that only exist within that block of code which defines the method my_funny_equation.

Now that this method is defined, I can do:

my_funny_equation(10, 12)
230

my_funny_equation( 4, 5)
# result: 30

answer+10
# result: (Ruby will choke here)

Why does the third command choke? Again, answer exists only within the little world defined in the my_funny_equation method, between the def and end lines. It has no value outside of the method definition. This is called function scope, a topic outside of, well, the scope of this simplified tutorial. Read more about scope here.

OK, the above was just introducing you to the concept of a method/function. The kind of methods we’ll be dealing with in our script are called instance methods. These methods belong to something…an actual number, for example. 6 is an instance of a Number. “Six” is an instance of a String

Example:

The number 2.67 is considered by Ruby to be of the class Float…that is, a number with a floating decimal point.

More specifically, 2.67 is an instance of a Float. So is 4.777. And so is 8.999.

What if I wanted to go about rounding a Float number? Well, luckily, Ruby has built in instance methods that do this. The basic structure is the instance, followed by the method’s name…as follows:

instance.method_name

The method for rounding a Float is called “round”. So, to round 2.67, we do:

2.67.round
>>2

This is a little confusing because of the two periods. Just be faithful that the Ruby interpreter knows the difference; it sees the first “dot” as a decimal point defining the number. The second “dot” tells it that we want to access the built-in Float method called round.

One more example, let’s work with arrays.

Let’s say we have:

an_array= [1,2,3,4,5,6]

I want to make an array that consists of the first three elements of *any* array. Luckily, Ruby arrays has a built in method called slice.

an_array.slice(0,3)
>>[1,2,3]

So, slice is the name of an instance method of things that are Arrays. Inside the parentheses are two arguments, the first denotes the element to start out at (in this case, 0, since we want the first element), the second denotes how many elements to include in this sub-array (3).

What was the point of all of this? In our final code, you’ll be seeing calls to methods. Someone already wrote the method that, say, collects all the text of a webpage and stores it into a variable for you. But you need to know the name of that method and how to invoke it.

Writing Your Script

OK, now we get past the fundamentals and into things that will really solve your problems. It wasn’t important to have intimate knowledge of the previous concepts, but just to know they exist.

But how can you, knowing just the basics, do something as complicated as connect to a series of web pages, collect their content, pick the exact points of needed data, and arrange them in a useful structure? Because other programmers have abstracted all these functions in such a way that we could do this series of tasks in just a few lines.

I’m going to write out an extremely-verbose way of performing these tasks to make each step clear…but as you get better, you’ll find ways to minimize your typing.


Here’s the list of steps we’ll be doing, in somewhat plain English:

  • 1) Grab the contents of the presidents list
  • 2) From that list, grab each president’s name
  • 3) Determine if the last name is longer than 6 characters
  • 4) If so, fetch the link to the president’s page and open it
  • 5) Grab the age from the president’s page
  • 6) Add up the data you gathered

Before doing any of the above steps, we’re going to download a Ruby library that makes the above tasks trivially easy (that is, compared to starting from scratch)…

Nokogiri

I won’t get into what “gems” are in relation to the Ruby programming language; just think of them as pre-packaged functions and code that you can easily download and re-use for your own scripts.

Complete instructions can be found at the nokogiri homepage. You may run into a lot of errors…my advice is to copy part of that error and Google it with that and “nokogiri,” and hopefully you’ll get an answer.

Hopefully, it’s as simple as going into your command-line console (exit the interpreter if you’re in there) and typing:

>> gem install libxml-ruby
>> gem install nokogiri

What is nokogiri? It’s a library of code that makes it easy to parse a webpage. Remember when you right-clicked on a webpage to view source, and how painful of a task it would be to collect, say, what the text of the third headline is…on 100 different pages? Nokogiri essentially allows you to do this with a couple lines of code. Check out the homepage here.

Step One: Fetch the Contents From the Presidents List

Let’s try Nokogiri out. Open your ruby interpreter and type in the following commands; these first lines invoke the method require, which will give your script access to the required libraries of code, including nokogiri:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

This next line will fetch the contents of Wikipedia’s list of U.S. Presidents

	list_of_presidents = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States'))

I’m going to quickly deconstruct this line:

  • Nokogiri::HTML specifies that we want a method that exists in the Nokogiri library, and more specifically, in its class named HTML.
  • open is the name of the method we want. Now you see why we had to specify the above…there are lots of libraries and contexts that have methods named open. We want Nokogiri’s.
  • ‘http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States’ is the string that holds the address of the page we want. The open method needs this to…well…know what to open.
  • list_of_presidents is the variable to which open will spit its contents into.

OK, that one line, maybe the most complicated line we’ve written so far, just did a whole lot for you.

Using a method in the Nokogiri library called open (which takes in a web page address as an argument), it opened a connection with Wikipedia, performed the Internet protocols necessary to exchange information, copied the content of the target page, and wraps it all up in a Nokogiri data structure for later manipulation. We are pointing to this data structure with the variable list_of_presidents

Let’s try to grab the contents of the second h2 tag (i.e. the second, secondary headline)

list_of_presidents.xpath('//h2')[1].content
=>Presidents

Running scripts from Text Editors or the Command Line

Running Ruby commands from the Interactive Ruby prompt is nice and all, for quick feedback. But from here on out, we’ll be writing a full-on script with a few dozen lines of code. So, it’ll be easier if you create a new text file with a file extension of .rb … something like, myfirstscript.rb to put your code in.

You should be using a text-editor for this…something better than Notepad, at least.

For Macs, there’s the free and excellent TextWrangler. If you’re willing to spend some money, TextMate is what I use and it’s worth the $55. A free 30-day trial can be \”downloaded here.

For Windows, the one-click Ruby installer includes the free SciTE4. Also, there’s the free Komodo Edit. For $35, there’s the “Textmate on Windows”, E-TextEditor (free trial here)

Some of these text editors have a shortcut-key that allows you to run the script. For example, SciTE uses F5. Note how the output is conveniently displayed to the side:

Writing a Ruby script in SciTE for Windows

Writing a Ruby script in SciTE for Windows

There’s also the old-fashioned command line, from which you ran IRB from. Navigate to the directory that you saved your file in. Then type “ruby whatever_your_file_name_is.rb“:

Running a script from the Windows command line

Running a script from the Windows command line

OK, here’s another high-level programming construct we’ll superficially try to cover…

XPath

XPath is a syntax used to address parts of HTML documents. It allows you, for example, to find all text that’s between headline, italics, paragraph, or whatever tags you want. You could also do something as specific as “Find the third link in every paragraph.”

From Zvon.org, how to select all 'BBB' nodes using XPath

From Zvon.org, how to select all 'BBB' nodes using XPath

It’s another field of knowledge in which you could spend your life memorizing. For our purposes, you just need to know that it’s a way to pinpoint an element, or a set of elements, in an HTML document.

list_of_presidents.xpath('//h2')[1].content
#result: "Presidents"

Let’s dissect the above nokogiri command. list_of_presidents was a variable holding a Nokogiri data structure…essentially, the entirety of the Wikipedia page in a format that the Nokogiri library can understand.

xpath, then, is an instance method of this data structure, that takes a string as an argument. That string contains XPath syntax.

The string, in the above example, is “//h2″. In XPath syntax (check out W3Schools for a primer), the double-slashes // tells the parser to look anywhere in the document. h2 is the specific tag – a level-2 headline – that we want. And [1] denotes that the result of the xpath method is an array, of which we want the value at the 1st index (technically, the second value of that array…remember that an array’s index starts at the 0th index). And content is an instance method of what was in that 1st index: a nokogiri data structure. content, in this case, pulls what was in those h2 tags: “Presidents“.

The above line could’ve been broken down into:

a = list_of_presidents
a = a.xpath('//h2')
a = a[1]
a = a.content
#result: "Presidents"

That was a very simple XPath query. Another one could be:

list_of_presidents.xpath('//p/a[4]')
Unlike arrays, XPath notation does not start at 0 So 1 will refer to the 1st element) hyperlink (<a> tag). The notation is contained within that string:

list_of_presidents.xpath(‘//p/a[4]’)[0]

…would refer to the first element of the array of fourth-hyperlinks that were inside p tags.

This will find the 4th hyperlink in each paragraph. If you try it out, you’ll get an array containing two elements…which makes sense, as there are only two paragraphs on this page (therefore, there can only be two fourth-in-a-paragraph hyperlinks)

Step 2: From a Table of Data, Fetch the President’s Name

At this time, it’s worth looking at how Wikipedia lists its presidents:

Wikipedia's List of Presidents of the United States

Wikipedia's List of Presidents of the United States

This is an HTML table. Each row appears to contain one president (there are sub-rows, which we’ll ignore, corresponding to each term). In the third column (the second column is the actual image file) are two important pieces of data for us: the president’s name and a link to that president’s Wikipedia page.

Remember that we wanted the age of each president. Unfortunately, that’s not listed on this table, so we’ll have to visit each page, where, presumably, an age is listed.

Visit w3Schools for a quick primer on HTML tables. But to be brief: tr designates a row and td designates a column. Let’s put our installation of Firefox’s Firebug to use. Let’s confirm that the info we want – a president’s name – is indeed in the third column.

Right click on the hyperlink of John Adams and select Inspect Element. The Firebug panel should pop-up like so, showing that the third <td> element contains “John Adams”. More specifically, it contains the text “John Adams” in between <a> tags, which we learned marks off a hyperlink. This will be important in the next step…

Using Firebug to find out the element containing "John Adams"

Adapting from our previous line of code using XPath, let’s try this:

those_columns = list_of_presidents.xpath("//tr/td[3]")

That XPath notation will find us every third <td&rt; (column) that is enclosed in a <tr> tag (row). That should spit out a large array of Nokogiri elements (as many as there are presidents).

We want the first of those, which is addressed in the 0th-index of that array…

those_columns[0]
# result is: "George Washington[2][3][4][5]"

So we got a name…but what’s with the bracketed numbers? If you look at the Wikipedia list again, you’ll see that those numbers are links to footnotes. Useful, but not to us. So how to extract just the name? Remember that each president’s name is enclosed in a a (hyperlink) tag. And it’s the first hyperlink. So let’s make our previous XPath a little more complex:

george_washingtons_name = list_of_presidents.xpath("//tr/td[3]/a[1]")[0]
=>"George Washington"

We’re now asking for the 1st (a[1], in XPath notation, is asking for the first a tag) hyperlink, in the third column (td), in each row (tr). The result is the string “George Washington”.

Step 3: Determine if the Last Name Is Longer Than 6 Characters

OK, now we have a name; how do we programmatically determine the length of the last name (remember, our goal is to search all presidents with last names with more than 6 letters)?

The split and length methods of String

First, let’s get the last name. It’s reasonable to assume that the last word in each string (“Bush” in “George W. Bush”) is the last name. Each word is set off by a space. So we are going to use a String instance method called split, which will take a string and divide it into separate pieces, using a character we specify. The result is an Array of strings.

So:

the_last_name = george_washingtons_name.split(' ')[-1]
# Result: "Washington"
  1. The above line can be described as thus: Take the string inside the variable george_washingtons_name
  2. Split it at every instance of a space
  3. Return the last element (the -1 index of an array returns the last element. -2 would return the second-to-last)

The result is: “Washington” from the string “George Washington” is assigned to the variable the_last_name

Now, this is when we finally use the conditional branch statement if

the_last_name.length > 6
# result: true
if the_last_name.length > 6
 puts("Yep, greater than 6")
end
# result: Yep, greater than 6

length is an instance method of Strings. In the first bit of code, we basically asked: is the length of the_last_name greater than 6. The interpreter says, true

In the second bit of code, we defined a branch statement, saying to print “Yep, greater than 6″ if the condition in the if statement (the_last_name.length > 6) was true. It was.

Step 4: If So, Fetch the Link to the President’s Page and Open It

Here’s the code, in verbose form, that we’ve taken to get here…plus a few more lines that flesh out how we want the script to actually execute.

	# open the required libraries
	require 'rubygems'
	require 'nokogiri'
	require 'open-uri'

	# Using nokogiri, fetch Wikipedia's list of presidents page
	list_of_presidents = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States'))

	# Using another nokogiri method, grab the third column from every row, and from those, grab the first hyperlink (which contains the prez's name)
	an_array_of_links = list_of_presidents.xpath("//tr/td[3]/a[1]")

So we dealt with George Washington’s name…but we want to deal with an array of presidential names. On each element, we want to execute the same operation (see if length of last name is greater than 6 letters, if so, fetch the link).

We’re going to use something called an each loop.

		count = 0

		an_array_of_links.each do |link_to_test|
		# This above statement can be read as: for each element in an_array_of_links, do
		# the following code (until the end line)
		# And as you go through each element, the variable use to reference the element will be named "link_to_test"

		   last_name = link_to_test.content.split(' ')[-1]   #remember that between the <a> tags was the president's name, with the last word being the last  name
			if last_name.length > 6
				the_link_to_the_presidents_page = link_to_test["href"]
				# We'll get to this part in the next section...
			end

		end
		# OK, we're at the end of the each loop. Go back to the top

I’m not going to dissect this. It’s enough to know that each is a method of an Array, and the code inside each do and end is executed for each element of an Array.

OK, using the code above, we are looping through all the presidents’ names and page links. On each name, we’re testing the length of the last name. And if the last name is longer than 6 letters…we’re going to open the link and grab the president’s age.

So:

	if last_name.length > 6
		the_link_to_the_presidents_page = link_to_test["href"] 

		# OK, the value of href is going to be something like "/wiki/George_Washington". That's an address relative to the Wikipedia site
		# so we need to prepend "http://en.wikipedia.org" to have a valid address...

		the_link_to_the_presidents_page = "http://en.wikipedia.org"+the_link_to_the_presidents_page

		# now let's fetch that page

		the_presidents_page = Nokogiri::HTML(open(the_link_to_the_presidents_page))

		# ... OK, now what?

	end

Step 5: Grab the age from the president’s page

All right, so the_presidents_page now holds all the html inside one of the president’s page. We need to scope it out to find the XPath necessary to fetch the age of the president.

Let’s take a look at George Washington’s page. More specifically, look at the sidebar to the right, which contains his vital statistics:

George Washington's Wikipedia Sidebar

As you can see, the age is listed, next to the “Died” line.

Using Firebug to check out the structure tells us that the sidebar is a table, and the death date is in the <td> cell that immediately follows the <th> cell containing the text “Died”.

Firebug Inspection of George Washington Sidebar

Firebug Inspection of George Washington Sidebar

OK, were going to have to use XPath to target those specific cells. Let’s test it out on George Washington’s page. I’m just going to provide you the XPath syntax; you’re welcome to read W3School’s tutorial to figure why it works:

	george = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/George_Washington'))
	death_date = george.xpath("//th[contains(text(), 'Died')]/following-sibling::*")[0].content 

	# => "December 14, 1799 (aged 67)Mount Vernon, Virginia,\nUnited States"

(Some references to the syntax above: contains, following-sibling

Well, death_date contains more than we wanted. How do we just get the 67 from the aged 67 part? There’s no html tag that sets 67 off (our job would have been so easy if it had been <age>67</age>).

The last new topic you’ll learn in order to complete the task is regular expressions.

Regular Expressions, aka regexes

Again, like HTML and XPath, regular expressions aren’t “programming”, but it’s a universe of syntax that requires entire books to describe. Put simply, regular expressions allow you to grab strings of text that match a pattern.

From regular-expressions.info, how to match HTML tags

From regular-expressions.info, how to match HTML tags

In this case, the pattern I want is: a number, either two-to-three digits long, that is after the word “aged “

I won’t go into the specifics here…I’ve found that you can learn regular expressions with a little reading and trial and error. In this case, the pattern I want, in regex terms, is /aged.+?([0-9]+)/ (note: although the text on the Wikipedia page reads something like “aged 67″, the space in between is a special HTML character, hence, the .+? used to capture it in the reg ex…don’t worry, that last sentence will make perfect sense when you someday understand reg exes.).

In descriptive English, this pattern is going to capture (what’s in the parentheses) any digits from 0-9 that follow the character sequence aged. The forward-slashes denote the beginning and end of the regex.

Again, a regular expression is a syntax, not an actual programming function. So we need to call Ruby’s instance method, match, which executes a text-search based on the syntax of regular expression that you passed into it. Like so:

death_date = george.xpath("//th[contains(text(), 'Died')]/following-sibling::*")[0].content
age_at_death = death_date.match(/aged.+?([0-9])/)[1]

As you can guess, match returns an array of elements. I don’t want to explain the match method in full here, but the 0th element contains the entire match, which would be “aged 67″, and the 1st element returns what was in between the parentheses of my regular expression…the pattern for a multi-digit number, i.e. 67. Again, you just have to learn about reg exes for this to make more sense.

You don’t have to be a programmer to appreciate regular expressions. Ever do find and replace in a text editor? Let’s say you have a bunch of text with numbers sprinkled through…and those numbers were supposed to have $ signs in front of them. There’s no simple find-and-replace that can replace every group of numbers (9, 12.3, 0.55) with ($9, $12.3, $0.55); but in text-editors that support regexes, you could do such a replacement in one command. This is pretty invaluable if you’ve ever had to clean up “dirty” comma-delimited files.

Bookmark regular-expressions.info and save yourself a lot of time in learning about reg exes.

Step 6: Add up the data you gathered

So now we’ve gotten to our goal: retrieving a president’s age from his Wikipedia page. Now we just need to add it all up and take the average.

Here’s the remaining things we have to do, in narrative form:
Before we go into each president’s page, we need a variable to hold the sum of all the ages (total_age). And we’ll need a variable to keep track of how many president’s ages we’ve retrieved (prez_count). However, not every page is going to have an age…since not all former presidents have passed away. So, if the “age” datapoint exists, add it to the total_age variable. And increment prez_count. If not, then do nothing, and go onto the next president until we’ve gone through all the presidents.

Once we’ve finished looping through the pages of presidents, divide total_age by prez_count. And we’re done.

The complete script

The final code is as follows (I’ve added several puts statements to notify you where in the execution the script is…it should take less than 2 minutes):

	require 'rubygems'
	require 'nokogiri'
	require 'open-uri'

	list_of_presidents = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States'))

	an_array_of_links = list_of_presidents.xpath("//tr/td[3]/a[1]")

	## These two variables will be added to throughout the execution of the script
	## At the end, they'll have the answers

	prez_count = 0
	total_age = 0

	an_array_of_links.each do |link_to_test|  

	   last_name = link_to_test.content.split(' ')[-1]   

		if last_name.length > 6
			the_link_to_the_presidents_page = link_to_test["href"]
			the_link_to_the_presidents_page = "http://en.wikipedia.org" + the_link_to_the_presidents_page
			prez_page = Nokogiri::HTML(open(the_link_to_the_presidents_page))

			puts "Entering the page: #{the_link_to_the_presidents_page}"

			death_date = prez_page.xpath("//th[contains(text(), 'Died')]/following-sibling::*")

      if death_date && death_date[0]
        # Doing something like `if some_variable_name` is basically asking, "Does some_variable_name have any value?".
        # It will return false if some_variable_name has been set to false or if it had never been set to anything at all, not even 0 or an empty string (both of which would return true)
        # The double ampersand && functions as an "AND", requiring that two conditional tests be true before entering the if-statement's true branch

			  age_at_death = death_date[0].content.match(/aged.+?([0-9]+)/)[1]
  	  		if age_at_death
  	  		  # we only get here if there was a "Died" table cell AND a text pattern similar to: "aged XX"
  	  		  puts "Age of #{link_to_test.content} is: #{age_at_death}"
  	  			total_age += age_at_death[1].to_i  # technically, age_at_death[0] is a String. to_i will make it a Number so we can safely add it to total_age
  	  			prez_count += 1
  	  		end #end of the if age_at_death
  	  end # end of the if death_date...
	  else
	    # we reach this branch of code if last_name was shorter than 6. Let's print a debug message to notify us:
	    puts "#{last_name} is not longer than 6 letters"
		end #end of the if last_name.length > 6

	end # OK, we're at the end of the each loop. Go back to the top

	# if we got here, we're out of the loop, and total_age and prez_count have the right values. So:
	the_final_value = total_age/prez_count.to_f  # to_f converts an integer to a decimal number, so we'll get partial years for the average
	puts "#{prez_count} presidents were counted, their age totaling: #{total_age}."
	puts "The average of their ages is #{the_final_value}"

As of Feb. 2010, running that script produces this output:

Entering the page: http://en.wikipedia.org/wiki/George_Washington
Age of George Washington is: 67
Adams is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Thomas_Jefferson
Age of Thomas Jefferson is: 83
Entering the page: http://en.wikipedia.org/wiki/James_Madison
Age of James Madison is: 85
Monroe is not longer than 6 letters
Adams is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Andrew_Jackson
Age of Andrew Jackson is: 78
Buren is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/William_Henry_Harrison
Age of William Henry Harrison is: 68
Tyler is not longer than 6 letters
Polk is not longer than 6 letters
Taylor is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Millard_Fillmore
Age of Millard Fillmore is: 74
Pierce is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/James_Buchanan
Age of James Buchanan is: 77
Entering the page: http://en.wikipedia.org/wiki/Abraham_Lincoln
Age of Abraham Lincoln is: 56
Entering the page: http://en.wikipedia.org/wiki/Andrew_Johnson
Age of Andrew Johnson is: 66
Grant is not longer than 6 letters
Hayes is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/James_A._Garfield
Age of James A. Garfield is: 49
Arthur is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Grover_Cleveland
Age of Grover Cleveland is: 71
Entering the page: http://en.wikipedia.org/wiki/Benjamin_Harrison
Age of Benjamin Harrison is: 67
Entering the page: http://en.wikipedia.org/wiki/Grover_Cleveland
Age of Grover Cleveland is: 71
Entering the page: http://en.wikipedia.org/wiki/William_McKinley
Age of William McKinley is: 58
Entering the page: http://en.wikipedia.org/wiki/Theodore_Roosevelt
Age of Theodore Roosevelt is: 60
Taft is not longer than 6 letters
Wilson is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Warren_G._Harding
Age of Warren G. Harding is: 57
Entering the page: http://en.wikipedia.org/wiki/Calvin_Coolidge
Age of Calvin Coolidge is: 60
Hoover is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Franklin_D._Roosevelt
Age of Franklin D. Roosevelt is: 63
Truman is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Dwight_D._Eisenhower
Age of Dwight D. Eisenhower is: 78
Entering the page: http://en.wikipedia.org/wiki/John_F._Kennedy
Age of John F. Kennedy is: 46
Entering the page: http://en.wikipedia.org/wiki/Lyndon_B._Johnson
Age of Lyndon B. Johnson is: 64
Nixon is not longer than 6 letters
Ford is not longer than 6 letters
Carter is not longer than 6 letters
Reagan is not longer than 6 letters
Bush is not longer than 6 letters
Entering the page: http://en.wikipedia.org/wiki/Bill_Clinton
Bush is not longer than 6 letters
Obama is not longer than 6 letters
21 presidents were counted, their age totaling: 1398.
The average of their ages is 66.5714285714286

The End?

Well, congratulations…you accomplished a trivial task, but you learned a set of methods that you can apply to much more important goals. If you’re a complete newbie to programming, hopefully this tutorial has given you a glimpse of what’s involved. And how, once you firm up your programming fundamentals, you can get real work done.

But I need to stress that this tutorial simplified things as much as possible…at the cost of best-practices programming. I chose Wikipedia as a target because it’s a reasonably well-structured, high-traffic site that has an ethos of making volumes of information available for the public good.

The script that we just wrote is a naive, little child, that gets what it wants as fast as it wants. In the real world, many sites that you attempt to scrape will not be so forgiving. Some sites will block you, or fail to connect, if you try to read a hundred pages at once. Some sites will have horrific HTML that will require much more complicated XPath and regular expression syntax. Sometimes, your internet connection might drop. All of this will cause the above script to halt to a ugly and premature death. Or even worse: collect bad data that you won’t know was erroneous.

All of these problems are solvable, but like any task, it takes experience that comes from trying and failing. Hopefully, this tutorial at least shows you how easy it is to try.

Other resources:

See my four-part series on web-scraping for journalists here.

I'm a programmer journalist, currently teaching computational journalism at Stanford University. I'm trying to do my new blogging at blog.danwin.com.