<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>danwin.com &#187; tutorial</title>
	<atom:link href="https://danwin.com/tag/tutorial/feed/" rel="self" type="application/rss+xml" />
	<link>https://danwin.com</link>
	<description>Words, photos, and code by Dan Nguyen. The &#039;g&#039; is mostly silent.</description>
	<lastBuildDate>Thu, 21 Nov 2019 12:29:57 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.2.39</generator>
	<item>
		<title>dataist blog: An inspiring case for journalists learning to code</title>
		<link>https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/</link>
		<comments>https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/#comments</comments>
		<pubDate>Wed, 16 Feb 2011 13:00:32 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[thoughts]]></category>
		<category><![CDATA[works]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[Dollars for Docs]]></category>
		<category><![CDATA[journalism]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[propublica]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=1582</guid>
		<description><![CDATA[<p>About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven&#8217;t looked back at it because I&#8217;m sure I&#8217;ll just spend the next few hours cringing. For example, what a dumb idea it was to [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/">dataist blog: An inspiring case for journalists learning to code</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p><a href="https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/pills-keyboard-300x200/" rel="attachment wp-att-1596"><img src="https://danwin.com/words/wp-content/uploads/2011/02/pills-keyboard-300x200.jpg" alt="" title="pills-keyboard-300x200" width="300" height="200" class="alignleft size-full wp-image-1596" /></a> About a year ago <a href="https://danwin.com/coding-for-journalists-a-four-part-series/">I threw up a long, rambling guide</a> hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven&#8217;t looked back at it because I&#8217;m sure I&#8217;ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from <a href="https://danwin.com/works/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/">&#8220;What is HTML&#8221; to actual Ruby scraping code all in a gigantic, badly formatted post</a>.</p>
<p>The series of articles have gotten a fair number of hits but I don&#8217;t know how many people were able to stumble through it. Though last week I noticed this <a href="http://dataist.wordpress.com/2011/02/05/mapping-ratata-whos-hot/">recent trackback from dataist</a>, a new &#8220;blog about data exploration&#8221; by Finnish journo <a href="http://jensfinnas.com/">Jens FinnÃ¤s</a>. He writes that he has &#8220;almost no prior programming experience&#8221; but, after going through my tutorials and checking out <a href="http://scraperwiki.com/">Scraperwiki</a>, was<a href="http://dataist.wordpress.com/2011/02/05/mapping-ratata-whos-hot/"> able to produce this cool network graph of the Ratata blog network after about &#8220;two days of trial and error&#8221;:</a></p>
<div id="attachment_1597" style="width: 510px" class="wp-caption aligncenter"><a href="http://dataist.wordpress.com/2011/02/05/mapping-ratata-whos-hot/"><img src="https://danwin.com/words/wp-content/uploads/2011/02/dataist-pdf.gif" alt="Mapping of Ratata blogging network by Jens FinnÃ¤s of dataist.wordpress.com" title="Mapping of Ratata blogging network by Jens FinnÃ¤s of dataist.wordpress.com" width="500" height="311" class="size-full wp-image-1597" /></a><p class="wp-caption-text">Mapping of Ratata blogging network by Jens FinnÃ¤s of dataist.wordpress.com</p></div>
<p>I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas&#8217;s <a href="http://dataist.wordpress.com/2011/02/05/mapping-ratata-whos-hot/">example</a>. Becoming good at coding is not a trivial task. But even the first steps of it can teach a non-coder some profound lessons about data important enough on their own. And if you&#8217;re a curious-type with a question you want to answer, you&#8217;ll soon figure out a way to put something together, as in Finnas&#8217;s case.</p>
<p>ProPublica&#8217;s <a href="http://projects.propublica.org/docdollars/">Dollars for Docs project</a> originated in part from this <a href="https://danwin.com/2010/04/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">Pfizer-scraping lesson</a> I added on to my <a href="https://danwin.com/coding-for-journalists-a-four-part-series/">programming tutorial</a>: I needed a timely example of public data that wasn&#8217;t as useful as it should be.</p>
<p>My colleagues Charles Ornstein and Tracy Weber may not be programmers (yet), but they are experienced enough with data to know its worth as an investigative resource, and turned an <a href="http://www.propublica.org/nerds/item/the-coders-cause-in-dollars-for-docs">exercise</a> in transparency into a <a href="http://projects.propublica.org/docdollars">focused and effective investigation</a>. It&#8217;s not trivial to find a story in data. Besides being able to do Access queries themselves, C&#038;T knew both the limitations of the data (for example, it&#8217;s difficult to make comparisons between the companies because of <a href="http://projects.propublica.org/docdollars/payment_reports">different reporting periods</a>) and its possibilities, such as the cross-checking of names en masse from the payment lists with state and federal doctor databases.</p>
<p>Their <a href="http://www.propublica.org/series/nurses">investigation into the poor regulation of California nurses</a> &ndash; a collaboration with the LA Times that was a <a href="http://www.pulitzer.org/citation/2010-Public-Service">Pulitzer finalist in the Public Service category</a> &ndash; was similarly data-oriented. They (and the LA Times&#8217; Maloy Moore and Doug Smith) had been diligently building a database of thousands of nurses &ndash; including their disciplinary records and the time it took for the nursing board to act &ndash; which made my part in <a href="http://projects.propublica.org/nurses">building a site</a> to graphically represent the data extremely simple.</p>
<p>The point of all this is: don&#8217;t put off your personal data-training because you think it requires a computer science degree, or that you have to become great at it in order for it to be useful. Even if after a week of learning, you can barely put together a programming script to alphabetize your tweets, you&#8217;ll likely gain enough insight to how data is made structured and useful, which will aid in just about every other aspect of your reporting repertoire. </p>
<p>In fact, just knowing to avoid taking notes like this:</p>
<blockquote><p>
Colonel Mustard used the revolver in the library? (not library)<br />
Miss Scarlet used the Candlestick in the dining room? (not Scarlet)<br />
&#8220;Mrs. Peacock, in the dining room, with the <s>revolver</s>? &#8220;<br />
&#8220;Colonel Mustard, rope, <s>conservatory</s>?&#8221;<br />
Mustard? Dining room? Rope (nope)?<br />
&#8220;Was it Mrs. Peacock with the <s>candlestick</s>, inside the dining room?&#8221;
</p></blockquote>
<p>And instead, recording them like this:</p>
<table>
<thead>
<tr>
<th>Who/What?</th>
<th>Role?</th>
<th>Ruled out?</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mustard</td>
<td>Suspect</td>
<td>N</td>
</tr>
<tr>
<td>Scarlet</td>
<td>Suspect</td>
<td>Y</td>
</tr>
<tr>
<td>Peacock</td>
<td>Suspect</td>
<td>N</td>
</tr>
<tr>
<td>Revolver</td>
<td>Weapon</td>
<td>Y</td>
</tr>
<tr>
<td>Candlestick</td>
<td>Weapon</td>
<td>Y</td>
</tr>
<tr>
<td>Rope</td>
<td>Weapon</td>
<td>Y</td>
</tr>
<tr>
<td>Conservatory</td>
<td>Place</td>
<td>Y</td>
</tr>
<tr>
<td>Dining Room</td>
<td>Place</td>
<td>N</td>
</tr>
<tr>
<td>Library</td>
<td>Place</td>
<td>Y</td>
</tr>
</tbody>
</table>
<p>&#8230;will make you a significantly more effective reporter, as well as position you to have your reporting and research become much more ready for thorough analysis and online projects.</p>
<p>There&#8217;s a motherlode of programming resources available through single Google search. My high school journalism teacher told us that if you want to do journalism, don&#8217;t major in it, just do it. I think the same can be said for programming. I&#8217;m glad I chose a computer field as an undergraduate so that I&#8217;m familiar with the theory. But if you have a career in reporting or research, you have real-world data-needs that most undergrads don&#8217;t. I&#8217;ve found that having those goals and needing to accomplish them has pushed my coding expertise far quicker than did any coursework.</p>
<p>If you aren&#8217;t set on learning to program, but want to get a better grasp of data, I recommend learning:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Regular_expression">Regular expressions</a> &#8211; a set of character patterns, easily printable on a cheat-sheet for memorization, that you use in a text-editor&#8217;s <em>Find and Replace</em> dialog to turn a chunk of text into something you can put into a spreadsheet, as well as clean up the data entries themselves. <a href="http://www.regular-expressions.info/">Regular-expressions.info</a> is the most complete resource I&#8217;ve found. A cheat-sheet can be <a href="http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/">found here</a>. <a href="http://en.wikipedia.org/wiki/Regular_expression">Wikipedia</a> has a list of some simple use cases.</li>
<li>
<a href="http://code.google.com/p/google-refine/">Google Refine</a> &#8211; A spreadsheet-like program that makes easy the task of cleaning and normalizing messy data. Ever go through campaign contribution records and wish you could easily group together and count as one, all the variations of &#8220;Jon J. Doe&#8221;, &#8220;Jonathan J. Doe&#8221;, &#8220;Jon Johnson Doe&#8221;, &#8220;JON J DOE&#8221;, etc.? Refine will do that. Refine developer David Huynh has an <a href="http://www.youtube.com/watch?v=yNccGtn3Wb0&#038;feature=player_embedded">excellent screencast</a> demonstrating Refine&#8217;s power. I wrote a guide as <a href="http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning">part of the Dollars for Docs tutorials</a>. Even if you know Excel like a pro &ndash; which I do not &ndash; Refine may make your data-life much more enjoyable.</li>
</li>
</ul>
<p>If you want to learn coding from the ground up, here&#8217;s a short list of places to start:</p>
<ul>
<li><a href="http://lifehacker.com/#!5744113/learn-to-code-the-full-beginners-guide">Lifehacker&#8217;s &#8220;Full Beginner&#8217;s Guide&#8221;</a> &#8211; a four day guide that covers the very basics to how to write a simple guessing game. It&#8217;s in Javascript, but as you&#8217;ll hear plenty of times from veterans, it really doesn&#8217;t matter what language you start out with.
</li>
<li><a href="http://www.ruby-doc.org/docs/ProgrammingRuby/">The Pragmatic Programmer&#8217;s Guide to Programming Ruby</a> &#8211; this covers an older version of Ruby, but is still a great comprehensive, browser-friendly book.
</li>
<li><a href="http://pine.fm/LearnToProgram/">Learn to Program (also in Ruby) by Chris Pine</a> &#8211; Written in 2004, this is still an elegant beginner&#8217;s guide
</li>
<li><a href="http://inventwithpython.com/chapters/">Invent Your Own Computer Games With Python</a> &#8211; You may not be interested in writing game software, but the same programming techniques apply in that field as they do anywhere else. This guide covers all the fundamentals and gives you great project examples.
</li>
<li><a href="http://scraperwiki.com/">ScraperWiki</a> has a massive collection of web-scraping scripts for your perusal, and is where the dataist&#8217;s FinnÃ¤s learned from example. ScraperWiki has a set of <a href="http://scraperwiki.com/help/tutorials/python/">python tutorials</a>, too.
</li>
<li>Here&#8217;s a <a href="http://www.e-booksdirectory.com/programming.php">giant list of free programming books</a>.
</li>
<li>Visit the <a href="http://www.reddit.com/r/learnprogramming">learnprogramming subforum in Reddit</a> to find a small, but active community of beginners who aren&#8217;t afraid to start the most basic of discussions with the forum&#8217;s programming experts. <a href="http://stackoverflow.com/">StackOverflow</a> is the single best site for specific questions or problems; often, you can Google your exact problem and a relevant StackOverflow discussion will be at the top.
</li>
<li>And you can always refer back to my <a href="https://danwin.com/coding-for-journalists-a-four-part-series/">four-part programming tutorial from last year</a>, which aims to cover HTML to writing Ruby to scrape websites. I <a href="http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data">also wrote a series of tutorials (with complete code) on how I collected data for Dollars for Docs</a>, including how to scrape from websites, Flash applications, PDFs, and even image files (the solution is specific to one kind of format, so I will gladly welcome anyone else to generalize it).
</li>
</ul>
<p>The post <a rel="nofollow" href="https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/">dataist blog: An inspiring case for journalists learning to code</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Pfizer Data Redux</title>
		<link>https://danwin.com/2010/04/pfizer-data-redux/</link>
		<comments>https://danwin.com/2010/04/pfizer-data-redux/#comments</comments>
		<pubDate>Wed, 28 Apr 2010 14:22:36 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[works]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[doctors]]></category>
		<category><![CDATA[journalists]]></category>
		<category><![CDATA[pfizer]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=763</guid>
		<description><![CDATA[<p>Updated the code and results to my guide on how to scraper Pfizer&#8217;s list of payments to doctors. It now contains a more normalized file that has a line for every doctor and payment. The aggregate totals changed marginally.</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/pfizer-data-redux/">Pfizer Data Redux</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Updated the code and results to my <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">guide on how to scraper Pfizer&#8217;s list of payments to doctors</a>. It now contains a more normalized file that has a line for every doctor and payment. The aggregate totals changed marginally.</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/pfizer-data-redux/">Pfizer Data Redux</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2010/04/pfizer-data-redux/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Coding for Journalists 101 : A four-part series</title>
		<link>https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/</link>
		<comments>https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/#comments</comments>
		<pubDate>Tue, 06 Apr 2010 13:51:40 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[works]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[journalism]]></category>
		<category><![CDATA[pfizer]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[web scraping]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=661</guid>
		<description><![CDATA[<p>Update, January 2012: Everything&#8230;yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you&#8217;ll find here. I&#8217;m only keeping this old walkthrough up as a historical reference. I&#8217;m sure [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/">Coding for Journalists 101 : A four-part series</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<div id="attachment_663" style="width: 510px" class="wp-caption aligncenter"><a href="http://www.flickr.com/photos/nicocavallotto/363251198/"><img src="https://danwin.com/words/wp-content/uploads/2010/04/363251198_9537fe7c6d.jpg" alt="nico.cavallotto" title="nico.cavallotto 363251198_9537fe7c6d" width="500" height="357" class="size-full wp-image-663" /></a><p class="wp-caption-text">Photo by Nico Cavallotto on Flickr</p></div>
<p><strong>Update, January 2012:</strong> Everything&#8230;yes, everything, is superseded by my free online book, <a href="http://ruby.bastardsbook.com">The Bastards Book of Ruby</a>, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you&#8217;ll find here. </p>
<p>I&#8217;m only keeping this old walkthrough up as a historical reference. I&#8217;m sure the code is so ugly that I&#8217;m not going to even try re-reading it.</p>
<p>So check it out: <a href="http://ruby.bastardsbook.com">The Bastards Book of Ruby</a></p>
<p>-Dan</p>
<p>&#8212;</p>
<p><strong>Update, Dec. 30, 2010:</strong> I published <a href="http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data">a series of data collection and cleaning guides for ProPublica</a>, to describe what I did for our Dollars for Docs project. There is a <a href="http://www.propublica.org/nerds/item/scraping-websites">guide for Pfizer which supersedes the one I originally posted here</a>.</p>
<p>So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A &#8220;little while&#8221; turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">how to help Pfizer in its mission to more transparent</a>, compelled me to just publish them in incomplete form. There&#8217;s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.</p>
<p>As the tutorials are aimed at people who aren&#8217;t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I&#8217;m very welcome to editing changes.</p>
<p><strong>DISCLAIMER:</strong> <em>The code, data files, and results are meant for reference and example only. You use it at your own risk.</em></p>
<ul>
<strong>Tutorial 1: <a href="https://danwin.com/works/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/">Go from knowing nothing to scraping Web pages. In an hour. Hopefully</a></strong> &#8211; A massive, sprawling tutorial that attempts to take you from learning what HTML is, to the definition of an &#8220;if <del datetime="2010-04-06T18:25:14+00:00">loop</del> statement&#8221;, and finally, to using a Ruby library to scrape some information from Wikipedia. It may be too confusing for total neophytes and laughably basic for self-taught programmers. But at least you can kind of see, from beginning to end, one roadmap on going from nothing to something in the programming world.</p>
<p><strong>Tutorial 2: <a href="https://danwin.com/works/coding-for-journalists-102-collecting-info-from-a-county-jail-site/">Scraping a County Jail Website to Find Out Who&#8217;s in Jail </a></strong> &#8211; This uses all the concepts from the first tutorial and applies them to something that a cops reporter might actually want to try out.</p>
<p><strong>Tutorial 3: <a href="https://danwin.com/works/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/">Who&#8217;s Been in Jail Before: Cross-checking the jail logs with the court system with Ruby&#8217;s Mechanize</a></strong> &#8211; This lesson introduces you to another Ruby library that allows you to automate the filling-out of forms so that you can access online databases, in this case, California criminal case histories to see if current inmates are repeat-alleged-offenders.</p>
<p><strong>Tutorial 4: <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">Improving Pfizer&#8217;s Dollars-to-Doctors Pay List</a></strong> &#8211; Last week, <strong>Pfizer</strong> <a href="http://www.nytimes.com/2010/04/01/business/01payments.html">released a list of nearly 5,000 doctors and medical institutions</a> that it made $35 million in consulting and expense payments. Fun. Unfortunately, the list, <a href="http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp">as it initially existed online</a>, is just about useless to anyone wanting to examine trends. This tutorial provides a script to make the list more interesting to journalists.
</ul>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/">Coding for Journalists 101 : A four-part series</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		</item>
		<item>
		<title>Coding for Journalists 104: Pfizer&#8217;s Doctor Payments; Making a Better List</title>
		<link>https://danwin.com/2010/04/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/</link>
		<comments>https://danwin.com/2010/04/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/#comments</comments>
		<pubDate>Tue, 06 Apr 2010 13:50:19 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[works]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[journalism]]></category>
		<category><![CDATA[pfizer]]></category>
		<category><![CDATA[scraper]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=643</guid>
		<description><![CDATA[<p>Update (12/30): So about an eon later, I&#8217;ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state. Update (4/28): Replaced the code and result files. Still haven&#8217;t written out a thorough explainer of what&#8217;s going on here. Update (4/19): After revisiting this script, I see [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">Coding for Journalists 104: Pfizer&#8217;s Doctor Payments; Making a Better List</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p><strong>Update (12/30): So about an eon later, <a href="http://www.propublica.org/nerds/item/scraping-websites">I&#8217;ve updated this by writing a guide for ProPublica</a>. Heed that one. This one will remain in its obsolete state.</strong></p>
<p><strong>Update (4/28): Replaced the code and result files. Still haven&#8217;t written out a thorough explainer of what&#8217;s going on here.</strong></p>
<p><strong>Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. I&#8217;m going to rework this script and post and update soon.</strong></p>
<p>So the world&#8217;s largest drug maker, <strong>Pfizer</strong>, decided to tell everyone which doctors they&#8217;ve been giving money to to speak and consult on its behalf in the latter half of 2009. These doctors are the same ones who, from time to time, recommend the use of Pfizer products.</p>
<p> <a href="http://www.nytimes.com/2010/04/01/business/01payments.html">From the NYT</a>:</p>
<blockquote><p>
				Pfizer, the worldâ€™s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. Pfizer also paid $15.3 million to 250 academic medical centers and other research groups for clinical trials in the same period.</p>
<p> A spokeswoman for Pfizer, Kristen E. Neese, said <strong>most of the disclosures were required by an integrity agreement that the company signed in August to settle a federal investigation into the illegal promotion of drugs for off-label uses</strong>.
			</p></blockquote>
<p>
So, not an entirely altruistic release of information. But it&#8217;s out there nonetheless. You can <a href="http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp">view their list here</a>. <strong>Jump to <a href="#results">my results here</a></strong><br />
<br />
<a href="http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp"><img src="https://danwin.com/words/wp-content/uploads/2010/04/pfizer-list.gif" alt="" title="pfizer-list" width="917"  class="aligncenter size-full wp-image-677"></a> Not bad at first glance. However, on further examination, it&#8217;s clear that the list is nearly useless unless you intend to click through all 480 pages manually, or, if you have a doctor in mind and you only care about that one doctor&#8217;s relationship. As a journalist, you probably have other questions. Such as:</p>
<ul>
<li>Which doctor received the most?
				</li>
<li>What was the largest kind of expenditure?
				</li>
<li>Were there any unusually large single-item payments?
				</li>
</ul>
<p>None of these questions are answerable unless you have the list in a spreadsheet. As I mentioned in earlier lessons&#8230;there are cases when the information is freely available, but the provider hasn&#8217;t made it easy to analyze. Technically, they are fulfilling their requirement to be &#8220;transparent.&#8221; </p>
<p>I&#8217;ll give them the benefit of the doubt that they truly want this list to be as accessible and visible as possible&#8230;I tried emailing them to ask for the list as a single spreadsheet, but the email function was broken. So, let&#8217;s just write some code to save them some work and to get our answers a little quicker.<br />
<span id="more-643"></span></p>
<link rel='stylesheet' href='https://danwin.com/css/code.css' type='text/css' media='all'>
<div class="code-doc">
<div class='over-note' style='font-size: 12pt; color: #a44; border: 1px solid black; margin: 20px; padding: 20px;'>This is part of a <a href="https://danwin.com/works/coding-for-journalists-101-a-four-part-series/">four-part series on web-scraping for journalists</a>. As of <strong>Apr. 5, 2010</strong>, it was published a bit incomplete because I wanted to post a timely solution to the <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">recent Pfizer doctor payments list release</a>, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact <a href="mailto:dan@danwin.com">dan@danwin.com</a> if you have any questions, or leave a comment below.</p>
<p><strong>DISCLAIMER:</strong> <em>The code, data files, and results are meant for reference and example only. You use it at your own risk.</em></p>
</div>
<div class="sec">
<h2>
					The Code<br />
				</h2>
<p>The following code uses the same nokogiri strategies in the past three lessons. But here are the specific considerations that we have to make for Pfizer&#8217;s list:</p>
<ul>
<li>The base url is: <a href="http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=All&amp;iPageNo=1">http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=All&amp;<strong>iPageNo=1</strong></a> The most interesting parameter, <strong>iPageNo</strong>, is bolded. If you replace &#8216;1&#8217; with any number, you&#8217;ll see you can progress through the list. There appears to be <a href="http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=All&amp;iPageNo=486">486 pages</a>.
					</li>
<li>So each page has a table of data with id <strong>#hcpPayments</strong>. The rows of data aren&#8217;t very normalized. For example, each &#8220;Entity Paid&#8221; can have many services/activity listed, with each of those items having another name attached to it. Then there are &#8220;cash&#8221; and &#8220;non-cash&#8221; values, which may or may not be numeric (&#8220;&#8212;&#8221; apparently means 0) There&#8217;s no easy css selector to grab each entity&#8230;but it seems that we can safely assume that if the first table column has a name (and the second and third contain city and state) that this is a new entity.
					</li>
<p>
						These are the steps we&#8217;ll take:</p>
<ul>
<li>Download pages 1 to 486 of the list (each page has 10 entries)</li>
<li>Run a method that gathers all the doctor names from the pages we just downloaded on to our hard drive)</li>
<li>From that list of doctors, query the Pfizer site and gather the individual payments to every doctor.</li>
</ul>
<div class='sec'>
<p>	At the top, I&#8217;ve written a few convenience methods to deal with strings. Also included are: <strong>get_doc_query</strong> is a function we call to extract the doctor name from the links on the site.
					</p>
<p><strong>puts_error</strong> is a quick function to log any errors we might have</p>
<pre name="code" class="ruby">
						# Some general functions to deal with strings
					class String

					  alias_method :old_strip, :strip

					  def strip
						  self.old_strip.gsub(/^[\302\240|\s]*|[\302\240|\s]*$/, '').gsub(/[\r\n]/, " ")
					  end

					  def strip_for_num
					    self.strip.gsub(/[^0-9]/, '')
					  end

					  def blank?
						respond_to?(:empty?) ? empty? : !self
					  end
					end
					
					
					END_PAGE=486
					BASE_URL=''
					DOC_QUERY_URL='http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?hcpdisplayName='


					def get_doc_query(str)
					  str.match(/hcpdisplayName\=(.+)/)[1]
					end

					def puts_error(str)
					  err = "#{Time.now}: #{str}"
					  puts err
					  File.open("pfizer_error_log.txt", 'a+'){|f| f.puts(err)}
					end
					
					
						</pre>
</p></div>
<div class='sec'>
<p>I found it easiest to download all the pages onto the hard drive first, using something like <a href='http://en.wikipedia.org/wiki/CURL'>CURL</a>, and then run the following code on it.</p>
<p><strong>process_local_pages</strong> is a method that will iterate through every page (you can set BASE_URL to either your hard drive if you&#8217;ve downloaded all the pages yourself, or to the Pfizer page), run <strong>process_row</strong>, and store all the doctor names and payees into separate files, as well as hold all the total amounts</p>
<p> The three resulting files that you get are:</p>
<ul>
<li><strong>pfizer_doctors.txt</strong> &#8211; Every doctor name listed. We will use this in the next step to query each doctor individual on Pfizer&#8217;s site</li>
<li><strong>pfizer_entities.txt</strong> &#8211; A list of every payment made to Entities</li>
<li><strong>pfizer_entity_totals.txt</strong> &#8211; A list of the total payments made to Entities</li>
</ul>
<pre name="code" class="ruby">


						def process_row(row, i, current_entity, arrays)  

						  tds = row.css('td').collect{|r| r.text.strip}

						   if !tds[3].blank? 
						     if !tds[1].blank?
						     # new entity
						     puts tds[0]
							     current_entity = {:name=>tds[0],:city=>tds[1], :state=>tds[2], :page=>i, :services=>[]} 
							     arrays[:entities].push(current_entity) if arrays[:entities]
						  	   current_class = row['class']
							   end

						     if tds[3].match(/Total/)
						       arrays[:totals].push([current_entity[:name], tds[4].strip_for_num, tds[5].strip_for_num].join("\t")) if arrays[:totals]

						     else
						        # new service
						   	   services_td = row.css('td')[3]
						   	   service_name = services_td.css("ul li a")[0].text.strip 
						   	   puts "#{current_entity[:name]}\t#{service_name}" 
						   	   current_entity[:services].push([service_name, tds[4].strip_for_num, tds[5].strip_for_num]) 

						   	   arrays[:doctors].push(services_td.css("ul li ul li a").map{|a| get_doc_query(a['href']) }.uniq) if arrays[:doctors]
						     end
						   elsif tds.reject{|t| t.blank?}.length == 0
						     #blank row
						   else
						     puts_error "Page #{i}: Encountered a row and didn't know what to do with it: #{tds.join("\t")}"
						   end

						   return current_entity
						end





						def process_local_pages

						  doctors_arr = []
						  entities_arr = []
						  totals_arr =[]

						  for i in 1..END_PAGE
						    begin
						  	   page = Nokogiri::HTML(open("#{BASE_URL}#{i}.html"))

						    	 count1, count2 = page.css('#pagination td.alignRight').last.text.match(/([0-9]{1,}) - ([0-9]{1,})/)[1..2].map{|c| c.to_i}
						    	 count = count2-count1+1

						    	 puts_error("Page #{i} WARNING: Pagination count is bad") if count < 0
						    	 puts("Page #{i}: #{count1} to #{count2}")

						    	 rows = page.css('#hcpPayments tbody tr')

						    	 current_entity=nil

						    	 rows.each do |row|  	   
						    	   current_entity= process_row(row, i, current_entity, {:doctors=>doctors_arr, :entities=>entities_arr, :totals=>totals_arr})
						       end

						     rescue Exception=>e
						  	   puts_error "Oops, had a problem getting the #{i}-page: #{[e.to_str, e.backtrace.map{|b| "\n\t#{b}"}].join("\n")}"
						     else


						     end
						  end

						  File.open("pfizer_doctors.txt", 'w'){|f|
						    doctors_arr.uniq.each do |d|
						        f.puts(d)
						    end
						  }

						  File.open("pfizer_entities.txt", 'w'){|f|
						    entities_arr.each do |e|
						      e[:services].each do |s|
						        f.puts("#{e[:name]}\t#{e[:page]}\t#{e[:city]}\t#{e[:state]}\t#{s[0]}\t#{s[1]}\t#{s[2]}")
						      end  
						    end
						  }


						  File.open("pfizer_entity_totals.txt", 'w'){|f|
						    totals_arr.uniq.each do |d|
						        f.puts(d)
						    end
						  }
						end

					</pre>
</p></div>
<div class='sec'>
<p><strong>process_doctor</strong> is what we run after we&#8217;ve compiled the list of doctor names that show up on the Pfizer list. Each doctor has his/her own page with detailed spending. The data rows are roughly in the same format as the main list, so we reuse <strong>process_row</strong> again</p>
<p>.</p>
<pre name="code" class="ruby">

						def process_doctor(r, time='')
						  begin
						    url = "#{DOC_QUERY_URL}#{r}"
						    page = Nokogiri::HTML(open("#{url}"))
						  rescue
							   puts_error "Oops, had a problem getting the #{r}-entry: #{[e.to_str, e.backtrace.map{|b| "\n\t#{b}"}].join("\n")}"
						  end

						  rows = page.css('#hcpPayments tbody tr')
						  entities_arr = []
						  current_entity=nil

						   rows.each do |row|  	   
						     current_entity= process_row(row, '', current_entity, {:entities=>entities_arr})
						   end


						   name = r.split('+')
						   puts_error("Should've been a last name at #{r}") if !name[0].match(/,$/)
						   name = "#{name[0].gsub(/,$/, '')}\t#{name[1..-1].join(' ')}"

						   vals=[]
						   entities_arr.each do |e| 
						     e[:services].each do |s|
						       vals.push("#{name}\t#{e[:name]}\t#{e[:page]}\t#{e[:city]}\t#{e[:state]}\t#{s[0]}\t#{s[1]}\t#{s[2]}\t#{url}\t#{time}")
						    end
						   end

						  vals.each{|val| File.open("pfizer_doctor_details.txt", "a"){ |f| 
						    f.puts val
						  }}

						  puts vals
						  return vals
						end


					</pre>
</p></div>
<div class='sec'>
<p><strong>process_doctor_pages</strong> is just a function that calls <strong>process_doctor</strong> for each name in the <strong>pfizer_doctors.txt</strong> we previously gathered</p>
<p>The final result is pfizer_doctor_details.txt, which contains a line for every payment to every doctor.</p>
<pre name="code" class="ruby">
						def process_doctor_pages
						  time = Time.now

						  File.open("pfizer_doctors.txt", 'r'){|f|
						     f.readlines.each do |r|
						        vals = process_doctor(r, time)
						     end 
						  }
						end		

					</pre>
</p></div>
</p></div>
<div class='sec'>
<h2><a name="results"></a><br />
					The Results</h2>
<p>				After Googling the top-Pfizer-paid-doctor on the list (<a href="http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?hcpdisplayName=SACKS,+GERALD+MICHAEL">Gerald Michael Sacks for ~$150K</a>), I came across the <a href='http://blog.pharmaconduct.org/'>Pharma Conduct</a> blog, which had <a href='http://blog.pharmaconduct.org/2010/04/who-were-top-5-recipients-of-money-from.html?src=PharmaConduct+20100403'>already posted partial aggregations of the list</a>, including the <a href='http://blog.pharmaconduct.org/2010/04/which-doctors-received-highest.html?src=PharmaConduct+20100405'>top 5 doctors</a>, complete with profiles and pics.</p>
<p>				As Pharma Conduct has already been on the ball, I&#8217;ll defer to its analysis. It has some good background here on how lame pharma companies have been in <a href='http://blog.pharmaconduct.org/2010/02/pharma-gets-failing-grades-for-initial.html'>past releases of data</a>. Overall, Pharma Conduct is <a href='http://blog.pharmaconduct.org/2010/03/pfizer-releases-payments-to-physicians.html'>less-than impressed</a> with Pfizer:</p>
<blockquote><p>
				Despite reporting more information than some its peers, Pfizer&#8217;s interface is still very limited.  For one, to use the search filtering, you must know a physician&#8217;s first name and last name, as well as the state where the payment was made.  Also, the data cannot be sorted by payment amount, which is a big limitation.  Pfizer should be given credit for releasing the information and being so thorough.  However, by releasing it in a format that is not really amenable to data analysis and is more suited to simply looking up results one physician at a time, I echo John Mack&#8217;s sentiment, namely, that this data is translucent, but not transparent.	</p></blockquote></div>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">Coding for Journalists 104: Pfizer&#8217;s Doctor Payments; Making a Better List</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2010/04/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Coding for Journalists 103: Who&#8217;s been in jail before: Cross-checking the jail log with the court system; Use Ruby&#8217;s mechanize to fill out a form</title>
		<link>https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/</link>
		<comments>https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/#comments</comments>
		<pubDate>Tue, 06 Apr 2010 13:40:53 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[works]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[courts]]></category>
		<category><![CDATA[journalism]]></category>
		<category><![CDATA[mechanize]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=584</guid>
		<description><![CDATA[<p>This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/">Coding for Journalists 103: Who&#8217;s been in jail before: Cross-checking the jail log with the court system; Use Ruby&#8217;s mechanize to fill out a form</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<div class='over-note' style='font-size: 12pt; color: #a44; border: 1px solid black; margin: 20px; padding: 20px;'>This is part of a <a href="https://danwin.com/works/coding-for-journalists-101-a-four-part-series/">four-part series on web-scraping for journalists</a>. As of <strong>Apr. 5, 2010</strong>, it was a published a bit incomplete because I wanted to post a timely solution to the <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">recent Pfizer doctor payments list release</a>, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact <a href="mailto:dan@danwin.com">dan@danwin.com</a> if you have any questions, or leave a comment below.</p>
<p><strong>DISCLAIMER:</strong> <em>The code, data files, and results are meant for reference and example only. You use it at your own risk.</em></p>
<p><b>In particular, with lesson 3</b>, I skipped basically any explanation to the code. I hope to get around to it later.</p>
</div>
<h2>Going to Court</h2>
<p>In the <a href="https://danwin.com/works/coding-for-journalists-101-a-four-part-series/">last lesson</a>, we learned how to write a script that would record who was in jail at a given hour. This could yield some interesting stories for a crime reporter, including spates of arrests for notable crimes and inmates who are held with $1,000,000 bail for relatively minor crimes. However, an even more interesting angle would be to check the inmates&#8217; prior records, to get a glimpse of the recidivism rate, for example.</p>
<p><a href="https://services.saccourt.com/indexsearchnew/CaseType.aspx">Sacramento Superior Court</a> allows users to search by not just names, but by the unique ID number given to inmates by Sacramento-area jurisdictions. This makes it pretty easy to link current inmates to court records.</p>
<p><a href="https://danwin.com/words/wp-content/uploads/2010/04/small-court-page.gif"><img src="https://danwin.com/words/wp-content/uploads/2010/04/small-court-page.gif" alt="" title="small-court-page" width="500"  class="size-full wp-image-672" /></a><br />
</p>
<p>However, the techniques we used in past lessons to automate the data collection won&#8217;t work here. As you can see in the above picture, you have to fill out a form. That&#8217;s not something any of the code we&#8217;ve written previously will do. Luckily, that&#8217;s where Ruby&#8217;s <strong>mechanize</strong> comes in.</p>
<p><span id="more-584"></span></p>
<div class="code-doc">
<link rel='stylesheet' href='https://danwin.com/css/code.css' type='text/css' media='all' />
<div class='sec'>
<h2>Ruby Mechanize</h2>
<p>Go the the <a href="http://mechanize.rubyforge.org/mechanize/">mechanize library homepage</a> to learn how to install it as a Ruby gem. It requires that <a href="http://nokogiri.rubyforge.org/">nokogiri</a> is installed, which you should&#8217;ve done if you&#8217;ve made it this far into my tutorials.</p>
<p>There are some <a href="http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html">basic examples on the project page</a>, but you&#8217;re going to have to read some of the technical documentation to learn some of mechanize&#8217;s commands.</p>
<p>Here&#8217;s a code example we&#8217;ll be using:</p>
<pre class="ruby" name="code">
search_form['txtXref']='00112233'
result_page_form = search_form.submit
</pre>
<p><b>search_form</b> refers to a mechanize Form object. In that HTML form is a textfield with a name of &#8216;txtXref&#8217;. The array notation we used above is setting that textfield to the value &#8216;00112233&#8217;.</p>
<p>Then, using mechanize&#8217;s Form object&#8217;s <b>submit</b> method, we submit the form just as if we had clicked the &#8220;Submit&#8221; button on a webpage.</p>
<p>That&#8217;s the basic theory.</p>
</div>
<div class='sec'>
<h2>The Code</h2>
<p>Note: The following code works, if you have an inmates.txt file from the last lesson (<a href="https://danwin.com/static/jail-list/inmates.txt">use this one if you don&#8217;t</a>; keep in mind that the last names and birthdates have been changed/redacted). However, it&#8217;s very rudimentary, with no error-checking at all. Still, it&#8217;ll give you a couple tab-delimited files that will list an inmate&#8217;s past charges and past sentences served, with XREF being the key that links those files to inmates.txt.</p>
<p>Remember that you&#8217;re accessing a live site here. This script pauses for 2 seconds after each access&#8230;there should be no reason to be more frequent about it.</p>
<p>This tutorial will be updated in the future.</p>
<pre name="code" class="ruby">
require 'rubygems'
require 'mechanize'
search_url='https://services.saccourt.com/indexsearchnew/CriminalSearchV2.aspx'
xrefs = File.open("inmates.txt", 'r').readlines().map{|x| x.split("\t")[7].match(/[0-9]+/).to_s}.uniq

# open datafile


a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

search_page = a.get(search_url) 
search_form = search_page.form_with(:name=>'frmCriminalSearch')

#show the fieldnames
search_form.fields.map {|f| f.name}
#=> ["__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "txtLastName", "txtFirstName", "txtDOB", "txtXref", "txtCaseNumber", "lstCaseType"]

search_form.buttons.map{|m| m.name}
# => ["btnFindByName", "btnFindByNumber"]


xrefs.each do |xref|
  puts "\nFinding info for xref: #{xref}"
  search_form['txtXref']=xref
  search_form.field_with(:name=>'lstCaseType').options[1].select
  result_page_form = search_form.submit.forms.first
  case_buttons = result_page_form.buttons[1..-2]

  puts "There are #{case_buttons.length} cases to check:"
  case_buttons.each do |cb|
    file_page = result_page_form.click_button(cb)
    file_page = file_page.parser
  
    charges_arr = []
    sentences_arr =[]
    charge_rows = file_page.css('#dgDispositionCharges tr')
  
    if charge_rows.length > 0
    puts "Charges: "
      charge_rows[1..-1].each do |cr|
        ctd = cr.css('td').map{|td| td.text}
        charges_arr << {:plea=>ctd[1], :charge=>ctd[2], :date=>ctd[4], :severity=>ctd[5]}
        puts "\t - #{charges_arr.last.collect().join("\t")}"
      end  
    end
  
    sentence_rows = file_page.css('#dgSentenceSummary tr')
  
    if sentence_rows.length > 0
      puts "Sentences: "
      sentence_rows[1..-1].each do |sr|
        sentences_arr << sr.css('td').map{|td| td.text}.join("\t")
        puts "\t - #{sentences_arr.last}"
      end
    end
    
    
    File.open("court_charges.txt",'a+'){ |f|

      charges_arr.each do |c|
        f.puts("#{xref}\t#{c[:plea]}\t#{c[:charge]}\t#{c[:date]}\t#{c[:severity]}")
      end
    }

    File.open("sentences.txt", 'a+'){ |f| 
      sentences_arr.each do |c|
        f.puts("#{xref}\t#{c}")
      end
    }
    
    
    
  
  end #done checking a case entry
  
  puts "Done with #{xref}, sleeping"
  sleep 1
  
  
end  

 

 
 
</pre>
</div>
</div>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/">Coding for Journalists 103: Who&#8217;s been in jail before: Cross-checking the jail log with the court system; Use Ruby&#8217;s mechanize to fill out a form</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Coding for Journalists 102: Who&#8217;s in Jail Now: Collecting info from a county jail site</title>
		<link>https://danwin.com/2010/04/coding-for-journalists-102-collecting-info-from-a-county-jail-site/</link>
		<comments>https://danwin.com/2010/04/coding-for-journalists-102-collecting-info-from-a-county-jail-site/#comments</comments>
		<pubDate>Tue, 06 Apr 2010 13:30:51 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[works]]></category>
		<category><![CDATA[crime]]></category>
		<category><![CDATA[jail]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[web scraping]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=485</guid>
		<description><![CDATA[<p>This is part 2 of a 4-part series in introductory coding for journalists. Go here for the first lesson. This lesson and code will still be verbose, but will have a lot less hand-holding than the previous one. This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-102-collecting-info-from-a-county-jail-site/">Coding for Journalists 102: Who&#8217;s in Jail Now: Collecting info from a county jail site</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>This is <a href="https://danwin.com/works/coding-for-journalists-101-a-four-part-series/">part 2 of a 4-part series</a> in introductory coding for journalists. <a href="https://danwin.com/works/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/">Go here for the first lesson</a>. This lesson and code will still be verbose, but will have a lot less hand-holding than the previous one.</p>
<p><span id="more-485"></span></p>
<link rel='stylesheet' href='https://danwin.com/css/code.css' type='text/css' media='all' />
<div class="code-doc">
<div class='over-note' style='font-size: 12pt; color: #a44; border: 1px solid black; margin: 20px; padding: 20px;'>This is part of a <a href="https://danwin.com/works/coding-for-journalists-101-a-four-part-series/">four-part series on web-scraping for journalists</a>. As of <strong>Apr. 5, 2010</strong>, it was a published a bit incomplete because I wanted to post a timely solution to the <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">recent Pfizer doctor payments list release</a>, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact <a href="mailto:dan@danwin.com">dan@danwin.com</a> if you have any questions, or leave a comment below.</p>
<p><strong>DISCLAIMER:</strong> <em>The code, data files, and results are meant for reference and example only. You use it at your own risk.</em></p>
</div>
<p><b>A note about privacy</b>: This tutorial uses files that I archived from a real-world jail website. Though booking records are public record, I make no claims about the legal proceedings involving the inmates who happened to be in jail when I took my snapshot. For all I know, they could have all been wrongfully arrested and therefore don&#8217;t deserve to have their name attached in online perpetuity to erroneous charges (even if the site only purports to record who was arrested and when, and not any legal conclusions). For that reason, I&#8217;ve redacted the last names of the inmates and randomized their birthdates.</p>
<div class='sec'>
<h2>The Cops Reporter and the Log</h2>
<p>If you&#8217;re a daily cops reporter, calling the police station to ask for the list of last night&#8217;s arrests is probably part of your job. Because many papers have some kind of cops blotter where arrested suspects are listed&#8230;and online and in print, this is usually one of a paper&#8217;s top features. The St. Petersburg Times has a modern version of the feature, <a href="http://mugshots.tampabay.com/">complete with mugshots and stats summaries</a>.</p>
<p>Arrest logs have sometimes been criticized for being little more than voyeurism (<a href="http://www.poynter.org/column.asp?id=101&#038;aid=161525">here&#8217;s a discussion over the St. Pete&#8217;s mugshot site</a>). But knowing who your law officers are arresting, and why, is essential to a nice, free society (and for a fair and efficient police force). And the more data you have as a reporter, the better you&#8217;ll be able to cover your beat.</p>
<p>Most pro-active police departments will announce when they&#8217;ve made high-profile arrests. But relying on the police to tell you what the most noteworthy arrests are kind of begs the question, and doesn&#8217;t tell the whole picture of arrest activity. Most states consider arrest logs to be public information (not that that <a href="http://www.arundelmuckraker.com/storyview.asp?storyID=59">stops some jurisdictions from hiding them</a>). But a paper list or a PDF is hard to analyze. Luckily, some police departments are putting their work on the Web They might be willing to send you a spreadsheet of arrest activity, but what if you wanted up-to-the-hour information, so that you could be aware of:</p>
<ol>
<li>Suspected crimes that fall between egregious and infamous (non-fatal assaults, robberies, car jackings, etc.)</li>
<li>An abnormally large number of arrests at a given time</li>
<li>Unusual types of suspected crimes at a given time</li>
</ol>
</div>
<div class='sec'>
<p>This is where the web-scraping you learned in my last tutorial gets useful. You&#8217;re going to have an automated way of collecting the latest arrests news, in an ordered fashion (so that you could, for example, find the inmate with the largest bail at a given time), and you&#8217;ll save yourself and your friendly police PIO tedious paper shuffling and typing.</p>
<p>I&#8217;m going to base my lesson on <a href="http://www.sacsheriff.com/inmate_information/">this sheriff department&#8217;s jail system</a>. I&#8217;ve mirrored a snapshot of their site <a href="https://danwin.com/static/jail-list/current_listing.cfm.html">here</a> (zip file <a href="https://danwin.com/static/jail-list/jail-list.zip">here</a>), so I recommend you run your scripts on my mirror (root directory: <a href="https://danwin.com/static/jail-list/current_listing.cfm.html">https://danwin.com/static/jail-list/</a>)before doing a real-world test. </p>
<p>The jail web site has these characteristics:</p>
<ul>
<li>At this page is a list of every person booked in the last 24 hours</li>
<li>The list typically has 100 to 200 inmates at a time</li>
<li>Most entries in that list contain a link to an inmate&#8217;s page containing data including name, DOB, bail, charges, booking time.</li>
<li>Each inmate has a unique identifying number called X-REF</li>
<li>Not all entries have a link; inmates who have been released have only their names listed</li>
</ul>
<p>The site is pretty useful and user-friendly. However, it&#8217;s hard to quickly glean any useful information from the main list. You have to click through each individual entry to find out why someone was jailed. <strong>The purpose of the following lesson is to automate that process so you can efficiently get the big picture of a jail&#8217;s activity.</strong></p>
<p>Program flow will go something like this:</p>
<ol>
<li><a href="#t_file_io">Create two text files</a>: one to store the list of inmates (inmates.txt), one to store the list of charges (charges.txt)</li>
<li><a href='#t_open_list'>Open the inmate listing page</a></li>
<li>Collect each list entry</li>
<li>If list entry is not a link (i.e. inmate has been released)</li>
<ol>
<li><a href="#t_ifnotlinkfetch">Fetch first name, middle name, last name, intake time and release date</a></li>
</ol>
<li>Else If list entry that is a link, open it</li>
<ol>
<li>Fetch first name, middle name, last name, xref, intake time, and DOB of an inmate</li>
<li>Fetch and parse list of charges</li>
<li>Fetch the bail amount</li>
</ol>
<li>In an each loop, for each inmate entry we collected above:	</li>
<ol>
<li> Output inmate information, in tab-delimited format, into <strong>inmates.txt</strong>, including the XREF.</li>
<li> Output the charges associated with the inmate into <strong>charges.txt</strong>. Each charge will take up one line, and the XREF of the inmate will also be included as to provided a key to the associated inmate </li>
</ol>
</ol>
<h3><a name="t_file_io"></a>File I/O</h3>
<p>We didn&#8217;t cover opening and writing to an external text file in the last lesson. So here&#8217;s how it goes briefly: Using Ruby&#8217;s <a href="http://ruby-doc.org/core/classes/IO.html">IO class</a>, we&#8217;re going to create two files, inmates.txt and charges.txt, and write to them what we find on the jail&#8217;s website. We&#8217;ll be using the variables <b>inmates_file</b> and <b>charges_file</b> to refer to the external files. </p>
<p>To open the the files and set the variables, use the IO class&#8217;s <b><a href="http://ruby-doc.org/core/classes/IO.html#M002238">new</a></b> method, which takes in two parameters: a string designating the file name, and a string<br />
designating the mode&#8230;which in this case, will be &#8220;a&#8221;: write-only (read about the <a href="http://ruby-doc.org/core/classes/IO.html">various modes here</a>).</p>
<pre name="code" class="ruby">
inmates_file = File.new('inmates.txt', 'a')
charges_file = File.new('charges.txt', 'a')
</pre>
<p>If these files don&#8217;t already exist, they will now. If they did, the &#8216;a&#8217; mode will append new content to the end of the file.</p>
<p>To write something to the file, use the <b>puts</b> method, which writes whatever string you supply to it as one line in the file (we&#8217;ve used this method without the IO class, in which case it outputs to the screen):</p>
<pre name="code" class="ruby">
charges_file.puts("Adding a new line of text to the charges file.")
</pre>
<p>While we&#8217;re setting up, let&#8217;s create an array of hashes, with each hash object holding an inmate and his/her information. We don&#8217;t have to do this&#8230;we could just output to the file each inmate record as we get to it, but this will allow us some flexibility later. All we have to do is initialize the array:</p>
<pre name="code" class="ruby">
inmates_array = []
</pre>
<h3><a name="t_open_list"></a>Open the inmate listing page</h3>
<p>Now let&#8217;s fetch the inmates listing. We&#8217;ll be using Nokogiri in the same fashion we did in the <a href="https://danwin.com/thoughts/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/#topic_nokogiri">last lesson</a>, beginning by requiring the nokogiri and open-uri libraries, then using the Open-URI&#8217;s <b><a href="http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/">open</a></b> method to fetch the page, and then Nokogiri&#8217;s <a href="http://nokogiri.rubyforge.org/nokogiri/Nokogiri/HTML/Document.html">HTML class</a> to wrap up the page in a parsable format.</p>
<pre name="code" class="ruby">
require 'rubygems'
require 'nokogiri'
require 'open-uri'
		
base_url='https://danwin.com/static/jail-list/' # all links on the list will be relative to this address		
inmate_listing = Nokogiri::HTML(open("#{base_url}current_listing.cfm.html"))
</pre>
<div class="note">A reminder. The construct <b>#{something_here}</b>, when put inside a double-quoted string, will treat <b>something_here</b> as an actual value of the variable <b>something_here</b>, not just the string. This is called <em>string interpolation</em>. The two following expressions, the latter using interpolation, are equivalent, though the latter will not throw an error if string2 happens to not be a String.</p>
<p>	a_combined_string = &#8220;Hello &#8221; + string2<br />
	a_combined_string = &#8220;Hello #{string2}&#8221;</p>
<p>Read more about Ruby&#8217;s <a href="http://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#Interpolation">string interpolation here</a>.
</div>
<p>Let&#8217;s visit the page with a browser and examine the structure. The list is an HTML table, with each row containing several columns, the first column being the inmate&#8217;s full name and, if the inmate hasn&#8217;t been released, a link to his/her booking page.</p>
<p>If you inspect the HTML closely, you&#8217;ll see that this page is composed of several tables. What we want is the table contained inside the &lt;td&gt; element with a class of &#8220;content.gsub(/\302\240/, &#8216; &#8216;).&#8221;</p>
<p>So we&#8217;ll collect all the table rows, using Nokogiri&#8217;s xpath method, and iterate through them using an each loop. We&#8217;re going to use a variation of an each loop called <b>each_index</b>, which provides the numerical index of the current iteration we&#8217;re on.</p>
<p><a name="t_ifnotlinkfetch"></a></p>
<pre name="code" class="ruby">
	inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1]
</pre>
<p>	The XPath syntax here is looking for a td element with class=&#8217;content&#8217;, then the table inside of that. There&#8217;s more than one, but the first one on the page has the data. From that, we gather all the rows (<b>tr</b>) within that. We call the collect method to convert the result into an array since Nokogiri&#8217;s xpath method returns a <em>NodeSet</em>, which won&#8217;t have the <b>each_index</b> method. <strong>each_index</strong> loops through an array, just like each, but it provides the index of the current iteration.</p>
<pre name="code" class="ruby">
	inmate_rows.each_index do |i|
		inmate_row = inmate_rows[i]
		inmates_array[i] = {}
		inmate = inmates_array[i]

		# each row has a set of columns with the inmate info
		list_columns = inmate_row.xpath('./td')
</pre>
<p>Because we know we&#8217;re on the ith row, we can also initialize the ith index in inmates_array as a hash to store the ith inmate&#8217;s information. Remember that each element in the inmates_array is going to be a hash of information.</p>
<p>Lets use the variable named inmate as a shorthand way to refer to this position in the inmates_array .Each time we iterate through the loop, <strong>inmate</strong> will refer to the next spot in the inmates_array.</p>
<p>This is easier to type out 10 times than inmates_array[i]</p>
<p>Before we get to visiting the individual inmate pages, let&#8217;s just collect the name and other information readily available here</p>
<p>Each name consists of a String in this format: <em>last_name</em>, <em>first_name</em> <em>middle_name</em></p>
<p>So let&#8217;s use the String split method. First to split the string by comma; this will give us an array with the first element being what&#8217;s on the left side of the comma. Splitting the second element of that array, with a space, will give us <em>another</em> array, consisting of a first name and middle name.			</p>
<pre name="code" class="ruby">
		
		
		
		# remember that you need to call Nokogiri's content method to get the text, as a String, between a tag	
		the_inmate_name =  list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',')
		
		inmate['last_name'] = the_inmate_name[0]					# the name before the comma
		inmate['first_name'] = the_inmate_name[1].split(' ')[0]		# the name after the comma, but before the next space
		inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1]
		
		
</pre>
<p>I&#8217;m going to be using this method call after each use of <b>content</b>: gsub(/\302\240/, &#8216; &#8216;).strip </p>
<p>Not all entries have a middle name. So we use the <em>if <strong>the_inmate_name</strong>.length > 2</em> conditional statement to tell Ruby to skip this line if the_inmate_name</p>
<pre name="code" class="ruby">
		
		# Moving on to the next table cell, which will be the 1 spot in list_columns
		inmate['sex'] = list_columns[1].content
		
		
		# next cell, DOB
		inmate['dob'] = list_columns[2].content
			
		# next cell, booking time
		inmate['intake_time'] = list_columns[3].content
		
	
		
		
		# let's go back to the first column to see if it contained a link
		if list_columns[0].xpath('./a').length == 0  # if there was no link, there would be 0 links returned
			
			# No link to visit, so this must have been a released inmate. Let's grab his/her release date 
			# which comes in the pattern "Released mm/dd/yyyy"...so we'll split the string and capture the second term

			inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1]
			
		else
		
			# visit link
			# we'll get to this subroutine in the next section
			
			
		end
	end
</pre>
<div class='note'>
	I make a method call named <b>gsub</b> to cleanse the strings of data. This particular website uses <strong>&amp;nbsp;</strong> (non-breaking-space) to form a space-character, and Nokogiri treats these differently than normal space characters, so <strong>strip</strong> doesn&#8217;t work as intended. So this method call is called frequently:<br />
	.gsub(/\302\240/, &#8216; &#8216;)</p>
<p>	Read more about this from <a href="http://www.vitarara.org/cms/hpricot_to_nokogiri_day_1">Vita Ara</a>
</div>
<p>OK, that should&#8217;ve given you a refresher on arrays, hashes, XPath, and string manipulation. Now we&#8217;ll handle the case of when the first <b>list_column</b> array item does contain a link. It will involve fetching the page from that link and then more XPathing to pick out the wanted data.</p>
<p>At this time, go to the <a href="https://danwin.com/static/jail-list/current_listing.cfm.html">inmate list page</a> and click on one of the inmate pages in the browser.</p>
<p>There&#8217;s a lot more information here; what will be most relevant to us right now is the X-Reference Number, charges, and bail. This next section of code will fit into the <b>else</b> branch of our <a href="#t_ifnotlinkfetch">previous section of code</a>.</p>
<pre name="code" class="ruby">
	# visit link (remember that the xpath method returns an array, so we have to explicitly refer to
	# the 0th index to get the link)
	inmate_link = list_columns[0].xpath('./a')[0]["href"] 
	
	# remember that we set base_url to contain the site's base address. we append 
	# inmate_link to it to get the absolute address to the inmate page
	inmate_page = Nokogiri::HTML(open("#{base_url}#{inmate_link}"))
	
	# everything is inside a &lt;td&gt; with a class="content" attribute, so let's set a variable
	# to hold the table rows inside
	
	content_table_rows = inmate_page.xpath("//td[@class='content']/table/tbody/tr")
	
	# the xref number appears to be in the third row and in the third cell
	# again, we're still using the inmate variable to hold the data associated with an inmate
	
	inmate["xref"] = content_table_rows[2].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
	# the strip method removes characters that are just space, such as tabs and carriage returns
	
	inmate['booking_number'] = content_table_rows[3].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
	inmate['arresting_agency'] = content_table_rows[13].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
	inmate['total_bail'] = content_table_rows[16].xpath("./td")[2].content.gsub(/(\302\240)|\s|\n|\r/, ' ').strip
</pre>
<div class='note'>
	Total bail gets an extra <strong>gsub</strong> condition because there are a few cases where carriage returns are in the table cell, which causes issues when we later try to import the result into a tab-delimited file/spreadsheet.
</div>
<p>OK, so we collected the basic info about each inmate. Now, we want to collect the charges leveled against them. This is a little bit trickier. If you inspect the table-cell containing the charges, you&#8217;ll see that the charge listing itself is a table. The first row of the table lists the case number and type of arrest (warrant, or fresh pickup). Below that is a list of charges, with each charge taking up two rows, like so:</p>
<table>
<tr>
<td>1st Row</td>
<td>1st Cell: Charge code (i.e. PC 459)</td>
<td>2nd Cell: Charge severity (i.e. Felony)</td>
</tr>
<tr>
<td>2nd Row:</td>
<td colspan='2'>Charge description (i.e. &#8220;Burglary&#8221;)</td>
</tr>
</table>
<p>For most of the inmate listings, this is immediately followed by another row listing the bail amount.</p>
<p><strong>However</strong>, there are a few inmates who are held on more than one charge. And there are some who are being held from multiple charges stemming from multiple warrants, such as this person here, who appears to have racked up a number of public nuisance accusations, including evading ticket fare and prohibited public drinking. In his case, the charge listing is one row after another, and each row could either mention the case, the agency that issued the warrant, the charge, or the bail amount per warrant.</p>
<p>My point here is that you won&#8217;t be able to predict that the third row, for instance, always contains the charge code and severity. But using <b>Inspect Element</b>, we see that the table cells containing the code, severity, and description have class attributes &#8220;cellTopLeft&#8221;, &#8220;cellTopMiddle&#8221; and &#8220;cellBottom&#8221;, respectively. The bail amount per case is in the cell with class &#8220;cellBail&#8221;&#8230;but we&#8217;re not interested in bail per case, so we&#8217;ll ignore it.</p>
<p>We&#8217;re going to loop through rows inside this table, and if that row contains a td cell of class &#8220;cellTopLeft&#8221;, we know that each this row will contain the code and severity of a charge. We&#8217;re going to assume that the row immediately following it has a cell with class &#8220;cellBottom,&#8221; which contains the description.</p>
<p>Processing this sub-table of charges will require its own loop. And since each inmate could have more than one charge, we need to store <b>&#8220;charges&#8221;</b> inside our <b>inmate</b> hash&#8230;<b>charges</b> will point to an array. And each item in the <b>charges</b> array will itself be a hash, with keys of &#8220;code&#8221;, &#8220;severity&#8221;, and &#8220;description.&#8221;</p>
<p>Confusing? Well, here&#8217;s a quick diagram of what we have so far, in terms of variables:</p>
<pre>
inmates		=> an array of Hashes...
				inmate = inmates[index] (each inmate is a Hash)
			=> inmate['first_name'] => inmate's first name
			=> inmate['last_name']  => inmate's last name
			=> inmate['xref'] 		=> inmate's xref
			... all the other attributes
			=> inmate['charges']  =>  an array of hashes
						charge = inmate['charges'][charge_index] (each charge is a Hash)
						charge['code']			=> charge's code
						charge['severity']			=> charge's severity
						charge['description']	=> charge's description
</pre>
<p>The loop to fill out that charge array is as follows:</p>
<pre name="code" class="ruby">
	# first, grab the entire table of charges that exists in the 16th row of the main content table
	table_of_charges = content_table_rows[15].xpath("./td")[2]
	
	# and give this inmate an array of charges
	inmate['charges'] = []
	
	# Now, collect all rows that have a td with class "cellTopLeft"
	charge_1st_rows =  table_of_charges.xpath(".//tr[td[@class='cellTopLeft']]")
	
	# Now, collect all rows that have a td with class "cellBottom"
	charge_2nd_rows = table_of_charges.xpath(".//tr[td[@class='cellBottom']]")
	
	# OK, you should do some basic error checking here. We expect the arrays of charge_1st_rows and charge_2nd_rows to have
	# equal length, since each charge has a code, severity and description, right?
	
	# If not, that means our assumption was wrong, and you should do something...like exit the script and re-examine your
	# datasource and assumptions about it. But I'll skip that for now
	
	charge_1st_rows.collect.each_index do |charge_row_index|
	
		# we found a row with a charge, so let's create a new hash that will hold the charge's attributes
		hash_of_inmate_charge = {}
		
		charge_1st_row = charge_1st_rows[charge_row_index]
		hash_of_inmate_charge['code'] = charge_1st_row.xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip
		hash_of_inmate_charge['severity'] = charge_1st_row.xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip
			
		# we assume that the row, with the same index in the charge_2nd_rows array will be the description of the charge
		# listed in charge_1st_rows
			
		hash_of_inmate_charge['description'] = charge_2nd_rows[charge_row_index].xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip
		
		
		# push this hash on to the array of inmate charges:
		inmate['charges'] << hash_of_inmate_charge	
		
	end
</pre>
<p>	Well, we've collected all the relevant inmate information, and if our assumptions were right, each of the inmate's charges. We've reached the end of the loop that examines each row in the main inmate listing. Our script will go onto the next inmate and collect his/her info. And so on until it has reached the end of the list. Here's all the code so far:</p>
<pre name="code" class="ruby">
		require 'rubygems'
		require 'nokogiri'
		require 'open-uri'
		inmates_array = []
		base_url='' 		
		inmate_listing = Nokogiri::HTML(open("#{base_url}current_listing.cfm.html"))

		inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1]
		inmate_rows.each_index do |i|
			inmate_row = inmate_rows[i]		
			inmates_array[i] = {}
			inmate = inmates_array[i]


			list_columns = inmate_row.xpath('./td')		
			the_inmate_name =  list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',')
			inmate['last_name'] = the_inmate_name[0]					# the name before the comma

			inmate['first_name'] = the_inmate_name[1].split(' ')[0]		# the name after the comma, but before the next space
			inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1] if  the_inmate_name.length > 2	



			inmate['sex'] = list_columns[1].content		
			inmate['dob'] = list_columns[2].content

			inmate['intake_time'] = list_columns[3].content


			if list_columns[0].xpath('./a').length == 0 
				inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1]
			else

				inmate_link = list_columns[0].xpath('./a')[0]["href"] 
				inmate_page = Nokogiri::HTML(open("#{base_url}#{inmate_link}"))
				content_table_rows = inmate_page.xpath("//td[@class='content']/table/tr")

		    if content_table_rows.length > 0


			  	inmate["xref"] = content_table_rows[2].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
		  		inmate['booking_number'] = content_table_rows[3].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
		  		inmate['arresting_agency'] = content_table_rows[13].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
		  		inmate['total_bail'] = content_table_rows[16].xpath("./td")[2].content.gsub(/\302\240/, ' ').gsub(/\s|\n|\r/, ' ').strip

		  		table_of_charges = content_table_rows[15].xpath("./td")[2]
		  		inmate['charges'] = []

		  		charge_1st_rows =  table_of_charges.xpath(".//tr[td[@class='cellTopLeft']]")
		  		charge_2nd_rows = table_of_charges.xpath(".//tr[td[@class='cellBottom']]")

		  		charge_1st_rows.collect{|x| x}.each_index do |charge_row_index|

		  			hash_of_inmate_charge = {}

		  			charge_1st_row = charge_1st_rows[charge_row_index]
		  			hash_of_inmate_charge['code'] = charge_1st_row.xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip
		  			hash_of_inmate_charge['severity'] = charge_1st_row.xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip
		  			hash_of_inmate_charge['description'] = charge_2nd_rows[charge_row_index].xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip

		  			# push this hash on to the array of inmate charges:
		  			inmate['charges'] << hash_of_inmate_charge	

		  		end
				end # end if content_table_rows

			end
		end
	</pre>
</p></div>
<div class="sec">
<h3><a name="#topic_file"></a>Storing your Data into a File</h3>
<p>		At this point in your script, all your carefully collected data is in memory. When the script finishes execution, it disappears. That defeats the purpose of any way of tracking data. So let's store it in a persistent way...my choice would be in some kind of database, like MySQL or SQLite. But for our purposes, we can quickly learn the methods to store this information in a tab-delimited file that can be opened as an Excel spreadsheet.</p>
<p>		We will be using Ruby's <a href="http://ruby-doc.org/core/classes/File.html#M002579">File class</a>:</p>
<pre name="code" class="ruby">

			##write to file
			File.open("inmate.txt", 'w'){ |f| 

				f.write("first_name\tmiddle_name\tlast_name\tsex\tdob\tintaketime\trelease_date\txref\tbooking_number\tarresting_agency\ttotal_bail\n")

				inmates_array.each do |inmate|

			f.write("#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}\t#{inmate['sex']}\t#{inmate['dob']}\t#{inmate['intake_time']}\t#{inmate['release_date']}\t#{inmate['xref']}\t#{inmate['booking_number']}\t#{inmate['arresting_agency']}\t#{inmate['total_bail']}\n")

				end
			}
</pre>
<p>A quick explanation. The <b>File</b> class has the <b>open</b> method, to which we pass in two arguments: the name of the file we want to write to, and the <em>mode</em>. In this case, we're using 'w', which stands for "write" mode. The curly-braces sets off the code that gets executed while this File is open, with the variable <strong>f</strong> referring to the actual file.</p>
<p>File also has an instance method called <b>write</b>, which takes in a String as an argument to write to the open file.</p>
<p>Backslash-t will write a <b>tab</b>, and backslash-n will write a newline character.</p>
<p>The next block of code is similar to the first...but it refers to a "charges.txt" file. Remember that each inmate could have more than one charge to his/her name. The following file lists every charge, but also lists the xref key to tie back into inmates.txt. For convenience sake, we're also going to print out the inmate name and the inmate's <em>total bail</em> on each line.</p>
<pre name="code" class="ruby">

			File.open("charges.txt",'w'){ |f|
			  f.write("name\txref\ttotal_bail\tcode\tseverity\tdescription\n")

			  inmates_array.each do |inmate|	  
				  if inmate['charges']
				    inmate['charges'].each do |charge|
			  	    f.write("#{inmate['first_name']} #{inmate['last_name']}\t#{inmate['xref']}\t#{inmate['total_bail']}\t#{charge['code']}\t#{charge['severity']}\t#{charge['description']}\n")
			      end
			    end
				end

			}
		</pre>
<p>Printing out the inmate's name and total bail, although redundant, allows us to quickly skim the list to see if there were any unusual crimes connected to unusual amounts of bail (note that the jail site does not breakdown bail amounts per charge).		</p></div>
<div class="sec">
<h3><a name="#topic_realworld"></a> Putting it all together for the real world</h3>
<p>		The above code, put all together, will execute cleanly and compile some nice text files for you, especially if you've saved the package of HTML files onto your hard drive. But in the real world, you'll be targeting an internet server, which may not like you hitting it at a rate of five times per second. Or, may intermittently fail.</p>
<p>		To deal with this, I've added a call to Ruby's <a href="http://ruby-doc.org/core/classes/Kernel.html#M005972">sleep</a> method, which pauses script execution for a given number of seconds. I've also thrown in some <a href="http://ruby.activeventure.com/programmingruby/book/tut_exceptions.html">error-handling.</a> Here's the basic structure:</p>
<pre name="code" class="ruby">
		# some code
		begin
			# risky code here
			# The Ruby interpreter will watch the code that gets executed within the begin branch...if something goes wrong, it's going to execute code in the following rescue branch
	
		rescue
			# the begin-branch messed up, time to run some other code
			puts "An error happened!"
		else
			# this code gets executed if the begin-branch worked fine
		ensure
			# this code in the ensure branch (which is optional) runs no matter what.
			puts "We're done with our error handling"
		end
	
		</pre>
<p>		Read more about <a href="http://ruby.activeventure.com/programmingruby/book/tut_exceptions.html">error-handling here</a>.</p>
<p>		And finally, I'm going to make a few alterations to the script to make it so that it'll run repeatedly for every half hour (essentially, by sleeping a half hour after going through the list). This is the crudest way to schedule a script, but it'll work for now. It will also use another instance method of <b>File</b>: <b>readlines</b>.</p>
<p>		Each half hour, it's likely that the list of inmates will be the same. So a crude way to reduce the number of repeat listings is to check the inmates.txt file (using the <b>match</b> method) to see if a given inmate's xref number is in there. This gets slower as inmates.txt grows. Like I said, it's crude. I prefer using a database, which is a topic outside the scope of this tutorial.</p>
<p>		So I've taken the code above and split it into five parts:</p>
<ol>
<li>the <strong>process_inmate_row</strong> method - This method takes in a single row from the list of inmates and reads the basic information, including name, sex, and date of birth. It takes in as its second argument the entire text of inmates.txt and sees if inmate.txt already contains the name. If so, it will return a hash of inmate data. If not, it will return nil
<p>Note: As said previously, constantly searching the entire inmates.txt file is incredibly inefficient. And, what happens if two John Smiths are arrested in the same time period? The name-check will fail to differentiate inmates of similar names (an even better match method would involve using the date of birth). But I leave it as an exercise for you to develop a more efficient method, which could involve a database. Or storing the name columns of inmates.txt into an array.</p>
<p>But the reason why we're doing the name-check is to save us the time of entering an inmate's page. And, of course, to not fill the inmates.txt file with duplicate entries.
				</li>
<li>the <strong>process_inmate_page_link</strong> method - The code that fetches an inmate's individual page and then processes the extra data, including the total bail amount and charges, is done here. It returns a hash of the inmate data. </li>
<li>the <strong>write_to_file</strong> method - This code invokes the File.open methods and, for each inmate and charge, writes a tab-delimited line to the inmates.txt and charges.txt files</li>
<li>the <strong>check_the_site</strong> method - This is the master method. It retrieves the list of inmates from the jail site and then on each row of inmate data, calls all the previously defined methods. IT also has some basic error handling. If something happens, like your internet connection drops in the middle of a page retrieval, it skips the current inmate and moves on. This is better than just crashing.</li>
<li>The main execution loop - All the code previously written out as methods will <b>do nothing</b> unless you actually invoke the methods. So we initialize a variable, called <b>hours</b>, to zero and while that is less than 24, we run the <strong>check_the_site</strong> method. After <strong>check_the_site</strong> finishes, <strong>hours</strong> is incremented and the script sleeps for an hour (3600 seconds).</li>
</ol></div>
<div class='sec'>
Here's the final code, which will be reading from the mirrored archive list I've provided <a href="https://danwin.com/static/jail-list/current_listing.cfm.html">here</a>. So obviously, running the main collection loop more than once is pointless as my list is static...but at least it's practice. You can download a <a href="https://danwin.com/static/jail-list/jail-list.zip">zipped archive</a> of the files here.</p>
<pre name='code' class='ruby'>
require 'rubygems'
require 'nokogiri'
require 'open-uri'


def process_inmate_row(inmate_row, inmate_text)
  
	list_columns = inmate_row.xpath('./td')		
	inmate = {}
	the_inmate_name =  list_columns[0].content.gsub(/\302\240/, ' ').strip.split(',')
	inmate['last_name'] = the_inmate_name[0]					# the name before the comma  
	inmate['first_name'] = the_inmate_name[1].split(' ')[0]		# the name after the comma, but before the next space
	inmate['middle_name'] = the_inmate_name[1].split(' ')[1..-1] if  the_inmate_name.length > 2	
  
  # at this point, we can determine if the inmate is already in our textfile
  
  name_to_match="#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}"  
   # remember that in the text file, we tab-delimited the name, so we have to match that pattern
	
  
   if inmate_text.match(name_to_match)
     puts "NOT adding inmate #{name_to_match} to inmates txt, as it already exists"
     inmate = nil
     # the method that invoked process_inmate_row will only add the inmate if it is not nil
     # we DON'T want this inmate added, so that's why we're setting it to nil
		
   else  
     
     	puts "Adding inmate #{name_to_match} to inmates txt"
 		  inmate['sex'] = list_columns[1].content		
   	  inmate['dob'] = list_columns[2].content
   	  inmate['intake_time'] = list_columns[3].content
	    puts "Basic info of inmate: #{inmate['first_name']} #{inmate['last_name']}: #{inmate['dob']}"
       
  end
  
  
  return inmate
  
end


def process_inmate_page_link(inmate_link)
  	inmate_page = Nokogiri::HTML(open(inmate_link))
		content_table_rows = inmate_page.xpath("//td[@class='content']/table/tr")

    more_inmate_stuff = {}
    
    if content_table_rows.length > 0
      
	  	more_inmate_stuff["xref"] = content_table_rows[2].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
  		
  		more_inmate_stuff['booking_number'] = content_table_rows[3].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
  		more_inmate_stuff['arresting_agency'] = content_table_rows[13].xpath("./td")[2].content.gsub(/\302\240/, ' ').strip
  		more_inmate_stuff['total_bail'] = content_table_rows[16].xpath("./td")[2].content.gsub(/\302\240/, ' ').gsub(/\s|\n|\r/, ' ').strip

  		puts "Found more inmate info, total-bail: #{more_inmate_stuff['total_bail']} arresting-agency: #{more_inmate_stuff['arresting_agency']}"


  		table_of_charges = content_table_rows[15].xpath("./td")[2]
  		more_inmate_stuff['charges'] = []

  		charge_1st_rows =  table_of_charges.xpath(".//tr[td[@class='cellTopLeft']]")
  		
  		puts "Number of charges: #{charge_1st_rows.length}"
  		charge_2nd_rows = table_of_charges.xpath(".//tr[td[@class='cellBottom']]")

  		charge_1st_rows.collect{|x| x}.each_index do |charge_row_index|

  			hash_of_inmate_charge = {}

  			charge_1st_row = charge_1st_rows[charge_row_index]
  			hash_of_inmate_charge['code'] = charge_1st_row.xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip
  			hash_of_inmate_charge['severity'] = charge_1st_row.xpath('.//td')[1].content.gsub(/\302\240/, ' ').strip
  			hash_of_inmate_charge['description'] = charge_2nd_rows[charge_row_index].xpath('.//td')[0].content.gsub(/\302\240/, ' ').strip

  			# push this hash on to the array of inmate charges:
  			more_inmate_stuff['charges'] << hash_of_inmate_charge	
    
        puts hash_of_inmate_charge.collect.join(" | ")
  		end
  		
  	else
  	  "Could not find more inmate info"	
		end # end if content_table_rows
  
    return more_inmate_stuff
end

def write_to_file(inmate)

    ##write to file
    puts "Writing to inmates.txt"
    
    # note that we use the 'a' mode here, which will append new input onto the end of an existing file (or create a new one if it doesn't exist), instead of overwriting it
    # Obviously, we don't want to keep overwriting inmates.txt if we intend it to be a persistent record of the inmate log
    
    File.open("inmates.txt", 'a+'){ |f| 
	f.write("first_name\tmiddle_name\tlast_name\tsex\tdob\tintaketime\trelease_date\txref\tbooking_number\tarresting_agency\ttotal_bail\n") unless File.size(f) >= 0 
      # we don't want to repeatedly print the column headers    
      f.write("#{inmate['first_name']}\t#{inmate['middle_name']}\t#{inmate['last_name']}\t#{inmate['sex']}\t#{inmate['dob']}\t#{inmate['intake_time']}\t#{inmate['release_date']}\t#{inmate['xref']}\t#{inmate['booking_number']}\t#{inmate['arresting_agency']}\t#{inmate['total_bail']}\t#{Time.now}\n")
   
    }
   
    puts "Writing to charges.txt"

    File.open("charges.txt",'a+'){ |f|
      f.write("name\txref\ttotal_bail\tcode\tseverity\tdescription\n") unless File.size(f) >= 0 
      # we don't want to repeatedly print the column headers
      
  	  if inmate['charges']
  	    inmate['charges'].each do |charge|
  	      puts "Writing charge: #{charge['description']}"
    	    f.write("#{inmate['first_name']} #{inmate['last_name']}\t#{inmate['xref']}\t#{inmate['total_bail']}\t#{charge['code']}\t#{charge['severity']}\t#{charge['description']}\t#{Time.now}\n")
        end
      end
  
    }
end
  
  

def check_the_site(base_url, index_url)
  # read the contents of inmates.txt into a variable so that we can check to see if an inmate already exists
   inmate_text = File.exists?("inmates.txt") ? File.open("inmates.txt", 'r').readlines().join() : ''
   inmates_added_count = 0 # just a piece of info we want to keep track of. We'll increment this number on each successful add
   
    
    begin	
      inmate_listing = Nokogiri::HTML(open("#{base_url}#{index_url}"))
    rescue Exception=>e
      puts "Oops, had a problem getting the inmates list at #{Time.now}"
      return nil #get out of here.
    end
      
    inmate_rows = inmate_listing.xpath("//td[@class='content']/table")[0].xpath(".//tr").collect[1..-1]
    puts "There are #{inmate_rows.length} rows to process"
    inmate_rows.each_index do |i|
  
      puts "\nProcessing inmate row: #{i}"
      inmate_row = inmate_rows[i]
      
      begin
        # The following code is potentially risky; we're making calls to process_inmate_row and process_inmate_page_link, two methods that could potentially throw an error if the data is improperly formatted or if the website refuses to send data
        
        # I've set up some rudimentary error handling to notify you of an error, but to keep chugging along to the next row
        
        inmate = process_inmate_row(inmate_rows[i], inmate_text)
        
        # process_inmate_row will return a hash of inmate data 
        # BUT, it will reutrn nil if it turns out this inmate already exists
        # so here's another if branch to check for that
        
        if inmate.nil?
          # do nothing
        else  
          # inmate was not blank, so let's continue
          list_columns = inmate_row.xpath('./td')		
        	if list_columns[0].xpath('./a').length == 0 
        		inmate['release_date'] = list_columns[4].content.gsub(/\302\240/, ' ').split(' ')[1]
        		puts "inmate was released on #{inmate['release_date']}"
        	else
        	  inmate_link = list_columns[0].xpath('./a')[0]["href"] 
        	  inmate_link = "#{base_url}#{inmate_link}"
        	  puts "Fetching: #{inmate_link}"
            more_inmate_attributes = process_inmate_page_link(inmate_link)
            inmate.merge!(more_inmate_attributes)
          end
    
          
        end # end of the if inmate.blank? branch
      rescue Exception=>e
        puts "Oops, had a problem getting data from inmate row #{i}, Error: #{e}"
      rescue Timeout::Error => e 
        puts "Had a timeout error: #{e}"
        sleep(10)
      else
         # got all the info for the inmage, so lets add him/her to the file
        
        unless inmate.nil?
          write_to_file(inmate) unless inmate.nil? 
  	      # an inline conditional: remember that inmate was set to nil if it already existed in the text file
  	      # we don't want to add it to the main array in such a case, hence the 'unless'
  	      inmates_added_count+=1
  	        puts "We successfully queried the site, so let's sleep a second"
          	sleep 1
  	    end

        
        
      end
    
    end
	
	  # reached the end, let's print a summary:
	  puts "#{Time.now}: Out of #{inmate_rows.length}, we added #{inmates_added_count} inmates"
	
end

   


hours = 0
BASE_URL='https://danwin.com/static/jail-list/'

while(hours < 24)
  puts "Checking the site (#{hours} out of 24 times):"
  puts "***********************"
  check_the_site(BASE_URL, 'current_listing.cfm.html')
  #run the code that hits the site and processes the links...this method also returns an array of all the inmates
  
  
  
  hours += 1 # increment the counter, or this will run forever...
  puts "sleeping till next iteration"
  
  sleep_count = 0
  while(sleep_count < 1800)
    sleep(1) #sleep for an hour
    sleep_count +=1
    puts "Will check again in #{(1800-sleep_count)/60} minutes" if sleep_count%60==0
  end
  
  
end

    
</pre>
</div>
<p><b>4/4/2010:</b> This lesson remains unfinished, but the above code should execute. From it, you should have text files that, at a glance, will tell you some of the more interesting circumstances that this set of inmates were arrested under. There's various kinds of analysis you could do on a long term basis. But trying to figure out why some inmates have bail set at $1,000,000 isn't easy; you need to know their prior criminal record too...which is <a href="https://danwin.com/works/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/">what we hope to do in the third tutorial in this series</a>.</p>
</div>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-102-collecting-info-from-a-county-jail-site/">Coding for Journalists 102: Who&#8217;s in Jail Now: Collecting info from a county jail site</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2010/04/coding-for-journalists-102-collecting-info-from-a-county-jail-site/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.</title>
		<link>https://danwin.com/2010/04/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/</link>
		<comments>https://danwin.com/2010/04/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/#comments</comments>
		<pubDate>Tue, 06 Apr 2010 12:40:34 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[works]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[journalism]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[web scraping]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=436</guid>
		<description><![CDATA[<p>UPDATE (12/1/2011): Ever since writing this guide, I&#8217;ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby. I&#8217;ve since learned that trying to teach the fundamentals of programming in one [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/">Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p><em><strong>UPDATE (12/1/2011)</strong>: Ever since writing this guide, I&#8217;ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: <a href="http://ruby.bastardsbook.com">The Bastards Book of Ruby</a>.</p>
<p>I&#8217;ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I&#8217;m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think:</p>
<p><a href="http://ruby.bastardsbook.com">http://ruby.bastardsbook.com</a></em></p>
<div class="sec">Someone asked in this <a href="http://wjchat.webjournalist.org/?page_id=50">online chat for journalists</a>: I want to program/code, but where does a non-programmer journalist begin?</p>
<p>My colleague <a href="http://twitter.com/thejefflarson">Jeff Larson</a> gave what I believe is the most practical and professionally-useful answer: <strong>web-scraping</strong> (jump to my summary of web-scraping here, or read this more authorative source).</p>
<p>This is my attempt to walk someone through the most basic computer science theory so that he/she can begin collecting data in an automated way off of web pages, which I think is one of the most useful (and time-saving) tools available to today&#8217;s journalist. And thanks to the countless hours of work by generous coders, the tools are already there to make this within the grasp of a beginning programmer.</p>
<p>You just have to know where the tools are and how to pick them up.</p>
<p>Click here for this page&#8217;s <a href="#toc">table of contents</a>. Or jump to the the <a href="#topic_html">theory lesson</a>. Or to the <a href="#topic_writing_your_script">programming exercise</a>. Or, if you already know what a function and variable is, and have Ruby installed, go straight to two of my walkthroughs of building a real-world journalistic-minded web scraper: Scraping <a href="https://danwin.com/works/coding-for-journalists-102-collecting-info-from-a-county-jail-site/">a jail site</a>, and scraping <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">Pfizer&#8217;s doctor payment list</a>.</p>
<p>Or, read on for some more exposition:
</p></div>
<p><span id="more-436"></span></p>
<div class="code-doc">
<link rel='stylesheet' href='https://danwin.com/css/code.css' type='text/css' media='all' />
<div class="sec">
<h3>Who this post is for</h3>
<div id="attachment_649" style="width: 410px" class="wp-caption alignleft"><img class="size-full wp-image-649" title="His-Girl-Friday" src="https://danwin.com/words/wp-content/uploads/2010/03/His-Girl-Friday.jpg" alt="His Girl Friday" width="400" height="301" /><p class="wp-caption-text">His Girl Friday</p></div>
<p>You&#8217;re a journalist who knows almost nothing about computers beyond using them to connect to the Internets, email, and cheat on Facebook scrabble. This is not entirely trivial; if you&#8217;re able to do this without typing your password and SSN into a phishing site, you&#8217;re (sadly) a step ahead of most of the Internet populace. OK, it&#8217;ll also help if you&#8217;re familiar enough with your operating system (Windows or Mac&#8230;I&#8217;m assuming anyone using Linux won&#8217;t even need this tutorial) to know how to install programs.</p>
<p>Anyone who has taken a semester of computer science will scoff at how I&#8217;ve simplified even the basic fundamentals of programming&#8230;and they&#8217;d be right&#8230;but my goal is just to get you into the basics to write some useful code immediately. You&#8217;re going to have to make the effort yourself to learn the topics in-depth.</p>
<p>Thankfully, coding is something that provides immediate success and failure. You hit <strong>Ctrl-R</strong>, your script runs, and in five seconds or less, you&#8217;ll learn if you did right. The more you fumble, the more you learn. And getting around an error <a href="http://lmgtfy.com/">no longer requires owning a reference library</a>.</p>
</div>
<div class="sec">
<h3>The roadmap</h3>
<p>This tutorial aims to walk you through the bare essentials of HTML, programming theory and tools so that you can do something very practical: build an automatic process to gather data from websites. I made this lesson into one giant page so you can see for yourself, in one glance, the number of words (about 9,000) it takes to touch upon what is essentially one semester in a first-level computer science course. Also, I have no ads to sell.</p>
<p>Here&#8217;s what will happen if you read this entire page:</p>
<ol>
<li>Learn a little HTML</li>
<li>Install Firefox+Fire Bug</li>
<li>Install Ruby, a programming language</li>
<li>Learn some programming theory</li>
<li>Write a script</li>
<li>Execute the script</li>
</ol>
<p>Jump to the <a href="#toc">table of contents</a> or read some more blab.</p>
</div>
<div class="sec">
<h3>What is web-scraping and how it&#8217;s important to journalists</h3>
<p><strong>Web-scraping</strong> (also called <strong>screen-scraping</strong>) is the automated process of collecting the *useful* data off of a webpage. This is made possible because of the design of HTML, which, when done right, puts this data in as predictable a format as an Excel spreadsheet&#8230;sans the convenient interface, keyboard shortcuts, and Clippy. So you have to write your own tool tailored to the structure of a webpage.</p>
<p>The importance of data collection should be obvious to a journalist. Used to be, if you wanted a set of data&#8230;such as the list of restaurant inspections so you could do a regression analysis of failed tests with respect to neighborhood income levels, you&#8217;d ask them for the data, sue them if they said no, and if you were on the right side of the law, they&#8217;d grudgingly hand you a chunk of ordered text that you could eventually put into a spreadsheet.</p>
<p>But now, it&#8217;s possible that a public-information officer will just point you to the public website and say, there it is. And it&#8217;s not always a case of them being ignorant/disdainful of laws that oblige them to give the dataset, in electronic form, that backs the website. From their viewpoint, the information is there for any idiot with an Internet connection to ask for, so what are you whining about?</p>
<p>At this point, you can either go through a weeks-long argument through emails and phone messages that ends with their legal counsel compelling the PI officer to hand over the data. Or, if keeping your story idea secret isn&#8217;t a priority, you could explain what your intent is, and why you need a whole dataset to see if a trend exists. Either way, you almost might have another week or so of waiting for the PIO to successfully wrangle their tech people (and legal staff, who need to vet the released data for any confidential info) to giving you the data in a nice comma-delimited format.</p>
<p>So, if their website already has the information you need (although, often, the web display omits record keys and such that are useful), why not write a script in 15 minutes to grab it? Also, even if data is released willingly, it&#8217;s not always at a convenient pace. If a website is updated faster than a PIO can send you email attachments, then scraping the website on a nightly basis will save both of you headaches.</p>
<p>And some types of information is just not FOIA-able. My former colleague <a href="http://hackerjournalist.net/">Brian Boyer</a>, now news-apps chief at the Tribune, created <a href="http://www.propublica.org/ion/changetracker">ProPublica&#8217;s ChangeTracker</a>, built on a web-scraping service, to check when and how the White House changes its website. The request, &#8220;Hey, can you tell me all the times you&#8217;ve changed text on your website, what the text originally was, and what you changed it to&#8221; is not something a PIO could, even if he/she wanted to, could easily fulfill.</p>
<p>Web-scraping sometimes has bad connotations&#8230;because this is how various members of royalty find your email address in order to tell you that they are a distant family relation with $10,000,000US that they desperately want to give to you. So yes, you could use it for ugly purposes. My response is that if that&#8217;s your ultimate goal, you are way behind the game, and you will probably suffer a humiliating karmic fate, either in your online or real life.</p>
<p>On the other hand, there are innnumerable sets of public, useful data that no one has gotten around to mapping out and collecting, in a useful format. So let&#8217;s get to it.</p>
</div>
<div class="over-note" style="font-size: 12pt; color: #a44; border: 1px solid black; margin: 20px; padding: 20px;">
<p>This is part of a <a href="https://danwin.com/works/coding-for-journalists-101-a-four-part-series/">four-part series on web-scraping for journalists</a>. As of <strong>Apr. 5, 2010</strong>, it was a published a bit incomplete because I wanted to post a timely solution to the <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">recent Pfizer doctor payments list release</a>, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact <a href="mailto:dan@danwin.com">dan@danwin.com</a> if you have any questions, or leave a comment below.</p>
<p><strong>DISCLAIMER:</strong> <em>The code, data files, and results are meant for reference and example only. You use it at your own risk.</em></p>
</div>
<div class="sec">
<h3>The task</h3>
<div id="attachment_652" style="width: 160px" class="wp-caption alignleft"><img class="size-thumbnail wp-image-652" title="jefferson-thomas" src="https://danwin.com/words/wp-content/uploads/2010/03/jefferson-thomas-150x150.jpg" alt="" width="150" height="150" /><p class="wp-caption-text">Thomas Jefferson lived to be 83, according to Wikipedia</p></div>
<p><strong>When you get through this tutorial, you will be able to answer the question:</strong> According to <strong>Wikipedia</strong>, what is the <em>average age of U.S. Presidents whose last names have <strong>more than six characters</strong></em>? Not an important question, but it is on the same order of difficulty as, say, <a href="https://danwin.com/works/coding-for-journalists-102-collecting-info-from-a-county-jail-site/">scraping a county jail&#8217;s booking list</a> to find the inmates with the largest bail amount and charge list, and <a href="https://danwin.com/works/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/">how many are repeat-offenders</a>&#8230;which are the second and third lessons.</p>
</div>
<div class="sec">
<h3><a name="toc"></a>Table of Contents</h3>
<ol>
<li><a href="#topic_html">The basics of HTML</a>
<ul>
<li><a href="#topic_tags">Tags</a></li>
<li><a href="#topic_links">Hyperlinks</a></li>
<li><a href="#topic_firefox">Install Firefox and Firebug</a></li>
</ul>
</li>
<li><a href="#topic_programming">Programming Basics</a>
<ul>
<li><a href="#topic_ruby">Installing Ruby</a></li>
<li><a href="#topic_irb">The Ruby Interactive Prompt</a></li>
<li><a href="#topic_string">Strings</a></li>
<li><a href="#topic_variables">Variables</a></li>
<li><a href="#topic_comparisons">Comparison Operators</a></li>
<li><a href="#topic_conditionals">Conditional Branches</a></li>
<li><a href="#topic_methods">Methods</a></li>
</ul>
</li>
<li><a href="#topic_writing_your_script">Writing Your Script</a>
<ul>
<li><a href="#topic_nokogiri">Nokogiri &#8211; a Ruby parser of HTML</a></li>
<li><a href="#topic_commandline">Running scripts from Text Editors or the Command Line</a></li>
<li><a href="#topic_step_one">Step 1: Fetch the Presidents List</a></li>
<li><a href="#topic_xpath">XPath</a></li>
<li><a href="#topic_step_two">Step 2: From a Table of Data, Fetch the President&#8217;s Name</a></li>
<li><a href="#topic_step_three">Step 3: Determine if the Last Name Is Longer Than 6 Characters</a></li>
<li><a href="#topic_step_four">Step 4: If So, Fetch President&#8217;s Page</a></li>
<li><a href="#topic_step_five">Step 5: Grab the age from the president&#8217;s page</a></li>
<li><a href="#topic_regex">Regular Expressions</a></li>
<li><a href="#topic_step_six">Step 6: Add up the data you gathered</a></li>
<li><a href="#topic_complete">The complete script</a></li>
</ul>
</li>
</ol>
</div>
<div class="sec">
<h2><a name="topic_html"></a>HTML</h2>
<p>HTML is what makes web pages <strong>not</strong> just a stream of characters. Why did that &#8220;not&#8221; in the previous sentence appear <strong>bold?</strong>. Because I wrapped the word &#8220;not&#8221; in <em>tags</em>. The raw code is: &lt;b&gt;not&lt;/b&gt;</p>
<p>The design and theory of HTML are topics that could consume the rest of your waking life. For now, it&#8217;s relevance to us is that with HTML, web pages have structure. And with structure, a web-scraper can reliably collect the useful bits of data as it would from columns of a spreadsheet.</p>
<p><a href="http://www.w3schools.com/html/html_intro.asp">W3Schools</a> is the best place to get a primer on HTML.</p>
</div>
<div class="sec">
<h3><a name="topic_tags"></a>Tags</h3>
<p>Tags are themselves contained in angle brackets (&lt; and &gt;) and come in pairs. The end tag is denoted by a forward slash: <strong>/</strong>.</p>
<p>So, anything between these tags â€“ &lt;i&gt; &lt;/i&gt; â€“ will appear in <em>italics</em>.</p>
<p>To make something a headline, use <strong>&lt;h1&gt; &lt;/h1&gt;</strong> tags. You can replace that &#8216;1&#8217; with numbers 2 through 6, with &#8216;1&#8217; being the most prominent kind of headline.</p>
<h1>Here is a h1 headline</h1>
<h4>Here is a h4 headline</h4>
<p>OK, one more critically important thing about tags. They can have <strong>attributes</strong>.</p>
<p>Let&#8217;s say I wanted to make something not only be a headline (i.e. bold large text), but the color <span>red</span>. There are many ways to do this, but let me show you the most simple (if not totally standards-compliant) way to illustrate the simplest form of an attribute:</p>
<p>An attribute consists of: the name of the attribute, an equals sign, and then the value of that attribute enclosed in quotation marks. Like so: <em>attribute</em>=&#8221;this_is_the_attributes_value&#8221;</p>
<pre class="ruby" name="code">&lt;h1 color="red"&gt;This is a headline&lt;/h1&gt;</pre>
<p>In that starting tag â€“ &lt;h1&gt; â€“ is where attributes goâ€“ after the tagname, <strong>h1</strong>, and before the closing right-angle-bracket. The name of the attribute, <strong>color</strong> is followed by an <strong>=</strong> sign. Then quotation marks (or single quotes; either way, they have to match, as they would when you write down someone&#8217;s quote, or someone quoting a quote) enclosing the <strong>value</strong> of the attribute. In this case, <strong>red</strong>.</p>
</div>
<div class="sec">
<h4>HTML Errors</h4>
<p>Couple of things to keep in mind. Tags come in pairs. When things look funny on a hand-coded webpage, usually it&#8217;s because the coder didn&#8217;t provide a closing tag to his starting tag. Here&#8217;s a properly tagged sentence:</p>
<pre class="ruby" name="code">&lt;b&gt;This sentence is meant to be bold.&lt;/b&gt; &lt;i&gt;This sentence is just in italics.&lt;/i&gt;</pre>
<p>Results in: <strong>This sentence is meant to be bold.</strong> <em>This sentence is just in italics.</em></p>
<p>In this sentence, I didn&#8217;t provided a closing bold tag, and so the bold part overlaps into the italics sentence, making a bold AND italicized sentence:</p>
<pre class="ruby" name="code">&lt;b&gt;This sentence is meant to be bold. &lt;i&gt;This sentence is just in italics.&lt;/i&gt;</pre>
<p>Results in: <strong>This sentence is meant to be bold. <em>This sentence is just in italics.</em></strong></p>
<p>Also, close the tags in the order they come in&#8230;I don&#8217;t know how to concisely explain this point, but the following is not properly-structured HTML. The part in red denotes how the closing-bold-tag should NOT come after the opening italics tag:</p>
<pre class="ruby" name="code">&lt;b&gt;This sentence is meant to be bold. <span style="color: red;">&lt;i&gt;&lt;/b&gt;</span>This sentence is just in italics.&lt;/i&gt;</pre>
<p>Sometimes browsers will compensate for coder-error and interpret this in a way that doesn&#8217;t look awful. But you just need to know that this violates a principle of HTML&#8230;and pages that you scrape that aren&#8217;t well structured may give strange results even if you&#8217;ve written a logically-designed scraper.</p>
</div>
<div class="sec">
<h3><a name="topic_links"></a>Hyperlinks</h3>
<p><strong>Hyperlinks</strong> are those (depending on a website&#8217;s style) underlined words that, upon clicking, send you to a whole different page. They are nothing more than special tags with an important attribute.</p>
<p>The tagged hyperlink makes the word &#8220;link&#8221; a clickable link that goes to Google. The <strong>href</strong> attribute describes where the link sends you:</p>
<pre class="ruby" name="code">This &lt;a href="http://google.com"&gt;link&lt;/a&gt; has many answers</pre>
<p>Results in:</p>
<p>This <a href="http://google.com">link</a> has many answers.</p>
<p>Want to try some tags and hyperlinks yourself? Use <a href="http://www.w3schools.com/html/tryit.asp?filename=tryhtml_links">W3Schools interactive editor</a>.</p>
<h3><a name="topic_firefox"></a>Firefox and Firebug</h3>
<p>As I wrote earlier, HTML structures the data you want. But you need to know how it&#8217;s structured, and so you need to know the designer&#8217;s blueprint. Not to get in a browser war, but just to make things easier on me, you can&#8217;t go wrong by first <a href="http://firefox.com">downloading Firefox</a>, the free open-source browser by Mozilla.</p>
<p>Now go to any website, right click on an empty space, and click &#8220;<strong>View Source</strong>&#8221; in the submenu. You&#8217;ll likely see something like this:</p>
<p><a href="https://danwin.com/words/wp-content/uploads/2010/02/Screen-shot-2010-03-13-at-9.27.18-PM.png"><img class="aligncenter size-medium wp-image-522" title="Screen shot 2010-03-13 at 9.27.18 PM" src="https://danwin.com/words/wp-content/uploads/2010/02/Screen-shot-2010-03-13-at-9.27.18-PM-500x256.png" alt="" width="500" height="256" /></a></p>
<p>That&#8217;s the raw HTML. You might eventually get to the point where HTML is what the Matrix is to Neo. But let&#8217;s make it as painless as possible. Firefox has many plugins, including one called Firebug, which makes it very easy to dissect code. <a href="http://getfirebug.com/">Get it here</a>.</p>
<div id="attachment_650" style="width: 160px" class="wp-caption alignleft"><img class="size-thumbnail wp-image-650" title="firebug-large" src="https://danwin.com/words/wp-content/uploads/2010/03/firebug-large-150x150.png" alt="Firebug, a plugin for Firefox" width="150" height="150" /><p class="wp-caption-text">Firebug, a plugin for Firefox</p></div>
<p>Double-click on one of the sample headlines in this tutorial to highlight it. Then right-click to open the submenu, then click &#8220;<strong>Inspect Element</strong>&#8220;. This should bring up a Firebug panel that lets you see the HTML that made that headline. This saves you from having to search through the entire source to find that headline, just to see the tags that wrap it.</p>
<p>Like I said, in order to successful web-scrape, you&#8217;re going to have to know how the elements â€“ the paragraphs, headlines, and links â€“ were structured. <strong><a href="http://www.getfirebug.com/">Firebug</a></strong> is a tool that helps pinpoint the elements you want to know about.</p>
</div>
<div class="sec">
<h2><a name="topic_programming"></a>Programming Basics</h2>
<p>A good way to annoy a programmer is to say something like, &#8220;Yeah, I have some programming experience: I&#8217;ve been writing HTML for two weeks now.&#8221; Writing HTML is <strong>not</strong> programming, any more than operating a stereo equalizer makes you a classically-trained guitarist. HTML is a way to describe and present content, but you&#8217;re not running any kind of computerized task.</p>
<p>So, I went through the basics of HTML so you&#8217;d be familiar with the content that you&#8217;d be collecting. Now we&#8217;ll learn the basics of how to program a script that will actually collect that content.</p>
</div>
<div class="sec">
<h3><a name="topic_ruby"></a>Installing Ruby</h3>
<p>What is Ruby? It&#8217;s a programming language. And like a spoken language, once you&#8217;ve learned one, you&#8217;ve learned the fundamentals (i.e. the concepts of verbs, nouns, sentences, etc.) that allow you to try out all the other ones. Ruby is also the basis for <a href="http://rubyonrails.org/">Ruby on Rails</a>, a very popular framework that many developers use to build data-driven websites. But right now, we&#8217;re collecting data from websites, not building them.</p>
<div class="note">I&#8217;ve purposely been brief here. Installing Ruby and its libraries may be the most frustrating aspect of this lesson, and I have little more insight to it than, &#8220;I have a Mac w/ Leopard, and it came with it&#8221;</div>
<p>Installation instructions for <a href="http://www.ruby-lang.org/en/downloads/">Ruby are here</a>&#8230;if you&#8217;re on a Mac OS X with Leopard or better, you should be good to go. Hopefully, the <a href="ftp://ftp.ruby-lang.org//pub/ruby/binaries/mswin32/ruby-1.9.1-p376-i386-mswin32.zip">one-click installer for Windows</a> should be easy enough to install (check the <strong>Enable RubyGems</strong> and <strong>SciTE</strong> boxes).</p>
<div id="attachment_545" style="width: 798px" class="wp-caption aligncenter"><a href="https://danwin.com/words/wp-content/uploads/2010/02/windows-ruby-installer.png"><img class="size-full wp-image-545" title="windows-ruby-installer" src="https://danwin.com/words/wp-content/uploads/2010/02/windows-ruby-installer.png" alt="" width="788" height="629" /></a><p class="wp-caption-text">The One-Click Ruby Installer for Windows</p></div>
<div class="note">More specifically, Ruby is an interpreted language&#8230;so I use the phrase &#8220;Ruby interpreter&#8221; to refer to the program that reads your script, makes sense of it, and executes it. <a href="http://en.wikipedia.org/wiki/Interpreter_%28computing%29">Read more about this definition at Wikipedia</a>.</div>
</div>
<div class="sec">
<h3><a name="topic_irb"></a>The Ruby Interactive Prompt (IRB)</h3>
<p>If you belong to the target-audience of this tutorial, you probably have been able to get your computer to perform tasks (such as, &#8216;Open my web browser&#8217;) with your mouse-clicking. Programming means you&#8217;re going to be typing out lines of code that executes tasks. Your web-scraping is essentially going to be a sequence of such commands, i.e. a script.</p>
<p>But why wait until you get a complete script when we can start executing commands right now? This is where <strong>Ruby&#8217;s Interactive Prompt (IRB) comes in</strong>. In its simplest form of operation, the IRB waits for you to type in a line of code, then for you to hit &#8220;Enter/Return&#8221;, and then it will run your command, provided it makes sense.</p>
<p>On Windows, go to your menu and type ctrl-R to bring up the Run&#8230; prompt. Type in &#8216;cmd&#8217;. Then type in &#8216;irb&#8217;. On the Mac, go to Applications=&gt;Terminal. At the command line, type in &#8216;irb&#8217;.</p>
<div id="attachment_523" style="width: 433px" class="wp-caption aligncenter"><a href="https://danwin.com/words/wp-content/uploads/2010/02/irb-screenshot.gif"><img class="size-full wp-image-523" title="irb-screenshot" src="https://danwin.com/words/wp-content/uploads/2010/02/irb-screenshot.gif" alt="Interactive Ruby prompt" width="423" height="267" /></a><p class="wp-caption-text">Interactive Ruby prompt</p></div>
<p>Now that you&#8217;re here, type in the following:</p>
<pre class="ruby" name="code">1+6
	#result: 7</pre>
<p>Congrats. You just wrote a one-line script to figure out what one plus six is.</p>
<div class="note"><strong>Note:</strong> In Ruby, the pound sign &#8216;#&#8217; designates the code following it to be a <strong>comment</strong>; I will use this convention in the code boxes to mark what your result after a command should be.</div>
<p>Let&#8217;s also learn a common Ruby command: <strong>puts</strong>. It simply outputs what comes after it (actually, not quite that simple, but you&#8217;ll learn soon in the next section)&#8230;I&#8217;ll be using this in the script to output results.</p>
<pre class="ruby" name="code">		puts "Hello World"
		#result: Hello World</pre>
<p>Read more about the <a href="http://ruby.about.com/od/tutorials/a/commandline_4.htm">command-line interpreter.</a></p>
</div>
<div class="sec">
<h3><a name="topic_string"></a>Strings</h3>
<p>Let&#8217;s say you want to be a little more narrative about the above <strong>1+6</strong> calculation. Try writing out those numbers and enclosing them in <strong>quotation marks</strong>. Like so:</p>
<pre class="ruby" name="code">	"One"+"Six"
	# result: "OneSix"</pre>
<p>Your answer won&#8217;t be &#8220;Seven&#8221;, but &#8220;OneSix&#8221;. Why? To human eyes and ears, <strong>1+6</strong> and <strong>&#8220;One&#8221;+&#8221;Six&#8221;</strong> might be the same. But in Ruby, and most other programming languages, the computer interprets the latter command to be joining two <strong>words</strong>, i.e. <strong>strings</strong> together.</p>
<div class="note">Strings can be enclosed in either double-quotes or single-quotes. However, double-quotes in Ruby and other languages, allow for some important manipulation, called <strong>string interpolation</strong>. Good to know for later. Just make sure whatever you use, the first mark matches the second.</div>
<p>In the programming-world, &#8220;six&#8221; is fundamentally different than 6. &#8220;Six&#8221; is what Ruby considers a <strong>String</strong>. 6 is a <strong>Number</strong>.</p>
<p>So what happens when you try to add <strong>&#8220;Six&#8221;</strong>, the string, to <strong>6</strong>, the number?</p>
<pre class="ruby" name="code">"Six"+6</pre>
<pre class="ruby" name="code">TypeError: can't convert Fixnum into String
	from (irb):2:in `+'
	from (irb):2
	from /usr/local/bin/irb:12:in `'</pre>
<p>Congrats, it&#8217;s your first of many, many times of making the Ruby interpreter choke. In the case of numbers and strings, it only knows how to add like items together.</p>
<p>The takeaway from this is that, for our purposes, anything in quotation marks is a string. Even a number in quotation marks is no longer a number. You&#8217;ll get the same above error if you try:</p>
<pre class="ruby" name="code">"6"+6</pre>
<p>The quotation marks make all the difference, just as they do in the journalism world. For example:</p>
<p><em>The governor is a scumbag who molests staffers on taxpayer-dime</em><br />
<em>by Dan Nguyen, Newswire, Inc.</em></p>
<p><em>Whistleblower: &#8220;The governor is a scumbag who molests staffers on taxpayer-dime&#8221;</em><br />
<em>by Dan Nguyen, Newswire, Inc.</em></p>
</div>
<div class="sec">
<h3><a name="topic_variables"></a>Variables</h3>
<p>OK, you now know that you shouldn&#8217;t add strings to numbers, and you&#8217;re perfectly content to add strings to create results like &#8220;eightzero&#8221;. What if you tire of typing quotation marks?</p>
<pre class="ruby" name="code">	eight+zero
	# NameError: undefined local variable or method `eight' ...</pre>
<p>What happened here? Well, without quotation marks, <em>eight</em> and <em>zero</em> are no longer considered <strong>strings</strong>. In their unquoted form, they are considered <strong>variables</strong> that hold some kind of <strong>value</strong>.</p>
<p>Think back to algebra when you were asked to solve &#8220;x+1=6&#8243;. You weren&#8217;t supposed to interpret that as:<br />
<em>the letter x added to the number 1 equals 6</em></p>
<p>The <strong>x</strong> is a stand-in for the value <strong>5</strong>. <strong>x</strong> could&#8217;ve been <strong>a</strong>, <strong>b</strong> or <strong>y</strong>.</p>
<div class="note">(Forgot what algebra was? Try this great primer, &#8220;<a href="http://opinionator.blogs.nytimes.com/2010/02/28/the-joy-of-x/">The Joy of X</a>&#8221; by the NYT&#8217;s Opinionator)</div>
<p>So, to make <strong>eight+zero</strong> understandable by the Ruby interpreter, you must assign those two terms values. So, try:</p>
<pre class="ruby" name="code">eight=8
zero=0
eight+zero
# result: 8</pre>
<p>Now, <strong>eight+zero</strong> is the same as <strong>8+0</strong>.</p>
<p>Enter the following into the IRB:</p>
<pre class="ruby" name="code">zero=1
eight+zero
# result: 9</pre>
<p>You should get <strong>9</strong> as the result. The variable <strong>eight</strong> is still <strong>8</strong>. But you assigned <strong>zero</strong> the value of <strong>1</strong>. Therefore, you were asking the interpreter to execute <strong>8+1</strong>.</p>
<p>Here&#8217;s what you should grok by now: unquoted words are considered to be variables, and they are empty unless you&#8217;ve assigned them a value. And the name of the variable is completely independent and unrelated to its actual value. Thus, <strong>nine=&#8221;nine&#8221;</strong> makes as much sense in Ruby as <strong>this_variable_has_a_value_that_is_not_nine_dang_it=&#8221;nine&#8221;</strong></p>
<p>Obviously, since you can name your variables just about anything (stick to a series of lowercase letters and numbers with no spaces or hypens), name them something that is related to their actual value, so that your code is more readable.</p>
<p>At this point, we&#8217;ve run through a lot of programming concepts. But if you don&#8217;t understand how the above examples, and the following:</p>
<pre class="ruby" name="code">one = 1
one = 2  # assigning the variable named one to another value
one + one
# result: 4</pre>
<p>&#8230;then <strong>pause for a moment.</strong> It&#8217;s not a trivial topic, but it is critical to understand it at least at this level. Go here for more discussion on variables.</p>
<div class="note">By the way, arithmetic symbols, such as <strong>+</strong> and <strong>&#8211;</strong>, are called <strong>operators</strong>. A statement like <strong>4+5</strong> is an <strong>expression</strong>. I&#8217;ll avoid, or mangle, the terminology throughout the lesson.</div>
</div>
<div class="sec">
<h3><a name="topic_comparisons"></a>Comparison operators</h3>
<p>Let&#8217;s say you&#8217;ve written a bunch of code and forgot whether you set the variable <strong>eight</strong> to &#8220;eight&#8221; or 8. How to test that? Well&#8230;typing in <strong>eight</strong> and hitting &#8216;Enter&#8217; is the easy way&#8230;but now&#8217;s a good time to learn the concept of a comparison.</p>
<p>We already know that <strong>=</strong>, the equals sign, is something that <em>assigns</em> a value: what&#8217;s on the right of the  <strong>=</strong> is set as the value of the <strong>variable</strong> on the left side.</p>
<p>So what&#8217;s a double equals sign <strong>==</strong> mean?</p>
<p>Write this sequence of code:</p>
<pre class="ruby" name="code">eight="eight"
eight==8
# result: false</pre>
<p>The second line of code, translated into English, is you telling the interpreter:</p>
<p><em>The value of the variable named <strong>eight</strong> is the number <strong>8</strong></em></p>
<p>To which the computer responds: <strong>false</strong></p>
<p>Here, Ruby is telling you that the string <strong>&#8220;eight&#8221;</strong>, to which the variable <strong>eight</strong> was assigned, is not equal to the number <strong>8</strong>.</p>
<p>Which we, from vainly trying to add <strong>&#8220;eight&#8221;+8</strong>, know is how Ruby interprets things. Evaluating <strong>eight==&#8221;eight&#8221;</strong> will yield the value of <strong>true</strong></p>
<div class="note">Note: <strong>true</strong> and <strong>false</strong> are not variable names. They are reserved words that are values in themselves. So, this will result in an error: true = &#8220;A string I&#8217;d like to assign the value named true&#8221;. However, replacing that equals sign with a double equals sign, <strong>==</strong>, will return a result of <strong>false</strong>.</div>
</div>
<div class="sec">
<h3><a name="topic_array"></a>Arrays</h3>
<p>Think of an <strong>Array</strong> as something that contains a sequence of other variables and values. In Ruby, and most other languages, arrays are set off by square brackets, <strong>[</strong> and <strong>]</strong>.</p>
<p>Here&#8217;s the easiest way to initialize an Array:</p>
<pre class="ruby" name="code">an_empty_array = []
array_with_numbers=[1,2,3,4]</pre>
<p>Above, I&#8217;ve assigned two variables the values of two different arrays. The first, <strong>an_empty_array</strong>, is empty. The second, <strong>array_with_numbers</strong>, is filled with four numbers. You could&#8217;ve written out four lines of code, assigning four different variables respectively with the numbers 1 through 4. With an array, you essentially have one variable referring to 4 values.</p>
<p>How do you access the individual values? Use the name of the variable, and then the <strong>index</strong>. Consider the index as an address) of the element you want, set off by square brackets (in this fashion, the square brackets denotes the variable they follow is an array, while the value inside them is the index/address). Such as:</p>
<pre class="ruby" name="code">array_with_numbers[0]</pre>
<p>In Ruby, <strong>the first element of an array has an index of 0</strong>. So the above line would give you the value of <strong>1</strong>. <strong>array_with_numbers[3] </strong>would get you <strong>4</strong>. The index <strong>4</strong> in <strong>array_with_numbers</strong> would get you an empty (<strong>nil</strong>) value.</p>
<p>Arrays can contain other variables too, like so:</p>
<pre class="ruby" name="code">an_empty_array = []
array_with_numbers=[1,2,3,4, an_empty_array]</pre>
<p><span class="var">array_with_numbers[4]</span> would now yield <strong>[]</strong>, an empty array, which is the value of the variable named <span class="var">an_empty_array</span></p>
<p>More about <a href="http://ruby-doc.org/core/classes/Array.html">Arrays</a> here.</p>
<h3>Hashes</h3>
<p>OK, I&#8217;m going to make another vast simplification of a programming object: <strong>Hashes</strong> can be considered Arrays in which the indexes are <strong>strings</strong>, not numbers. Hashes are denoted by curly brackets.</p>
<pre class="ruby" name="code">a_hash = {"one"=&gt;1, "two"=&gt;2, "three"=&gt;3}</pre>
<p>Note the convention of <strong>=&gt;</strong> which assigns a value to an index (the correct term, actually, is <strong>key</strong>) of the hash. So:</p>
<pre class="ruby" name="code">a_hash["two"]
# result: 2</pre>
<p>It&#8217;s not important right now to understand the full differences and capabilities of Arrays and Hashes, but you&#8217;ll be seeing this notation in the script we write.</p>
<p>Read more about <a href="http://ruby-doc.org/core/classes/Hash.html">Hashes here</a>.</p>
</div>
<div class="sec">
<h3><a name="topic_conditionals"></a>Conditional Branches</h3>
<p>So far, we&#8217;ve been typing in single line commands. Your final script is going to be a long list of commands telling the computer to:</p>
<ol>
<li>Go to Wikipedia&#8217;s listing of each U.S. President&#8217;s page (i.e. a list of links to each page)</li>
<li><strong>Visit, via hyperlink, each page belonging to a president whose last name is longer than six letters</strong></li>
<li>Grab the president&#8217;s age from each individual page, if that president is dead</li>
<li>Average those ages</li>
</ol>
<p>Our criteria for inclusion means we have to come up with some way to <strong>not visit</strong>, say, John Adams&#8217;s Wikipedia page. And to not include a living president&#8217;s age. So inside our script, there&#8217;s going to be a section of code telling the computer to go into a webpage&#8230;but that code should only execute <strong>if</strong> the length of a President&#8217;s last name is greater than 6.</p>
<p>That&#8217;s where the <strong>if</strong> conditional branch comes in. Without getting too far past the basics, here&#8217;s the simplified code:.</p>
<pre class="ruby" name="code">president = "John Adams"
last_name_length = 5  # I manually set this variable for now; in your actual script, you'll find this value programmatically 

if last_name_length &gt; 6
 # <em>then go to his wikipedia page...and while we're in this branch of code, let's print something</em>
 puts "Entering a page"
else
 #<em>OK, don't go there. But let's print out a statement</em>
 puts "This name is too short"
end

# result: "This name is too short"</pre>
<p>What the above section of code is essentially saying is that if the value of the variable <span class="var">last_name_length</span> is greater than 6, then do what was in between <strong>if</strong> and <strong>else</strong>. Otherwise, completely <em>skip</em> what was there and go to what&#8217;s between the <strong>else</strong> and <strong>end</strong></p>
<p>The <span class="res">else</span> is optional&#8230;if you want, you could do <em>nothing</em> if the conditional statement (if last_name_length &gt; 6) isn&#8217;t satisfied. The <strong>end</strong> <em>is required</em>; it tells Ruby that that&#8217;s the end of that optional branch of code that started with the <strong>if</strong>.</p>
<p>Up till now, our series of commands have been straight-forward: the interpreter executes one line after another. Introducing the <strong>if</strong> statement has introduced a fork in the road; if the condition in the <strong>if</strong> statement isn&#8217;t met, the interpreter skips past that <strong>if</strong> block.</p>
<p>The <strong>if</strong> statement is the simplest of such conditional branches. All you need to know for now is that there&#8217;s a way to tell the Ruby interpeter to execute a certain bit of code if a condition is met. <a href="http://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Control_Structures">Read more about it here</a>.</p>
</div>
<div class="sec">
<h3><a name="topic_methods"></a>Methods</h3>
<p>I&#8217;m really going to be brief here. Think of methods as a set of commands that are useful enough to run more than once.</p>
<div class="note">Out of bad habit, I&#8217;ll use the term <strong>function</strong> as a synonym for <strong>method</strong>. They&#8217;re the same concept, except method is a kind of function, the explanation of which requires me getting into object-oriented programming. Which I don&#8217;t want to right now.</div>
<p>Let&#8217;s say I need to take two numbers, multiply them together, subtract 5 from the product, and then add the result to itself. In code, that would be:</p>
<pre class="ruby" name="code">#initialize the variables:
a = 10
b = 20

#now make each step its own line
c = a * b
c = c - 5
c = c+c
# result: 390</pre>
<p>Well, that could&#8217;ve been one line, without using the placeholder variable named <strong>c</strong>, like so:</p>
<pre class="ruby" name="code">(a*b)-5 + (a*b)-5</pre>
<p>If I need to run this more than once, it&#8217;s a bit annoying to type out each time we want to run that series of commands, so let&#8217;s define a function called <span class="foo">my_funny_equation</span></p>
<pre class="ruby" name="code">def my_funny_equation (first_argument, second_argument)
  answer = (first_argument*second_argument)-5 + (first_argument*second_argument)-5
  return  answer
end</pre>
<p>Inside the parentheses, following <strong>my_funny_equation</strong> are the <em><strong>arguments</strong></em>, the values that you want the method to work with.</p>
<p>The takeaway here is that I&#8217;ve encapsulated my series of commands into a block of code. The variable names, arbitrarily named <strong>first_argument</strong>, <strong>second_argument</strong>, and <strong>answer</strong>, are references that only exist within that block of code which defines the method <strong>my_funny_equation</strong>.</p>
<p>Now that this method is defined, I can do:</p>
<pre class="ruby" name="code">my_funny_equation(10, 12)
230

my_funny_equation( 4, 5)
# result: 30

answer+10
# result: (Ruby will choke here)</pre>
<p>Why does the third command choke? Again, <strong>answer</strong> exists only within the little world defined in the my_funny_equation method, between the <strong>def</strong> and <strong>end</strong> lines. It has no value outside of the method definition. This is called function scope, a topic outside of, well, the scope of this simplified tutorial. Read more about <a href="http://en.wikipedia.org/wiki/Scope_%28programming%29">scope here</a>.</p>
<p>OK, the above was just introducing you to the concept of a method/function. The kind of methods we&#8217;ll be dealing with in our script are called <strong>instance methods</strong>. These methods belong to something&#8230;an actual number, for example. <strong>6</strong> is an <em>instance</em> of a <strong>Number</strong>. <strong>&#8220;Six&#8221;</strong> is an <em>instance</em> of a <strong>String</strong></p>
<p><strong>Example:<br />
</strong></p>
<p>The number <strong>2.67</strong> is considered by Ruby to be of the class <strong><a href="http://ruby-doc.org/core-1.9/classes/Float.html">Float</a></strong>&#8230;that is, a number with a floating decimal point.</p>
<p>More specifically, <strong>2.67</strong> is an <em>instance</em> of a Float. So is 4.777. And so is 8.999.</p>
<p>What if I wanted to go about rounding a Float number? Well, luckily, Ruby has built in <strong>instance methods</strong> that do this. The basic structure is the instance, followed by the method&#8217;s name&#8230;as follows:</p>
<p><em>instance</em>.<em>method_name</em></p>
<p>The method for rounding a Float is called &#8220;round&#8221;. So, to round 2.67, we do:</p>
<pre class="ruby" name="code">2.67.round
&gt;&gt;2</pre>
<p>This is a little confusing because of the two periods. Just be faithful that the Ruby interpreter knows the difference; it sees the first &#8220;dot&#8221; as a decimal point defining the number. The second &#8220;dot&#8221; tells it that we want to access the built-in Float method called <strong>round</strong>.</p>
<p>One more example, let&#8217;s work with <a href="#topic_array">arrays</a>.</p>
<p>Let&#8217;s say we have:</p>
<pre class="ruby" name="code">an_array= [1,2,3,4,5,6]</pre>
<p>I want to make an array that consists of the first three elements of *any* array. Luckily, Ruby arrays has a built in method called <a href="http://ruby-doc.org/core/classes/Array.html#M002221">slice</a>.</p>
<pre class="ruby" name="code">an_array.slice(0,3)
&gt;&gt;[1,2,3]</pre>
<p>So, <strong>slice</strong> is the name of an <em>instance method of things that are Arrays</em>. Inside the parentheses are two arguments, the first denotes the element to start out at (in this case, 0, since we want the first element), the second denotes how many elements to include in this sub-array (3).</p>
<p>What was the point of all of this? In our final code, you&#8217;ll be seeing calls to methods. Someone already wrote the method that, say, collects all the text of a webpage and stores it into a variable for you. But you need to know the name of that method and how to invoke it.</p>
</div>
<div class="sec">
<h2><a name="topic_writing_your_script"></a>Writing Your Script</h2>
<p>OK, now we get past the fundamentals and into things that will really solve your problems. It wasn&#8217;t important to have intimate knowledge of the previous concepts, but just to know they exist.</p>
<p>But how can you, knowing just the basics, do something as complicated as connect to a series of web pages, collect their content,  pick the exact points of needed data, and arrange them in a useful structure? Because other programmers have abstracted all these functions in such a way that we could do this series of tasks in just a few lines.</p>
<p>I&#8217;m going to write out an extremely-verbose way of performing these tasks to make each step clear&#8230;but as you get better, you&#8217;ll find ways to minimize your typing.</p>
<p><a name="steplist"></a><br />
<strong>Here&#8217;s the list of steps we&#8217;ll be doing</strong>, in somewhat plain English:</p>
<ul>
<li>1) Grab the contents of the presidents list</li>
<li>2) From that list, grab each president&#8217;s name</li>
<li>3) Determine if the last name is longer than 6 characters</li>
<li>4) If so, fetch the link to the president&#8217;s page and open it</li>
<li>5) Grab the age from the president&#8217;s page</li>
<li>6) Add up the data you gathered</li>
</ul>
</div>
<p>Before doing any of the above steps, we&#8217;re going to download a Ruby library that makes the above tasks trivially easy (that is, compared to starting from scratch)&#8230;</p>
<div class="sec">
<h3><a name="topic_nokogiri"></a>Nokogiri</h3>
<p>I won&#8217;t get into what &#8220;<a href="http://docs.rubygems.org/read/chapter/1">gems</a>&#8221; are in relation to the Ruby programming language; just think of them as pre-packaged functions and code that you can easily download and re-use for your own scripts.</p>
<p>Complete instructions can be found at the nokogiri homepage. You may run into a lot of errors&#8230;my advice is to copy part of that error and Google it with that and &#8220;nokogiri,&#8221; and hopefully you&#8217;ll get an answer.</p>
<p>Hopefully, it&#8217;s as simple as going into your command-line console (exit the interpreter if you&#8217;re in there) and typing:</p>
<pre>&gt;&gt; gem install libxml-ruby
&gt;&gt; gem install nokogiri</pre>
<p>What is <a href="http://nokogiri.org/">nokogiri</a>? It&#8217;s a library of code that makes it easy to parse a webpage. Remember when you right-clicked on a webpage to view source, and how painful of a task it would be to collect, say, what the text of the third headline is&#8230;on 100 different pages? Nokogiri essentially allows you to do this with a couple lines of code. Check out the <a href="http://nokogiri.org/">homepage here</a>.</p>
</div>
<div class="sec">
<h4><a name="topic_step_one"></a>Step One: Fetch the Contents From the Presidents List</h4>
<p>Let&#8217;s try Nokogiri out. Open your ruby interpreter and type in the following commands; these first lines invoke the method <strong>require</strong>, which will give your script access to the required libraries of code, including nokogiri:</p>
<pre class="ruby" name="code">require 'rubygems'
require 'nokogiri'
require 'open-uri'</pre>
<p>This next line will fetch the contents of <a href="http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States">Wikipedia&#8217;s list of U.S. Presidents</a></p>
<pre class="ruby" name="code">	list_of_presidents = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States'))</pre>
<p>I&#8217;m going to quickly deconstruct this line:</p>
<ul>
<li><strong>Nokogiri::HTML</strong> specifies that we want a method that exists in the <strong>Nokogiri</strong> library, and more specifically, in its class named <strong>HTML</strong>.</li>
<li><strong>open</strong> is the name of the method we want. Now you see why we had to specify the above&#8230;there are lots of libraries and contexts that have methods named <strong>open</strong>. We want Nokogiri&#8217;s.</li>
<li>&#8216;http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States&#8217; is the string that holds the address of the page we want. The <strong>open</strong> method needs this to&#8230;well&#8230;know what to open.</li>
<li><strong>list_of_presidents</strong> is the variable to which <strong>open</strong> will spit its contents into.</li>
</ul>
<p>OK, that one line, maybe the most complicated line we&#8217;ve written so far, just did a whole lot for you.</p>
<p>Using a method in the Nokogiri library called <strong>open</strong> (which takes in a web page address as an argument), it opened a connection with Wikipedia, performed the Internet protocols necessary to exchange information, copied the content of the target page, and wraps it all up in a <a href="http://nokogiri.org/Nokogiri/HTML/Document.html">Nokogiri data structure</a> for later manipulation. We are pointing to this data structure with the variable <span class="var">list_of_presidents</span></p>
<p>Let&#8217;s try to grab the contents of the second <strong>h2</strong> tag (i.e. the second, secondary headline)</p>
<pre class="ruby" name="code">list_of_presidents.xpath('//h2')[1].content
=&gt;Presidents</pre>
</div>
<div class="sec">
<h3><a name="topic_commandline"></a>Running scripts from Text Editors or the Command Line</h3>
<p>Running Ruby commands from the <a href="#topic_irb">Interactive Ruby prompt</a> is nice and all, for quick feedback. But from here on out, we&#8217;ll be writing a full-on script with a few dozen lines of code. So, it&#8217;ll be easier if you create a new text file with a file extension of .rb &#8230; something like, <strong>myfirstscript.rb</strong> to put your code in.</p>
<p>You should be using a <strong>text-editor</strong> for this&#8230;something better than Notepad, at least.</p>
<p>For <strong>Macs</strong>, there&#8217;s the free and excellent <a href="http://www.barebones.com/products/TextWrangler/">TextWrangler</a>. If you&#8217;re willing to spend some money, <a href="http://macromates.com/">TextMate</a> is what I use and it&#8217;s worth the $55. A free 30-day trial can be <a href="http://license.macromates.com/\">\&#8221;downloaded here</a>.</p>
<p>For <strong>Windows</strong>, the one-click Ruby installer includes the free <a href="http://scintilla.org/SciTE.html">SciTE</a>4. Also, there&#8217;s the free <a href="http://www.activestate.com/komodo_edit/">Komodo Edit</a>. For $35, there&#8217;s the &#8220;Textmate on Windows&#8221;, <a href="http://www.e-texteditor.com/">E-TextEditor</a> (<a href="http://www.e-texteditor.com/download/e_setup.exe">free trial here</a>)</p>
<p>Some of these text editors have a shortcut-key that allows you to run the script. For example, SciTE uses <strong>F5</strong>. Note how the output is conveniently displayed to the side:</p>
<div id="attachment_546" style="width: 654px" class="wp-caption aligncenter"><a href="https://danwin.com/words/wp-content/uploads/2010/02/windows-scite.png"><img class="size-full wp-image-546" title="windows-scite" src="https://danwin.com/words/wp-content/uploads/2010/02/windows-scite.png" alt="Writing a Ruby script in SciTE for Windows" width="644" height="378" /></a><p class="wp-caption-text">Writing a Ruby script in SciTE for Windows</p></div>
<p>There&#8217;s also the old-fashioned command line, from which you ran IRB from. Navigate to the directory that you saved your file in. Then type &#8220;ruby <em>whatever_your_file_name_is.rb</em>&#8220;:</p>
<div id="attachment_547" style="width: 712px" class="wp-caption aligncenter"><a href="https://danwin.com/words/wp-content/uploads/2010/02/windows-cmd-line-ruby.png"><img class="size-full wp-image-547" title="windows-cmd-line-ruby" src="https://danwin.com/words/wp-content/uploads/2010/02/windows-cmd-line-ruby.png" alt="Running a script from the Windows command line" width="702" height="392" /></a><p class="wp-caption-text">Running a script from the Windows command line</p></div>
</div>
<div class="sec">OK, here&#8217;s another high-level programming construct we&#8217;ll superficially try to cover&#8230;</p>
<h3><a name="topic_xpath"></a>XPath</h3>
<p><strong>XPath</strong> is a syntax used to address parts of HTML documents. It allows you, for example, to find all text that&#8217;s between headline, italics, paragraph, or whatever tags you want. You could also do something as specific as &#8220;Find the third link in every paragraph.&#8221;</p>
<div id="attachment_548" style="width: 370px" class="wp-caption aligncenter"><a href="https://danwin.com/words/wp-content/uploads/2010/02/xpath-bbb.jpg"><img class="size-full wp-image-548" title="xpath-bbb" src="https://danwin.com/words/wp-content/uploads/2010/02/xpath-bbb.jpg" alt="From Zvon.org, how to select all 'BBB' nodes using XPath" width="360" height="300" /></a><p class="wp-caption-text">From Zvon.org, how to select all &#39;BBB&#39; nodes using XPath</p></div>
<p>It&#8217;s another field of knowledge in which you could spend your life memorizing. For our purposes, you just need to know that it&#8217;s a way to pinpoint an element, or a set of elements, in an HTML document.</p>
<pre class="ruby" name="code">list_of_presidents.xpath('//h2')[1].content
#result: "Presidents"</pre>
<p>Let&#8217;s dissect the above nokogiri command. <strong>list_of_presidents</strong> was a variable holding a Nokogiri data structure&#8230;essentially, the entirety of the Wikipedia page in a format that the Nokogiri library can understand.</p>
<p><strong>xpath</strong>, then, is an <a href="#topic_method">instance method</a> of this data structure, that takes a string as an argument. That string contains XPath syntax.</p>
<p>The string, in the above example, is <strong>&#8220;//h2&#8243;</strong>. In XPath syntax (check out <a href="http://www.w3schools.com/XPath/xpath_syntax.asp">W3Schools for a primer</a>), the double-slashes <strong>//</strong> tells the parser to look anywhere in the document. <strong>h2</strong> is the specific tag â€“ a level-2 headline â€“ that we want. And <strong>[1]</strong> denotes that the result of the <strong>xpath</strong> method is an array, of which we want the value at the 1st index (technically, the second value of that array&#8230;remember that an array&#8217;s index starts at the <a href="http://blog.nicksieger.com/articles/2006/07/27/why-does-an-array-index-start-at-0-not-1">0th index</a>). And <strong>content</strong> is an instance method of what was in that 1st index: a nokogiri data structure. <strong>content</strong>, in this case, pulls what was in those <strong>h2</strong> tags: &#8220;<strong>Presidents</strong>&#8220;.</p>
<p>The above line could&#8217;ve been broken down into:</p>
<pre class="ruby" name="code">a = list_of_presidents
a = a.xpath('//h2')
a = a[1]
a = a.content
#result: "Presidents"</pre>
<p>That was a very simple XPath query. Another one could be:</p>
<pre class="ruby" name="code">list_of_presidents.xpath('//p/a[4]')</pre>
<div class="note">Unlike arrays, XPath notation does <strong>not</strong> start at <strong>0</strong> So <strong>1</strong> will refer to the 1st element) hyperlink (<strong>&lt;a&gt; tag</strong>). The notation is contained within that string:</p>
<p><strong>list_of_presidents</strong>.xpath(&#8216;//p/a[4]&#8217;)[0]</p>
<p>&#8230;would refer to the first element of the array of fourth-hyperlinks that were inside <strong>p</strong> tags.</div>
<p>This will find the <em>4th</em> hyperlink in each <strong>p</strong>aragraph. If you try it out, you&#8217;ll get an array containing two elements&#8230;which makes sense, as there are only two paragraphs on this page (therefore, there can only be two <strong>fourth-in-a-paragraph</strong> hyperlinks)</p>
</div>
<div class="sec">
<h4><a name="topic_step_two"></a>Step 2: From a Table of Data, Fetch the President&#8217;s Name</h4>
<p>At this time, it&#8217;s worth looking at how Wikipedia lists its presidents:</p>
<div id="attachment_525" style="width: 1034px" class="wp-caption aligncenter"><a href="https://danwin.com/words/wp-content/uploads/2010/02/Screen-shot-2010-03-13-at-9.32.35-PM.png"><img class="size-large wp-image-525" title="Screen shot 2010-03-13 at 9.32.35 PM" src="https://danwin.com/words/wp-content/uploads/2010/02/Screen-shot-2010-03-13-at-9.32.35-PM-1024x370.png" alt="Wikipedia's List of Presidents of the United States" width="1024" height="370" /></a><p class="wp-caption-text">Wikipedia&#39;s List of Presidents of the United States</p></div>
<p>This is an HTML table. Each row appears to contain one president (there are sub-rows, which we&#8217;ll ignore, corresponding to each term). In the third column (the second column is the actual image file) are two important pieces of data for us: the president&#8217;s name and a link to that president&#8217;s Wikipedia page.</p>
<p>Remember that we wanted the age of each president. Unfortunately, that&#8217;s not listed on this table, so we&#8217;ll have to visit each page, where, presumably, an age is listed.</p>
<p>Visit w3Schools for a <a href="http://www.w3schools.com/html/html_tables.asp">quick primer on HTML tables</a>. But to be brief: <strong>tr</strong> designates a row and <strong>td</strong> designates a column. Let&#8217;s put our installation of Firefox&#8217;s Firebug to use. Let&#8217;s confirm that the info we want â€“ a president&#8217;s name â€“ is indeed in the third column.</p>
<p>Right click on the hyperlink of John Adams and select <strong>Inspect Element</strong>. The Firebug panel should pop-up like so, showing that the third <strong>&lt;td&gt;</strong> element contains &#8220;John Adams&#8221;. More specifically, it contains the text &#8220;John Adams&#8221; in between &lt;a&gt; tags, <a href="#topic_link">which we learned marks off a hyperlink</a>. This will be important in the next step&#8230;</p>
<p><a name="john_adams_firebug"></a></p>
<div id="attachment_445" style="width: 510px" class="wp-caption aligncenter"><a href="https://danwin.com/words/wp-content/uploads/2010/02/john-adams-firebug.gif"><img class="size-medium wp-image-445" title="john-adams-firebug" src="https://danwin.com/words/wp-content/uploads/2010/02/john-adams-firebug-500x344.gif" alt="" width="500" height="344" /></a><p class="wp-caption-text">Using Firebug to find out the element containing &quot;John Adams&quot;</p></div>
<p>Adapting <a href="#topic_xpath">from our previous line of code</a> using XPath, let&#8217;s try this:</p>
<pre class="ruby" name="code">those_columns = list_of_presidents.xpath("//tr/td[3]")</pre>
<p>That XPath notation will find us every third <strong>&lt;td&amp;rt;</strong> (column) that is enclosed in a <strong>&lt;tr&gt;</strong> tag (row). That should spit out a large <strong>array</strong> of Nokogiri elements (as many as there are presidents).</p>
<p>We want the first of those, which is addressed in the 0th-index of that array&#8230;</p>
<pre class="ruby" name="code">those_columns[0]
# result is: "George Washington[2][3][4][5]"</pre>
<p>So we got a name&#8230;but what&#8217;s with the bracketed numbers? If you look at the Wikipedia list again, you&#8217;ll see that those numbers are links to footnotes. Useful, but not to us. So how to extract just the name? Remember that <a href="#john_adams_firebug">each president&#8217;s name</a> is enclosed in a <strong>a</strong> (hyperlink) tag. And it&#8217;s the first hyperlink. So let&#8217;s make our previous XPath a little more complex:</p>
<pre class="ruby" name="code">george_washingtons_name = list_of_presidents.xpath("//tr/td[3]/a[1]")[0]
=&gt;"George Washington"</pre>
<p>We&#8217;re now asking for the <strong>1st</strong> (a[1], in XPath notation, is asking for the first <strong>a</strong> tag) hyperlink, in the third column (<strong>td</strong>), in each row (<strong>tr</strong>). The result is the string <strong>&#8220;George Washington&#8221;</strong>.</p>
<h4><a name="topic_step_three"></a>Step 3: Determine if the Last Name Is Longer Than 6 Characters</h4>
<p>OK, now we have a name; how do we programmatically determine the length of the last name (remember, our goal is to search all presidents with last names with more than 6 letters)?</p>
<p><strong>The split and length methods of String</strong></p>
<p>First, let&#8217;s get the last name. It&#8217;s reasonable to assume that the last word in each string (&#8220;Bush&#8221; in &#8220;George W. Bush&#8221;) is the last name. Each word is set off by a <strong>space</strong>. So we are going to use a String instance method called <a href="http://ruby-doc.org/core/classes/String.html#M000803">split</a>, which will take a string and divide it into separate pieces, using a character we specify. The result is an Array of strings.</p>
<p>So:</p>
<pre class="ruby" name="code">the_last_name = george_washingtons_name.split(' ')[-1]
# Result: "Washington"</pre>
<ol>
<li>The above line can be described as thus: Take the string inside the variable <strong>george_washingtons_name</strong></li>
<li>Split it at every instance of a <strong>space</strong></li>
<li>Return the last element (the -1 index of an array returns the last element. -2 would return the second-to-last)</li>
</ol>
<p>The result is: &#8220;Washington&#8221; from the string &#8220;George Washington&#8221; is assigned to the variable <strong>the_last_name</strong></p>
<p>Now, this is when we finally use the conditional branch statement <strong>if</strong></p>
<pre class="ruby" name="code">the_last_name.length &gt; 6
# result: true</pre>
<pre class="ruby" name="code">if the_last_name.length &gt; 6
 puts("Yep, greater than 6")
end
# result: Yep, greater than 6</pre>
<p><strong>length</strong> is an instance method of Strings. In the first bit of code, we basically asked: is the length of <strong>the_last_name</strong> greater than <strong>6</strong>. The interpreter says, <strong>true</strong></p>
<p>In the second bit of code, we defined a branch statement, saying to print &#8220;Yep, greater than 6&#8243; if the condition in the <strong>if</strong> statement (<strong>the_last_name.length &gt; 6</strong>) was <strong>true</strong>. It was.</p>
</div>
<div class="sec">
<h4><a name="topic_step_four"></a>Step 4: If So, Fetch the Link to the President&#8217;s Page and Open It</h4>
<p>Here&#8217;s the code, in verbose form, that we&#8217;ve taken to get here&#8230;plus a few more lines that flesh out how we want the script to actually execute.</p>
<pre class="ruby" name="code">	# open the required libraries
	require 'rubygems'
	require 'nokogiri'
	require 'open-uri'

	# Using nokogiri, fetch Wikipedia's list of presidents page
	list_of_presidents = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States'))

	# Using another nokogiri method, grab the third column from every row, and from those, grab the first hyperlink (which contains the prez's name)
	an_array_of_links = list_of_presidents.xpath("//tr/td[3]/a[1]")</pre>
<p>So we dealt with George Washington&#8217;s name&#8230;but we want to deal with an array of presidential names. On each element, we want to execute the same operation (see if length of last name is greater than 6 letters, if so, fetch the link).</p>
<p>We&#8217;re going to use something called an <strong>each</strong> loop.</p>
<pre class="ruby" name="code">		count = 0

		an_array_of_links.each do |link_to_test|
		# This above statement can be read as: for each element in an_array_of_links, do
		# the following code (until the end line)
		# And as you go through each element, the variable use to reference the element will be named "link_to_test"

		   last_name = link_to_test.content.split(' ')[-1]   #remember that between the &lt;a&gt; tags was the president's name, with the last word being the last  name
			if last_name.length &gt; 6
				the_link_to_the_presidents_page = link_to_test["href"]
				# We'll get to this part in the next section...
			end

		end
		# OK, we're at the end of the each loop. Go back to the top</pre>
<p>I&#8217;m not going to dissect this. It&#8217;s enough to know that <strong>each</strong> is a method of an Array, and the code inside <strong> each do</strong> and <strong>end</strong> is executed for each element of an Array.</p>
</div>
<div class="sec">
<p>OK, using the code above, we are looping through all the presidents&#8217; names and page links. On each name, we&#8217;re testing the length of the last name. And if the last name is longer than 6 letters&#8230;we&#8217;re going to open the link and grab the president&#8217;s age.</p>
<p>So:</p>
<pre class="ruby" name="code">	if last_name.length &gt; 6
		the_link_to_the_presidents_page = link_to_test["href"] 

		# OK, the value of href is going to be something like "/wiki/George_Washington". That's an address relative to the Wikipedia site
		# so we need to prepend "http://en.wikipedia.org" to have a valid address...

		the_link_to_the_presidents_page = "http://en.wikipedia.org"+the_link_to_the_presidents_page

		# now let's fetch that page

		the_presidents_page = Nokogiri::HTML(open(the_link_to_the_presidents_page))

		# ... OK, now what?

	end</pre>
<h4><a name="topic_step_five"></a>Step 5: Grab the age from the president&#8217;s page</h4>
<p>All right, so <strong>the_presidents_page</strong> now holds all the html inside one of the president&#8217;s page. We need to scope it out to find the XPath necessary to fetch the age of the president.</p>
<p>Let&#8217;s take a look at George Washington&#8217;s page. More specifically, look at the sidebar to the right, which contains his vital statistics:</p>
<div id="attachment_462" style="width: 534px" class="wp-caption aligncenter"><a href="https://danwin.com/words/wp-content/uploads/2010/02/george_washington_sidebar_wiki.gif"><img class="size-full wp-image-462" title="george_washington_sidebar_wiki" src="https://danwin.com/words/wp-content/uploads/2010/02/george_washington_sidebar_wiki.gif" alt="" width="524" height="461" /></a><p class="wp-caption-text">George Washington&#39;s Wikipedia Sidebar</p></div>
<p>As you can see, the age is listed, next to the &#8220;Died&#8221; line.</p>
<p>Using Firebug to check out the structure tells us that the sidebar is a table, and the death date is in the <strong>&lt;td&gt;</strong> cell that immediately follows the <strong>&lt;th&gt;</strong> cell containing the text &#8220;Died&#8221;.</p>
<div id="attachment_463" style="width: 531px" class="wp-caption aligncenter"><a href="https://danwin.com/words/wp-content/uploads/2010/02/george_washington_firebug.gif"><img class="size-full wp-image-463" title="Firebug Inspection of George Washington Sidebar" src="https://danwin.com/words/wp-content/uploads/2010/02/george_washington_firebug.gif" alt="Firebug Inspection of George Washington Sidebar" width="521" height="226" /></a><p class="wp-caption-text">Firebug Inspection of George Washington Sidebar</p></div>
<p>OK, were going to have to use XPath to target those specific cells. Let&#8217;s test it out on George Washington&#8217;s page. I&#8217;m just going to provide you the XPath syntax; you&#8217;re welcome to read <a href="http://www.w3schools.com/XPath/xpath_syntax.asp">W3School&#8217;s tutorial</a> to figure why it works:</p>
<pre class="ruby" name="code">	george = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/George_Washington'))
	death_date = george.xpath("//th[contains(text(), 'Died')]/following-sibling::*")[0].content 

	# =&gt; "December 14, 1799 (aged 67)Mount Vernon, Virginia,\nUnited States"</pre>
<p>(Some references to the syntax above: <a href="http://bytes.com/topic/net/answers/176666-text-string-finder">contains</a>,  <a href="http://www.zvon.org/xxl/XPathTutorial/Output/example15.html">following-sibling</a></p>
<p>Well, <strong>death_date</strong> contains more than we wanted. How do we just get the <strong>67</strong> from the <strong>aged 67</strong> part? There&#8217;s no html tag that sets <strong>67</strong> off (our job would have been so easy if it had been &lt;age&gt;67&lt;/age&gt;).</p>
<p>The last new topic you&#8217;ll learn in order to complete the task is <strong>regular expressions</strong>.</p>
</div>
<div class="sec">
<h3><a name="topic_regex"></a>Regular Expressions, aka regexes</h3>
<p>Again, like HTML and XPath, regular expressions aren&#8217;t &#8220;programming&#8221;, but it&#8217;s a universe of syntax that requires entire books to describe. Put simply, regular expressions allow you to grab strings of text that match a pattern.</p>
<div id="attachment_550" style="width: 824px" class="wp-caption aligncenter"><a href="http://www.regular-expressions.info/examples.html"><img class="size-full wp-image-550" title="Untitled-4" src="https://danwin.com/words/wp-content/uploads/2010/02/Untitled-4.png" alt="From regular-expressions.info, how to match HTML tags" width="814" height="251" /></a><p class="wp-caption-text">From regular-expressions.info, how to match HTML tags</p></div>
<p>In this case, the pattern I want is: <em>a number, either <strong>two-to-three digits long</strong>, that is <strong>after</strong> the word <strong>&#8220;aged &#8220;</strong></em></p>
<p>I won&#8217;t go into the specifics here&#8230;I&#8217;ve found that you can learn regular expressions with a little reading and trial and error. In this case, the pattern I want, in regex terms, is <strong>/aged.+?([0-9]+)/</strong> (note: although the text on the Wikipedia page reads something like &#8220;aged 67&#8243;, the space in between is a special HTML character, hence, the <strong>.+?</strong> used to capture it in the reg ex&#8230;don&#8217;t worry, that last sentence will make perfect sense when you someday understand reg exes.).</p>
<p>In descriptive English, this pattern is going to capture (what&#8217;s in the parentheses) any digits from 0-9 that follow the character sequence <strong>aged</strong>. The forward-slashes denote the beginning and end of the regex.</p>
<p>Again, a regular expression is a syntax, not an actual programming function. So we need to call Ruby&#8217;s instance method, <a href="http://ruby-doc.org/core/classes/String.html#M000778">match</a>, which executes a text-search based on the syntax of regular expression that you passed into it. Like so:</p>
<pre class="ruby" name="code">death_date = george.xpath("//th[contains(text(), 'Died')]/following-sibling::*")[0].content
age_at_death = <strong>death_date.match(/aged.+?([0-9])/)[1]</strong></pre>
<p>As you can guess, <a href="http://ruby-doc.org/core/classes/String.html#M000778">match</a> returns an array of elements. I don&#8217;t want to explain the <strong>match</strong> method in full here, but the 0th element contains the entire match, which would be &#8220;aged 67&#8243;, and the 1st element returns what was in between the parentheses of my regular expression&#8230;the pattern for a multi-digit number, i.e. 67. Again, you just have to learn about reg exes for this to make more sense.</p>
<div class="note">You don&#8217;t have to be a programmer to appreciate regular expressions. Ever do find and replace in a text editor? Let&#8217;s say you have a bunch of text with numbers sprinkled through&#8230;and those numbers were supposed to have $ signs in front of them. There&#8217;s no simple find-and-replace that can replace every group of numbers (9, 12.3, 0.55) with ($9, $12.3, $0.55); but in text-editors that support regexes, you could do such a replacement in one command. This is pretty invaluable if you&#8217;ve ever had to clean up &#8220;dirty&#8221; comma-delimited files.</div>
<p>Bookmark <a href="http://www.regular-expressions.info/">regular-expressions.info</a> and save yourself a lot of time in learning about reg exes.</p>
</div>
<div class="sec">
<h4><a name="topic_step_six"></a>Step 6: Add up the data you gathered</h4>
<p>So now we&#8217;ve gotten to our goal: retrieving a president&#8217;s age from his Wikipedia page. Now we just need to add it all up and take the average.</p>
<p>Here&#8217;s the remaining things we have to do, in narrative form:<br />
Before we go into each president&#8217;s page, we need a variable to hold the sum of all the ages (<strong>total_age</strong>). And we&#8217;ll need a variable to keep track of how many president&#8217;s ages we&#8217;ve retrieved (<strong>prez_count</strong>). However, not every page is going to have an age&#8230;since not all former presidents have passed away. So, if the &#8220;age&#8221; datapoint exists, add it to the total_age variable. And increment <strong>prez_count</strong>. If not, then do nothing, and go onto the next president until we&#8217;ve gone through all the presidents.</p>
<p>Once we&#8217;ve finished looping through the pages of presidents, divide <strong>total_age</strong> by <strong>prez_count</strong>. And we&#8217;re done.</p>
</div>
<div class="sec">
<h2><a name="topic_complete"></a>The complete script</h2>
<p>The final code is as follows (I&#8217;ve added several <strong>puts</strong> statements to notify you where in the execution the script is&#8230;it should take less than 2 minutes):</p>
<pre class="ruby" name="code">	require 'rubygems'
	require 'nokogiri'
	require 'open-uri'

	list_of_presidents = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States'))

	an_array_of_links = list_of_presidents.xpath("//tr/td[3]/a[1]")

	## These two variables will be added to throughout the execution of the script
	## At the end, they'll have the answers

	prez_count = 0
	total_age = 0

	an_array_of_links.each do |link_to_test|  

	   last_name = link_to_test.content.split(' ')[-1]   

		if last_name.length &gt; 6
			the_link_to_the_presidents_page = link_to_test["href"]
			the_link_to_the_presidents_page = "http://en.wikipedia.org" + the_link_to_the_presidents_page
			prez_page = Nokogiri::HTML(open(the_link_to_the_presidents_page))

			puts "Entering the page: #{the_link_to_the_presidents_page}"

			death_date = prez_page.xpath("//th[contains(text(), 'Died')]/following-sibling::*")

      if death_date &amp;&amp; death_date[0]
        # Doing something like `if some_variable_name` is basically asking, "Does some_variable_name have any value?".
        # It will return false if some_variable_name has been set to false or if it had never been set to anything at all, not even 0 or an empty string (both of which would return true)
        # The double ampersand &amp;&amp; functions as an "AND", requiring that two conditional tests be true before entering the if-statement's true branch

			  age_at_death = death_date[0].content.match(/aged.+?([0-9]+)/)[1]
  	  		if age_at_death
  	  		  # we only get here if there was a "Died" table cell AND a text pattern similar to: "aged XX"
  	  		  puts "Age of #{link_to_test.content} is: #{age_at_death}"
  	  			total_age += age_at_death[1].to_i  # technically, age_at_death[0] is a String. to_i will make it a Number so we can safely add it to total_age
  	  			prez_count += 1
  	  		end #end of the if age_at_death
  	  end # end of the if death_date...
	  else
	    # we reach this branch of code if last_name was shorter than 6. Let's print a debug message to notify us:
	    puts "#{last_name} is not longer than 6 letters"
		end #end of the if last_name.length &gt; 6

	end # OK, we're at the end of the each loop. Go back to the top

	# if we got here, we're out of the loop, and total_age and prez_count have the right values. So:
	the_final_value = total_age/prez_count.to_f  # to_f converts an integer to a decimal number, so we'll get partial years for the average
	puts "#{prez_count} presidents were counted, their age totaling: #{total_age}."
	puts "The average of their ages is #{the_final_value}"</pre>
<p>As of Feb. 2010, running that script produces this output:</p>
<blockquote style="height: 300px; overflow: auto;"><p>Entering the page: http://en.wikipedia.org/wiki/George_Washington<br />
Age of George Washington is: 67<br />
Adams is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/Thomas_Jefferson<br />
Age of Thomas Jefferson is: 83<br />
Entering the page: http://en.wikipedia.org/wiki/James_Madison<br />
Age of James Madison is: 85<br />
Monroe is not longer than 6 letters<br />
Adams is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/Andrew_Jackson<br />
Age of Andrew Jackson is: 78<br />
Buren is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/William_Henry_Harrison<br />
Age of William Henry Harrison is: 68<br />
Tyler is not longer than 6 letters<br />
Polk is not longer than 6 letters<br />
Taylor is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/Millard_Fillmore<br />
Age of Millard Fillmore is: 74<br />
Pierce is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/James_Buchanan<br />
Age of James Buchanan is: 77<br />
Entering the page: http://en.wikipedia.org/wiki/Abraham_Lincoln<br />
Age of Abraham Lincoln is: 56<br />
Entering the page: http://en.wikipedia.org/wiki/Andrew_Johnson<br />
Age of Andrew Johnson is: 66<br />
Grant is not longer than 6 letters<br />
Hayes is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/James_A._Garfield<br />
Age of James A. Garfield is: 49<br />
Arthur is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/Grover_Cleveland<br />
Age of Grover Cleveland is: 71<br />
Entering the page: http://en.wikipedia.org/wiki/Benjamin_Harrison<br />
Age of Benjamin Harrison is: 67<br />
Entering the page: http://en.wikipedia.org/wiki/Grover_Cleveland<br />
Age of Grover Cleveland is: 71<br />
Entering the page: http://en.wikipedia.org/wiki/William_McKinley<br />
Age of William McKinley is: 58<br />
Entering the page: http://en.wikipedia.org/wiki/Theodore_Roosevelt<br />
Age of Theodore Roosevelt is: 60<br />
Taft is not longer than 6 letters<br />
Wilson is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/Warren_G._Harding<br />
Age of Warren G. Harding is: 57<br />
Entering the page: http://en.wikipedia.org/wiki/Calvin_Coolidge<br />
Age of Calvin Coolidge is: 60<br />
Hoover is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/Franklin_D._Roosevelt<br />
Age of Franklin D. Roosevelt is: 63<br />
Truman is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/Dwight_D._Eisenhower<br />
Age of Dwight D. Eisenhower is: 78<br />
Entering the page: http://en.wikipedia.org/wiki/John_F._Kennedy<br />
Age of John F. Kennedy is: 46<br />
Entering the page: http://en.wikipedia.org/wiki/Lyndon_B._Johnson<br />
Age of Lyndon B. Johnson is: 64<br />
Nixon is not longer than 6 letters<br />
Ford is not longer than 6 letters<br />
Carter is not longer than 6 letters<br />
Reagan is not longer than 6 letters<br />
Bush is not longer than 6 letters<br />
Entering the page: http://en.wikipedia.org/wiki/Bill_Clinton<br />
Bush is not longer than 6 letters<br />
Obama is not longer than 6 letters<br />
21 presidents were counted, their age totaling: 1398.<br />
The average of their ages is 66.5714285714286</p></blockquote>
</div>
<div class="sec">
<h3><a name="topic_end"></a>The End?</h3>
<p>Well, congratulations&#8230;you accomplished a trivial task, but you learned a set of methods that you can apply to much more important goals. If you&#8217;re a complete newbie to programming, hopefully this tutorial has given you a glimpse of what&#8217;s involved. And how, once you firm up your programming fundamentals, you can get real work done.</p>
<p>But I need to stress that this tutorial simplified things as much as possible&#8230;at the cost of best-practices programming. I chose Wikipedia as a target because it&#8217;s a reasonably well-structured, high-traffic site that has an ethos of making volumes of information available for the public good.</p>
<p>The script that we just wrote is a naive, little child, that gets what it wants as fast as it wants. In the real world, many sites that you attempt to scrape will not be so forgiving. Some sites will block you, or fail to connect, if you try to read a hundred pages at once. Some sites will have horrific HTML that will require much more complicated XPath and regular expression syntax. Sometimes, your internet connection might drop. All of this will cause the above script to halt to a ugly and premature death. Or even worse: collect bad data that you won&#8217;t know was erroneous.</p>
<p>All of these problems are solvable, but like any task, it takes experience that comes from trying and failing. Hopefully, this tutorial at least shows you how easy it is to try.</p>
<p><strong>Other resources:</strong></p>
<ul>
<li>Tutorials on <a href="http://www.w3schools.com/html/default.asp">HTML</a>, <a href="http://www.w3schools.com/css/default.asp">CSS</a>, <a href="http://www.w3schools.com/xpath/default.asp">XPath</a> and a bunch of other useful topics,  from W3Schools</li>
<li><a href="http://ruby.about.com/">About.com&#8217;s guide to Ruby</a></li>
<li><a href="http://www.ruby-lang.org/en/downloads/">Installing Ruby</a></li>
<li><a href="http://www.sapphiresteel.com/The-Little-Book-Of-Ruby">The Little Book of Ruby</a> &#8211; A free e-book from SaphireSteel Software</li>
<li><a href="http://stdlib.rubyonrails.org&gt;&lt;/a&gt;">Ruby Standard Library Documentation</a></li>
<li><a href="http://nokogiri.org/tutorials">Nokogiri Tutorial</a></li>
<li><a href="http://railscasts.com/episodes/190-screen-scraping-with-nokogiri">Video tutorial of Nokogiri Screen Scraping</a> from Railscasts</li>
<li><a href="http://en.wikipedia.org/wiki/Ruby_%28programming_language%29#Examples">Ruby Examples</a> from Wikipedia</li>
<li><a href="http://www.zvon.org/xxl/XPathTutorial/General/examples.html">XPath tutorial</a> from Zvon</li>
<li><a href="http://www.regular-expressions.info/">Regular-Expressions.info</a>, pretty much the best regular expression resource online</li>
<li><a href="http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/">A printable cheat-sheet for regular expressions</a></li>
</ul>
</div>
</div>
<p>See my <a href="https://danwin.com/works/coding-for-journalists-101-a-four-part-series/">four-part series on web-scraping for journalists here</a>.</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/">Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully.</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2010/04/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
	</channel>
</rss>
