<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>danwin.com &#187; coding</title>
	<atom:link href="https://danwin.com/tag/coding/feed/" rel="self" type="application/rss+xml" />
	<link>https://danwin.com</link>
	<description>Words, photos, and code by Dan Nguyen. The &#039;g&#039; is mostly silent.</description>
	<lastBuildDate>Thu, 21 Nov 2019 12:29:57 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.2.39</generator>
	<item>
		<title>dataist blog: An inspiring case for journalists learning to code</title>
		<link>https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/</link>
		<comments>https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/#comments</comments>
		<pubDate>Wed, 16 Feb 2011 13:00:32 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[thoughts]]></category>
		<category><![CDATA[works]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[Dollars for Docs]]></category>
		<category><![CDATA[journalism]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[propublica]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=1582</guid>
		<description><![CDATA[<p>About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven&#8217;t looked back at it because I&#8217;m sure I&#8217;ll just spend the next few hours cringing. For example, what a dumb idea it was to [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/">dataist blog: An inspiring case for journalists learning to code</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p><a href="https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/pills-keyboard-300x200/" rel="attachment wp-att-1596"><img src="https://danwin.com/words/wp-content/uploads/2011/02/pills-keyboard-300x200.jpg" alt="" title="pills-keyboard-300x200" width="300" height="200" class="alignleft size-full wp-image-1596" /></a> About a year ago <a href="https://danwin.com/coding-for-journalists-a-four-part-series/">I threw up a long, rambling guide</a> hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven&#8217;t looked back at it because I&#8217;m sure I&#8217;ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from <a href="https://danwin.com/works/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/">&#8220;What is HTML&#8221; to actual Ruby scraping code all in a gigantic, badly formatted post</a>.</p>
<p>The series of articles have gotten a fair number of hits but I don&#8217;t know how many people were able to stumble through it. Though last week I noticed this <a href="http://dataist.wordpress.com/2011/02/05/mapping-ratata-whos-hot/">recent trackback from dataist</a>, a new &#8220;blog about data exploration&#8221; by Finnish journo <a href="http://jensfinnas.com/">Jens FinnÃ¤s</a>. He writes that he has &#8220;almost no prior programming experience&#8221; but, after going through my tutorials and checking out <a href="http://scraperwiki.com/">Scraperwiki</a>, was<a href="http://dataist.wordpress.com/2011/02/05/mapping-ratata-whos-hot/"> able to produce this cool network graph of the Ratata blog network after about &#8220;two days of trial and error&#8221;:</a></p>
<div id="attachment_1597" style="width: 510px" class="wp-caption aligncenter"><a href="http://dataist.wordpress.com/2011/02/05/mapping-ratata-whos-hot/"><img src="https://danwin.com/words/wp-content/uploads/2011/02/dataist-pdf.gif" alt="Mapping of Ratata blogging network by Jens FinnÃ¤s of dataist.wordpress.com" title="Mapping of Ratata blogging network by Jens FinnÃ¤s of dataist.wordpress.com" width="500" height="311" class="size-full wp-image-1597" /></a><p class="wp-caption-text">Mapping of Ratata blogging network by Jens FinnÃ¤s of dataist.wordpress.com</p></div>
<p>I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas&#8217;s <a href="http://dataist.wordpress.com/2011/02/05/mapping-ratata-whos-hot/">example</a>. Becoming good at coding is not a trivial task. But even the first steps of it can teach a non-coder some profound lessons about data important enough on their own. And if you&#8217;re a curious-type with a question you want to answer, you&#8217;ll soon figure out a way to put something together, as in Finnas&#8217;s case.</p>
<p>ProPublica&#8217;s <a href="http://projects.propublica.org/docdollars/">Dollars for Docs project</a> originated in part from this <a href="https://danwin.com/2010/04/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">Pfizer-scraping lesson</a> I added on to my <a href="https://danwin.com/coding-for-journalists-a-four-part-series/">programming tutorial</a>: I needed a timely example of public data that wasn&#8217;t as useful as it should be.</p>
<p>My colleagues Charles Ornstein and Tracy Weber may not be programmers (yet), but they are experienced enough with data to know its worth as an investigative resource, and turned an <a href="http://www.propublica.org/nerds/item/the-coders-cause-in-dollars-for-docs">exercise</a> in transparency into a <a href="http://projects.propublica.org/docdollars">focused and effective investigation</a>. It&#8217;s not trivial to find a story in data. Besides being able to do Access queries themselves, C&#038;T knew both the limitations of the data (for example, it&#8217;s difficult to make comparisons between the companies because of <a href="http://projects.propublica.org/docdollars/payment_reports">different reporting periods</a>) and its possibilities, such as the cross-checking of names en masse from the payment lists with state and federal doctor databases.</p>
<p>Their <a href="http://www.propublica.org/series/nurses">investigation into the poor regulation of California nurses</a> &ndash; a collaboration with the LA Times that was a <a href="http://www.pulitzer.org/citation/2010-Public-Service">Pulitzer finalist in the Public Service category</a> &ndash; was similarly data-oriented. They (and the LA Times&#8217; Maloy Moore and Doug Smith) had been diligently building a database of thousands of nurses &ndash; including their disciplinary records and the time it took for the nursing board to act &ndash; which made my part in <a href="http://projects.propublica.org/nurses">building a site</a> to graphically represent the data extremely simple.</p>
<p>The point of all this is: don&#8217;t put off your personal data-training because you think it requires a computer science degree, or that you have to become great at it in order for it to be useful. Even if after a week of learning, you can barely put together a programming script to alphabetize your tweets, you&#8217;ll likely gain enough insight to how data is made structured and useful, which will aid in just about every other aspect of your reporting repertoire. </p>
<p>In fact, just knowing to avoid taking notes like this:</p>
<blockquote><p>
Colonel Mustard used the revolver in the library? (not library)<br />
Miss Scarlet used the Candlestick in the dining room? (not Scarlet)<br />
&#8220;Mrs. Peacock, in the dining room, with the <s>revolver</s>? &#8220;<br />
&#8220;Colonel Mustard, rope, <s>conservatory</s>?&#8221;<br />
Mustard? Dining room? Rope (nope)?<br />
&#8220;Was it Mrs. Peacock with the <s>candlestick</s>, inside the dining room?&#8221;
</p></blockquote>
<p>And instead, recording them like this:</p>
<table>
<thead>
<tr>
<th>Who/What?</th>
<th>Role?</th>
<th>Ruled out?</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mustard</td>
<td>Suspect</td>
<td>N</td>
</tr>
<tr>
<td>Scarlet</td>
<td>Suspect</td>
<td>Y</td>
</tr>
<tr>
<td>Peacock</td>
<td>Suspect</td>
<td>N</td>
</tr>
<tr>
<td>Revolver</td>
<td>Weapon</td>
<td>Y</td>
</tr>
<tr>
<td>Candlestick</td>
<td>Weapon</td>
<td>Y</td>
</tr>
<tr>
<td>Rope</td>
<td>Weapon</td>
<td>Y</td>
</tr>
<tr>
<td>Conservatory</td>
<td>Place</td>
<td>Y</td>
</tr>
<tr>
<td>Dining Room</td>
<td>Place</td>
<td>N</td>
</tr>
<tr>
<td>Library</td>
<td>Place</td>
<td>Y</td>
</tr>
</tbody>
</table>
<p>&#8230;will make you a significantly more effective reporter, as well as position you to have your reporting and research become much more ready for thorough analysis and online projects.</p>
<p>There&#8217;s a motherlode of programming resources available through single Google search. My high school journalism teacher told us that if you want to do journalism, don&#8217;t major in it, just do it. I think the same can be said for programming. I&#8217;m glad I chose a computer field as an undergraduate so that I&#8217;m familiar with the theory. But if you have a career in reporting or research, you have real-world data-needs that most undergrads don&#8217;t. I&#8217;ve found that having those goals and needing to accomplish them has pushed my coding expertise far quicker than did any coursework.</p>
<p>If you aren&#8217;t set on learning to program, but want to get a better grasp of data, I recommend learning:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Regular_expression">Regular expressions</a> &#8211; a set of character patterns, easily printable on a cheat-sheet for memorization, that you use in a text-editor&#8217;s <em>Find and Replace</em> dialog to turn a chunk of text into something you can put into a spreadsheet, as well as clean up the data entries themselves. <a href="http://www.regular-expressions.info/">Regular-expressions.info</a> is the most complete resource I&#8217;ve found. A cheat-sheet can be <a href="http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/">found here</a>. <a href="http://en.wikipedia.org/wiki/Regular_expression">Wikipedia</a> has a list of some simple use cases.</li>
<li>
<a href="http://code.google.com/p/google-refine/">Google Refine</a> &#8211; A spreadsheet-like program that makes easy the task of cleaning and normalizing messy data. Ever go through campaign contribution records and wish you could easily group together and count as one, all the variations of &#8220;Jon J. Doe&#8221;, &#8220;Jonathan J. Doe&#8221;, &#8220;Jon Johnson Doe&#8221;, &#8220;JON J DOE&#8221;, etc.? Refine will do that. Refine developer David Huynh has an <a href="http://www.youtube.com/watch?v=yNccGtn3Wb0&#038;feature=player_embedded">excellent screencast</a> demonstrating Refine&#8217;s power. I wrote a guide as <a href="http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning">part of the Dollars for Docs tutorials</a>. Even if you know Excel like a pro &ndash; which I do not &ndash; Refine may make your data-life much more enjoyable.</li>
</li>
</ul>
<p>If you want to learn coding from the ground up, here&#8217;s a short list of places to start:</p>
<ul>
<li><a href="http://lifehacker.com/#!5744113/learn-to-code-the-full-beginners-guide">Lifehacker&#8217;s &#8220;Full Beginner&#8217;s Guide&#8221;</a> &#8211; a four day guide that covers the very basics to how to write a simple guessing game. It&#8217;s in Javascript, but as you&#8217;ll hear plenty of times from veterans, it really doesn&#8217;t matter what language you start out with.
</li>
<li><a href="http://www.ruby-doc.org/docs/ProgrammingRuby/">The Pragmatic Programmer&#8217;s Guide to Programming Ruby</a> &#8211; this covers an older version of Ruby, but is still a great comprehensive, browser-friendly book.
</li>
<li><a href="http://pine.fm/LearnToProgram/">Learn to Program (also in Ruby) by Chris Pine</a> &#8211; Written in 2004, this is still an elegant beginner&#8217;s guide
</li>
<li><a href="http://inventwithpython.com/chapters/">Invent Your Own Computer Games With Python</a> &#8211; You may not be interested in writing game software, but the same programming techniques apply in that field as they do anywhere else. This guide covers all the fundamentals and gives you great project examples.
</li>
<li><a href="http://scraperwiki.com/">ScraperWiki</a> has a massive collection of web-scraping scripts for your perusal, and is where the dataist&#8217;s FinnÃ¤s learned from example. ScraperWiki has a set of <a href="http://scraperwiki.com/help/tutorials/python/">python tutorials</a>, too.
</li>
<li>Here&#8217;s a <a href="http://www.e-booksdirectory.com/programming.php">giant list of free programming books</a>.
</li>
<li>Visit the <a href="http://www.reddit.com/r/learnprogramming">learnprogramming subforum in Reddit</a> to find a small, but active community of beginners who aren&#8217;t afraid to start the most basic of discussions with the forum&#8217;s programming experts. <a href="http://stackoverflow.com/">StackOverflow</a> is the single best site for specific questions or problems; often, you can Google your exact problem and a relevant StackOverflow discussion will be at the top.
</li>
<li>And you can always refer back to my <a href="https://danwin.com/coding-for-journalists-a-four-part-series/">four-part programming tutorial from last year</a>, which aims to cover HTML to writing Ruby to scrape websites. I <a href="http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data">also wrote a series of tutorials (with complete code) on how I collected data for Dollars for Docs</a>, including how to scrape from websites, Flash applications, PDFs, and even image files (the solution is specific to one kind of format, so I will gladly welcome anyone else to generalize it).
</li>
</ul>
<p>The post <a rel="nofollow" href="https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/">dataist blog: An inspiring case for journalists learning to code</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2011/02/dataist-blog-an-inspiring-case-for-journalists-learning-to-code/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Pfizer Data Redux</title>
		<link>https://danwin.com/2010/04/pfizer-data-redux/</link>
		<comments>https://danwin.com/2010/04/pfizer-data-redux/#comments</comments>
		<pubDate>Wed, 28 Apr 2010 14:22:36 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[works]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[doctors]]></category>
		<category><![CDATA[journalists]]></category>
		<category><![CDATA[pfizer]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=763</guid>
		<description><![CDATA[<p>Updated the code and results to my guide on how to scraper Pfizer&#8217;s list of payments to doctors. It now contains a more normalized file that has a line for every doctor and payment. The aggregate totals changed marginally.</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/pfizer-data-redux/">Pfizer Data Redux</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Updated the code and results to my <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">guide on how to scraper Pfizer&#8217;s list of payments to doctors</a>. It now contains a more normalized file that has a line for every doctor and payment. The aggregate totals changed marginally.</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/pfizer-data-redux/">Pfizer Data Redux</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2010/04/pfizer-data-redux/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Coding for Journalists 101 : A four-part series</title>
		<link>https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/</link>
		<comments>https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/#comments</comments>
		<pubDate>Tue, 06 Apr 2010 13:51:40 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[works]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[journalism]]></category>
		<category><![CDATA[pfizer]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[web scraping]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=661</guid>
		<description><![CDATA[<p>Update, January 2012: Everything&#8230;yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you&#8217;ll find here. I&#8217;m only keeping this old walkthrough up as a historical reference. I&#8217;m sure [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/">Coding for Journalists 101 : A four-part series</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<div id="attachment_663" style="width: 510px" class="wp-caption aligncenter"><a href="http://www.flickr.com/photos/nicocavallotto/363251198/"><img src="https://danwin.com/words/wp-content/uploads/2010/04/363251198_9537fe7c6d.jpg" alt="nico.cavallotto" title="nico.cavallotto 363251198_9537fe7c6d" width="500" height="357" class="size-full wp-image-663" /></a><p class="wp-caption-text">Photo by Nico Cavallotto on Flickr</p></div>
<p><strong>Update, January 2012:</strong> Everything&#8230;yes, everything, is superseded by my free online book, <a href="http://ruby.bastardsbook.com">The Bastards Book of Ruby</a>, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you&#8217;ll find here. </p>
<p>I&#8217;m only keeping this old walkthrough up as a historical reference. I&#8217;m sure the code is so ugly that I&#8217;m not going to even try re-reading it.</p>
<p>So check it out: <a href="http://ruby.bastardsbook.com">The Bastards Book of Ruby</a></p>
<p>-Dan</p>
<p>&#8212;</p>
<p><strong>Update, Dec. 30, 2010:</strong> I published <a href="http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data">a series of data collection and cleaning guides for ProPublica</a>, to describe what I did for our Dollars for Docs project. There is a <a href="http://www.propublica.org/nerds/item/scraping-websites">guide for Pfizer which supersedes the one I originally posted here</a>.</p>
<p>So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. A &#8220;little while&#8221; turned out to be more than a month-and-a-half. I actually wrote most of it in a week and then forgot about. The timeliness of the fourth lesson, which shows <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">how to help Pfizer in its mission to more transparent</a>, compelled me to just publish them in incomplete form. There&#8217;s probably inconsistencies in the writing and some of the code examples, but the final code sections at the end of each tutorial do seem to execute as expected.</p>
<p>As the tutorials are aimed at people who aren&#8217;t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. It was my attempt to think how to make the code most readable, and I&#8217;m very welcome to editing changes.</p>
<p><strong>DISCLAIMER:</strong> <em>The code, data files, and results are meant for reference and example only. You use it at your own risk.</em></p>
<ul>
<strong>Tutorial 1: <a href="https://danwin.com/works/coding-for-journalists-go-from-a-know-nothing-to-web-scraper-in-an-hour-hopefully/">Go from knowing nothing to scraping Web pages. In an hour. Hopefully</a></strong> &#8211; A massive, sprawling tutorial that attempts to take you from learning what HTML is, to the definition of an &#8220;if <del datetime="2010-04-06T18:25:14+00:00">loop</del> statement&#8221;, and finally, to using a Ruby library to scrape some information from Wikipedia. It may be too confusing for total neophytes and laughably basic for self-taught programmers. But at least you can kind of see, from beginning to end, one roadmap on going from nothing to something in the programming world.</p>
<p><strong>Tutorial 2: <a href="https://danwin.com/works/coding-for-journalists-102-collecting-info-from-a-county-jail-site/">Scraping a County Jail Website to Find Out Who&#8217;s in Jail </a></strong> &#8211; This uses all the concepts from the first tutorial and applies them to something that a cops reporter might actually want to try out.</p>
<p><strong>Tutorial 3: <a href="https://danwin.com/works/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/">Who&#8217;s Been in Jail Before: Cross-checking the jail logs with the court system with Ruby&#8217;s Mechanize</a></strong> &#8211; This lesson introduces you to another Ruby library that allows you to automate the filling-out of forms so that you can access online databases, in this case, California criminal case histories to see if current inmates are repeat-alleged-offenders.</p>
<p><strong>Tutorial 4: <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">Improving Pfizer&#8217;s Dollars-to-Doctors Pay List</a></strong> &#8211; Last week, <strong>Pfizer</strong> <a href="http://www.nytimes.com/2010/04/01/business/01payments.html">released a list of nearly 5,000 doctors and medical institutions</a> that it made $35 million in consulting and expense payments. Fun. Unfortunately, the list, <a href="http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp">as it initially existed online</a>, is just about useless to anyone wanting to examine trends. This tutorial provides a script to make the list more interesting to journalists.
</ul>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/">Coding for Journalists 101 : A four-part series</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		</item>
		<item>
		<title>Coding for Journalists 103: Who&#8217;s been in jail before: Cross-checking the jail log with the court system; Use Ruby&#8217;s mechanize to fill out a form</title>
		<link>https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/</link>
		<comments>https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/#comments</comments>
		<pubDate>Tue, 06 Apr 2010 13:40:53 +0000</pubDate>
		<dc:creator><![CDATA[Dan Nguyen]]></dc:creator>
				<category><![CDATA[works]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[courts]]></category>
		<category><![CDATA[journalism]]></category>
		<category><![CDATA[mechanize]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">https://danwin.com/?p=584</guid>
		<description><![CDATA[<p>This is part of a four-part series on web-scraping for journalists. As of Apr. 5, 2010, it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release, but the code at the bottom of each tutorial should execute properly. The code examples are [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/">Coding for Journalists 103: Who&#8217;s been in jail before: Cross-checking the jail log with the court system; Use Ruby&#8217;s mechanize to fill out a form</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></description>
				<content:encoded><![CDATA[<div class='over-note' style='font-size: 12pt; color: #a44; border: 1px solid black; margin: 20px; padding: 20px;'>This is part of a <a href="https://danwin.com/works/coding-for-journalists-101-a-four-part-series/">four-part series on web-scraping for journalists</a>. As of <strong>Apr. 5, 2010</strong>, it was a published a bit incomplete because I wanted to post a timely solution to the <a href="https://danwin.com/works/pfizer-web-scraping-for-journalists-part-4-pfizers-doctor-payments/">recent Pfizer doctor payments list release</a>, but the code at the bottom of each tutorial should execute properly. The code examples are meant for reference and I make no claims to the accuracy of the results. Contact <a href="mailto:dan@danwin.com">dan@danwin.com</a> if you have any questions, or leave a comment below.</p>
<p><strong>DISCLAIMER:</strong> <em>The code, data files, and results are meant for reference and example only. You use it at your own risk.</em></p>
<p><b>In particular, with lesson 3</b>, I skipped basically any explanation to the code. I hope to get around to it later.</p>
</div>
<h2>Going to Court</h2>
<p>In the <a href="https://danwin.com/works/coding-for-journalists-101-a-four-part-series/">last lesson</a>, we learned how to write a script that would record who was in jail at a given hour. This could yield some interesting stories for a crime reporter, including spates of arrests for notable crimes and inmates who are held with $1,000,000 bail for relatively minor crimes. However, an even more interesting angle would be to check the inmates&#8217; prior records, to get a glimpse of the recidivism rate, for example.</p>
<p><a href="https://services.saccourt.com/indexsearchnew/CaseType.aspx">Sacramento Superior Court</a> allows users to search by not just names, but by the unique ID number given to inmates by Sacramento-area jurisdictions. This makes it pretty easy to link current inmates to court records.</p>
<p><a href="https://danwin.com/words/wp-content/uploads/2010/04/small-court-page.gif"><img src="https://danwin.com/words/wp-content/uploads/2010/04/small-court-page.gif" alt="" title="small-court-page" width="500"  class="size-full wp-image-672" /></a><br />
</p>
<p>However, the techniques we used in past lessons to automate the data collection won&#8217;t work here. As you can see in the above picture, you have to fill out a form. That&#8217;s not something any of the code we&#8217;ve written previously will do. Luckily, that&#8217;s where Ruby&#8217;s <strong>mechanize</strong> comes in.</p>
<p><span id="more-584"></span></p>
<div class="code-doc">
<link rel='stylesheet' href='https://danwin.com/css/code.css' type='text/css' media='all' />
<div class='sec'>
<h2>Ruby Mechanize</h2>
<p>Go the the <a href="http://mechanize.rubyforge.org/mechanize/">mechanize library homepage</a> to learn how to install it as a Ruby gem. It requires that <a href="http://nokogiri.rubyforge.org/">nokogiri</a> is installed, which you should&#8217;ve done if you&#8217;ve made it this far into my tutorials.</p>
<p>There are some <a href="http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html">basic examples on the project page</a>, but you&#8217;re going to have to read some of the technical documentation to learn some of mechanize&#8217;s commands.</p>
<p>Here&#8217;s a code example we&#8217;ll be using:</p>
<pre class="ruby" name="code">
search_form['txtXref']='00112233'
result_page_form = search_form.submit
</pre>
<p><b>search_form</b> refers to a mechanize Form object. In that HTML form is a textfield with a name of &#8216;txtXref&#8217;. The array notation we used above is setting that textfield to the value &#8216;00112233&#8217;.</p>
<p>Then, using mechanize&#8217;s Form object&#8217;s <b>submit</b> method, we submit the form just as if we had clicked the &#8220;Submit&#8221; button on a webpage.</p>
<p>That&#8217;s the basic theory.</p>
</div>
<div class='sec'>
<h2>The Code</h2>
<p>Note: The following code works, if you have an inmates.txt file from the last lesson (<a href="https://danwin.com/static/jail-list/inmates.txt">use this one if you don&#8217;t</a>; keep in mind that the last names and birthdates have been changed/redacted). However, it&#8217;s very rudimentary, with no error-checking at all. Still, it&#8217;ll give you a couple tab-delimited files that will list an inmate&#8217;s past charges and past sentences served, with XREF being the key that links those files to inmates.txt.</p>
<p>Remember that you&#8217;re accessing a live site here. This script pauses for 2 seconds after each access&#8230;there should be no reason to be more frequent about it.</p>
<p>This tutorial will be updated in the future.</p>
<pre name="code" class="ruby">
require 'rubygems'
require 'mechanize'
search_url='https://services.saccourt.com/indexsearchnew/CriminalSearchV2.aspx'
xrefs = File.open("inmates.txt", 'r').readlines().map{|x| x.split("\t")[7].match(/[0-9]+/).to_s}.uniq

# open datafile


a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

search_page = a.get(search_url) 
search_form = search_page.form_with(:name=>'frmCriminalSearch')

#show the fieldnames
search_form.fields.map {|f| f.name}
#=> ["__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "txtLastName", "txtFirstName", "txtDOB", "txtXref", "txtCaseNumber", "lstCaseType"]

search_form.buttons.map{|m| m.name}
# => ["btnFindByName", "btnFindByNumber"]


xrefs.each do |xref|
  puts "\nFinding info for xref: #{xref}"
  search_form['txtXref']=xref
  search_form.field_with(:name=>'lstCaseType').options[1].select
  result_page_form = search_form.submit.forms.first
  case_buttons = result_page_form.buttons[1..-2]

  puts "There are #{case_buttons.length} cases to check:"
  case_buttons.each do |cb|
    file_page = result_page_form.click_button(cb)
    file_page = file_page.parser
  
    charges_arr = []
    sentences_arr =[]
    charge_rows = file_page.css('#dgDispositionCharges tr')
  
    if charge_rows.length > 0
    puts "Charges: "
      charge_rows[1..-1].each do |cr|
        ctd = cr.css('td').map{|td| td.text}
        charges_arr << {:plea=>ctd[1], :charge=>ctd[2], :date=>ctd[4], :severity=>ctd[5]}
        puts "\t - #{charges_arr.last.collect().join("\t")}"
      end  
    end
  
    sentence_rows = file_page.css('#dgSentenceSummary tr')
  
    if sentence_rows.length > 0
      puts "Sentences: "
      sentence_rows[1..-1].each do |sr|
        sentences_arr << sr.css('td').map{|td| td.text}.join("\t")
        puts "\t - #{sentences_arr.last}"
      end
    end
    
    
    File.open("court_charges.txt",'a+'){ |f|

      charges_arr.each do |c|
        f.puts("#{xref}\t#{c[:plea]}\t#{c[:charge]}\t#{c[:date]}\t#{c[:severity]}")
      end
    }

    File.open("sentences.txt", 'a+'){ |f| 
      sentences_arr.each do |c|
        f.puts("#{xref}\t#{c}")
      end
    }
    
    
    
  
  end #done checking a case entry
  
  puts "Done with #{xref}, sleeping"
  sleep 1
  
  
end  

 

 
 
</pre>
</div>
</div>
<p>The post <a rel="nofollow" href="https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/">Coding for Journalists 103: Who&#8217;s been in jail before: Cross-checking the jail log with the court system; Use Ruby&#8217;s mechanize to fill out a form</a> appeared first on <a rel="nofollow" href="https://danwin.com">danwin.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://danwin.com/2010/04/coding-for-journalists-part-3-cross-checking-the-jail-log-with-the-court-system-use-rubys-mechanize-to-fill-out-a-form/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
