Category Archives: works

actual works, projects

Soft-launching Stanford’s Computational Journalism Lab

Last week, my Stanford colleagues and I launched the website for the Computational Journalism Lab. It’s a soft-launch, as the lab isn’t a physical lab, but more of an umbrella for the computational work and meetups that we are planning, such as a Computational Journalism conference in 2016, our collaboration in the California Civic Data Coalition, and of course, our coursework.

Also, as I’ve mentioned before on this blog, pretty much all of my future blogging is going to happen at, which is built in Jekyll. Not coincidentally, the Computational Journalism Lab site is also built in Jekyll — see its Github here.

A guide to using Github for non-developers

I’m constantly being asked by friends to help me with their websites, and I’m constantly not at all enthusiastic to do it. I mean, I enjoy helping friends out and creating things, but web development is not at all the “fun part.” It’s a complex field, but more annoyingly, it’s difficult to scaffold a site so that a web-novice can maintain it. So you either have to settle for being the site’s maintenance person in perpetuity, or, not be bothered that your friends will waste countless hours hacking and breaking a brittle, barely visited website.

Github Pages has been a great and convenient way to publish websites. So I’ve been telling my non-dev friends, hey, just create a Github account and publish away! Unfortunately, while there are many great Github and Git resources, all of them presume that you actually want to use the many cool collaborative, developer-focused features of Git/Github. Whereas I want my non-dev friends just to piggyback off of Github to quickly build a website from scratch.

So in the past month, I’ve slowly been putting together a guide that is as basic as possible, even to the point of showing which buttons to click, and explaining how HTML is different than raw text. Check it out here: Build a Web Portfolio from Scratch with Github Pages.

Check out the Reddit discussion here. To my surprise, even aspiring developers have found it useful, even though the guide is aimed at people who do not intend to be web developers.

Creating this guide isn’t an act of altruism for me, though. It’s another way to experiment with online publishing, namely, how to reduce the friction between thinking of things to write about and getting them onto the Web. I stuck to using Jekyll but kind of wish I had gone with using Middleman. In any case, I feel much further along in having a refined CMS-workflow than I did with the Bastards Books and with my Small Data Journalism site, which is also built on Jekyll.

Github activity for 2013: 1,500+ commits, 52-day-streak

Just noticed that because of that damned Thanksgiving break, my latest Github commit streak ended at 52 days in a row. My public profile doesn’t show most of my commits as I’ve been using Github for Skift company projects, but I’ve been tentatively trying to design some open-source utilities. One of them is yearbook, yet another attempt (for me) to wrap face-recognition and mugshot-cropping into a convenient command-line interface. Another gem, which I just started on today, is spodunk, probably a futile attempt to make Google Spreadsheets more of a CMS. And of course, there’s small-data-journalism, which contains all the content for my Small Data Journalism site.

In the heat-graph of my private and public Github commits, you can see how relatively intense my commit activity has been in the past couple of weeks. The fruit of that labor is SkiftIQ, a site for better tracking how the travel industry performs and behaves on social media.

Earlier this year I made it a goal every day to contribute to someone else’s project, even if it was simple grammar and spelling fixes. I’ve fallen away from that but not for lack of interest: being able to wade through a large codebase and isolate (and sometimes fix) even just minor bugs is probably the best programming/critical-thinking exercise I’ve put myself through…and sometimes it actually helps other people.

small data Journalism – practical lessons of data journalism

Next week, my short course on data journalism at New York University (SCPS) begins. It’s only five weeks long, but I wanted the material and readings I used for the class to be accessible for anyone. You can check it out at, a site I built using Jekyll and which will serve as the home for future musings and data projects.

“Better know a developer” at AAJA 2013

I had the privilege of being on a panel with the New York Times’s Chase Davis and former YouTube designer Hong Qu at this year’s Asian American Journalists Association convention

The panel was titled, “Better Know a Developer” and my part of it was to discuss how non-programming journalists can work best with programmers.

You can see the slides here. The advice boils down to: Don’t believe in magic. Think about how you would do it yourself. And use a spreadsheet.

Ruby MiniTest Cheat Sheet, Unit and Spec reference

Ruby’s standard testing suite, MiniTest, is in dire need of a quick-and-handy reference for its syntax. I’ve put one together comparing the unit syntax (assert/refute) and the spec syntax (must/wont). You can see it below or download the Google Spreadsheet I made and roll your own sheet (HTML, XLS)

Test syntax

Unit Spec Arguments Examples



obj, msg=nil

assert_empty []
refute_empty [1,2,3]



exp, act, msg=nil

assert_equal 2, 2
refute_equal 2,1
2.must_equal 2
2.wont_equal 1



exp, act, dlt=0.001, msg=nil

assert_in_delta 2012, 2010, 2
refute_in_delta 2012, 3012, 2
2012.must_be_within_delta 2010, 2
2012.wont_be_within_delta 3012, 2


act, dlt=0.001, msg=nil

2012.must_be_close_to 2010, 2
2012.wont_be_close_to 3012, 2



a, b, eps=0.001, msg=nil

assert_in_epsilon 1.0, 1.02, 0.05
refute_in_epsilon 1.0, 1.06, 0.05
1.0.must_be_within_epsilon 1.02, 0.05
1.0.wont_be_within_epsilon 1.06, 0.05



collection, obj, msg=nil

assert_includes [1, 2], 2
refute_includes [1, 2], 3
[1, 2].must_include 2
[1, 2].wont_include 3



klass, obj, msg=nil

assert_instance_of String, "bar"
refute_instance_of String, 100
"bar".must_be_instance_of String
100.wont_be_instance_of String



klass, obj, msg=nil

assert_kind_of Numeric, 100
refute_kind_of Numeric, "bar"
100.must_be_kind_of Numeric
"bar".wont_be_kind_of Numeric



exp, act, msg=nil

assert_match /\d/, "42"
refute_match /\d/, "foo"
"42".must_match /\d/
"foo".wont_match /\d/



obj, msg=nil

assert_nil [].first
refute_nil [1].first



o1, op, o2, msg=nil

assert_operator 1, :<, 2
refute_operator 1, :>, 2
1.must_be :<, 2
1.wont_be :>, 2



stdout = nil, stderr = nil

assert_output("hi\n"){ puts "hi" }{puts "hi"}.must_output "hi\n"




assert_raises(NoMethodError){ nil! }{nil!}.must_raise NoMethodError



obj, meth, msg=nil

assert_respond_to "foo",:empty?
refute_respond_to 100, :empty?
"foo".must_respond_to :empty?
100.wont_respond_to :empty?



exp, act, msg=nil

assert_same :foo, :foo
refute_same ['foo'], ['foo']
:foo.must_be_same_as :foo
['foo'].wont_be_same_as ['foo']




assert_silent{ 1 + 1 }{ 1 + 1}.must_be_silent



sym, msg=nil

assert_throws(:up){ throw :up}{throw :up}.must_throw :up

Test Setup

Unit Spec
setup() before(type = nil, &block)
teardown() after(type = nil, &block)


via MiniTest::Mock

  • expect(name, retval, args=[]) – Expect that method name is called, optionally with args or a blk, and returns retval.


    @mock.expect(:meaning_of_life, 42)
    @mock.meaning_of_life # => 42
    @mock.expect(:do_something_with, true, [some_obj, true])
    @mock.do_something_with(some_obj, true) # => true
    @mock.expect(:do_something_else, true) do |a1, a2|
       a1 == "buggs" && a2 == :bunny
  • verify – Verify that all methods were called as expected. Raises MockExpectationError if the mock object was not called as expected.

Other syntax

  • def flunk(msg=nil)
  • def pass(msg=nil)
  • def skip(msg=nil, bt=caller)
  • def it (desc="anonymous", &block)
  • i_suck_and_my_tests_are_order_dependent!() – Call this at the top of your tests when you absolutely positively need to have ordered tests. In doing so, you’re admitting that you suck and your tests are weak. (TestCase public class method)
  • parallelize_me!() – Call this at the top of your tests when you want to run your tests in parallel. In doing so, you’re admitting that you rule and your tests are awesome.
  • make_my_diffs_pretty!() – Make diffs for this TestCase use pretty_inspect so that diff in assert_equal can be more details. NOTE: this is much slower than the regular inspect but much more usable for complex objects.



The Bastards Book of Regular Expressions

Well, I’m not quite done with my promised revision of the Bastards Book of Ruby. Or of Photography…but I’ve decided, oh what the hell, I should write something about regular expressions.

Actually, there is some method to this madness. As part of the process of updating the Ruby book, I realized I needed to spin off some of the larger, non-Ruby related topics. So, at some point, there will be mini-books about HTML and SQL. Regular expressions, as I keep telling people who want to deal with data, are incredibly important, even if you think you never want to learn programming. Hopefully this mini-book will make a strong case for learning regexes.

The second motive is I’ve been looking for a html/text-to-pdf workflow. So this is my experiment with Leanpub, which promises to turn a set of Markdown files into PDF/mobi/etc, while handling the selling process. I don’t expect to sell any copies of the BBoRegexes, but I hope to get a lot of insight about the mechanics behind Leanpub and if it presents a viable way for me to publish my other projects.

Check out the Leanpub homepage for my tentatively tiled book, The Bastards Book of Regular Expressions. Or, you could just read the mega-chapter on regexes in my Ruby book.

First-year traffic stats for the Bastards Book of Ruby (and Photography): 140,000 unique visitors

The majority of this photo’s Flickr pageviews comes from the Meet Your Web Inspector chapter, probably from readers who click the photo to see what exactly the hell is going on. Here’s the story: One day, a queen bee decided to take residence on a Chinatown mailbox. Her hive decided to follow her, causing the city to block off the corner so the bee inspectors could pluck her out. My friend from China who was visiting NYC for the first time said to me later that this was the most exciting thing she saw in New York all week. Coincidentally, the web inspector chapter is probably the most useful part (for beginners) of my programming book.

Last year, I wanted programming to be more accessible. So I published a rough draft of what I called the “Bastards Book of Ruby” and then never added to it again. It’s interesting to see how much actual traction it got. Here’s a screenshot of the Google Analytics visitors overview:

Google Analytics Pageviews for

Caveat: Unfortunately, I never figured out how to get Google Analytics to do multiple subdomains, so this report includes the July traffic boost from the Bastards Book of Photography, which, day-to-day, is not as popular as the Ruby book.

In absolute terms, fewer than half a million pageviews in a year is not impressive for a free website (this 2 min. video I took of some guys playing Super Mario Brothers in the subway is already at 250K visits). But given that it’s a book about programming and that each single page consists of a “chapter” and is – in retrospect – way too long (the Ruby book is about 75,000 words altogether), I’ll pretend that they’re more “substantial” page views. At least a few visitors saved pages for offline viewing and really, after you’ve gone through the chapters you care about, there’s not much reason to return to the book since I never really updated it.

However, according to the % New Visits metric, there’s been a steady increase in percentage of new visitors:

New visits to

Google Analytics % of new visits to

Some traffic high points:

  • 12/5/2011 – Bastards Book of Ruby is released. It made it to Hacker News’s front page: 11,566 pageviews
  • 12/22/2011 – Once in awhile, either me or someone else would submit individual chapters to various sites. The chapter I had about scraping the Putnam County Sheriff’s Office got a few upvotes on HN: 6,528 pageviews
  • 5/16/2012 – The Ruby book makes it to the top of HN, possibly in response to Jeff Atwood’s “Please Don’t Learn to Code” published the day before: 38,359 pageviews
  • 6/21/2012The Bastards Book of Photography is published. It took me about two weeks to put together and I wanted to try out Octopress as a CMS to replace the hack Rails-to-static-file system I wrote for the Ruby book. It made the front of HN as well as Reddit’s photography and howto subreddits: 45,519

In the last few months, the average number of visitors per week ranges from 3,000 to 4,000 visitors.

Promotional work and referrals

The BBoR was aimed toward journalism professionals trying to learn coding but didn’t formally pitch it outside of Twitter, listing it on my online bios and mentioning it on the NICAR mailing list. Other hacker journalists were kind enough to give it a few shout-outs.

Most of the referrals overall came from technical audiences such as Hacker News and Reddit’s technical subreddits (again, the photo book numbers are mixed in here, which I think accounts for most of the social media traffic):

  1. 38,817
  2. 16,297
  3. 7,062
  4. 6,372
  5. 4,630
  6. 3,704
  7. 2,540
  8. 2,476
  9. 1,348
  10. 1,015

The most-read topics

The most popular section of the Ruby book is the five-chapter-series I wrote on web scraping, which is of particular interest to journalists dealing with cruddy government websites that will never have an API. The Parsing HTML with Nokogiri is the most popular individual chapter.

In my own cookie-wiped Google Search, the book is the top result for “ruby web scraping”.

Here are the top 20 search terms that don’t include the word “bastard”. Apparently the Bastards Book doesn’t rank high at all for any photography search terms:

  1. alter positions in a list ruby
  2. ruby io safe io video stream
  3. parse image path with ruby
  4. putnam county jail log
  5. ruby mechanize
  6. nokogiri
  7. ruby web crawler
  8. `
  9. ruby if else
  10. web scraping ruby
  11. finding curly bracket special characters in excel
  12. ruby nokogiri
  13. ruby collections
  14. ruby parse html
  15. ruby web scraping
  16. text editor using wrong version of ruby
  17. nokogiri book
  18. ruby open html
  19. how to run a saved program in ruby
  20. ruby inline if
The respective covers of the books.

The respective covers of the books.

General interest in programming

Besides fixing typos and errors, I never did fulfill the promise of making major updates (to either of the books) this year and I’ve rarely mentioned the book after its first month – except in discussions about journalism and programming, which are pretty rare in general. To my surprise though, daily traffic has been generally steady. As I mentioned earlier, the two books receive about 3 to 4 thousand visitors weekly. When the Ruby book peaked on Hacker News in May, the average jumped from 1,500 visitors to 2,500 visitors. The current average has been the status quo since the photo book was released in July.

It would seem that the photo book accounts for the majority of the difference. But anecdotally, I get thank-you emails and tweets every week about the Ruby book and almost never hear feedback on the photo book. Sure, there are plenty of in-depth photography guides in comparison to programming books. But there’s far fewer aspiring programmers than photographers.

I hope to help change that and so have been working on a major update to the BBoR (including converting it to PDF form, probably the most requested feature). Earlier than later, hopefully, and as my off-work hours permit.

Sometime after March 2011 was when I started thinking about writing a programming book. This was after I had tried teaching the first learn-to-code class at General Assembly, an 8PM Thursday class called: “Coding for Beginners: Data Mashing with APIs”. Here’s the description I wrote up for the class:

Students will learn the fundamentals of programming by creating simple yet powerful scripts to collect and organize the data found in Web services such as Twitter, Google Maps, and Foursquare. The class will walk through sample Ruby code to understand the basic theory of programming, including variables, methods, arrays and loops. At the end, students will be able to write a fully-functional custom script to access and scrape website data.

This class is intended for absolute newbies and those beginning to learn how to code. Laptops are optional. Code for the lesson will be provided so that students can follow along during class and after. Prerequisites: None.

(Before you spit out your coffee, I do reflect below about how hilariously absurd this synopsis is, in retrospect)

Jenny 8. Lee at Hacks/Hackers) introduced me to GA but I remember the GA coordinator and I both thinking, “Who the hell actually wants to learn basic programming?” I don’t know what the average price for a GA class was then. But I know when we set the price at $30 for a 2-3 hour class, I thought it was too high and we’d end up with a pretty bare turnout.

And I was wrong: it was the fastest-selling class in GA’s then-young history and sold out in a day. And thankfully, the interest doesn’t seem to be a fluke: GA today regularly holds beginners’ programming classes, ranging from price of $175 for an afternoon to $3,000 for an eight-week Rails course (including Ruby fundamentals).

My own class didn’t go terribly well because, as I’ve found out since then, it’s kind of difficult to cover in 3 hours the programming fundamentals (variables/methods/if statements/for-loops in order to create a mashup from Twitter and Foursquare APIs) that it takes real teachers a semester to cover, especially to an audience mostly unfamiliar with the command line.

So the Ruby book was an attempt to create a resource that might actually be helpful for aspiring coders. Less than half a million pageviews might not much. But it’s been gratifying to see the Ruby book continue to be used even in its messy draft form by beginners who are incredibly committed to learning new things in life. I hope the next revision of the Ruby book will be even more useful to them.

Other resources: Who knows when I’ll actually finish. In the meantime, a lot of great programming resources have come out this past year, too many to list so I’ll just point out the Ruby ones. Free resources include Zed Shaw’s Ruby version of his Learn to Code the Hard Way and Codecademy’s Ruby track. Sau Sheong Chang’s Exploring Everyday Things with R and Ruby is one of my favorite books I’ve bought this year and it follows a philosophy similar to mine: use code to do creative, real-world problem solving.

How to convert Access .mdb files to .csv or SQL using Mac OS X

MDBLite on the Mac App Store

A life-changing app for data enthusiasts.

Update Travis Swicegood, of the Texas Tribune, pointed out that mdbtools has a homebrew recipe (brew install mdbtools), which avoids this thorny problem. While waiting for the homebrew recipe to install, though, I found this mdb-sqlite project, which uses the Java library Jackcess, to allow command-line conversion…which is lacking in the solution I originally posted below. Still, it was the most productive $1.99 I’ve spent in a while.

MDBLite, $1.99 in the Mac App Store, allows you to convert Microsoft Access mdb files into SQL, CSV, or SQLite databases.

For the past few years, I’ve kept an old Windows XP laptop around just to open Access databases to convert them to Excel or CSV. I only found out about MDBLite after digging through some obscure discussion groups that mentioned in passing. The entire purpose of this blog post is to inform all other poor souls who use Macs but must still deal with government data. If that is the most important thing my blog provides for the Internet, I’d still be proud.

So far, it works as well as advertised. Which is pretty amazing if you’ve ever had to deal with Access conversions.

Bastards Book of Ruby has a Hacker News revival

Bastards Book HN spike

The Hacker News traffic spike for the Bastards Book of Ruby

I’ve procrastinated in updating my book of practical Ruby coding. But the site got an unexpected boost in interest and traffic when someone posted it to Hacker News this past week, possibly in response to the “Please Don’t Learn to Code” debate started by Jeff Atwood.

Sidenote: The Bastards Book did reach the front page when I submitted its introductory essay, aptly titled “Programming is for Anyone.” That sprawling essay needs to be revised but I believe it in even more.

The HN posting reached the top, something I couldn’t get it to do back when I originally posted the draft. It was encouraging to see the need for something like this out there and makes me want to jump back into this as a summer project. I’ve definitely thought of many more examples to include and have hopefully become a better writer.

The main “fix” will be moving it from my totally-overkill Ruby-on-Rails system, structuring the book’s handmade HTML code into something simple enough for Markdown, and pushing it to Github. I’ve since gotten familiar with Jekyll, which is mostly painless with the jekyll-bootstrap gem.