Soft-launching Stanford’s Computational Journalism Lab

Last week, my Stanford colleagues and I launched the website for the Computational Journalism Lab. It’s a soft-launch, as the lab isn’t a physical lab, but more of an umbrella for the computational work and meetups that we are planning, such as a Computational Journalism conference in 2016, our collaboration in the California Civic Data Coalition, and of course, our coursework.

Also, as I’ve mentioned before on this blog, pretty much all of my future blogging is going to happen at blog.danwin.com, which is built in Jekyll. Not coincidentally, the Computational Journalism Lab site is also built in Jekyll — see its Github here.

Goodbye WordPress; hello Jekyll and Markdown

WordPress is great software, but the fact that it takes many minutes to just publish a simple, HTML-error-ridden post has kept me from updating this blog. So I’m planning on doing future blogging at blog.danwin.com. It takes a little time to set up the Jekyll instance, and I’ll over-obsess over the design details. But at least I’ll be improving my hacking skills, as opposed to struggling with the monolith that is WordPress. At some point, I’ll figure out a way to intelligently migrate posts from danwin.com. But if all I end up doing is writing new posts to blog.danwin.com in good ol’ Markdown, that’s still a victory for me.

The big, ignored business of extorting undocumented migrants

I've been a long-time New Yorker subscriber but it's been a long time since I've sat down with the print edition to read a story. Today, I sat still long enough to finish Sarah Stillman's stunningly well-reported piece, Where are the Children? in the April 27, 2015 edition.

The article is brimming with examples of the dire impact of unintended consequences, the most macro of which is how U.S. efforts to secure the border have led to a new economy for those who prey on the smuggling of migrants:

â€œItâ€™s exactly like Prohibitionâ€”exactly like bootlegging,â€ Terry Goddard told me recently. As the mayor of Phoenix during the nineteen-eighties and Arizonaâ€™s attorney general from 2003 to 2011, Goddard had presided over the explosion in border-security measures, aggressively seeking to eliminate stash houses where migrants were held for ransom. But he discovered that the source of the problem went much deeper than individual smugglers. Arizonaâ€™s harsh anti-immigrant laws made undocumented victims afraid to coÃ¶perate with law enforcement on prosecutions, and, as long as the country continued to rely on immigrant labor while giving workers few avenues for legal entry, extortionists would have access to a consistent supply of prey. â€œYou can push down the practice in Arizona,â€ he said, of stash-house extortions, â€œand it will pop up elsewhere.â€ In recent years, â€œelsewhereâ€ has come to mean the Rio Grande Valley, in Texasâ€”the Godoy boysâ€™ planned point of entry into the country.

Even as lengthy as Stillman's article is, there are passages that themselves could be feature-length articles:

A year and a half before Brayan and Robinson Godoy travelled north, I arrived at Mexicoâ€™s border with Guatemala, in the state of Tabasco, to join a group of nearly forty Central American women on a bus trip to search for their children, spouses, and relatives, many of whom had vanished en route to the U.S. During the next three weeks, we travelled three thousand miles along Mexicoâ€™s migrant trail, tracing the same path north to Texas that awaited the Godoys, before we looped back south, through the countryâ€™s interior kidnapping hubs. At morgues, hospitals, shelters, and mass graves, we looked for clues to the whereabouts of the missing.

Today, NPR published an interview with Stillman, and a commenter eloquently summed up the inverse relationship between the importance of Stillman's work and the public interest:

Judging by the dearth of comments here, one would conclude that a) this is not a very important issue for most NPR readers/listeners, and b) those who have paid attention, have little or no compassion for the human condition, and a gross misunderstanding about undocumented aliens.

Unix for Journalism – Computational Methods in the Civic Sphere

I’ve been too busy with teaching and other things to blog…most of my writing is currently being done for my class, Computational Methods in the Civic Sphere, which is aimed at students in the Stanford Journalism Masters program, but open to any student. It has about 30 students currently, few with any programming experience, and they’re all learning how to do things from the command-line.

While I do love the command-line, my intention is to use the command-line’s step-by-step nature, and the Unix philosophy of “Do one thing and do it well” to show how complex computational tasks can be broken down to a series of discrete, explainable-to-an-eight-year-old steps…though the design and sum of those steps can be intimidating.

One of my favorite lessons I’ve written so far: Montage the world from the command-line with Google and Instagram, which explains how to combine data from two APIs to create a montage of any location in the world…and starting just from the humble command-line.

How to compile OpenCV 2.4.10 on Ubuntu 14.04 and 14.10

For my upcoming Computational Methods in the Civic Sphere class at Stanford, I wanted my students to have access to OpenCV so that they could explore computer-vision algorithms, such as face detection with Haar classifiers.

On the Stanford FarmShare machines (which run on Ubuntu 13.10), I had trouble getting their installation of OpenCV working, but was able to use the
Anaconda distribution to install both Python 2.7.8 and OpenCV 2.4.9.1 via the Binstar package repo.

Briefly, here are the instructions:

Get the Anaconda download link
curl (*the-anaconda-URL-script*) -o /tmp/anaconda-install.sh && bash /tmp/anaconda-install.sh
conda install binstar
conda install -c https://conda.binstar.org/menpo opencv

Note: For Mac users for whom `brew install opencv` isn’t working: Anaconda worked well enough for me, though I had to install from a different pacakge repo:

conda install -c https://conda.binstar.org/jjhelmus opencv

The Anaconda system, which I hadn't used before but find really convenient, automatically upgrades/downgrades the necessary dependencies (such as numpy).

Using Anaconda works fine on fresh Ubuntu installs (I tested on AWS and Digital Ocean), but I wanted to see if I could compile it from source just in case I couldn't use Anaconda. This ended up being a very painful time of wading through blog articles and Github issues. Admittedly, I'm not at all an expert at *nix administration, but it's obvious there's a lot of incomplete and varying answers out there.

The help.ubuntu.docs on OpenCV are the most extensive, but right at the top, they state:

Ubuntu's latest incarnation, Utopic Unicorn, comes with a new version of libav, and opencv sources will fail to build with this new library version. Likewise, some packages required by the script no longer exist (libxine-dev, ffmpeg) in the standard repositories. The procedures and script described below will therefore not work at least since Ubuntu 14.10!

The removal of ffmpeg from the official Ubuntu package repo is, from what I can tell, the main source of errors when trying to compile OpenCV for Ubuntu 14.04/14.10. Many of the instructions deal with getting ffmpeg from a personal-package-archive and then trying to build OpenCV. That approach didn't work for me, but admittedly, I didn't test out all the possible variables (such as version of ffmpeg).

In the end, what worked was to simply just set the flag to build without ffmpeg:

  cmake [etc] -D WITH_FFMPEG=OFF

I've created a gist to build out all the software I want for my class machines, but here are the relevant parts for OpenCV:

sudo apt-get update && sudo apt-get -y upgrade
sudo apt-get -y dist-upgrade && sudo apt-get -y autoremove

# build developer tools. Some of these are probably non-pertinent
sudo apt-get install -y git-core curl zlib1g-dev build-essential \
     libssl-dev libreadline-dev libyaml-dev libsqlite3-dev \
     libxml2-dev libxslt1-dev libcurl4-openssl-dev \
     python-software-properties

# numpy is a dependency for OpenCV, so most of these other
# packages are probably optional
sudo apt-get install -y python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose
## Other scientific libraries (obviously not needed for OpenCV)
pip install -U scikit-learn
pip install -U nltk

### opencv from source
# first, installing some utilities
sudo apt-get install -y qt-sdk unzip
OPENCV_VER=2.4.10
curl "http://fossies.org/linux/misc/opencv-${OPENCV_VER}.zip" -o opencv-${OPENCV_VER}.zip
unzip "opencv-${OPENCV_VER}.zip" && cd "opencv-${OPENCV_VER}"
mkdir build && cd build
# build without ffmpeg
cmake -D WITH_TBB=ON -D BUILD_NEW_PYTHON_SUPPORT=ON \
      -D WITH_V4L=ON -D INSTALL_C_EXAMPLES=ON \
      -D INSTALL_PYTHON_EXAMPLES=ON -D BUILD_EXAMPLES=ON \
      -D WITH_QT=ON -D WITH_OPENGL=ON -D WITH_VTK=ON \
      -D WITH_FFMPEG=OFF ..

A recurring issue I had come across – I didn't test it myself, but just saw it in the variety of speculation regarding the difficulty of building OpenCV – is that building with a Python other than the system's Python would cause problems. So, for what it's worth, the above process works with 14.04's Python 2.7.6, and 14.10's 2.7.8. I'm not much of a Python user myself so I don't know much about best practices regarding environment…pyenv works pretty effortlessly (that is, it works just like rbenv), but I didn't try it in relation to building OpenCV.

Also, this isn't the bare minimum…I'm not sure what dev tools or which cmake flags are are absolutely needed, or if qt-sdk is needed if you don't build with Qt support. But it works, so hopefully anyone Googling this issue will be able to make some progress.

Note: Other things I tried that did not work on clean installs of Ubuntu 14.04/14.10:

Sebastian Montabone's writeup
sudo apt-get build-dep opencv
sudo apt-get install libopencv-dev (as per StackOverflow)

The Python code needed to do simple face-detection looks something like this (based off of examples from OpenCV-Python and Practical Python and OpenCV:

(You can find pre=built XML classifiers at the OpenCV repo)


import cv2
face_cascade_path = '/YOUR/PATH/TO/haarcascade_frontalface_default.xml'
face_cascade = cv2.CascadeClassifier(face_cascade_path)

scale_factor = 1.1
min_neighbors = 3
min_size = (30, 30)
flags = cv2.cv.CV_HAAR_SCALE_IMAGE

# load the image
image_path = "YOUR/PATH/TO/image.jpg"
image = cv2.imread(image_path)

# this does the work
rects = face_cascade.detectMultiScale(image, scaleFactor = scale_factor,
  minNeighbors = min_neighbors, minSize = min_size, flags = flags)

for( x, y, w, h ) in rects:
  cv2.rectangle(image, (x, y), (x + w, y + h), (255, 255, 0), 2)

cv2.imwrite("YOUR/PATH/TO/output.jpg", image)

The Computer Science of Tomorrow, Today, and the Past

“The thing you are doing has likely been done before. And that might seem depressing, but I think it’s the most wonderful thing ever. Because it means an education in computer science is worth something.”

The quote above comes from an informative and entertaining talk that John Graham-Cumming gave at OSCON in 2013, in which he points out that there hasn’t been much new in computing since 1983, what with wireless networking first implemented in 1971, markup languages in the 1960s, and, as pictured above, “hypertext with clickable links, 1967″.

Because progress has largely consisted of performance and interface improvements, it is comforting to know that applying yourself to the knowledge of computing is nearly as vital and timeless a pursuit as math and literacy. Fittingly, I saw this video after someone linked to it on Hacker News, in response to a 1964 Atlantic article I linked to: Martin Greenberger’s “Computers of Tomorrow”.

In his 50-year-old essay, Greenberger effectively predicts, the Internet, net neutrality, cloud computing, and the automation of the New York Stock Exchange. But the best line is the essay’s last line, which aligns with Graham-Cumming’s optimism about human knowledge in computing:

By 2000 AD man should have a much better comprehension of himself and his system, not because he will be innately any smarter than he is today, but because he will have learned to use imaginatively the most powerful amplifier of intelligence yet devised.

Graham-Cumming’s talk is available on SlideShare too.

MySQL (and SQLite) for Data Journalists

My first task since joining Stanford was to create the Public Affairs Data Journalism I, a required course for all students in the graduate program. As public records and government workings deserve their own class, I didn’t know for sure if it’d be worth teaching SQL to my students, most of whom hadn’t gone beyond Excel.

But after running out of patience with the finicky nature ofÂ spreadsheet GUIs, I decided to unload a bevy of SQL syntax on my studentsÂ earlier this month. They picked it up so quickly that last week, I based their midterm almost entirely on evaluating their SQL prowess, and I can say with some admiration, they now have more knowledge of SQL than I did after a year or so of self-learning…even though for many of them, this is their first time learning a programming language in the context of journalism.

I’ve been creatingÂ tutorials for their convenience, and you can use them too. Because I’m dealing with a variety of operating systems, from Windows XP to OSX 10.6 to 10.9, I decided to give them the option of doing the lessons in MySQL or SQLite…and it wasn’t too frustrating, though I spent more time than I’d like creating multiplatform datasets and lessons.

I’ll write more about my thoughts on teaching SQL in a longer post, but I can say that I am most definitely now a believer in moving past spreadsheets to SQL’s expressive way of data querying.

Preparing eggs and programming

As an egg fan, I loved this Times dining article about a “tasting expedition” of the high- and low-brow egg dishes in New York. As a programmer, there were two passages that stuck out to me about the nature of skill, complexity, and genius behind cooking (and programming):

â€œIn the French Laundry book, no one step is very difficult,â€ [author Michael Ruhlman] said. â€œThere are just so many that it takes technique to its farthest reaches.â€ For instance, Mr. Keller insists that fava beans be peeled before cooking. â€œIf youâ€™re good, it takes 20 seconds per bean,â€ Mr. Ruhlman said. â€œSomeone in his kitchen put a batch of them in the water once it lost its boil. Thomas [Keller] said, â€˜Get rid of those.â€™ That guy didnâ€™t last.â€

This next passage comes after the Times writer and Ruhlman visit Aldea in the Flatiron district to try George Mendes’ “signature Knoll Krest Farm Egg with bacalao (salt cod), olive and potato.”

After we left, I expressed surprise that so much effort went into a dish billed on the menu as a â€œsnack.â€ Mr. Ruhlman nodded. â€œWorking as a chef can be mind-numbingly boring,â€ he said. â€œThe reason dishes are so good is not because someone is a genius, but because he or she has done it a thousand times. They are looking to keep their minds active and energetic.â€

I couldn’t describe programming better myself: no one line is difficult, its the order and arrangement of thousands of steps that make a useful program. And you don’t have to be a genius, but because programming inherently involves repetitive processes, you have to keep your mind alive, and be continuously observant and critical of the patterns you come across.

Everything is (even more) broken

Tech journalist Quinn Norton believes Everything is Broken in computing and in computer security. And so do I. But I’ve rarely disagreed so strongly with someone over something we both ostensibly agree on.

Part of the problem is that Norton’s essay is a bit of a pointless sprawl. I agree completely that the “average piece-of-shit Windows desktop is so complex that no one person on Earth really knows what all of it is doing, or how.” And that this complacency is a bad thing. However, Norton then goes on to list a bunch of government-led security attacks, such as the NSA-Snowden revelations and Stuxnet, in such a way that her message is inescapably, “Windows is bad because the government wants it so.” Or, as Norton puts it, “The NSA is doing so well because software is bullshit.”

Or, maybe the NSA (yes, the same NSA that hired someone who very publicly flouted government surveillance to be their systems admin) is “doing so well” because our political status quo chooses to fund and enable it, and exploiting weaknesses in software is just one tool in the NSA’s politically-supported mission? In which case, improving yourÂ software is a very indirect, and mostly ineffective way (including for reasons inherent to software), if you wanted to diminish the NSA’s surveillance power.

This conflatingÂ of cause and effect is reflected in how Norton obviously understands how and why software is flawed, but somehow manages to draw the wrong conclusions.Â For me, the most disagreeable part of Norton’s essay is at the end:

Computers donâ€™t serve the needs of both privacy and coordination not because itâ€™s somehow mathematically impossible. There are plenty of schemes that could federate or safely encrypt our data, plenty of ways we could regain privacy and make our computers work better by default. It isnâ€™t happening now because we havenâ€™t demanded that it should, not because no one is clever enough to make that happen.

This notion that “if only those programmers got their priorities in order, things would be good” is so ass-backwards that I believe Norton’s well-intentioned essay ends up being unintentionally harmful. Even a Manhattan Project of the world’s most diligent and ethical programmers would still be bound by the thesisÂ from of Alan Turing and Alonzo Church, that some computational problemsÂ basicallyÂ are “mathematically impossible.”Â While I don’t have the computer science chops or patience to write out a proof, but I would humbly submit that the kind of program needed to provide predictable security for all the kinds of wondrous, but unpredictable things humans want to communicate, could be reduced to aÂ Entscheidungsproblem.

So not only is “everything broken”, but there are things broken in such a way that they can’t be fixed in the way weÂ wantÂ them to be fixed, just like the proverbial cake we want to eat and have.Â We’re never going to get a Facebook that makes it possible to find, within milliseconds, 5 select friends, out of a userbase of 1 billion spread out across the world, and share with them an intimately personal photo in such a way thatÂ only those five friends will see it and ensure that they never share it in such a way that a potential employer, 5 years from now, might come across it — and to provide such privacy that doesn’t severely impede the convenience and power of social sharing.

The problem is not a horde of incompetent, inhuman programmers at Facebook. It’s not the NSA that pulling the levers here. It’s not the corporate-industrial complex that seeks to strip away our privacy for commercial greed.Â The problem is usÂ — and by us, I mean what Norton describes as the “â€Šthe normal people living their lives under all this insanity” — and our natural desire to wieldÂ this amazing power. But unlessÂ the range of human thought, action, and desire becomes so limited that it can be summed by a Turing machine, then we must acceptÂ that power and privacy involve trade-offs that not just software companies, but thatÂ we, “the normals”,Â have to make. We have to choose to limit our dependance on systems that are never truly “fixed” in the way humans want them to be.

There’s a whole essay’s worthÂ of tangental argument about how we, “the normals,” have toÂ raise ourÂ standard of computing literacy, that we must teach the computer, and not the other way around, but I don’t think it’s fair for me to critique Norton’s essay for being sprawling by writing an even more sprawling piece of my own. But what I find most ironic in Norton’s piece is the distorted concept ofÂ agency; her notionÂ that Facebook and Google are not all-powerful, and in fact, “live about a week from total ruin all the time” if only “the normals” would rise up and protest so that those otherwise clever software developers would prove old man Turing wrong.

To put it another way, imagine a literary critic writing an essay about how the state of society’s literacy is “just fucked” because look at how well such Tom Clancy and the Twilight series have sold, despite their derivative, formulaic content. And that publishers and authors would produce more intellectually-edifying books, if only readers everywhere would rise up and demand those intellectually-edifying books to be written. Yes, those very same readers who caused those popular bestsellers to be bestsellers in the first place.

This begging the question is obviously not Norton’s intent. And again, I can’t argue against the notion that “everything is broken” and that everyone needs to be much more aware of it. But I think Norton’s need to cram every hot-tech issue into her critique, that we are all getting hacked because NSA/Stuxnet, ends upÂ conveying a solution that is even less useful than had it been your typical angry, non-actionable essay.

Our complex addiction to medical spending – the New Yorker on the “pain-pills problem”

What we extravagantly spend on healthcare has become even more a pressing topic with the recent release of Medicare spending data â€“ the most detailed dataset yet made public â€“ and of course, the ongoing implementation of Obamacare. Last week, The New Yorker’s Rachel Aviv brought focus to a microlevel of medical spending: a doctor who thought he could save the most rejected of patients, and who now will spend up to 33 years in prison for “the unlawful distribution of controlled substances” that led to the deaths of several patients.

Unfortunately, Aviv’s article, titled “Prescription for Disaster; The heartland’s pain-pills problem” is behind a paywall. Here’s part of the abstract:

In 2005, the medical examiner in Wichita, Kansas, noticed a cluster of deaths that were unusually similar in nature: in three years, sixteen men and women, between the ages of twenty-two and fifty-two, had died in their sleep. In the hours before they lost consciousness, they had been sluggish and dopey, struggling to stay awake. A few had complained of chest pain. â€œI canâ€™t catch my breath,â€ one kept saying. All of them had taken painkillers prescribed by a family practice called the Schneider Medical Clinic.

On September 13, 2005, Schneider arrived at work to find the clinic cordoned off with police tape…Agents from the Kansas Bureau of Investigation and the Drug Enforcement Administration led Schneider into one of the clinicâ€™s fourteen exam rooms and asked him why he had been prescribing so many opioid painkillers.

He responded that sixty per cent of his patients suffered from chronic pain, and few other physicians in the area would treat them. The agents wrote, â€œHe tries to believe his patients when they describe their health problems and he will believe them until they prove themselves wrong.â€ When asked how many of his patients had died, Schneider said that he didnâ€™t know.

Aviv’s article is powerful, moreso because it managed to cover an impressive number of dysfunctional systems while detailing the very human aspect of failure. Dr. Schneider, as Aviv portrays him, is almost the archetype of the ideal heartland doctor. He was a manager of the local grocery’s meat department until he became inspired by how his hospital treated his daughter for pneumonia. He became the first in his family to graduate from college; his daughter tells Aviv that Schneider ‘was “never comfortable with the level of status” that came with the job.’

But Dr. Schneider’s humility and kind-heartedness ran into an ill-timed storm of palliative care research, social dysfunction, and market forces. After he opened his own practice, Dr. Schneider told Aviv that:

Pharmaceutical reps came in and enlightened me that it was O.K. to treat chronic pain because there is no real cure. They had all sorts of studies showing that the long-acting medications were appropriate.

Other doctors in Wichita sent their unwanted patients to Dr. Schneider. And “nearly a dozen sales representatives” would visit him each day, taking him out to meals and cluttering his office with branded gifts. I looked for Dr. Schneider’s name in ProPublica’s Dollars for Docs database, but his clinical work happened well before the wave of financial disclosures that came in 2007. Cephalon, which would later become notorious and criminally charged for illegally marketing its narcotics, was a frequent patron of Dr. Schneider’s. From Aviv’s report:

The company sent Schneider’s physician assistant to New York for an “Actiq consultants meeting”; it paid for her to stay at the W hotel and to ride a boat on the Hudson. In 2003, Schneider was sent to an Actiq conference in New Orleans, sponsored by Cephalon. He said that a specialist told him, “You could stick multiple Actiq suckers in your mouth and your rear end and you still wouldn’t overdose. It’s clinically impossible”

People shocked by the revelation of financial ties between doctors and drug companies often assume (sometimes without enough justification, in my opinion) that the doctors are traitors to the Hippocratic Oath and humanity. But Aviv’s report describes a doctor who is so Pollyannish that a prison guard chides him for talking to The New Yorker and Aviv: “you know she’s just going to tear you apart,” Schneider apparently confides to Aviv.

There’s more going on here than just the chase for money by the drug companies, or the naivetÃ©/cravenness of the doctors who prescribe the drugs. There’s the huge issue of palliative care â€“ how do we know whether patients really “need” painkillers? â€“ and the pressure of politics, including the role of the D.E.A. and patient advocates, and of course, how much government should subsidize health care at all. There’s even the peripheral issue of electronic medical records and bureaucracy; Dr. Schneider’s clinic was so poorly managed that patients, who were rejected by one of the clinic’s doctors, would simply sign up with another doctor who worked at Schneider’s clinic, thanks to the clinic’s sloppy record keeping. It didn’t help that the clinic took in so many patients that “appointments were generally scheduled every ten minutes.”

It’s worth picking up a print copy â€“ or even subscribing â€“ just to read Aviv’s article on Dr. Schneider. It reveals the astonishingly heart-breaking complexity behind medical spending, and yet, even pushing the limits of the longform article format, it barely begins to describe the depth of that complexity.