Category Archives: thoughts

Thoughts, musings, etc.

ProPublica at Netexplo

A few weeks ago, I had the honor of joining my colleagues Charlie Ornstein and Tracy Weber in Paris to receive a Netexplo award for our work with Dollars for Docs. Check out the presentation video they prepared for the awards ceremony (held at UNESCO), featuring us as bobbleheads.

The easiest way to explain Netexplo is that one of the organizers told me that it hopes to be a South by Southwest of Paris. Check out the quirky trophy we got:

Check out the other great entries in this year’s ceremony.

This was my first trip to Paris so of course I took photos like a shutterbug tourist. You can view them on my Flickr account:

Because of a typo, the government needs to keep your private data 10 times longer?

Yesterday the Obama administration approved new rules to greatly extend the time – from 180 days to 1,826 days (5 years) – that domestic intelligence services can retain American citizens’ private information. Citizens are eligible to be part of this federal data warehouse even when “there is no suspicion that they are tied to terrorism.”

As Charlie Savage in the New York Times reports:

Intelligence officials on Thursday said the new rules have been under development for about 18 months, and grew out of reviews launched after the failure to connect the dots about Umar Farouk Abdulmutallab, the â€œunderwear bomber,â€ before his Dec. 25, 2009, attempt to bomb a Detroit-bound airliner.

After the failed attack, government agencies discovered they had intercepted communications by Al Qaeda in the Arabian Peninsula and received a report from a United States Consulate in Nigeria that could have identified the attacker, if the information had been compiled ahead of time.

The case of the “underwear bomber” is a strange justification for this expansion of data storage. Because the 2009 Christmas terror attempt nearly succeeded thanks to a series of what seems like common human errors, not from an information drought.

Shortly after the underwear bomber incident, the White House released a report examining how our vast intelligence network failed to prevent Abdulmutallab, the bomber, from boarding a flight from Amsterdam to Detroit.

One of the critical failures? Someone at the State Department, when sending information about Abdulmutallab to the National Counterterrorism Center, misspelled his name. Even though his father alerted American intelligence officials a full month before the attempted attack, our sophisticated surveillance system was partially stymied by a single misplaced letter.

As Foreign Policy reported in 2010:

State called an impromptu press briefing late Thursday evening to address the issue. The tone of the briefing was combative, as reporters pressed the “senior administration official” for details about the misspelling that he seemed not to want to give up. But here’s what we learned.

Someone (they won’t say who) at the State Department (presumably at the U.S. Embassy in Nigeria) did check to see if Abdulmutallab had a visa (they won’t say exactly when). That person was working off the Visas Viper cable originally sent from the embassy to the NCTC, which had the name wrong.

“There was a dropped letter in that — there was a misspelling,” the official said. “They checked the system. It didn’t come back positive. And so for a while, no one knew that this person had a visa.” (They won’t say for how long)

The chain of failures is more complicated than that, but the fact that a typo was a big enough of a wrench to warrant special mention in the White House review is an indication that the government’s surveillance systems, despite the work of its data architects, engineers and scientists, were compromised by some pretty banal problems, like not having spell-check capability.

In fact, the White House report goes out of its way to assert that the information-sharing problems that failed to prevent the 9/11 attacks “have now, 8 years later, largely been overcome.” Information about Abdulmutallab (again, his own father met with U.S. officials to warn them of his son a month ahead of the attack), his association with Al Qaeda, and Al Qaeda’s attack planning, “was available to all-source analysts at the CIA and the NCTC prior to the attempted attack.”

In other words, the 9/11 attack was possible because government agencies wouldn’t share information with each other. Now, they are happily sharing information with each other, they just aren’t diligently looking at it.

So the best solution is to enact a ten-fold increase the legal time limit for storing American citizens’ data?

It sounds like the government’s ability to detect terrorists would be greatly improved with better user-friendly software and adherence to data-handling standards. The ability to catch slight misspellings and do fuzzy data matches is something that Facebook and Google users have enjoyed for years; hell, the basic concept and consumer-friendly implementation has been around in Microsoft Word since about 20 years ago. Have software overhauls been enacted before deciding that the government needs more of its citizens’ private information? Or does the review of such technical details and policies seem too unsexy and pedantic for our intelligence bureaucracy?

The Times article also mentions that the guidelines call for more duplication of entire databases…which is a bit confusing. I’m assuming that this doesn’t refer to making backup copies (in case of a hard drive failure), but to a method of data-sharing between analysts. This is how the Times describes it:

The guidelines are also expected to result in the center making more copies of entire databases and â€œdata mining themâ€ using complex algorithms to search for patterns that could indicate a threat.

Hopefully, this doesn’t mean that database files are being copied and passed around so that each department can have their own copy of another department’s data. This would seem to introduce a few major logistical issues: namely, how do you know the copy you have contains the latest data? Remember that the typo in Abdulmutallab’s name was one mistake that helped spawn a series of snafus. Are we going to have an incident in which a terrorist slips through because an analyst forgot to update his/her copy of a database before mining it? Also, there’s the possibility that some of these data copies might end up lying around long after their 5-year limit.

There have been several reports of how intelligence agencies now suffer from too much data, to the point where analysts are “drowning in the data.” If this is a reason cited for how an attack went unprevented in the future, I hope the proposed reform is not “more data.”

Tools to get to the precipice of programming

I’m not a master programmer but it’s been so long since I’ve done my first “Hello World” that I don’t remember how people first grok the point of programming (for me, it was to get a good grade in programming class).

So when teaching non-programmers the value of code, I’m hoping there’s an even friendlier, shallower first step than the many zero-to-coder references out there, including Zed Shaw’s excellent Learn Code the Hard Way series.

Not only should this first step be “easy”, but nearly ubiquitous, free-to-use, and most importantly: has immediate benefit for both beginners and experts. The point here is not to teach coding, per se, but to get them to a precipice of great things. So that when they stand at the edge, they can at least see something to program towards, even if the end goal is simply labor-aversion, i.e. “I don’t want to copy-and-paste 100 web page tables by hand.”

Here are a few tools I’ve tried:

Inspecting a cat photo

1. Using the web inspector – I’ve never seen the point of taking an indepth HTML class (unless you want to become a full-time web designer/developer, and even then…) because so many non-techies even grasp that webpages are (largely) text, external multimedia assets (such as photos and videos), and the text that describes where those assets come from. To them, editing a webpage is as arcane as compiling a binary.

Nothing breaks that illusion better than the web inspector. Its basic element-inspector and network panel illustrates immediately the “magic” behind the web. As a bonus, with regular, casual use, the inspector can teach you the HTML and CSS vocabulary if you do intend to be a developer. It’s hard to think of another tool that is as ubiquitous and easy to use as the web inspector, yet as immensely useful to beginner and expert alike.

Its uses are immediate, especially for anyone who’s ever wanted to download a video from YouTube. To journalists, I’ve taught how this simple-to-use tool has helped me in my investigative reporting when I needed to find an XML file that was obfuscated through a Flash object.

In a hands-on class I taught, a student asked “So how do I get that XML into Excel?” – and that’s when you can begin to describe the joy of a basic for loop.

Here’s an overview of a hands-on web session I taught at NICAR12. Here’s the guide I wrote for my ProPublica project. And here’s the first of a multi-part introduction to the web inspector.

Refine WH Visitors

2. Google Refine – Refine is a spreadsheet-like software that allows you to easily explore and clean data: the most common example is resolving varied entries (“JOHN F KENNEDY”, “John F. Kennedy”, “Jack Kennedy”, “John Fitzgerald Kennedy”) into one (“John F. Kennedy”). Given that so many great investigative stories and data projects start with “How many times does this person’s name appear in this messy database?”, its uses are immediate and obvious.

Refine is an open-source tool that works out of the web browser and yet is such a powerful point-and-click interface that I’m happy to take my data out of my scripted workflow in order to use Refine’s features on it. Not only can you use regular expressions to help filter/clean your data, you can write full-on scripts, making Refine a pretty good environment to show some basic concepts of code (such as variables and functions).

I wrote a guide showing how Refine was essential for one of my investigative data projects. Refine’s official video tutorial is also a great place to start.

3. Regular Expressions – maybe it was because my own comsci curriculum skipped regexes, leaving me to figure out their worth much much later than I should have. But I really try to push learning regexes every time the following questions are asked:

In Excel, how do I split this “last_name, first_name middle_name” column into three different columns?
In Excel, how do I get all these date formats to be the same?
In Excel, how do I extract the zip code from this address field?

…and so on. The use of LEFT, TRIM, RIGHT, etc. functions seem to always be much more convoluted than the regex needed to do this kind of simple parsing. And while regexes aren’t the answer to every parsing problem, they sure deliver a lot of return for the investment (which can start from a simple cheat sheet next to your computer).

Regular-expressions.info has always been one of my favorite references. Zed Shaw is also writing a book on regexes. I’ve also written a lengthy tutorial on regexes.

—

So none of these tools or concepts involve programming…yet. But they’re immediately useful on their own, opening new doors to useful data just enough to interest beginners into going further. In that sense, I think these tools make for an inviting introduction towards learning programming.

Code, Don’t Tell: Programming as an Essential Journalism Skill

(tl;dr: this started out as a short post about how all of journalism can benefit from learning to code. It is now a massive rant that maybe I’ll split up later. It covers:

A quote by Seymour Hersh
Two of my own projects as case studies:

SOPA Opera: Using programming to create greater transparency on a single political issue
Dollars for Docs: Using programming to drive a nationwide investigation

How new, important stories can come from “old” ones
A practical roadmap for non-programmers on where to start, with a list of resources and things to download
A short list of inspirational ex-non-programmers

This post is inspired by a recent discussion on the NICAR (National Institute of Computer Assisted Reporting) mailing list, in which a journalism professor asked how her students should position themselves for a newspaper’s web developer job. The answer I suggested was: have them learn programming and have them publish projects online, on their own, that they can later show an employer.

But I’m becoming more convinced that programming – a decent grasp of it, not make-the-next-Facebook level – is an essential skill for all journalists, even ones that never intend to produce a webpage in their career. And for students, or any aspiring journalist, I think I can make the case that programming is absolutely the most important skill to learn in school (along with honing your interviewing, research and writing skills at the school paper/radio/TV station) if you want to improve your chances for a serious journalism career.

Hersh and Bamford

A few years ago, I attended a panel on investigative reporting that featured Seymour Hersh – the Pulitzer Prize-winning reporter who exposed the My Lai massacre – and James Bamford, a former Navy intelligence analyst who is well-respected for writing books that managed to penetrate the workings of even the super-secret NSA (affectionately known as the No Such Agency).

Seymour Hersh

The discussion turned to the use of the Freedom of Information Act, a law that reporters wield to get sensitive, unpublished documents from the federal government. Given that the NSA isn’t known for being chatty, Bamford explained how his stories were put together through exhaustive uses of FOIA.

When an audience member asked Hersh how often he used FOIA, his response – and I’m quoting from memory here – was:

“Why the fuck would I FOIA documents?”

I can’t read Hersh’s mind, but I’m guessing that he wasn’t wholesale dismissing the importance of FOIA, which has been essential in countless investigative stories.

He probably meant that: He’s Seymour Hersh. He exposed the My Lai massacre. He’s a regular contributor for the New Yorker. The kind of stories he writes for the New Yorker involves the type of people who wouldn’t be caught dead making a statement that would ever be reprinted on a document subject to a FOIA request.

And even if they were FOIAble, those requests take time (sometimes many years) and involve countless lawyers and legal wrangling. In the meantime, he’s up to his eyeballs in secret officials who, for some reason or other, are eager to spill their secrets to him, because he’s Seymour Hersh.

So, why the fuck would he FOIA documents?

Bamford doesn’t have quite that brand power and his targets are likely more reticent. But he’s learned – possibly through his Navy days – that there are plenty of important secrets in the stacks of documents that have been deemed fit for public consumption. It’s not always obvious, even among intelligence officials, to see how a mass of innocuous information inadvertently reveals big secrets.

So, back to the subject of aspiring journalists: to them, the already-employed journalists are the Seymour Hershes. These journalists have established themselves and their beat, which they can focus full-time on because they’re earning a salary to do so. Their phone contains the cell numbers of all the important officials who won’t ignore their 9 p.m. Sunday night call. When they write something, a large number of trees, barrels of ink, and/or corporate-purchased bandwidth are readily expended to make it known.

On the other end of the spectrum, the aspiring journalists are the Bamfords. Between work shifts, they have the same right as Joe Public to attend meetings and leave inquiries with contact@yourcity.gov. But for them, even the local police department might as well be the NSA. They aren’t going to be privy to hush-hush phone calls or let past the murder scene tape.

If this is your current vantage point, even if you intend to be a Hersh-style reporter, you’re going to have to Bamford your way into a field that has an increasing amount of noise and a corresponding shrinkage in paid, established positions.

Given this situation, I can think of no better strategy than to learn programming. This is a skill that not only makes more efficient every other journalism skill (writing, researching, publishing) but can, like Bamford’s relentless FOIAs, reveal stories that non-programming journalists will never be able to do, and in an unfortunate number of cases, even conceive of.

Learning the Hard Way

Zed Shaw, who isn’t a journalist but is renowned for both his code contributions and his widely-read (and free!) how-to-program books, puts it this way:

“Programming as a profession is only moderately interesting. It can be a good job, but you could make about the same money and be happier running a fast food joint. You’re much better off using code as your secret weapon in another profession.

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

Note that he doesn’t have many romantic notions about programming as a profession. However, programming is something bigger than just a job: it’s an essential, game-changing skill.

Code, don’t tell

“Show, don’t tell” is how my high school journalism teacher taught us how to write. Instead of telling the reader something:

James Smith is one of the toughest football players on the team.

– show it, through observed evidence:

When the halftime whistle blew, James Smith walked to the sidelines and collapsed. He later was told that the neck pain he played with through the second quarter was caused by a fracture in his neck.

I guess “Code, don’t tell” doesn’t really make sense; it’s just my made-up-way of saying that we have fantastically more ways than ever to tell – blog posts, retweets, status updates, auto-aggregations and other forms of repurposing – but we’re little better equipped to find and develop the actual stories. Programming is a skill that cuts through the noise, allows for the analysis and reporting on new substantive information sources, and even provides a way to create innovative story-telling forms (i.e. the web developer’s role).

So to follow my high school teacher’s advice, here’s an overview of my two most successful journalism projects so far, both done at ProPublica. As I explain later, both more or less originated from me sitting on my couch, being annoyed by what I saw as a lack of transparency. The first one, SOPA Opera, was initially self-published and probably could have been done entirely from the couch. The other, Dollars for Docs, was a full out effort by my colleagues and me. But it was programming-driven at every phase.

SOPA Opera

I won’t rehash the debate over this now-dead Internet regulation law, but the inspiration for the SOPA Opera news app was simply: I had read plenty of debate about SOPA for months. But when I wanted to see just which legislators actually supported it and their reasons for doing so, there wasn’t yet a great resource for that.

If you know about the official legislative site, THOMAS and are familiar with its navigation, you could at least find the list of sponsors. But good luck trawling the Congressional committee sites to find transcripts and testimony related to the law. It goes without saying that a list of opponents doesn’t exist and is beyond the official scope of THOMAS anyway.

So SOPA Opera, boiled down, is an pretty pedantic concept: “Hey, here’s a list of Congressmembers and what I’ve found out so far about their positions on SOPA.“

In other words, changing this:

SOPA sponsors, on THOMAS

To this:

The gist of SOPA Opera could be done without any programming whatsoever. You could even build a static in Photoshop and upload the image onto the Internet. So what role did programming play in this? It made it very easy to gather the already-available information, which included: the official list of sponsors, the boilerplate biographical and district information on every Congressmember (including their mug shots), and contribution data from the Center for Responsive Politics.

No exaggeration: a decent programmer could build a nice site from this data in about half-an-hour. The jazzy part of the site – the dynamic sorting of the list – was already built and offered as a free plugin to use (courtesy David DeSandro). It’s entirely possible to create SOPA Opera by hand, given a few days and an infinite amount of patience.

So programming allowed me to save my time and energy for the actual reporting. I thought about building a scraper to go through each Congressmember’s Facebook and Twitter page to search for the term “SOPA” But until the blackout, most lawmakers had nothing to say on the topic. So at first, my “research” largely involved typing “SOPA [some congressmember’s name]” into Google News and usually finding nothing.

When the SOPA issue blew up during the Jan. 18 blackout, I didn’t have to do any searching, as Congressmembers pretty much rushed to make known their opposition to SOPA. SOPA Opera was designed in a way to make it easy for constituents to look up their representative and, if I had no information about him/her, tell me what they found out after talking to their representative. Or, in a few cases, Congressional staffers contacted me directly.

SOPA Opera easily broke the single-day traffic record at ProPublica. This was mostly due to blackout-participating sites like Craigslist that directed their traffic to us as a reference. Clearly, what caused the seismic shift on the SOPA debate were the mega-sites that coordinated the millions of emails and phone calls to lawmakers, and SOPA Opera was an indirect beneficiary of the increased public interest.

But I believe that SOPA Opera made at least one important contribution to the debate: it made very clear the level and characteristics of support enjoyed by SOPA. One thing that the THOMAS listing of sponsors fails to do is note the political parties of the lawmakers. I felt that this was a critical piece and it was easy to get and display. The result: many visitors to SOPA Opera who had believed that SOPA was a diabolical scheme by [whatever-party-they-oppose] were shocked at how SOPA’s support was so bi-partisan and broad.

I heard from a number of people who had been highly energized about the anti-SOPA debate yet were completely shocked that Sen. Al Franken – automatically assumed to be on the side of “Internet freedom” – was in fact, was a sponsor of SOPA’s counterpart in the Senate. This was no state secret: Sen. Franken has been passionate and outspoken in his support and was one of the few who didn’t back away after political support collapsed post-blackout.

SOPA Opera’s success probably owes less to my skill than to the dismal state of accessibility in our legislative process. This goes to show that even when you arrive extremely late to the game, it’s possible to make a significant impact by simply having an idea of how things can be better. This applies in just about any situation and profession. Programming just makes it much easier to push your creation forward.

Some technical details on self-publishing

I don’t want to dwell too much on the web-side of things, as that is just one specific use of programming. But to get back on the topic of how someone can position themselves for a web-related media job, SOPA Opera is a really excellent example of the potential in self-publishing.

SOPA Opera spent about a week on a domain (sopaopera.org) that I purchased for $10. It didn’t bear the ProPublica brand then, and I didn’t have time to promote it beyond a few tweets and submitting it to Hacker News and Reddit.

But in just a week, sopaopera.org had racked up about 150,000 pageviews before we migrated it to ProPublica:

That’s not a huge number in itself, and traffic to it increased exponentially under ProPublica’s umbrella. But it had gained enough notice that prominent were linking to it. The problem I had at the start: Googling a random lawmaker’s name and the term “SOPA” and finding absolutely nothing – was solved, as Google highly-indexed all of the auto-generated lawmaker pages on SOPA Opera. At least one Congressmember’s office emailed me to update his page.

Not bad for a holiday break project and $10, using free resources and tools that are available to anyone with a computer. Back when I applied for newspaper jobs, I had to carry a portfolio of cut-out newspaper articles to show editors that at least a few people (my college paper and the one newspaper I interned at) had been willing to waste trees and ink on me. You were out of luck if all you had were a bunch of links to blog posts.

The mindset is different today, articles published on a traditional publication’s website, or at an online-only organization like Huffington Post, can count as legit clips. But I’d like to think that showing a full-blown website that includes not only traditional reporting content, but examples of how visual and interface design can tell a new story, as well as being able to provide concrete metrics (pageviews, referring links) of impact, would be even more impressive to today’s news editors.

Dollars for Docs

Like SOPA Opera, Dollars for Docs (aka “D4D”) is late to its respective topic. It is a long accepted practice for medical companies to pay doctors to promote their products, not much different from a notable athlete who endorses a shoe that she considers to be the best for her sport. But in recent years, lawmakers and regulators have called for more transparency of these financial ties to prevent cases in which a doctors are unduly influenced by their benefactors.

Data on company-to-doctor payments is at least two decades old: Minnesota enacted a law in 1993 requiring companies to disclose their payments. However, that “data”, which came in the form of paper records that had to be hand-entered into a computer – after, of course, you visited the records’ actual storage location and photocopied each page at 25 cents a pop. For that reason, the records were collected but unexamined for at least a decade until Dr. Joseph Ross (now at Yale University) and the Public Citizen advocacy group collected and analyzed the records.

In 2007, they published their findings in the Journal of the American Medical Association, with the conclusion that the payment records were â€œcompromised by incomplete disclosure as well as insufficient access.â€:

In Vermont (which enacted a public disclosure law in 2001), most disclosures were redacted for “trade secrets” reasons. Of the publicly disclosed payments, 75% of them lacked information identifying the recipient
In Minnesota, many of the companies had years in which they reported nothing.
The “public” disclosures were pretty much inaccessible to the public. Dr. Ross and Public Citizen had to go to court to get the Vermont records.

Dr. Ross told me that after his study was published, not only was it apparent that the public was in the dark, but doctors themselves had no idea that the data were even being collected. The Minnesota pharmacy board was subsequently so swamped by requests from other researchers, hospitals and litigants that it began publishing the disclosures online.

At around the same time, the New York Times published its own investigation using the Minnesota records. Their analyzed both the company disclosures and Medicaid payments to Minnesota psychiatrists and found that during a time period in which company payments to Minnesota psychiatrists increased â€œcompromised by incomplete disclosure as well as insufficient access.â€, antipsychotic drug prescriptions for children jumped more than 900 percent.

The Times investigation (also in 2007) sparked a large political fight in which U.S. Senate investigators targeted prominent psychiatrists whose work had expanded the use of antipsychotic drugs as treatment for children.

The end result of this was a proposed federal law to mandate these payment disclosures nationwide. This law was later folded into the 2010 health care reform package. By 2013, the federal government will publish a database of these disclosures.

Couch database

The idea for D4D was sparked from something I wrote for my blog one evening. I was writing some programming tutorials to show journalists, well, how programming could be used for everyday reporting, and I needed a current example. I came across this Times article, Pfizer Gives Details on Payments to Doctors, which reported that Pfizer was fulfilling the terms of a legal settlement by publishing a searchable database of the health professionals it paid. At that point, it was the fourth such drug company that had disclosed its payments in advance of the 2013 law – most of the others had done so also as part of settling their own lawsuits.

At that point, I knew virtually nothing about the issue. But what I did see was that Pfizer’s site seemed unnecessarily cumbersome. Though the disclosures were mandated, Pfizer’s site made it difficult to do simple analyses such as finding the professionals who had received substantial amounts or even the sum of the database’s payments. So I wrote a scraper and published the code and data for others to use.

Data-scraping and pharmaceutical payments isn’t a high-traffic topic, but the blog post caught the eyes of a few important people. My colleagues Charles Ornstein and Tracy Weber, both Pulitzer Prize health reporters, were well-informed about the issue but didn’t know how feasible it would be to do a broad analysis of the data. I think I would eventually have written a scraper and published the delimited data for every drug company that had so far been required to disclose (this article in the Times, Data on Fees to Doctors Is Called Hard to Parse, was a particular inspiration), but Charlie and Tracy knew how to turn it into a strong investigation.

Another side-benefit of self publishing: A reporter at PBS had also been working to collect and parse the data. He noticed My Pfizer post, though, and rather than being competitors, PBS teamed up with ProPublica in conducting the investigation.

Done stories are never dead. They aren’t even done.

Even with the groundbreaking work already done by Dr. Ross, Public Citizen and the Times in 2007, the subsequent Senate investigations, and the impending official database in 2013, there was still room for D4D to become a valuable, innovative investigation. It had considerable impact on the debate, prompting companies and medical institutions to change their disclosure and conflict-of-interest policies. D4D is currently is the most-viewed resource at ProPublica.

You can read our series coverage here.

The most obvious way that D4D differed from previous investigations is that it looked at the available nationwide set, not just Minnesota’s. Some of the data-driven angles we took included:

Cross-referencing our payments database against all the state medical licensing databases to see if any of the highly-paid doctors had serious disciplinary issues. Prescription data is not publicly available, so this was an alternative way of scrutinizing companies’ assertions that they paid doctors for their prestige, and not their prescribing habits. It essentially involved writing a scraper for each state website.
Cross-referencing our payments database against the faculty list at various medical schools to see if there were discrepancies in what the doctors disclosed to their institutions.
Examining the differences between the quarterly reports, to see if payment levels dropped and why. Since the disclosures are largely unaudited, collecting and normalizing the data for comparison is the only way to double-check the completeness of the reports
Much of D4D’s success came in our willingness to share our data with our reporting partners, and then later, to any newsroom that asked. This spawned hundreds of independently reported stories.

So even on a well-trodden story, there are still countless angles when you combine a keen reporter’s instinct with an ability to collect data. A prevalent theme in the journalism industry – especially in the age of Twitter – is the drive to be first. I understand that it’s important in a ratings-sweep context, and it certainly gets the blood going when you’re competing for a story, but I’ve never thought that it was good for journalism or particularly useful to news consumers.

This is why I enjoy data-driven journalism. On any given topic, there are so many valid and substantial ways to cross-examine the evidence behind a story and produce meaningful stories. And as time passes, the analysis only becomes more interesting, not less, because more and more data is added to the picture. Data-driven journalism can be done through simple use of Excel. But programming, as I explain in the next section, can vastly increase the opportunities and depth of these analyses.

A sidenote: The hardest part of D4D was the logistics, which were only manageable through programming. It’s too boring to go into detail, but D4D required a collaborative reporting process more disciplined than “just send that Word.doc as an attachment.” I don’t forsee a project of D4D’s scale being attempted by many other organizations, because of the difficulty in managing all the moving parts.

Never attribute to malice that which…

The most interesting reaction I got from D4D came not from doctors, but from researchers, compliance officers, and even federal investigators whose jobs it was to monitor these disclosures. They were thrilled that D4D made it so easy for them to check up on things. What was surprising to me was that I just assumed that everyone who had a professional stake in overseeing these disclosures had already collected the data themselves. The scraping-and-collecting of the company reports was by far the easiest part of D4D, and even if you couldn’t program you could at least hack together a system of copying-and-pasting from the various data-sources, if such information was vital to your job.

The truth is that the concept of “user interface” is as critical to an investigation as it is in separating successful tech startups from their clunky, failed competitors. I occasionally get asked for advice by researchers on their own projects. What stalls a surprising number of interesting investigative projects and analyses is not something as malicious as a shady CEO or the threat of a lawsuit, but problems as benign as: a company over the years has hundreds of datafiles, all zipped and scattered across many webpages. Is it possible to somehow download them all, unzip them, and put them into a database (or Excel) just so I can just find if someone’s name is in there?. Or: If I could only fix the few places where the agency screwed up in outputting this comma-delimited data file, I could analyze it in Excel.

If you can’t program, this isn’t a trivial problem: working with ten such files is an inconvenience. When there are 100 files, then the momentum for an inquiry might just stop dead, especially if the inquiry arises from curiosity instead of certainty (think of how many great stories and investigations have come out of such casual inquiries). However, with some basic programming, the difference between organizing 10 files and 10,000 files is a matter of milliseconds. A programmer thus has the power to not only work with already-normalized datasets and produce interesting stories, but he/she can (efficiently) create datasets that otherwise would have never been examined.

To reiterate: there are an astonishing number of stories and inquiries that are derailed by what are trivial technical issues to any half-competent programmer. This is both alarming from a civic perspective, yet extremely exciting if you’re someone with the right skills at the right time, as you’ll never want for ideas.

A practical road to programming

OK enough abstract talk. The amazing promise of programming is that there are so many opportunities. This leads to its biggest problem when trying to learn it: there are too many places to start.

This section contains some advice. It may not be the best advice for everyone, but at least everything I mention below is absolutely free to use and to learn from.

The Basics

If you haven’t already, create a Twitter account. Stop kvetching about how “no one wants to read what I ate for breakfast” because that casually implies that people would want to read your 100,000 word opus, as soon as you finish it. They won’t.
But having a Twitter account provides one avenue to spread your work and just as importantly, a channel to learn from people who aren’t just tweeting about their breakfasts.

Get a Dropbox. Get used to putting stuff on the cloud. Not your sensitive documents, but things like e-reference books and datasets and code. This is much better than emailing (and in some ways, more secure) things to yourself.

Create a Google account. Even if you don’t use it for email, Google Documents is extremely useful. And you may find use from the other parts of Google’s ecosystem

Get a second or third browser: If you’re paranoid about Twitter/Facebook/Google cookies tracking you, then use one browser to handle those accounts and another browser to do all your other web-browsing.

Data stuff

If you don’t have Excel, you can download OpenOffice’s capable suite. That said, Google Docs is probably the easiest way to get into keeping spreadsheets, with the added bonus of being in the cloud and thus easy to do collaborations and to use your programming with. Again, be cautious about putting very sensitive data there. But I’d argue that the cloud is still safer than keeping everything on a stealable-Macbook.

Google Refine: this was a project formerly known as Gridworks. It runs in the browser, Unlike Google Docs, you don’t need a Google Account or to be online to use it. It’s similar to a spreadsheet except that you won’t use it to calculate an average/sum of a column or to make charts. It’s for cleaning data, to quickly determine that “John F. Kennedy” — “Jack F. Kennedy”, “John Fitzgerald Kennedy” and “J.F. Kennedy” are all the same person. There have been some investigative data-work that would not have been possible without this tool. Check out the video introduction here; I’ve also written a tutorial at ProPublica.

Given the number of important stories that basically boil down to finding someone’s name several times in a database, it’s a little amazing to me that every serious reporter hasn’t at least tried Google Refine.

Programming

Don’t get stalled by trying to figure out which is the best language. The three most current popular: Ruby, Python and JavaScript, will serve your needs well, and you’ll find it relatively easy to pick up the other two after learning one.

That said, there’s one main big difference: Ruby and Python are more general purpose scripting languages. You can use them to sort your files, process (and build) a database, and even build a full out website (you may have heard of Ruby on Rails and Django).

JavaScript is most typically used for web interactivity in the browser, everything from animating buttons to full-fledged applications. Because it’s in every browser, it takes no work to try it out and to produce interactive bits. It takes a little more work to setup JS to do things like web-scraping or local file processing.

JS has an additional advantage in that there many interactive tutorials that you can access through your browser. Codecademy is one of the best known ones.

Programming resources

Zed Shaw’s “How to Learn Python the Hard Way” is one of the most popular beginnner-level (and free) ebooks. There is also a Ruby version.

A little self-promotion: for people who best learn through practical projects, I’ve been working on my own Ruby beginner’s guide, tentatively titled the Bastards Book of Ruby. It’s a work in progress but you’ll find some ideas on starter projects to work towards (a good start is writing a script to download and store all your tweets).

HTML

Don’t learn HTML. That is, don’t take a course in HTML. Learn enough to know that the HTML behind a webpage is just plain text. And learn enough to understand how:

<a target="_blank" href="http://en.wikipedia.org/wiki/HTML">Wikipedia's entry on HTML</a>

Creates a link that takes you to Wikipedia in a new window, like this: Wikipedia’s entry on HTML

That’s basically enough to get the concept of HTML (and the idea of meta-information) and to begin scraping webpages. One of the fastest ways to learn as you go is to get acquainted with your web browser’s inspector.

People to learn from

I’m not a particularly inspiring example of a journo-coder: I took up computer engineering because I was afraid there wouldn’t be many journalism jobs so I kind of half-stumbled into combining reporting with code because it’s easy to learn programming fundamentals during college. This is why if you’re a college student now, I strongly advise you to pick up programming at a time when learning is your main job in life.

Much more impressive to me are people who were doing well in their day jobs but decided to pick up programming in their spare hours – and then returned to do their day jobs with newfound inspiration and possibilities.

John Keefe (WNYC) – About a year-and-a-half ago, I remember John coming to Hacks/Hackers events to watch people code and to continually apologize for having to ask what he thought were dumb questions. In an incredibly short time, Keefe learned enough hacking to produce some great, creative apps and now heads WNYC’s data team, and also leads the discussion among news orgs on how to modernize the way we do things like election coverage.

Zach Sims – Sims is a co-founder of Codecademy. He was a poli-sci major who ventured into tech entrepreneurship but was frustrated that his lack of technical skills hindered his work. He learned programming on his own and with a co-founder, created Codecademy to teach others how to program. Codecademy itself is one of the hottest recent startups.

Neil Saunders – I stumbled across Neil Saunders’ blog while looking for R + Ruby examples. His blog is titled “What You’re Doing is Rather Desperate“, inspired by the reaction of a colleague who was apparently unimpressed with his use of programming in his bioscience job.
It’s a misconception that scientifically-minded professionals also know how to program. In fact, some don’t even have basic computer skills. Saunders not only publishes his code, but shows how others in his field can greatly improve their research with programming skills.

Kaitlyn Trigger –
As this TechCrunch article puts it, Kaitlyn Trigger was a poly sci major who “never took any computer classes.” She has been together with Instagram co-founder Mike Krieger but had been frustrated that she didn’t understand his work. So she picked up/downloaded Learn Python the Hard Way, learned the Python-based web framework Django, and created Lovestagram as a Valentine Day’s present.

It’s not just a cute story – learning Python and Django and making something within 2 months in your spare time is a pretty incredible achievement. It’s an awesome example of how having a project in mind can really help you learn code.

Matt Waite – was an award winning newspaper reporter before becoming a web developer. He went on, as a web developer, to win the most prestigious of journalism prizes: a Pulitzer for PolitiFact. He now teaches at University of Nebraska Lincoln and keeps a blog related to his work with journalism students.

Woody Allen: Every step is part of the writing process

Woody Allen (2006), photo by Colin Swan

One of the best books I’ve picked up recently is Eric Lax’s Conversations with Woody Allen: His Films, the Movies, and Moviemaking, which is basically a 400+ page interview, spanning decades, between the author and Allen. I’m a fair-weather fan myself, I’ve only seen a few of his movies but I’ve always admired his relentless pursuit for his art, even when some of it seems to just be screwball comedy.

The book is divided into 8 parts for different facets of Allen’s work, including “Writing It”, “Shooting, Sets, Locations” and “Directing.” The following excerpt comes from the “Editing” part and in it, Allen talks about how he sees every step of filmmaking as part of the writing process (emphasis added):

[Eric Lax]: Youâ€™re involved with the details of every step of a film, and Iâ€™ve noticed that you do not delegate any part of its creation, even assembling a first cut from takes youâ€™ve already selected.

[Woody Allen]: To me the movie is a handmade product. I was watching a documentary on editing on television the other day and many wonderful filmmakers were on and wonderful editors and everyone was talking briefly about how they edit. Years ago, they would turn it over to an editor. Or there are people I know who finish shooting and go away for a vacation and let the editor do a draft; then they come back and they check it out and do their changes.

I canâ€™t do that. It would be unthinkable for me not to be in on every inch of movie – and this is not out of any sort of ego or sense of having to control; I just canâ€™t imagine it any other way. How could I not be in on the editing, on the scoring, because I feel that the whole project is one big writing project?

You may not be writing with a typewriter once you get past the script phase, but when youâ€™re picking locations and casting and on the set, youâ€™re really writing. Youâ€™re writing with film, and youâ€™re writing with film when you edit it together and you put some music in. This is all part of the writing process for me.

Lax, Eric (2009-08-12). Conversations with Woody Allen (p. 284). Random House, Inc.. Kindle Edition.

I feel the exact same way about any kind of modern storytelling. Whether it’s done as a photo essay, movie, or news application/website, each step of the process can profoundly affect and be affected by your editorial vision. Back in the day of traditional journalism, it’s possible that you could have one person do just the interviewing and research and then one person to put it as story form. But the feedback in that process – an unexpectedly emotional interview that alters what you previously thought the story arc should be – would almost be entirely lost.

Google’s search has been dumbed down for the novices and solipsistic

In response to Google’s latest plan to combine all your usage data on all of its platforms (GMail, Youtube, etc.) into one tidy user-and-advertiser-friendly package, I’m mostly sitting on the fence. This is because I’ve always assumed everything I type into Google Search will inextricably be linked to my personal GMail account…so I try not to search for anything job/life-sensitive in the same browser that I use GMail for.

But even before this policy, Google’s vanilla search (not the one inside Google+) has noticeably gotten too personalized. Not in a creepy sense, but in a you’re-too-dumb-to-figure-out-an-address bar way. And this is not a good feature for us non-novice Internet users.

For example, I’ve been in a admittedly-petty, losing competition with the younger, better-muscled Dan Nguyen for the top of Google’s search results. My identity (this blog, danwin.com) has always come in second-place or lower…unless I perform a search for my name while logged into my Google/GMail account:

Me on Google Search. I am logged in on the left browser, logged out on the right.

The problem isn’t that my blog shows up first for my little search universe. It’s that my Google+ profile is on top, pushing all the other search results below the fold.

This seems really un-useful to me. The link to my own Google+ profile already occupies the top-left corner of my browser every time I visit a Google-owned site. I don’t need another prominent link to it. But I’ll give Google the benefit of the doubt here; they’re making the reasonable guess that someone who is searching for their own name is just looking for their own stuff…though conveniently, Google thinks the most important stuff about the searcher happens to be the searcher’s Google+ profile.

So here’s a more general example. I do a lot of photography and am always interested in what other people are doing. So here’s a search for “times square photos” in normal search (image search seems to behave the same way logged in or out):

'times square photos' on Google Search. I am logged in on the left browser, logged out on the right.

I generally love how Google automatically includes multimedia when relevant; for example, I rarely go to Google Maps now because typing in an address in the general search box, like “50 broadway” will bring up a nice local map. But in the case of “times square photos,” Google automatically assumes that I’m most interested in my own Times Square photos.

I may be a little solipsistic, but this is going overboard. And it seems counter-productive. If I’m the type of user to continually look up different kind of photos and all I see right away are my own photos, my search universe is going to be slightly duller.

Wasn’t the original assumption of search was that the user is looking for something he/she doesn’t currently know? Like, the hours of my favorite bookstore. Doing that search pulls up a helpful sidebox, with the hours, next to the search results:

The Strand's opening hours

This is fantastic. And I do appreciate Google catering to my caveman of the question, especially when I’m on a mobile device.

But in the case of my example photo and name search, Google has gone a step too far in dumbing things down.

My hypothesis is that they are catering to the legion of users who get to yahoo.com by going to Google and typing in “Yahoo.” I imagine Google’s massive analytics system has told them that this is how many users get to GMail, as opposed to typing in gmail.com.

Google seems to be making this apply to every kind of search: when I type in a search query for “dan nguyen” or “times square photos”, Google checks to see if these are terms in my Google profile. If so, it pushes them to the top of the search pile because I must be one of those idiots who doesn’t realize that the Dan+ in the top left corner is how I get to my Google profile or that is too lazy to go to Flickr to look up my own Times Square photos.

The kicker is that that assumption contradicts my behavior. If I’m a user who was technical enough to figure out how to fill out my Google profile and properly link up third-party accounts…aren’t I the type of user who’s technical enough to get to my own Flickr photos by myself?

Searching for my own name is stupid, and kind of an edge case. But what if I’m working on a business site and have linked it (and/or its Google+ page) to my profile? And then I’m constantly doing searches to see how well that site is doing in SEO and SiteRank compared to similarly named/themed sites? Since I’m not in that situation, I can only guess: but will I have to use a separate browser just to get a reliable, business-savvy search?

I realize that this dumbing-down “feature” is the kind of thing that has to be auto-opt-in for its target audience. But I can think of a slightly non-intrusive way to make it manually opt-in. If what I really want are my own Times Square photos, then wait for me to prepend a “my” to the query. I’d think even the novice users could get into this habit.

A Million Pageviews, Thousands of Dollars Poorer, and Still Countlessly Richer.

Snowball fight in Times Square, Manhattan, New York

Update: This post rambled longer than I intended it to and I forgot that I had meant to include some observations on what I’ve noticed about Flickr’s traffic pattern. I’ve added some grafs to the bottom of this post.

My Flickr account hit 1,000,000 pageviews this weekend. Two years ago, I bought a Pro account shortly after the above photo of some punk kid throwing a snowball at me in Times Square was posted on Flickr’s blog. Since then I set my account to share all of my photos under the Creative Commons Non-commercial license (but I’ve let anyone who asks use them for free).

My account was on track to have 500K pageviews by October (of this past year) but then this photo of pilots marching on Wall Street hit Reddit and attracted 150K views all by itself, so then a million total views seemed just around the corner :).

Net Profit

I was paid $120 for this photo, which was used in New York’s campaign to remind people that they can’t smoke in Coney Island (or any other public park).

So how much have I gained monetarily in these two years of paying for a Flickr Pro account?

Two publications offered a total of $135 for my work. Minus the two years of Pro fees ($25 times 2 years) and that comes to about $80. If I spent at minimum 1 minute to shoot, edit, process, and upload each of my ~3,100 photos, I made a rate of $1.50/hour for my work.

Of course, I’ve spent much more time than one minute per photo. And I’ve taken far more than 3,100 photos (I probably have 15 to 20 times as many stored on my backup drives). And of course, thousands of dollars for my photo equipment, including repairs and replacements. So:

+ $135 from publications
– $50 for Flickr Pro fees
– $8,000 (and change) for Canon 5D Mark 2, Canon S90, lenses, repairs from constant use in the rain/snow/etc.

So doing the math…I’m several thousands of dollars in the hole.

Gains

Monetarily, my photography is a large loss for me. I’m lucky enough to have a job (and, for better or worse, no car or mortgage and few other hobbies to pay for) to subsidize it. So why do I keep doing it and, in general, giving away my work for free?

Well, there is always the promise of potential gain:

I made a $1,000 (mostly to cover expenses) to shoot a friend’s wedding because his fiance liked the work I posted on my Facebook account…but weddings are so much work that I’ve decided to avoid shooting them if I can help it.
I’ve also taken photos for my job at ProPublica, including this portrait for a story that was published in the Washington Post. I’m not employed specifically to take photos, but it’s nice to be able to do it on office time.
I also now have a large cache of stock photos to use for the random sites I build. For example, I used the Times Square snowball photo to illustrate a programming lesson on image manipulation and face-recognition technology.
Even if my photos were up to professional par, I’m not the type to declare (in person) to others, “Hey, one of my hobbies is photography. Look at these pictures I took.” Flickr/Facebook/Tumblr is a nice passive-humblebrag way to show this side passion to others. And I’ve made a few good friends and new opportunities because of the visibility of my work.

In the scheme of things, a million pageviews is not a lot for two years…A photo might get that in a few days if it’s a popular enough meme. And pageviews have only a slight correlation to actual artistic merit (neither the above snowball or pilot photos are my favorite of the series). But it’s amazing and humbling to think that – if the average visitor who stumbles on my account might look at 4 photos – something I’ve done as a hobby might have reached nearly a quarter million people (not counting the times when sites take advantage of the CC-licensing and reprint my photos).

Having any kind of audience, no matter how casual, is necessary to practice improve my art if I were to ever try to become a paid professional photographer. So that’s one important way that I’m getting something from my online publishing.

Photos are as free as the photographer wants them to be

My personal milestone coincidently comes after the posting of two highly-linked-to articles on the costs of a photo: This Photograph is Not Free by John Mueller and This Photograph is Free by Tristan Nitot. They both make good points (Mueller’s response to Nitot is nuanced and deserves to also be considered).

Mueller and Nitot aren’t necessarily at odds at each other so there’s not much for me to add. Photos are worth good money. To cater to a client, to buy the (extra) professional equipment, to spend more time in editing and post-processing (besides cropping, color-correction and contrast, I don’t do much else to my photos), to take more time to be there at an assignment – this is all most definitely worth charging for.

And that is precisely why I don’t put the effort into marketing or selling mine. The money isn’t worth taking that amount of time and energy from what I currently consider my main work and passion. However, what I’ve gotten so far from my photography – the extra incentive to explore the great city I live in, the countless friends and memories, and of course, the photos to look back on and reuse for whatever I want – the $8,000 deficit is easily covered by that. Having the option to easily share my photos to (hopefully) inspire and entertain others is icing.

—

One more side-benefit of using a public publishing system like Flickr: I couldn’t devise a better way to organize and browse my own work with minimal effort. And I’m often rediscovering what I considered to be throwaway photos because others find them interesting.

Here are a few other photos I’ve taken over the years that were either frequently-viewed or considered “interesting” by Flickr’s bizarre algorithm:

Jumping for joy during New York blizzard, Times Square

Sunset over Battery Park and Statue of Liberty

Pushing a Taxi - New York Blizzard Snowstorm Thundersnow Blaaaaagh

Lightning strikes the Empire State Building

Brooklyn Bridge photographer-tourist, Photo of

New York Snow Blizzard 2011, Lone Man on the Brooklyn Bridge

Ground Zero NY celebrates news of Osama bin Laden's death

Grand Central Moncler NYFW Flash Mob Dancin

A few more observations on Flickr pageviews: It’s hard to say if 1,000,000 page views is a lot especially considering the number of photos I have uploaded in total. Before the pilots on Wall Street photo, I averaged about 200-500 pageviews a day. After that, I put more effort into maintaining my account and regularly uploading photos. Now on a given day, if I don’t upload anything particularly interesting the account averages about 1,500 views.

Search engines bring very little traffic. So other than what (lack of) interest my photos have for the general Internet, I think my upload-and-forget mindset towards my account also limits my pageviews. I have a good friend on Flickr who gets far fewer pageviews but gets far more comments than I do. I rarely comment on my contacts’ photos and barely participate in the various groups.

I’m disconnected enough from the Flickr social scene that I only have a very vague understanding of how its Explore section works. Besides the blog, the Explore collection is the best way to get seen on Flickr. It features “interesting” photos as determined by an algorithm that, as best I can tell, is affected by some kind of in-group metric.

I’ve only had three photos make it to Explore: the snowball fight in Times Square, the lightning hitting the Empire State Building, and this one where my subway train got stuck and we had to walk out the tunnel. The pilots photo did not make it to Explore, so I’m guessing that amount of traffic (particularly if a huge portion of it comes from one link on Reddit) is not necessarily a prime factor to getting noticed by Flickr’s algorithm.

SOPA Opera now on ProPublica

Moved my SOPAopera.org site over to my employer’s home, ProPublica: http://projects.propublica.org/sopa

It’s the same data, just with a facelift and respectable color scheme:

The SOPA Debate and How It’s Affected by Congress’s Understanding of Child Porn

Rep. Lamar Smith, chairman of the House Judiciary Committee and SOPA sponsor

Update (1/22/2012): SOPA was indefinitely postponed by Rep. Lamar Smith on Friday (PIPA is likewise stalled). Rep. Smith also has another Internet rights bill on deck though: the The Protecting Children from Internet Pornographers Act of 2011, which mandates that Internet services store customer data for up to 18 months to make it easier for law enforcement to investigate them for child porn trafficking. This proposed bill is discussed in the latter half of this post, including how its level of support is similar (and different) than SOPA’s.

H.R. 1981 has made it farther than SOPA did. It made it out of the Judiciary Committee (which is chaired by Rep. Lamar Smith and also handled SOPA) with a 19-10 vote in July of last year and is placed on the Union Calendar. Compare HR.1981’s progress compared to SOPA’s). H.R. 1981 has 39 cosponsors, compared to SOPA’s original 31. Read the text of HR 1981.

One thing I’ve learned from the whole SOPA affair is how obscure our lawmaking process is even in this digital age. The SOPA Opera site I put up doesn’t do anything but display publicly available information: which legislators support/oppose SOPA and why. But it still got a strong reaction from users, possibly because they misunderstand our government’s general grasp of technology issues.

Sen. Al Franken is one of the co-sponsors for PROTECT-IP, the Senate's version of SOPA

The most common refrain I saw was: “I cannot believe that Rep/Senator [insert name] is for SOPA! [insert optional expletive].” In particular, “Al Franken” was a frequently invoked name because his fervent advocacy on net neutrality seemed to make the Minnesota senator, in many of his supporters’ opinions, an obvious enemy of SOPA. In fact, one emailer accused me of being out to slander Franken, even though the official record shows that Franken has spoken strongly for PROTECT-IP (the Senate version of SOPA) and even co-sponsored it.

So there’s been a fair amount of confusion as to what mindset is responsible for SOPA. Since party lines can’t be used to determine the rightness/wrongess of SOPA, fingers have been pointed at the money trail: SOPA’s proponents reportedly receive far more money from media/entertainment-affiliated donors than they do from the tech industry. The opposite trend exists for the opponents.

It’s impossible of course to know exactly what’s in the our legislators’ minds. But a key moment during the Nov. 16 House Judiciary hearing on SOPA suggest that their opinions may be rooted less in malice/greed (if you’re of the anti-SOPA persuasion) than in something far more prosaic: their level of technological comprehension.

You can watch the entire, incredibly-inconvenient-to-access webcast at the House Judiciary’s hearing page. I’ve excerpted a specific clip in which Rep. Tom Marino (R-PA) is asking Katherine Oyama (Google’s copyright lawyer) about why Google can stop child porn but not online piracy:

REP. MARINO: I want to thank Google for what it did for child pornography – getting it off the website. I was a prosecutor for 18 years and I find it commendable and I put those people away. So if you can do that with child pornography, why can you not do it [with] these rogue websites [The Pirate Bay, et al.]? Why not hire some whiz kids out of college to come in and monitor this and work for the company to take these off?

My daughter who is 16 and my son who is 12, we love to get on the Internet and we download music and we pay for it. And I get to a site and I say this is a new one, this is good, we can get some music here. And my daughter says Dad, don’t go near that one. It’s illegal, it’s free, and given the fact that you’re on Judiciary, I don’t think you should be doing that…Maybe we need to hire her [laugh]…but, why not?

OYAMA: The two problems are similar in that they’re both very serious problems they’re both things that we all should be working to fighting against. But they’re very different in how you go about combatting it. So for child porn, we are able to design a machine that is able to detect child porn. You can detect certain colors that would show up in pornography, you can detect flesh tones. You can have manual review, where someone would look at the content and they would say this is child porn and this shouldn’t appear.

We can’t do that for copyright just on our own. Because any video, any clip of content, it’s going to appear to the user to be the same thing. So you need to know from the rights holder…have you licensed it, have you authorized it, or is this infringement?”

REP. MARINO: I only have a limited amount of time here and I appreciate your answer. But we have the technology, Google has the technology, we have the brainpower in this country, we certainly can figure it out.

The subject of child pornography is so awful that it’s little wonder that no one really thinks about how it’s actually detected and stopped. As it turns out, it’s not at all complicated.

When I was a college reporter, I had the idea to drive down to the county district attorney’s office and go through all the search warrants. Search warrants become part of the public record, but district attorneys can seal them if police worry that details in an affidavit or search warrant would jeopardize an investigation. I wanted to count how many times this was done at the county DA, because some major cases had been sealed for months. And I wondered if the DA was too overzealous in keeping private what should be the people’s business.

But there were plenty of big cases among the unsealed warrants. I went to college in a small town but there was a bizarre, seemingly constant stream of students being charged with child porn possession. Either college students were becoming particularly perverse or the campus police happened to be crack cyber-sleuths in rooting out the purveyors.

I don’t know about the former, but I learned that the police were not particularly skilled at hacking, based on their notes in the search warrants. In fact, finding the suspects was comically easy because of the unique setup of our college network. Everyone in the dorms had an ethernet hookup but there was no Google, Napster or BitTorrent at the time. So one of the students built a search engine that allowed any student to search the shared files of every other student. And since Windows apparently made this file sharing a default (and at the time, 90+ percent of students’ computers were PCs), the student population had inadvertent access to a huge breadth of files, including MP3s and copied movies and even homework papers.

So to find out if anyone had child porn, the police could just log onto the search engine and type in the appropriate search terms. But the police didn’t even have to do this. Other students would stumble upon someone’s porn collection (you had the option of exploring anyone’s entire shared folder, not just files that came up on the search) and report it. The filenames were all the sickening indication needed to suspect someone of possession.

Google’s Oyama alludes to more technically sophisticated ways of detecting it, but the concept is just as simple as it was at my college: no matter how it’s found, child pornography is easy to categorize as child porn because of its visual characteristics, whether it’s the filename or the images itself. In fact, it’s not even necessary for a human to view a suspected file to know (within a high mathematical probability) that it contains the purported illegal content.

If you’ve ever used Shazam or any of the other song-recognition services, you’ve put this concept into practice. When you hold up a phone to identify a song playing over the bar’s speakers, it’s not as if your phone dials up one of Shazam’s resident music experts who then responds with her guess of the song. The Shazam app looks for certain high points (as well as their spacing, i.e. the song’s rhythm) to generate a “fingerprint” of the song, and then compares it against Shazam’s master database of song “fingerprints”.

No human actually has to “listen” to the song. This is not a new technological concept; it’s as old as, well, the fingerprint database used by law enforcement.

So what Rep. Marino essentially wants is for Google to build a Shazam-like service that doesn’t just identify a song by “listening” to it, but also determines if whoever playing that song has the legal right to do so. Thus, this anti-pirate-Shazam would have to determine from the musical signature of a song such things as whether it came from an iTunes or Amazon MP3 or a CD. And not only that, it would have to determine whether or not the MP3 or CD is a legal or illegal copy.

In a more physical sense, this is like detecting a machine that can determine from a photograph of your handbag whether it’s a cheap knockoff and whether or not you actually own that bag – as opposed to having stolen it, or having bought it from someone who did steal it.

I’m not a particularly skilled engineer but I can’t fathom how this would be done and neither can Google, apparently. But Rep. Marino and at least a few others on the House Judiciary committee have more faith in Google’s technical prowess and they don’t believe that Google is doing enough.

And frankly, I can’t blame them.

From their apparently non-technical vantage point, what they see is that Google is an amazing company who seems to have no limit in its capabilities. It can instantly scour billions of webpages. It can plot in seconds the driving route from Des Moines ot Oaxaca, Mexico. And at some point, might even make a car that drives that route all by itself.

And Google has demonstrated the power to stop evil acts, because it has effectively prevented the spread of child porn in its search engine and other networks. Child porn is a terrible evil; software/media piracy less so. It stands to reason – in a non-technical person’s thinking – that anyone who can stop a great evil must surely be able to stop a lesser evil.

And so, to continue this line of reasoning, if Google doesn’t stop a lesser evil such as illegal MP3 distribution, then it must be because it doesn’t care enough. Or, as some House members noted, Google is loathe to take action because it makes money off of sites that trade in ill-gotten intellectual property.

So you can see how one’s position on SOPA may be inspired not as much out of devotion to an industry but more from a particular (or lack thereof) understanding of the technological tradeoffs and hurdles.

Rep. Marino et. all sees this as something within the realm of technological possibility for Google’s wizards, if only they had some legal incentive. Google and other SOPA opponents see that the problem that SOPA ostensibly tackles is not one that can be solved with any amount of technological expertise. Thus, each side can be as anti-online-piracy/pro-intellectual-property as the other and yet fight fiercely over SOPA.

Smith’s anti-child porn, database-building bill

Though SOPA has taken the spotlight, there is another Internet-related bill on the House Judiciary’s agenda. It’s H.R. 1981, a.k.a The Protecting Children from Internet Pornographers Act of 2011, which proposed a mandate that Internet sites keep track of their users IP information for up to 18 months, to make it easier to investigate Internet crimes – such as downloading child pornography.

H.R. 1981 was introduced by House Judiciary Chairman Rep. Lamar Smith (R-Tex.) who is, of course, the legislator who introduced SOPA. And like SOPA, the support for H.R. 1981 is non-partisan because child pornography is neither a Republican or Democratic cause.

And also like SOPA, the opposition to H.R. 1981 is along non-partisan lines. Among the most vocal opponents to the child porn bill is the Judiciary committee’s ranking member Rep. John Conyers (D-MI). Is it because he is in the pocket of the child porn lobby? No; Conyers argues that even though child porn is bad, H.R. 1981 relies on using technology in a way that is neither practical nor ethical. From CNET:

The bill is mislabeled,” said Rep. John Conyers of Michigan, the senior Democrat on the panel. “This is not protecting children from Internet pornography. It’s creating a database for everybody in this country for a lot of other purposes.”

Rep. John Conyers (D-MI)

Rep. Conyers apparently understands that just because a law purports to fight something as evil (and, of course, politically unpopular) as child pornography doesn’t mean that the law’s actual implementation will be sound.

So when the wrong-to-be-righted is online piracy – i.e. SOPA – what is Conyers’ stance? He is one of its most vocal supporters:

The Internet has regrettably become a cash-cow for the criminals and organized crime cartels who profit from digital piracy and counterfeit products. Millions of American jobs are at stake because of these crimes.

Is it because Conyers is in the pocket of big media? Or that he hates the First Amendment? That’s not an easily apparent conclusion judging from his past votes and legislative history.

It’s of course possible that Conyers takes this particular stance on SOPA because SOPA, all things considered, happens to be a practical and fair law in the way that H.R. 1981 isn’t.

But a more cynical viewpoint is that Conyer’s technological understanding for one bill does not apply to the other. Everyone has been screwed over at some point by a massive, faceless database so it’s easy to be fearful of online databases – in fact, the less you know about computers, the more concerned you’ll be of the misuse of databases.

The technological issues underlying SOPA are arguably far more complex, though, and it’s not clear – as evidenced by Rep. Marino’s line of questioning – that Congressmembers, whether they support or oppose SOPA, have a full understanding of them.

As it stands though, SOPA had 31 cosponsors at its heyday. H.R. 1981 has 39. It will be interesting to see if this bill by Rep. Smith will face any residual backlash after what happened with SOPA.

The Irrationality of Price Anchoring

Roulette wheel, by Conor Ogle from London, UK

tl;dr summary – people will unconsciously anchor their judgment on a random number given to them, even if that number was fabricated in front of their own eyes.

If you’ve never heard the term “price anchor“, you’ve undoubtedly experienced it anytime you’ve had to negotiate a purchase, whether it was for a used car, an online auction, or even cheap souvenirs at a Chinatown bazaar.

Even if you manage to haggle the seller down to half the initial stated price, don’t be too quick to congratulate yourself; the merchant might have gotten what he/she wanted despite the discount, if the asking price was over the top.

And the kicker is: even if you knew the initial asking price was too high, you still might have been fooled into overpaying.

Daniel Kahneman, a psychologist who won the Nobel Prize for Economics, conducted a famous experiment in which college students were influenced by a random number that they knew to be random:

Amos and I once rigged a wheel of fortune. It was marked from 0 to 100, but we had it built so that it would stop only at 10 or 65. We recruited students of the University of Oregon as participants in our experiment. One of us would stand in front of a small group, spin the wheel, and ask them to write down the number on which the wheel stopped, which of course was either 10 or 65.

We then asked them two questions:

Is the percentage of African nations among UN members larger or smaller than the number you just wrote?

What is your best guess of the percentage of African nations in the UN?

The spin of a wheel of fortune – even one that is not rigged – cannot possibly yield useful information about anything, and the participants in our experiment should simply have ignored it. But they did not ignore it. The average estimates of those who saw 10 and 65 were 25% and 45%, respectively.

Kahneman, Daniel (2011-10-25). Thinking, Fast and Slow (p. 120). Macmillan.

In his recently published book (well worth purchasing, BTW), Thinking, Fast and Slow, Kahneman reflects that “we were not the first to observe the effects of anchors, but our experiment was the first demonstration of its absurdity: peopleâ€™s judgments were influenced by an obviously uninformative number.”

This lesson is something to keep in mind as a content-creator, particularly in a digital age in which it’s very easy to distribute your work for free. The fact that so many developers sell their apps for free (or, “freemium”) creates a price anchor that makes even 99 cents seem too much for a high-quality game (see Tom Oatmeal’s excellent comic on this phenomenon).

Kahneman’s experiment suggests that some people might be fooled into paying higher than they might, but in today’s inter-connected world where price comparisons are instant, it’s harder to get away with irrationally high prices. The more pressing concern is that pricing too low can cause customers to assign an irrationally low value to your product.

Rather than get offended when people demand that your product be free, see it as a quirk in human psychology – people can’t help but be fooled by even random numbers – and adjust accordingly. You may just have to adjust your product or its pitch to emphasize its uniqueness before you have the freedom to set a palatable price anchor.