Tag Archives: data

Peter Norvig on cleverness and having enough data

I’m reposting this entry, near verbatim, from The Blog of Justice, which picks out a keen quote from Google/Stanford genius, Peter Norvig, from his Google Tech Talk on August 9, 2011.

“And it was fun looking at the comments, because you’d see things like ‘well, I’m throwing in this naive Bayes now, but I’m gonna come back and fix it it up and come up with something better later.’ And the comment would be from 2006. [laughter] And I think what that says is, when you have enough data, sometimes, you don’t have to be too clever about coming up with the best algorithm.”

I think about this insight more times than I’d like to admit, in those frequent situations where you end up spending more time on a clever, graceful solution because you look down on the banal work of finding and gathering data (or, in the classical pre-computer world, fact-finding and research).

But I also think about it in the context of people who are clever, but don’t have enough data to justify a “big data” solution. There’s an unfortunate tendency among these non-tech-savvy types to think that, once someone tells them how to use a magical computer program, they’ll be able to finish their work.

The flaw here is, well, if you don’t have enough data (i.e. more than a few thousand data-points or observations), then no computer program will help you find any worthwhile insight. But what’s more of a tragedy is that, since the datasets involved here are small, these clever people could’ve done their work just fine without waiting for the computerized solution.

So yes, having lots of data can make up for a lack of cleverness, because computers are great at data processing. But if you’re in the opposite situation – a clever person with not a lot of data – don’t overlook your cleverness.

Because of a typo, the government needs to keep your private data 10 times longer?

Yesterday the Obama administration approved new rules to greatly extend the time – from 180 days to 1,826 days (5 years) – that domestic intelligence services can retain American citizens’ private information. Citizens are eligible to be part of this federal data warehouse even when “there is no suspicion that they are tied to terrorism.”

As Charlie Savage in the New York Times reports:

Intelligence officials on Thursday said the new rules have been under development for about 18 months, and grew out of reviews launched after the failure to connect the dots about Umar Farouk Abdulmutallab, the “underwear bomber,” before his Dec. 25, 2009, attempt to bomb a Detroit-bound airliner.

After the failed attack, government agencies discovered they had intercepted communications by Al Qaeda in the Arabian Peninsula and received a report from a United States Consulate in Nigeria that could have identified the attacker, if the information had been compiled ahead of time.

The case of the “underwear bomber” is a strange justification for this expansion of data storage. Because the 2009 Christmas terror attempt nearly succeeded thanks to a series of what seems like common human errors, not from an information drought.

Shortly after the underwear bomber incident, the White House released a report examining how our vast intelligence network failed to prevent Abdulmutallab, the bomber, from boarding a flight from Amsterdam to Detroit.

One of the critical failures? Someone at the State Department, when sending information about Abdulmutallab to the National Counterterrorism Center, misspelled his name. Even though his father alerted American intelligence officials a full month before the attempted attack, our sophisticated surveillance system was partially stymied by a single misplaced letter.

As Foreign Policy reported in 2010:

State called an impromptu press briefing late Thursday evening to address the issue. The tone of the briefing was combative, as reporters pressed the “senior administration official” for details about the misspelling that he seemed not to want to give up. But here’s what we learned.

Someone (they won’t say who) at the State Department (presumably at the U.S. Embassy in Nigeria) did check to see if Abdulmutallab had a visa (they won’t say exactly when). That person was working off the Visas Viper cable originally sent from the embassy to the NCTC, which had the name wrong.

“There was a dropped letter in that — there was a misspelling,” the official said. “They checked the system. It didn’t come back positive. And so for a while, no one knew that this person had a visa.” (They won’t say for how long)

The chain of failures is more complicated than that, but the fact that a typo was a big enough of a wrench to warrant special mention in the White House review is an indication that the government’s surveillance systems, despite the work of its data architects, engineers and scientists, were compromised by some pretty banal problems, like not having spell-check capability.

In fact, the White House report goes out of its way to assert that the information-sharing problems that failed to prevent the 9/11 attacks “have now, 8 years later, largely been overcome.” Information about Abdulmutallab (again, his own father met with U.S. officials to warn them of his son a month ahead of the attack), his association with Al Qaeda, and Al Qaeda’s attack planning, “was available to all-source analysts at the CIA and the NCTC prior to the attempted attack.”

In other words, the 9/11 attack was possible because government agencies wouldn’t share information with each other. Now, they are happily sharing information with each other, they just aren’t diligently looking at it.

So the best solution is to enact a ten-fold increase the legal time limit for storing American citizens’ data?

It sounds like the government’s ability to detect terrorists would be greatly improved with better user-friendly software and adherence to data-handling standards. The ability to catch slight misspellings and do fuzzy data matches is something that Facebook and Google users have enjoyed for years; hell, the basic concept and consumer-friendly implementation has been around in Microsoft Word since about 20 years ago. Have software overhauls been enacted before deciding that the government needs more of its citizens’ private information? Or does the review of such technical details and policies seem too unsexy and pedantic for our intelligence bureaucracy?

The Times article also mentions that the guidelines call for more duplication of entire databases…which is a bit confusing. I’m assuming that this doesn’t refer to making backup copies (in case of a hard drive failure), but to a method of data-sharing between analysts. This is how the Times describes it:

The guidelines are also expected to result in the center making more copies of entire databases and “data mining them” using complex algorithms to search for patterns that could indicate a threat.

Hopefully, this doesn’t mean that database files are being copied and passed around so that each department can have their own copy of another department’s data. This would seem to introduce a few major logistical issues: namely, how do you know the copy you have contains the latest data? Remember that the typo in Abdulmutallab’s name was one mistake that helped spawn a series of snafus. Are we going to have an incident in which a terrorist slips through because an analyst forgot to update his/her copy of a database before mining it? Also, there’s the possibility that some of these data copies might end up lying around long after their 5-year limit.

There have been several reports of how intelligence agencies now suffer from too much data, to the point where analysts are “drowning in the data.” If this is a reason cited for how an attack went unprevented in the future, I hope the proposed reform is not “more data.”

Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List

Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state.

Update (4/28): Replaced the code and result files. Still haven’t written out a thorough explainer of what’s going on here.

Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. I’m going to rework this script and post and update soon.

So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. These doctors are the same ones who, from time to time, recommend the use of Pfizer products.

From the NYT:

Pfizer, the world’s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. Pfizer also paid $15.3 million to 250 academic medical centers and other research groups for clinical trials in the same period.

A spokeswoman for Pfizer, Kristen E. Neese, said most of the disclosures were required by an integrity agreement that the company signed in August to settle a federal investigation into the illegal promotion of drugs for off-label uses.

So, not an entirely altruistic release of information. But it’s out there nonetheless. You can view their list here. Jump to my results here

Not bad at first glance. However, on further examination, it’s clear that the list is nearly useless unless you intend to click through all 480 pages manually, or, if you have a doctor in mind and you only care about that one doctor’s relationship. As a journalist, you probably have other questions. Such as:

  • Which doctor received the most?
  • What was the largest kind of expenditure?
  • Were there any unusually large single-item payments?

None of these questions are answerable unless you have the list in a spreadsheet. As I mentioned in earlier lessons…there are cases when the information is freely available, but the provider hasn’t made it easy to analyze. Technically, they are fulfilling their requirement to be “transparent.”

I’ll give them the benefit of the doubt that they truly want this list to be as accessible and visible as possible…I tried emailing them to ask for the list as a single spreadsheet, but the email function was broken. So, let’s just write some code to save them some work and to get our answers a little quicker.
Continue reading