Monthly Archives: December 2009

Using PDFTOTEXT to convert a batch of PDFs to text and splitting them by page

I can’t believe how hard it was to find this (also, I know basically nothing about bash scripting), so maybe the next person who Googles this will find this post and save themselves a few minutes:

(replace ‘999’ with the number of pages in a document)

for f in *.PDF; 
   do 
         for i in {1..999}; 
         do 
         pdftotext -f "$i" -l $l "$i" -layout $f "${f%.PDF}_$1.txt"; 
     done; 
done

Or:
for f in *.PDF; do for i in {1..999}; do pdftotext -f "$i" -l $l "$i" -layout $f "${f%.PDF}_$i.txt"; done; done

The above script will tell pdftotext to take every .PDF file and convert each page into a separate text file in the format original_file_name_pagenumber.txt

22 million Bush White House e-mails found

From the AP:

Computer technicians have found 22 million missing White House e-mails from the administration of President George W. Bush and the Obama administration is searching for dozens more days’ worth of potentially lost e-mail from the Bush years, according to two groups that filed suit over the failure by the Bush White House to install an electronic record keeping system.

Read more: http://www.sfgate.com/cgi-bin/article.cgi?f=/n/a/2009/12/14/national/w120825S68.DTL&tsp=1#ixzz0ZhjtALTt

An interesting tidbit from a Jan. 2009 Talking Points Memo article:

But it doesn’t sound like we’ll get everything. The new email system that the White House switched to four years ago allowed all staff members to access storage files and delete messages — unlike the previous system, which was designed to preserve all messages containing official business. Fuchs said that the White House has still declined to make a forensic copy of the records, so any emails that were deleted likely won’t be recovered. And since we’re talking about millions of emails, it may be impossible to know what we don’t have.

Facebook CEO’s Mark Zuckerberg Gets Caught With His Privacy Pants Down

UPDATE: This commenter notes that Hill has a friend of a friend with Zuckerberg, which is a different level of privacy than just the whole world. Hill did note that a previous look into Zuckerberg’s profile showed it to be private (though she may have made the mutual friend since then).

True/Slant’s Kashmir Hill catches Facebook CEO’s Mark Zuckerberg not quite grokking his own brainchild’s privacy policies.:

Facebook CEO Mark Zuckerberg either missed that article or doesn’t care. Back in October, I checked the Facebook profiles of the Facebook executive team, and found their privacy settings to be quite high.

Well, that’s changed. His profile is now on uber-public settings. I can see his wall, his photo albums, and his events calendar. Zuckerberg recently became a fan of Taylor Swift, uploaded graphic photos of “The Great Goat Roast of 2009″ three months ago, and plans to attend the Facebook holiday party on Friday night. I can even tell you where it’s going to be held.

You can check out his profile here.

I think it’s obvious that Zuckerberg did NOT intend for all his photos to get out there. He’s kept his profile public (possibly to save face, though in the before/after pics, his wall reads like a list of press releases) his photo albums are now hidden:

Before Kashmir Hill’s article:

Mark-Zuckerbergs-profile-privacy-settings-low

After:

Mark Zuckerberg's profile, now with more privacy!

Mark Zuckerberg's profile, now with more privacy!

Also related: Reuters financial blogger Felix Salmon, and many others, had his friend list scraped and posted by a rival financial-laws activist site.

NPR: Charles Babbage’s ‘Difference Engine'; a computer designed in the steam age

“I wish to God these calculations had been executed by steam!”

Charles Babbage

Charles Babbage

– Charles Babbage, the 19th-century man who, with better luck and political skill, could’ve brought the information age to the Victorians.

This fascinating story from NPR (“A 19th-Century Mathematician Finally Proves Himself“):

Charles Babbage, the man whom many consider to be the father of modern computing, never got to complete any of his life’s work. The Victorian gentleman was a brilliant mathematician, but he wasn’t very good at politics and fundraising, so he never got the financial backing to finish any of his elaborate machine designs. For decades, even his fans weren’t certain whether his computing machines would have worked.

But Doron Swade, a former curator at the Science Museum in London, has proven that Babbage wasn’t just an eccentric dreamer. Using nothing but materials that would have been available to Babbage in the 1840s, Swade and a group of engineers successfully built Babbage’s Difference Engine — and a version is now on display at the Computer History Museum in Mountain View, Calif.

The Difference Engine fills half a gallery and stands taller than most men. It’s 5 tons of cast iron, steel and bronze woven together from 8,000 distinct parts. Though it looks like it could be a sculpture, the machine is essentially a giant calculator. Tim Robinson, a docent at the museum, says it’s “the first automatic calculating machine.”

This engine — made from 162-year-old designs — doesn’t have a power pack; it has a hand crank. Robinson works up a sweat as he turns it. “As long as you keep turning that crank, it will produce entirely new results,” he says.

babbage machine 10301732

First shots with Canon S90

First shots with Canon S90, originally uploaded by zokuga.

So I caved in and bought the Canon S90. I’ve always wanted to get one of the Powershot G-series but held back because of the price and the not-quite-compact size. The S90 was just small enough, and Ken Rockwell gave such a raving review (“World’s Best Pocket Camera“) that I thought, what the hell.

So far, I haven’t regretted it. The shutter-lag, lack of viewfinder and selective focus is worth having a camera to let me take decent quality images on a whim. I’m getting self-conscious about carrying the 5d2 around town. I’m still not a very deliberate photographer, so taking snapshots on the street with a $4,000 camera kit just feels ridiculous. Being able to take a random shot on the street or in the bar and not feel like a pretentious doof is pretty liberating. None of these shots are particularly interesting, but it’s a good test run.

First shots with Canon S90

First shots with Canon S90

First shots with Canon S90

First shots with Canon S90

First shots with Canon S90

Bad Nurses, and Our Tragic Inability to Track Them

Get rich in the temp nursing business

Get rich in the temp nursing business

On Sunday, my ProPublica colleagues Tracy Weber and Charles Ornstein, in conjunction with the Los Angeles Times, put out a story examining the lack of standards in the temp nursing agency, a dangerous situation considering California’s desperate shortage of nursing staff.

Emboldened by a chronic nursing shortage and scant regulation, the firms vie for their share of a free-wheeling, $4-billion industry. Some have become havens for nurses who hopscotch from place to place to avoid the consequences of their misconduct. (see related story: A ‘Crazy’ Way for an Industry to Operate)

A joint investigation with the Los Angeles Times found dozens of instances in which staffing agencies skimped on background checks or ignored warnings from hospitals about sub-par nurses on their payrolls. Some hired nurses sight unseen, without even conducting an interview.

The gist of the problem: California lacks virtually any kind of tracking of errant temp nurses. This nurse, for example, was accused of stealing drugs from at least six hospitals, suffered a drug-induced seizure on the job, and had his Minnesota nursing license suspended before California got around to filing an accusation against him. Two years later, after a few more reported incidents of drug theft, the California registered nursing board finally revoked his license when he didn’t make his hearing on time.

Charlie and Tracy have been covering this story even before they joined ProPublica; LATimers Maloy Moore and Doug Smith contributed a massive amount of the essential research and data-analysis. This temp nurses chapter is just another consequence of what appears to be awful records-keeping and sloth by the various oversight bodies.

My own contribution to the coverage was small, the most notable aspect of which was this Ruby on Rails site I built to catalogue the sanctioned nurses, a relatively minor task compared to actually collecting and parsing the data (i.e. reading through all the PDF files for the buried information). . It was pretty simple, allowing users at a glance to see the numbers of disciplined nurses by various categories, including year and type of discipline. I was a little skeptical of doing it at first, just because the CA nursing board does have a searchable and functional database of its own.

Theoretically (well, if it weren’t the case that the records themselves are often incomplete, so that criminal nurses come up with a clean sheet), any member of the public could look up their own nurses’ records and avoid the bad ones. But the meat of the Charlie’s and Tracy’s is the numbers: 1,254 days on average to discipline a nurse (compared to 173 for Texas). 1,706 days before one nurse, who was kicked out of a drug-recovery program and considered a threat to public safety, had even an accusation filed against her. Our site makes it evident that hard numbers, not just heartbreaking anecdotes,  argue against California’s regulatory status quo.

A screenshot from our sanctioned nurses database

A screenshot from our sanctioned nurses database

The reporters on this story put in months of time manually tabulating the data to come up with the thrust of their stories. Sadly, all of these numbers and statistical conclusions were probably right under the nursing board’s nose. The regulators apparently track dates and types of accusations and disciplines for each nurse. A few simple database queries would’ve quickly uncovered the glaring delays and bottlenecks in the system (e.g. (SELECT AVG(TO_DAYS(`date_discipline`)-TO_DAYS(`date_initial_complaint`)) as average_delay from `disciplinary_actions`).

A day after Charlie and Tracy’s initial story in July 2009, Gov. Schwarzenegger sacked a majority of the registered nursing board and new regulations include making public the restrictions on a nurse’s license. Read ProPublica’s complete coverage on California’s flawed oversight of health-care workers here.

Max Baucus and Michael Steele: That ain’t legal either, dude.


NYT: Baucus Acknowledges Recommending a Woman He Was Dating

“Today’s report that Senator Max Baucus used his Senate office to advance a taxpayer-funded appointment for his staff-member girlfriend raises a whole host of ethical questions,” Mr. Steele said.

I agree that this doesn’t look like best-behavior. But Dude, “girlfriend” is not the preferred nomenclature for a 53-year-old woman. “Lady-friend,” please.

FriendFeed, Deepest Sender, FriendFeed Activity Widget

Still configuring the blog…opened a FriendFeed account so I could gather my various Flickr, Twitter, and Delicio.us items into the sidebar, using the FriendFeed Activity Widget. And am now using Deepest Sender, a FF-extension for WordPress posting so I don’t have to use the slow-as-molasses admin. Also switched to Thematic’s 3-column layout with the middle column reserved for the FF detritus, to make my site look more active than it is.