beginners.rb |
|
---|---|
Coding for Absolute Beginner(Presented at NICAR-2011 at 4pm in one of the rooms)by Dan Nguyen @ ProPublica (twitter: @dancow) Learn a little Ruby code, do a little web-scraping and make a webpage that’ll find us something to do in Raleigh (or wherever you’re at) Trying to learn how to code, but think that it’s beyond your grasp? For the most part, what makes programming seem hard is that it requires typing, and even one small typo can cause confusion in the program. But you don’t have to do anything conceptually hard or think out of the box. The mechanical part is easy, so you can save your out-of-the-box thinking for your ideas. This tutorial assumes you’re using Ruby 1.8+ http://rubyonrails.org/download Pick up a free text-editor like SciTE for Windows or TextWrangler for Mac Windows: http://bit.ly/SciTE Mac: http://bit.ly/textwrangler The raw textfile from which this tutorial snippet was generated can be found here: http://gist.github.com/845131 In fact, you can just open that up, paste it into your text-editor, run it (see disclaimer first), and you will have a webpage. Here is the final output (varies on what address you’re using). This is actually one of the most entertaining webpages I have ever put together, and all it required was the concepts covered here. To execute code for this tutorial, all you need to do is open up your text-editor, type in some code, and: Windows, SciTE: hit F5 Mac, TextWrangler: Go to the menubar, find the #!, select Run Disclaimer: I wrote this during breakfast. I may have made errors in terminology, theory, and taste, and I cannot guarantee executing the steps here will not explode your computer (maybe you actually should read the code), but the code worked for me, so there. |
|
Say Hello |
|
The traditional first line of code. You just did a lot there. |
puts "Hello world!" |
Take a look how this is done in other languages: http://bit.ly/helloworldcode Languages like Ruby and Python make it easy to get into getting things done |
|
The MethodThe first word in that code is puts This is what’s called a method in Ruby (also referred to as a function in other languages). It is short-hand for “print”; in this case, it’s printing to your screen. |
|
The StringThat “Hello world!” is two words; in our code, it’s basically one thing: a String Strings are used to hold characters, whether it be a single word or an entire book |
puts "Four score and seven years ago our fathers
brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the proposition
that all men are created equal." |
Quotation marks, single or double, signifies the beginning and end of a string |
puts 'Hello, "world"'
puts "'Allo, world" |
How would you use double-quotes inside double-quotes, for instance, when trying to describe someone speaking? Use the backslash \ character. |
puts "That fellow said, \"Hello World\"" |
You will use the backslash frequently, as it is how we signify in our code that the following character is special. In this case, we want the quotation mark to not be treated like a normal quotation mark, which would prematurely end our sentence. |
|
VariablesThink of these as pointers, or labels of actual values. |
hello_world = "Hello, World"
puts hello_world |
They don’t exist until you declare them. You can name them just about anything you want with alphanumeric characters, and as long as they aren’t # words already special to Ruby, like if or end. To be safe, let’s just use lowercase alphabetical characters and underscores |
one_plus_one = 3
puts one_plus_one |
You can see that the equals sign, =, is used to assign values. And that’s about as much introduction as we need to actually do something |
|
What’s to do in this town?Go to your browser and open this address:
|
|
|
|
|
|
If you saw something like this, with or without tags…
|
|
…then you’ve just used Google’s Geocoding API You can read the details here: http://code.google.com/apis/maps/documentation/geocoding/ |
|
The APILet’s look at the structure of that url real quick:
This is the address of the site where it exists on the Internets
This question mark signifies the end of the address and the beginning of the parameters that we use to tell Google what exactly we want
This is the first key-value pair of parameters. The key in this case is “address”. The value, i.e. the address we’re interested in, is “500+Fayetteville Street,+Raleigh,+NC”. The plus-sign is just how we signify a space-character for the URL.
The other key-value pair here is sensor=false and is just a parameter that Google requires. But notice the ampersand. That is what delimites the different key-value pairs in a request string. |
|
Back to Strings and VariablesOK, back to coding. Let’s put the address into a variable |
my_address='500+Fayetteville+Street,+Raleigh,+NC' |
We’re going to use a new String trick to manage the url: |
the_url = "http://maps.googleapis.com/maps/api/geocode/xml?sensor=false&address="+my_address |
As you can guess, that plus sign adds the value of my_address to the variable the_url. Notice how the order of paramaters for Google’s API is not important But if you like order, you could also do this: |
the_url = "http://maps.googleapis.com/maps/api/geocode/xml?address="+my_address+"&sensor=false" |
It works, but this is not considered particularly elegant. For most situations, we’d rather do this: |
the_url = "http://maps.googleapis.com/maps/api/geocode/xml?address=#{my_address}&sensor=false" |
We’ll be using this trick later. Basically, what’s inside the #{} gets interpreted and output as a string; in this case, the value of my_address. If you didn’t have the pound-sign-and-curly-brackets around my_address, you’d be asking Google’s API for the the geocoding information of “my_address,” which would just be silly. Mess around with puts to see when you change the value of my_address and reassign the value to the_url. Or when you remove the curly-brackets Also to do this the string must be in double-quotes |
|
The Call, in CodeOK finally, we’re going to do something. After we do this: |
require 'open-uri' |
Why is programming “easy,” relatively speaking? Because other programmers have done the hard work of making libraries of code and methods that they’ve wrapped up in a single word. With require we’ve called up a library that makes retrieving web-files as easy as this: (you did set my_address and the_url, right?) |
google_result = open(the_url)
puts google_result.read |
You should see the XML you got when visiting the API through your browser. Ta-da, that’s kind of useful… A few notes before moving on: 1. The google_result variable is not in itself a string. If you try puts sans the read: |
puts google_result |
You get #<StringIO:0x10063ade8> It’s not just a string, it’s something called a StringIO. For now, we just care about the contents, which can be read through a method called read If you really care what makes up StringIO, require the ‘pp’ library and do: |
require 'pp'
pp google_result |
|
|
That DotOne more very important thing: You probably noticed the dot we used in google_result.read. Basically, the dot calls the method read which belongs to google_result (read is one of those things that StringIO has that String does not) google_result is referred to as the method’s receiver. That’s just some lingo. All methods in Ruby have a receiver. |
|
Sidenote: What is the receiver of open and puts? Something called the Kernel. If you wanted to be verbose, you could do: |
Kernel.puts "I AM THE RECEIVER, SAYS KERNEL" |
And when you require ‘open-uri’, you’re giving the Kernel the open method. Not doing that require and trying to call open would give you an error message BTW, if you thought the code above was complicated, try looking at what’s behind open-uri and seeing what you got to skip past. Heck, look at the code behind puts |
|
FourSquare APIEnough theory lecture, let’s make use of our data Let’s store whatever you got for latitude and longitude
into variables. There’s several lat/long pairs in the xml,
but either pair is fine. Ideally, you want the pair nested
in between the |
my_lat = 35.7732548
my_lng = -78.6399158 |
Let’s take a look at the (old) Foursquare API
This is the API call for nearby tips. You can see that the tips is part of the address and filter specifies the…filter…i.e. we want tips near us. Now we just have to give it the lat/long coordinates to search nearby. Let’s use that string trick with the curly brackets: |
fsquare_url="http://api.foursquare.com/v1/tips?geolat=#{my_lat}&geolong=#{my_lng}&filter=nearby" |
|
|
Calling FourSquareLet’s make the call |
fsquare_result = open(fsquare_url)
fsquare_xml_string = fsquare_result.read |
If you were successful, and depending what coordinates you used, you should’ve gotten back a big XML relating to tips. Remeber you have to call fsquare_result’s read method to actually refer to the XML string |
|
Method-makinOK, so we haven’t really done anything useful yet, anything beyond what you could’ve done by hitting up these API urls in your browser yourself. But now we’re going to see how a little code can save us a load of time. With a few lines of code, we will build our own webpage from this FourSquare XML. First, we’re going to make our own method. This is actually something I’d rather not have to do now, but since I don’t know what programming enviornment you are in, I can’t assume that you can install new libraries (if you’re on your own computer, download the Nokogiri Rubygem now; it # will stay with for the rest of your web-scraping/document-processing career) Without further ado: |
class String
def xml_tag_getter(some_tag_word)
self.scan(/<#{some_tag_word}\>(.+?)<\/#{some_tag_word}>/).map{|p| p[0] if p}
end
end |
There’s a lot to explain here, including the object-oriented nature that underlies all of Ruby and what “object” and “oriented” mean. It’s beyond the scope of this lesson. But I’ll just give you the upshot: I’m giving strings a new method called xml_tag_getter and it wants a string. If you don’t understand the self.scan… line, don’t worry about it now. It’s not code worth revisiting, but it will save us some time in this specific use case ; when you get home, install Nokogiri. If everything in this method is totally incomprehensible to you, it’s probably because you don’t know regular expressions. You could quit this lesson now, learn them at regular-expressions.info, and be much more empowered. |
|
Arrays Let’s put the xml_tag_getter to use. If you’ve skimmed over the contents we
put into fsquare_xml_string, you might have noticed that there were many repeating elements, in particular |
|
Using the xml_tag_getter method, we can pull these elements out: |
fsquare_array = fsquare_xml_string.xml_tag_getter('tip') |
The result of this is a data structure we haven’t used yet, the Array Think of an array as a list of other data-structures; in this case, it’s a list of separate strings. We can access each individual string with the square bracket notation: |
puts fsquare_array[0]
puts fsquare_array[1] |
This should output the first and second elements of the array, with each element being a String. In our case, each string contains what was between each pair of |
|
To find out the length of the array, you simply call it’s length method |
puts "This array has #{fsquare_array.length}" |
|
|
The Loop So why are we using an Array? We knew that fsquare_array had multiple |
|
There are many ways to get through an array. You could find its length, and then repeat whatever operation you want to do on each of its elements that many times. Obviously, there’s a better way to do that: the each method: |
fsquare_array.each do |tip_xml|
puts "This is a tip:\n#{tip_xml}\n\n\n\n"
end |
This should’ve printed out each tip; the backslash-n is a special character for printing a new line. A few things to mind here. each travels through the array and does whatever you specify in between the do and the end keywords. The word in between the pipe-characters is the variable name used to refer to the array element that the each method is currently on. |
|
Making our own webpageHopefully you can tell that we’re finally close to doing something useful with code. The array and its each method give us a way to process an arbitrarily-sized chunk of data without having to repeat ourselves. How many tips did Foursquare give us? Let the computer worry about it. So far, the xml string (remember that it is still a String; if you had Nokogiri, you could put it in an XML object and make life easier) hasn’t been useful to us. So let’s use the emerging technology called Hypertext Markup Language to make something pretty with it. We’re going to reuse the xml_tag_getter method. If it worked on the big fsquare_xml_string string, it ought to work on all the separate If you read the output, you probably saw that each tip has tags for not just the content of the tip, but the name and photo of each tip’s author. So what we’ll do is print out an easy to read webpage from this XML. If you know HTML, great. If not, it’s OK, we’re just making more strings. |
|
The FileOne more Ruby data structure to learn: the File. You’re probably sick of reading as puts puts it. And it’s pretty ephemeral, as soon as you close your editor or its output screen, the output disappears. So let’s write to an actual textfile with the File: |
outputfile = File.open("fsquare_output.html", 'w') |
What we’ve done here is open a new file, with a name that we specified in the first string. The second string, ‘w’, tells the File.open method that we’re writing to the new file. Two things to note: File has it’s own method, different than the one ‘open-uri’ bestowed upon Kernel The above filename will dump it to wherever you ran your script. You might want to name it something like “~/Desktop/foursquare_output.html” to find it more easily. |
|
The webpage-making code |
|
Here it is, everything we’ve learned, all put together to make a webpage. It’s assuming that the XML contents is still in foursquare_xml, that my_address still has that address you looked up, and that *xml_tag_getter has been defined. |
outputfile = File.open("fsquare_output.html", 'w') |
Writing the opening html for our file. Hey, File has it’s own puts method, too! This one prints to the file. |
outputfile.puts "<html><body align=\"center\"><title>NICAR: My Very First Webpage</title><table align=\"left\" width=\"800\" cellspacing=\"10\" cellpadding=\"15\">"
outputfile.puts "<h1 style=\"color:#00aaff\"><blink>My Very First Webpage-from-an-API-Reading-Ruby-script</blink></h1>"
outputfile.puts "<h2>Things to do near #{my_address} according to Foursquare<h2>" |
Now we do the loop…Note that I skip defining fsquare_array on its own line: We know that xml_tag_getter gives us an Array, and we can jump right into invoking its each method |
fsquare_xml_string.xml_tag_getter('tip').each do |tip_xml| |
This line creates a HTML table row and the first cell, the tipper’s name: |
outputfile.puts "<tr><td>#{tip_xml.xml_tag_getter("firstname")} #{tip_xml.xml_tag_getter("lastname")}"
|
We took that curly-bracket notation to the extreme here (not really). We’re not only calling a method inside the string, we’re using a String inside the string. Whoa. Now we’re printing out the tipper’s image and tip |
outputfile.puts "<br><img src=\"#{tip_xml.xml_tag_getter("photo")}\" ></td>"
outputfile.puts "<td>
<strong>#{tip_xml.xml_tag_getter("fullpathname")} </strong> #{tip_xml.xml_tag_getter("name")}<br>
<strong>Distance:</strong> #{tip_xml.xml_tag_getter("distance")}<br>#{tip_xml.xml_tag_getter("address")}
<p>#{tip_xml.xml_tag_getter("text")}</p></td><td>" |
As a bonus, we’ll hit up the Google Static Maps API. |
|
And hell, we’ll throw in a bonus fundamental Ruby concept, the if statement Basically, if something is true, do it. If not, don’t. |
if !my_lat.nil? && !my_lng.nil? && !tip_xml.xml_tag_getter("geolong").empty?
|
Don’t know for sure if you had these variables defined still. If not, then this following code never executes. nil means nothing. We’re asking “are my_lat AND my_lng non-existent? AND does the geolong tag in the XML contain something?” The exclamation-mark means the negative of the true/false condition we’re testing. So if they ARE existent and non-empty, this statement executes. Confusing? That’s OK, this was just a bonus. |
outputfile.puts "<img src=\"http://maps.google.com/maps/api/staticmap?size=250x150&sensor=false&markers=color:red%7Clabel:A%7C#{my_lat},#{my_lng}&markers=color:blue%7Clabel:B%7C#{tip_xml.xml_tag_getter("geolat")},#{tip_xml.xml_tag_getter("geolong")}\" />"
|
This is the end of the if statement |
end
outputfile.puts "</td></tr> <!-- the end of this table row -->" |
And we’re done with this xml node, onto the top of the loop until its done |
end |
Print out the closing tags |
outputfile.puts "</table></body></html>" |
And one more method name to learn: close. We’re closing the file and we’re done. |
outputfile.close |
That’s the end of these notes. Learn these things, and maybe more, at my 4PM NICAR session on Saturday. If you thought you learned a lot now, wait till you experience it through Powerpoint slides Special thanks to Jeremy Ashkenas of DocumentCloud, who wrote the quick-and-easy documentation-generator that made this HTML version, rtomayko for porting it to Ruby, and to IRE, NICAR, ProPublica, and the many others making it possible to improve journalism. |
|