Scraping Makes You Feel Like a Small God
Housekeeping
Questions?
Scraping
What does it mean to think like a computer? What are alternatives to scraping?
Ruby
What is it? Do you have it? Check by typing “ruby” at your terminal. If you get something like “command not found”, download Ruby. Windows friends
We also need a library called watir. If you’re lucky, typing “gem install watir” at your terminal will get you it.
Watir
Watir is at its best when web pages are really complicated to scrape. But it works well for easier stuff too. Let’s follow their intro example.
We can use ruby interactively (a line at a time) with the command ‘irb’. Or you can run an entire script at once with ruby whatever.rb (where whatever.rb contains commands like the ones below.)
require 'watir'
browser = Watir::Browser.new :chrome
browser.goto 'http://bit.ly/watir-example'
browser.text_field(:name => 'entry.1000000').set 'Amanda'
browser.text_field(:name => 'entry.1000001').set "I come here from Australia. \n The weather is great here."
browser.radio(:value => 'Watir').set
browser.checkbox(:value => 'Ruby').set
browser.checkbox(:value => 'Python').set
browser.checkbox(:value => 'Python').clear
browser.select_list(:name => 'entry.1000004').select 'Chrome'
# your exercise: figure out how to click a button about how happy you are.
browser.button(:name => 'submit').click
puts browser.text.include? 'Your response has been recorded.'
puts browser.title == 'Thanks!'
browser.close
Let’s try something you might really want to do.
We’ve used watir to scrape power outage maps, fill out this form 100,000 times, get lists of donors in Indian elections and grab thousands of art auction results in China
WNYC collects NYC school attendance data every day. Having it lets them respond to news, and also feeds some enterprise.
require 'watir'
browser = Watir::Browser.new :chrome
browser.goto 'http://schools.nyc.gov/AboutUs/schools/data/Attendance.htm'
puts browser.span(:id => 'doecontrol_middlecentercontainer_a_attendance_lblPcnt').text
puts browser.link(:text => 'Microsoft Excel').click
Could watir have helped us a couple of weeks ago?
Maybe, but this is probably overkill.
#http://stackoverflow.com/questions/1259009/watir-image-processing
require 'watir-webdriver'
require 'open-uri'
b = Watir::Browser.new :chrome
b.goto "http://projects.propublica.org/drug-labels/"
sleep 3
images = b.images(:src => /projectx\/labels/)
images.each do |img|
url = img.src
name = url.gsub(/.*\//, '')
File.open(name, 'wb') do |f|
f.write open(url).read
end
end
What about something one of you really does want to do?
Part of Jacqui’s project requires knowing where the top-rated beers on BeerAdvocate.com are made.
require 'watir'
browser = Watir::Browser.new :chrome
browser.goto 'http://www.beeradvocate.com/lists/top/'
table = browser.table(:index => 1)
links = table.links()
urls = links.map(&:href)
stop = urls.length
(0..stop).step(3) do |i|
begin
browser.goto urls[i]
sleep 1
name = browser.h1.text
place = browser.link(:href, /place/).text
style = browser.link(:href, /style/).text
puts name + "," + place + "," + style
rescue
puts "uhoh " + urls[i]
end
end
browser.close
Use our new knowledge.
Answer Jacqui’s question (you will probably need your excel skills). A hint: “Inc.” is not a state.
Time permitting
Gather Bro and community ratings for the top-rated beers, and come up with a lede based on this information.