NYU — Spring 2015 / Week 6

Scraping Makes You Feel Like a Small God

Housekeeping

Questions?

Scraping

What does it mean to think like a computer? What are alternatives to scraping?

Ruby

What is it? Do you have it? Check by typing “ruby” at your terminal. If you get something like “command not found”, download Ruby. Windows friends

We also need a library called watir. If you’re lucky, typing “gem install watir” at your terminal will get you it.

Watir

Watir is at its best when web pages are really complicated to scrape. But it works well for easier stuff too. Let’s follow their intro example.

We can use ruby interactively (a line at a time) with the command ‘irb’. Or you can run an entire script at once with ruby whatever.rb (where whatever.rb contains commands like the ones below.)

require 'watir' 

browser = Watir::Browser.new :chrome
browser.goto 'http://bit.ly/watir-example'

browser.text_field(:name => 'entry.1000000').set 'Amanda'

browser.text_field(:name => 'entry.1000001').set "I come here from Australia. \n The weather is great here."
browser.radio(:value => 'Watir').set

browser.checkbox(:value => 'Ruby').set
browser.checkbox(:value => 'Python').set
browser.checkbox(:value => 'Python').clear

browser.select_list(:name => 'entry.1000004').select 'Chrome'

# your exercise: figure out how to click a button about how happy you are.

browser.button(:name => 'submit').click

puts browser.text.include? 'Your response has been recorded.'
puts browser.title == 'Thanks!'

browser.close

Let’s try something you might really want to do.

We’ve used watir to scrape power outage maps, fill out this form 100,000 times, get lists of donors in Indian elections and grab thousands of art auction results in China

WNYC collects NYC school attendance data every day. Having it lets them respond to news, and also feeds some enterprise.

require 'watir' 

browser = Watir::Browser.new :chrome

browser.goto 'http://schools.nyc.gov/AboutUs/schools/data/Attendance.htm'

puts browser.span(:id => 'doecontrol_middlecentercontainer_a_attendance_lblPcnt').text

puts browser.link(:text => 'Microsoft Excel').click

Could watir have helped us a couple of weeks ago?

Maybe, but this is probably overkill.

#http://stackoverflow.com/questions/1259009/watir-image-processing
require 'watir-webdriver'
require 'open-uri'
b = Watir::Browser.new :chrome

b.goto "http://projects.propublica.org/drug-labels/"
sleep 3

images = b.images(:src => /projectx\/labels/)

images.each do |img|

    url = img.src
    name =  url.gsub(/.*\//, '')

      File.open(name, 'wb') do |f|
             f.write open(url).read
      end 

end

What about something one of you really does want to do?

Part of Jacqui’s project requires knowing where the top-rated beers on BeerAdvocate.com are made.

require 'watir' 

browser = Watir::Browser.new :chrome
browser.goto 'http://www.beeradvocate.com/lists/top/'

table = browser.table(:index => 1)
links = table.links()
urls = links.map(&:href)
stop = urls.length

(0..stop).step(3) do |i|

  begin
      browser.goto urls[i]
      sleep 1
      name = browser.h1.text
      place = browser.link(:href, /place/).text
      style = browser.link(:href, /style/).text
      puts name + "," + place + "," + style
  rescue 
     puts "uhoh " + urls[i]
  end

end

browser.close

Use our new knowledge.

Answer Jacqui’s question (you will probably need your excel skills). A hint: “Inc.” is not a state.

Time permitting

Gather Bro and community ratings for the top-rated beers, and come up with a lede based on this information.