Saturday, May 27, 2017

Web scraping with R

Since this is a football blog written by a self-professed stats nerd, I have to write a little about the progress I'm making with respect to my efforts to perform web-scraping. Given my limited ability to program using HTML and something called "CSS," it has been a bit of an up hill battle... not unlike my social life has been for the last 31 years... but I digress.

This is code that I found online:

#Loading the rvest package
library('rvest')

#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'

#Reading the HTML code from the website
webpage <- read_html(url)

#Using CSS selectors to scrap the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')

#Converting the ranking data to text
rank_data <- html_text(rank_data_html)

#Let's have a look at the rankings
head(rank_data)
 

The MOST important part about all this code is figuring out where the CSS selector is. Some folks have developed web extensions for Chrome or Firefox. I went about it in a couple different ways:
1. In Mozilla firefox, hit: ctrl + shift + i.
2. Scroll through things and hover over the code, and see what it's point at
3. Figure out the CSS selector and plug that into the html_nodes call.

The slightly easier way is to...
1. right click on something on a webpage and click "Inspect Element."
2. This should bring you to the inspector at the line you want.
3. right click on the line of code
4. --> copy --> CSS Selector

Your web scraping pain is just a tiny bit less.

You're welcome.

No comments:

Post a Comment