webscraping assignment

# install.packages('rvest')

library('rvest')

## Loading required package: xml2

url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'

webpage <- read_html(url)

#Using CSS selectors to scrape the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')

#Converting the ranking data to text
rank_data <- html_text(rank_data_html)

#Let's have a look at the rankings
head(rank_data)

## [1] "1." "2." "3." "4." "5." "6."

#Data-Preprocessing: Converting rankings to numerical
rank_data<-as.numeric(rank_data)

#Let's have another look at the rankings
head(rank_data)

## [1] 1 2 3 4 5 6

#Using CSS selectors to scrape the title section
title_data_html <- html_nodes(webpage,'.lister-item-header a')

#Converting the title data to text
title_data <- html_text(title_data_html)

#Let's have a look at the title
head(title_data)

## [1] "Suicide Squad"  "Moonlight"      "Rogue One"      "The Handmaiden"
## [5] "Split"          "La La Land"

#Using CSS selectors to scrape the Movie runtime section
runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')

#Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)

#Let's have a look at the runtime
head(runtime_data)

## [1] "123 min" "111 min" "133 min" "145 min" "117 min" "128 min"

#Data-Preprocessing: removing mins and converting it to numerical

runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

#Let's have another look at the runtime data
head(runtime_data)

## [1] 123 111 133 145 117 128

#Using CSS selectors to scrape the Movie genre section
genre_data_html <- html_nodes(webpage,'.genre')

#Converting the genre data to text
genre_data <- html_text(genre_data_html)

#Let's have a look at the runtime
head(genre_data)

## [1] "\nAction, Adventure, Fantasy            "
## [2] "\nDrama            "                     
## [3] "\nAction, Adventure, Sci-Fi            " 
## [4] "\nDrama, Romance, Thriller            "  
## [5] "\nHorror, Thriller            "          
## [6] "\nComedy, Drama, Music            "

#Data-Preprocessing: removing \n
genre_data<-gsub("\n","",genre_data)

#Data-Preprocessing: removing excess spaces
genre_data<-gsub(" ","",genre_data)

#taking only the first genre of each movie
genre_data<-gsub(",.*","",genre_data)

#Convering each genre from text to factor
genre_data<-as.factor(genre_data)

#Let's have another look at the genre data
head(genre_data)

## [1] Action Drama  Action Drama  Horror Comedy
## 9 Levels: Action Adventure Animation Biography Comedy Crime ... Mystery

#Using CSS selectors to scrape the IMDB rating section
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

#Converting the ratings data to text
rating_data <- html_text(rating_data_html)

#Let's have a look at the ratings
head(rating_data)

## [1] "6.0" "7.4" "7.8" "8.1" "7.3" "8.0"

#Data-Preprocessing: converting ratings to numerical
rating_data<-as.numeric(rating_data)

#Let's have another look at the ratings data
head(rating_data)

## [1] 6.0 7.4 7.8 8.1 7.3 8.0

#Using CSS selectors to scrape the votes section
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

#Converting the votes data to text
votes_data <- html_text(votes_data_html)

#Let's have a look at the votes data
head(votes_data)

## [1] "551,821" "237,197" "492,296" "85,384"  "378,205" "451,254"

#Data-Preprocessing: removing commas
votes_data<-gsub(",","",votes_data)

#Data-Preprocessing: converting votes to numerical
votes_data<-as.numeric(votes_data)

#Let's have another look at the votes data
head(votes_data)

## [1] 551821 237197 492296  85384 378205 451254

I can’t believe the above one worked. I had to manipulate it.

In hopes that I can get the two required plots to work, I’m ignoring the fields that aren’t part of those two graphs. If I’m successful, maybe I’ll try to go back and practice with all the ones I ignored.

#Combining all the lists to form a data frame
movies_df<-data.frame(Rank = rank_data, Title = title_data,
Runtime = runtime_data,
Genre = genre_data,
Rating = rating_data,
Votes = votes_data)

#Structure of the data frame

str(movies_df)

## 'data.frame':    100 obs. of  6 variables:
##  $ Rank   : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title  : Factor w/ 100 levels "10 Cloverfield Lane",..: 64 51 58 75 62 42 33 24 10 50 ...
##  $ Runtime: num  123 111 133 145 117 128 139 108 151 107 ...
##  $ Genre  : Factor w/ 9 levels "Action","Adventure",..: 1 7 1 7 8 5 4 1 1 3 ...
##  $ Rating : num  6 7.4 7.8 8.1 7.3 8 8.1 8 6.5 7.6 ...
##  $ Votes  : num  551821 237197 492296 85384 378205 ...

library('ggplot2')

qplot(data = movies_df,Runtime,fill = Genre,bins = 30)

Question 1 - based on the above data (and scanning the imdb website), American Honey from the Adventure genre had the longest runtime. Though I’m not sure how we’re supposed to tell which movie from just looking at the above chart. (I’m also not a big fan of the colors used here.)

ggplot(movies_df,aes(x=Runtime,y=Rating))+
geom_point(aes(size=Votes,col=Genre))

Question 2: Based on the above data, in the Runtime of 130 - 160 minutes, which genre had the highest votes?

Again, I’m not liking the colors here, but the largest dots appear to be in the action category.

webscraping assignment

Don Allen

10/23/2019