week8WebScrapingLab

source

Notice fig_caption: yes in the title block source:

This permits code chunks, plots, and output to be captioned by including the caption in the leader of the code chunk:
```{r figN,fig.cap=“\label{fig:figN}Source: snapshot of http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature”}

In this case it is the url of IMDB search that we are scraping

Load the libraries we will be using:

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(RColorBrewer)

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
## 
##     pluck

## The following object is masked from 'package:readr':
## 
##     guess_encoding

Download and unzip saved IMDB search results from GitHub

IMDB search ‘http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature’ results previously saved, zipped, and uploaded to GitHub. We now download them from GitHub and unzip them in the R session’s temporary directory.

gitUrl<-"https://github.com/sdutky/mcData110/raw/master/webScrapingLab/webScrapingLab.zip"
setwd(tempdir())
download.file(gitUrl,"webScrapingLab.zip",method = "auto")
system("unzip -o -q webScrapingLab.zip")
# on windows, try the following:
# system('compact /u "webScrapingLab.zip" /i /Q /f')
# c:\\wimdows\\system32 must be on PATH

Read the local copy of the webpage using the rvest library

“webScrapingLab.html” is a copy of the saved “Feature Film, Released between 2016-01-01 and 2016-12-31 (Sorted by Popularity Ascending) - IMDb.html”

library('rvest')

webpage <- read_html(paste(tempdir(),"/webScrapingLab.html", sep=""))

Now, we’ll be scraping the following data from this website.

Rank: The rank of the film from 1 to 100 on the list of 100 most popular feature films released in 2016.
Title: The title of the feature film.
Description: The description of the feature film.
Runtime: The duration of the feature film.
Genre: The genre of the feature film,
Rating: The IMDb rating of the feature film.
Metascore: The metascore on IMDb website for the feature film.
Votes: Votes cast in favor of the feature film.
Gross_Earning_in_Mil: The gross earnings of the feature film in millions.
Director: The first, main director of the feature film.
Actor: The main actor in the feature film.

Process and clean data:

Scrape each field, use gsub to remove extraneous punctuation such as newlines and trailing spaces. If there are fewer than 100 values scraped, insert NA’s where there are missing values. Coerce numeric strings to numeric values.
add the values on to tibble IMDB.

create function to return text scraped from webpage as selected by CSS:

scrapeCssText<-function(webPage,cssSelector) {
#Using CSS selectors to scrape the HTML of the section of interest
html <- html_nodes(webPage, cssSelector)

#Convert and return the html data as a text vector
html_text(html)
}

Rank: The rank of the film from 1 to 100 on the list of 100 most popular feature films released in 2016.

rank_data <- scrapeCssText(webpage,'.text-primary')


#Let's have a look at the rankings
summary(rank_data)

##    Length     Class      Mode 
##       100 character character

head(rank_data)

## [1] "1." "2." "3." "4." "5." "6."

# coerce to numeric
rank_data<-as.integer(rank_data)

IMDB<-tibble( rank=rank_data)

Title: The title of the feature film.

title_data <- scrapeCssText(webpage,'.lister-item-header a')


#Let's have a look at the titles
summary(title_data)

##    Length     Class      Mode 
##       100 character character

head(title_data)

## [1] "Suicide Squad"                     
## [2] "La La Land"                        
## [3] "War Dogs"                          
## [4] "Split"                             
## [5] "Batman: The Killing Joke"          
## [6] "Batman v Superman: Dawn of Justice"

IMDB$title<-as.character(title_data)

Description: The description of the feature film.

description_data <- scrapeCssText(webpage,'.ratings-bar+ .text-muted')


#Let's have a look at the description
summary(description_data)

##    Length     Class      Mode 
##       100 character character

head(description_data)

## [1] "\n    A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."           
## [2] "\n    While navigating their careers in Los Angeles, a pianist and an actress fall in love while attempting to reconcile their aspirations for the future."                                          
## [3] "\n    Loosely based on the true story of two young men, David Packouz and Efraim Diveroli, who won a three hundred million dollar contract from the Pentagon to arm America's allies in Afghanistan."
## [4] "\n    Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before the apparent emergence of a frightful new 24th."                                 
## [5] "\n    As Batman hunts for the escaped Joker, the Clown Prince of Crime attacks the Gordon family to prove a diabolical point mirroring his own fall into madness."                                   
## [6] "\n    Fearing that the actions of Superman are left unchecked, Batman takes on the Man of Steel, while the world wrestles with what kind of a hero it really needs."

# get rid of leading newlines and spaces
description_data<-gsub("^\\n *","",description_data)

IMDB$description<-as.character(description_data)

Runtime: The duration of the feature film.

runtime_data <- scrapeCssText(webpage,'.runtime')


#Let's have a look at the runtimes
summary(runtime_data)

##    Length     Class      Mode 
##       100 character character

head(runtime_data)

## [1] "123 min" "128 min" "114 min" "117 min" "76 min"  "151 min"

# get rid of all non-numerics, then convert to integer
runtime_data<-as.integer(gsub("[^0-9]*","",runtime_data))
IMDB$runtime<-runtime_data

Genre: The genre of the feature film,

genre_data <- scrapeCssText(webpage,'.genre')

summary(genre_data)

##    Length     Class      Mode 
##       100 character character

head(genre_data)

## [1] "\nAction, Adventure, Fantasy            "
## [2] "\nComedy, Drama, Music            "      
## [3] "\nBiography, Comedy, Crime            "  
## [4] "\nHorror, Thriller            "          
## [5] "\nAnimation, Action, Crime            "  
## [6] "\nAction, Adventure, Sci-Fi            "

# Clean up genre data by removing leading newline and trailing spaces
genre_data<-gsub("^\\n","",genre_data)
genre_data<-gsub(" *$","",genre_data)
# keep only the first category, eliminating everything starting with the comma
genre_data<-gsub(",.*$","",genre_data)

IMDB$genre<-as.character(genre_data)

Rating: The IMDb rating of the feature film.

rating_data <- scrapeCssText(webpage,'.ratings-imdb-rating strong')
summary(rating_data)

##    Length     Class      Mode 
##       100 character character

head(rating_data)

## [1] "6.0" "8.0" "7.1" "7.3" "6.4" "6.5"

IMDB$rating<-as.numeric(rating_data)

Metascore: The metascore on IMDb website for the feature film.

metascore_data <- scrapeCssText(webpage,'.metascore')
summary(metascore_data)

##    Length     Class      Mode 
##        94 character character

head(metascore_data)

## [1] "40        " "93        " "57        " "62        " "44        "
## [6] "65        "

# eliminate non-numerics, coerce to integer
metascore_data<-as.integer(gsub("[^0-9]*","",metascore_data))
# insert NA's where they belong
metascore_data<-c(metascore_data[1:4],NA,
                  metascore_data[5:19],NA,
                  metascore_data[20:34],NA,
                  metascore_data[35:42],NA,
                  metascore_data[43:63],NA,
                  metascore_data[64:65],NA,
                  metascore_data[66:94])
IMDB$metascore<-metascore_data

Votes: Votes cast in favor of the feature film.

votes_data <- scrapeCssText(webpage,'.sort-num_votes-visible span:nth-child(2)')
summary(votes_data)

##    Length     Class      Mode 
##       100 character character

head(votes_data)

## [1] "550,883" "450,506" "159,584" "377,291" "44,553"  "587,788"

# get rid of comma, coerce to integer
votes_data<-as.integer(gsub(",","",votes_data))
IMDB$votes<-votes_data

Gross_Earning_in_Mil: The gross earnings of the feature film in millions.

gross_data <- scrapeCssText(webpage,'.ghost~ .text-muted+ span')
summary(gross_data)

##    Length     Class      Mode 
##        92 character character

head(gross_data)

## [1] "$325.10M" "$151.10M" "$43.02M"  "$138.29M" "$3.78M"   "$330.36M"

# get rid of non-numerics but keep decimal point and coerce to numeric
gross_data<-as.numeric(gsub("[^0-9.]*","",gross_data))
# insert NA's where they belong:
gross_data<-c( gross_data[1:20],NA,
               gross_data[21:35],NA,
               gross_data[36],NA,NA,
               gross_data[37:61],NA,
               gross_data[62:64],NA,
               gross_data[65],NA,
               gross_data[66:90],NA,
               gross_data[91:92]
)
IMDB$gross<-gross_data

Director: The first, main director of the feature film.

director_data <- scrapeCssText(webpage,'.text-muted+ p a:nth-child(1)')
summary(director_data)

##    Length     Class      Mode 
##       100 character character

head(director_data)

## [1] "David Ayer"         "Damien Chazelle"    "Todd Phillips"     
## [4] "M. Night Shyamalan" "Sam Liu"            "Zack Snyder"

IMDB$director<-as.character(director_data)

Actor: The main actor in the feature film.

actor_data <- scrapeCssText(webpage,'.lister-item-content .ghost+ a')
summary(actor_data)

##    Length     Class      Mode 
##       100 character character

head(actor_data)

## [1] "Will Smith"   "Ryan Gosling" "Jonah Hill"   "James McAvoy"
## [5] "Kevin Conroy" "Ben Affleck"

IMDB$actor<-as.character(actor_data)

display the vitals of IMDB

str(IMDB)

## Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  11 variables:
##  $ rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title      : chr  "Suicide Squad" "La La Land" "War Dogs" "Split" ...
##  $ description: chr  "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive "| __truncated__ "While navigating their careers in Los Angeles, a pianist and an actress fall in love while attempting to reconc"| __truncated__ "Loosely based on the true story of two young men, David Packouz and Efraim Diveroli, who won a three hundred mi"| __truncated__ "Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before t"| __truncated__ ...
##  $ runtime    : int  123 128 114 117 76 151 108 133 107 147 ...
##  $ genre      : chr  "Action" "Comedy" "Biography" "Horror" ...
##  $ rating     : num  6 8 7.1 7.3 6.4 6.5 8 7.8 7.6 7.8 ...
##  $ metascore  : int  40 93 57 62 NA 44 65 65 81 75 ...
##  $ votes      : int  550883 450506 159584 377291 44553 587788 841926 491520 234017 605293 ...
##  $ gross      : num  325.1 151.1 43.02 138.29 3.78 ...
##  $ director   : chr  "David Ayer" "Damien Chazelle" "Todd Phillips" "M. Night Shyamalan" ...
##  $ actor      : chr  "Will Smith" "Ryan Gosling" "Jonah Hill" "James McAvoy" ...

summary(IMDB)

##       rank           title           description           runtime     
##  Min.   :  1.00   Length:100         Length:100         Min.   : 76.0  
##  1st Qu.: 25.75   Class :character   Class :character   1st Qu.:101.5  
##  Median : 50.50   Mode  :character   Mode  :character   Median :111.5  
##  Mean   : 50.50                                         Mean   :113.2  
##  3rd Qu.: 75.25                                         3rd Qu.:123.0  
##  Max.   :100.00                                         Max.   :161.0  
##                                                                        
##     genre               rating        metascore         votes       
##  Length:100         Min.   :5.000   Min.   :21.00   Min.   :  2596  
##  Class :character   1st Qu.:6.200   1st Qu.:47.00   1st Qu.: 76882  
##  Mode  :character   Median :6.800   Median :60.00   Median :129876  
##                     Mean   :6.785   Mean   :59.16   Mean   :171233  
##                     3rd Qu.:7.400   3rd Qu.:71.75   3rd Qu.:212759  
##                     Max.   :8.400   Max.   :99.00   Max.   :841926  
##                                     NA's   :6                       
##      gross          director            actor          
##  Min.   :  0.04   Length:100         Length:100        
##  1st Qu.: 14.27   Class :character   Class :character  
##  Median : 53.98   Mode  :character   Mode  :character  
##  Mean   : 95.55                                        
##  3rd Qu.:125.46                                        
##  Max.   :532.18                                        
##  NA's   :8

IMDB

## # A tibble: 100 x 11
##     rank title description runtime genre rating metascore  votes  gross
##    <int> <chr> <chr>         <int> <chr>  <dbl>     <int>  <int>  <dbl>
##  1     1 Suic… A secret g…     123 Acti…    6          40 550883 325.  
##  2     2 La L… While navi…     128 Come…    8          93 450506 151.  
##  3     3 War … Loosely ba…     114 Biog…    7.1        57 159584  43.0 
##  4     4 Split Three girl…     117 Horr…    7.3        62 377291 138.  
##  5     5 Batm… As Batman …      76 Anim…    6.4        NA  44553   3.78
##  6     6 Batm… Fearing th…     151 Acti…    6.5        44 587788 330.  
##  7     7 Dead… A wisecrac…     108 Acti…    8          65 841926 363.  
##  8     8 Rogu… The daught…     133 Acti…    7.8        65 491520 532.  
##  9     9 Moana In Ancient…     107 Anim…    7.6        81 234017 249.  
## 10    10 Capt… Political …     147 Acti…    7.8        75 605293 408.  
## # … with 90 more rows, and 2 more variables: director <chr>, actor <chr>

Plots:

Saurav Kaushik’s qplot:

qplot(data = IMDB,runtime,fill = genre,bins = 30)

My version using ggplot with geom_histogram with ggplotly:

gh<-ggplot(data=IMDB,aes( x=runtime,fill=genre))+
  labs(title="Histogram of the Top 100 Films of 2016\nby Runtime, Color Filled by Genre")+
  theme(plot.title = element_text(hjust=0,size=10))+
  scale_fill_brewer(palette = "Set1")+
  geom_histogram(color="white",bins=30)
ggplotly(gh)

Source: snapshot of http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature

My version, take 2: add more information to plotly tooltips

makeHist<-IMDB # make new copy for putting histogram together

# compute bin breaks for geom_histogram
breaks<-hist(makeHist$runtime,breaks="Scott", plot=FALSE)$breaks 
makeHist$bin<-cut(makeHist$runtime,breaks,labels=FALSE) # assign each row to a bin

makeHist<- makeHist %>%
  mutate(bin=as.character(bin)) %>%
  group_by(  bin,genre) %>%
  mutate(meanRuntime=mean(runtime),meanRank=mean(rank),
         minRuntime=min(runtime),maxRuntime=max(runtime),
            meanGross=mean(gross,na.rm=TRUE),meanRating=mean(rating)) %>%
  select(runtime,bin,genre,meanRuntime,minRuntime,maxRuntime,
         meanRank,meanGross,meanRating) %>%
  mutate(meanGross=replace(meanGross,is.na(meanGross),0)) %>%
  arrange(bin,genre)

gh<-ggplot(data=makeHist,aes( x=meanRuntime,fill=genre,
      labelMark="______________", labelBin=bin,
      label0=meanRuntime,labelMax=maxRuntime,
      labelMin=minRuntime,label1=meanRank,
      label2=meanGross,label3=meanRating
  ))+
  labs(title="Histogram of the Top 100 Films of 2016\nby Runtime, Color Filled by Genre")+
  theme(plot.title = element_text(hjust=0,size=10))+
  scale_fill_brewer(palette = "Set1")+
  geom_histogram(color="white",breaks=breaks)
#gh
ggplotly(gh)

Source: snapshot of http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature

Question 1: Based on the above data, which movie from which Genre had the longest runtime?

answer<-IMDB %>%
  filter(runtime>150) %>%
  select(title,genre,runtime) %>%
  filter(runtime>=max(runtime))

answer

## # A tibble: 2 x 3
##   title   genre  runtime
##   <chr>   <chr>    <int>
## 1 Silence Drama      161
## 2 Dangal  Action     161

Saurav Kaushik’s point plot:

ggplot(IMDB,aes(x=runtime,y=rating))+
geom_point(aes(size=votes,col=genre))

My version: using ggplotly and geom_rug to add rug plots to see the distributions of the variables on the x and y axes.

gp<-ggplot(IMDB)+
  aes(x=runtime,y=rating, size=votes,fill=genre,
      label0="-------",label1=rank, label2=title, label4=runtime, label5=genre, label6=rating, label7=metascore, label8=votes, label9=gross, label10=director, label11=actor)+
  labs(title="Point Plot of the Top 100 Films of 2016\nsized by votes and color filled by genre")+
  theme(plot.title = element_text(hjust=0,size=10))+
  guides(size=FALSE)+ # supress legend for size/votes
  scale_color_brewer(palette = "Set1")+
  scale_fill_brewer(palette = "Set1")+
geom_point(col="white")+
  geom_rug(inherit.aes = FALSE,
           data=IMDB,
           aes(x=runtime,y=rating,color=genre))
ggplotly(gp)

Source: snapshot of http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature

#gp

# Diagnostice for submitting to Stack Overflow
#dput(IMDB[,c("runtime","title","rating","votes","genre","rank","metascore","gross","director","actor")])

Question 2: Based on the above data, in the Runtime of 130-160 mins, which genre has the highest votes?

answer<- IMDB %>%
  filter(runtime>129 & runtime<161) %>%
  group_by(genre) %>%
  summarise(sumVotes=sum(votes)) %>%
  arrange(desc(sumVotes)) 

answer[1,]

## # A tibble: 1 x 2
##   genre  sumVotes
##   <chr>     <int>
## 1 Action  2330847

Saurav Kaushik’s point plot of runtime vs gross:

ggplot(IMDB,aes(x=runtime,y=gross))+
geom_point(aes(size=rating,col=genre))

## Warning: Removed 8 rows containing missing values (geom_point).

My version: again with ggplotly and geom_rug

gp<-ggplot(IMDB)+
  aes(x=runtime,y=gross, size=rating,color=genre,
      label0="-------",label1=rank, label2=title, label4=runtime, label5=genre, label6=rating, label7=metascore, label8=votes, label9=gross, label10=director, label11=actor)+
  labs(title="Point Plot of the Top 100 Films of 2016\nsized by rating and color filled by genre")+
  theme(plot.title = element_text(hjust=0,size=10))+
  guides(size=FALSE)+ # supress legend for size/rating
  scale_color_brewer(palette = "Set1")+
  scale_fill_brewer(palette = "Set1")+
geom_point()+
  geom_rug(inherit.aes = FALSE,data=IMDB,aes(x=runtime,y=gross,color=genre))
ggplotly(gp)

Source: snapshot of http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120

answer<-IMDB %>%
  filter(runtime>99 & runtime<121) %>%
  group_by(genre) %>%
  summarise(meanGross=mean(gross,na.rm=TRUE)) %>%
  arrange(desc(meanGross))

answer[1,]

## # A tibble: 1 x 2
##   genre     meanGross
##   <chr>         <dbl>
## 1 Adventure       364

Mystery

I can’t explain what happens with the fill and boundary colors of my two point plots above using colorbrewer above.

The code is essentially identical for both plots, however the first call to geom_point requires ‘col=“white”’ to avoid having the points surrounded by a black boundary line. Including ‘col=“white”’ in the second call to geom_point turns the points completely white.

Additionally I don’t understand why the legend key labels take the format (label,1).

week8WebScrapingLab

Steve Dutky

10/20/2019

source

Notice fig_caption: yes in the title block source:

Load the libraries we will be using:

Download and unzip saved IMDB search results from GitHub

Read the local copy of the webpage using the rvest library

Now, we’ll be scraping the following data from this website.

Process and clean data:

create function to return text scraped from webpage as selected by CSS:

Rank: The rank of the film from 1 to 100 on the list of 100 most popular feature films released in 2016.

Title: The title of the feature film.

Description: The description of the feature film.

Runtime: The duration of the feature film.

Genre: The genre of the feature film,

Rating: The IMDb rating of the feature film.

Metascore: The metascore on IMDb website for the feature film.

Votes: Votes cast in favor of the feature film.

Gross_Earning_in_Mil: The gross earnings of the feature film in millions.

Director: The first, main director of the feature film.

Actor: The main actor in the feature film.

display the vitals of IMDB

Plots:

Saurav Kaushik’s qplot:

My version using ggplot with geom_histogram with ggplotly:

My version, take 2: add more information to plotly tooltips

Question 1: Based on the above data, which movie from which Genre had the longest runtime?

Saurav Kaushik’s point plot:

My version: using ggplotly and geom_rug to add rug plots to see the distributions of the variables on the x and y axes.

Question 2: Based on the above data, in the Runtime of 130-160 mins, which genre has the highest votes?

Saurav Kaushik’s point plot of runtime vs gross:

My version: again with ggplotly and geom_rug

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120

Mystery