This permits code chunks, plots, and output to be captioned by including the caption in the leader of the code chunk:
```{r figN,fig.cap=“\label{fig:figN}Source: snapshot of http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature”}
In this case it is the url of IMDB search that we are scraping
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(RColorBrewer)
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
IMDB search ‘http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature’ results previously saved, zipped, and uploaded to GitHub. We now download them from GitHub and unzip them in the R session’s temporary directory.
gitUrl<-"https://github.com/sdutky/mcData110/raw/master/webScrapingLab/webScrapingLab.zip"
setwd(tempdir())
download.file(gitUrl,"webScrapingLab.zip",method = "auto")
system("unzip -o -q webScrapingLab.zip")
# on windows, try the following:
# system('compact /u "webScrapingLab.zip" /i /Q /f')
# c:\\wimdows\\system32 must be on PATH
“webScrapingLab.html” is a copy of the saved “Feature Film, Released between 2016-01-01 and 2016-12-31 (Sorted by Popularity Ascending) - IMDb.html”
library('rvest')
webpage <- read_html(paste(tempdir(),"/webScrapingLab.html", sep=""))
Rank: The rank of the film from 1 to 100 on the list of 100 most popular feature films released in 2016.
Title: The title of the feature film.
Description: The description of the feature film.
Runtime: The duration of the feature film.
Genre: The genre of the feature film,
Rating: The IMDb rating of the feature film.
Metascore: The metascore on IMDb website for the feature film.
Votes: Votes cast in favor of the feature film.
Gross_Earning_in_Mil: The gross earnings of the feature film in millions.
Director: The first, main director of the feature film.
Actor: The main actor in the feature film.
Scrape each field, use gsub to remove extraneous punctuation such as newlines and trailing spaces. If there are fewer than 100 values scraped, insert NA’s where there are missing values. Coerce numeric strings to numeric values.
add the values on to tibble IMDB.
scrapeCssText<-function(webPage,cssSelector) {
#Using CSS selectors to scrape the HTML of the section of interest
html <- html_nodes(webPage, cssSelector)
#Convert and return the html data as a text vector
html_text(html)
}
rank_data <- scrapeCssText(webpage,'.text-primary')
#Let's have a look at the rankings
summary(rank_data)
## Length Class Mode
## 100 character character
head(rank_data)
## [1] "1." "2." "3." "4." "5." "6."
# coerce to numeric
rank_data<-as.integer(rank_data)
IMDB<-tibble( rank=rank_data)
title_data <- scrapeCssText(webpage,'.lister-item-header a')
#Let's have a look at the titles
summary(title_data)
## Length Class Mode
## 100 character character
head(title_data)
## [1] "Suicide Squad"
## [2] "La La Land"
## [3] "War Dogs"
## [4] "Split"
## [5] "Batman: The Killing Joke"
## [6] "Batman v Superman: Dawn of Justice"
IMDB$title<-as.character(title_data)
description_data <- scrapeCssText(webpage,'.ratings-bar+ .text-muted')
#Let's have a look at the description
summary(description_data)
## Length Class Mode
## 100 character character
head(description_data)
## [1] "\n A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."
## [2] "\n While navigating their careers in Los Angeles, a pianist and an actress fall in love while attempting to reconcile their aspirations for the future."
## [3] "\n Loosely based on the true story of two young men, David Packouz and Efraim Diveroli, who won a three hundred million dollar contract from the Pentagon to arm America's allies in Afghanistan."
## [4] "\n Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before the apparent emergence of a frightful new 24th."
## [5] "\n As Batman hunts for the escaped Joker, the Clown Prince of Crime attacks the Gordon family to prove a diabolical point mirroring his own fall into madness."
## [6] "\n Fearing that the actions of Superman are left unchecked, Batman takes on the Man of Steel, while the world wrestles with what kind of a hero it really needs."
# get rid of leading newlines and spaces
description_data<-gsub("^\\n *","",description_data)
IMDB$description<-as.character(description_data)
runtime_data <- scrapeCssText(webpage,'.runtime')
#Let's have a look at the runtimes
summary(runtime_data)
## Length Class Mode
## 100 character character
head(runtime_data)
## [1] "123 min" "128 min" "114 min" "117 min" "76 min" "151 min"
# get rid of all non-numerics, then convert to integer
runtime_data<-as.integer(gsub("[^0-9]*","",runtime_data))
IMDB$runtime<-runtime_data
genre_data <- scrapeCssText(webpage,'.genre')
summary(genre_data)
## Length Class Mode
## 100 character character
head(genre_data)
## [1] "\nAction, Adventure, Fantasy "
## [2] "\nComedy, Drama, Music "
## [3] "\nBiography, Comedy, Crime "
## [4] "\nHorror, Thriller "
## [5] "\nAnimation, Action, Crime "
## [6] "\nAction, Adventure, Sci-Fi "
# Clean up genre data by removing leading newline and trailing spaces
genre_data<-gsub("^\\n","",genre_data)
genre_data<-gsub(" *$","",genre_data)
# keep only the first category, eliminating everything starting with the comma
genre_data<-gsub(",.*$","",genre_data)
IMDB$genre<-as.character(genre_data)
rating_data <- scrapeCssText(webpage,'.ratings-imdb-rating strong')
summary(rating_data)
## Length Class Mode
## 100 character character
head(rating_data)
## [1] "6.0" "8.0" "7.1" "7.3" "6.4" "6.5"
IMDB$rating<-as.numeric(rating_data)
metascore_data <- scrapeCssText(webpage,'.metascore')
summary(metascore_data)
## Length Class Mode
## 94 character character
head(metascore_data)
## [1] "40 " "93 " "57 " "62 " "44 "
## [6] "65 "
# eliminate non-numerics, coerce to integer
metascore_data<-as.integer(gsub("[^0-9]*","",metascore_data))
# insert NA's where they belong
metascore_data<-c(metascore_data[1:4],NA,
metascore_data[5:19],NA,
metascore_data[20:34],NA,
metascore_data[35:42],NA,
metascore_data[43:63],NA,
metascore_data[64:65],NA,
metascore_data[66:94])
IMDB$metascore<-metascore_data
votes_data <- scrapeCssText(webpage,'.sort-num_votes-visible span:nth-child(2)')
summary(votes_data)
## Length Class Mode
## 100 character character
head(votes_data)
## [1] "550,883" "450,506" "159,584" "377,291" "44,553" "587,788"
# get rid of comma, coerce to integer
votes_data<-as.integer(gsub(",","",votes_data))
IMDB$votes<-votes_data
gross_data <- scrapeCssText(webpage,'.ghost~ .text-muted+ span')
summary(gross_data)
## Length Class Mode
## 92 character character
head(gross_data)
## [1] "$325.10M" "$151.10M" "$43.02M" "$138.29M" "$3.78M" "$330.36M"
# get rid of non-numerics but keep decimal point and coerce to numeric
gross_data<-as.numeric(gsub("[^0-9.]*","",gross_data))
# insert NA's where they belong:
gross_data<-c( gross_data[1:20],NA,
gross_data[21:35],NA,
gross_data[36],NA,NA,
gross_data[37:61],NA,
gross_data[62:64],NA,
gross_data[65],NA,
gross_data[66:90],NA,
gross_data[91:92]
)
IMDB$gross<-gross_data
director_data <- scrapeCssText(webpage,'.text-muted+ p a:nth-child(1)')
summary(director_data)
## Length Class Mode
## 100 character character
head(director_data)
## [1] "David Ayer" "Damien Chazelle" "Todd Phillips"
## [4] "M. Night Shyamalan" "Sam Liu" "Zack Snyder"
IMDB$director<-as.character(director_data)
actor_data <- scrapeCssText(webpage,'.lister-item-content .ghost+ a')
summary(actor_data)
## Length Class Mode
## 100 character character
head(actor_data)
## [1] "Will Smith" "Ryan Gosling" "Jonah Hill" "James McAvoy"
## [5] "Kevin Conroy" "Ben Affleck"
IMDB$actor<-as.character(actor_data)
str(IMDB)
## Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 11 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "Suicide Squad" "La La Land" "War Dogs" "Split" ...
## $ description: chr "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive "| __truncated__ "While navigating their careers in Los Angeles, a pianist and an actress fall in love while attempting to reconc"| __truncated__ "Loosely based on the true story of two young men, David Packouz and Efraim Diveroli, who won a three hundred mi"| __truncated__ "Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before t"| __truncated__ ...
## $ runtime : int 123 128 114 117 76 151 108 133 107 147 ...
## $ genre : chr "Action" "Comedy" "Biography" "Horror" ...
## $ rating : num 6 8 7.1 7.3 6.4 6.5 8 7.8 7.6 7.8 ...
## $ metascore : int 40 93 57 62 NA 44 65 65 81 75 ...
## $ votes : int 550883 450506 159584 377291 44553 587788 841926 491520 234017 605293 ...
## $ gross : num 325.1 151.1 43.02 138.29 3.78 ...
## $ director : chr "David Ayer" "Damien Chazelle" "Todd Phillips" "M. Night Shyamalan" ...
## $ actor : chr "Will Smith" "Ryan Gosling" "Jonah Hill" "James McAvoy" ...
summary(IMDB)
## rank title description runtime
## Min. : 1.00 Length:100 Length:100 Min. : 76.0
## 1st Qu.: 25.75 Class :character Class :character 1st Qu.:101.5
## Median : 50.50 Mode :character Mode :character Median :111.5
## Mean : 50.50 Mean :113.2
## 3rd Qu.: 75.25 3rd Qu.:123.0
## Max. :100.00 Max. :161.0
##
## genre rating metascore votes
## Length:100 Min. :5.000 Min. :21.00 Min. : 2596
## Class :character 1st Qu.:6.200 1st Qu.:47.00 1st Qu.: 76882
## Mode :character Median :6.800 Median :60.00 Median :129876
## Mean :6.785 Mean :59.16 Mean :171233
## 3rd Qu.:7.400 3rd Qu.:71.75 3rd Qu.:212759
## Max. :8.400 Max. :99.00 Max. :841926
## NA's :6
## gross director actor
## Min. : 0.04 Length:100 Length:100
## 1st Qu.: 14.27 Class :character Class :character
## Median : 53.98 Mode :character Mode :character
## Mean : 95.55
## 3rd Qu.:125.46
## Max. :532.18
## NA's :8
IMDB
## # A tibble: 100 x 11
## rank title description runtime genre rating metascore votes gross
## <int> <chr> <chr> <int> <chr> <dbl> <int> <int> <dbl>
## 1 1 Suic… A secret g… 123 Acti… 6 40 550883 325.
## 2 2 La L… While navi… 128 Come… 8 93 450506 151.
## 3 3 War … Loosely ba… 114 Biog… 7.1 57 159584 43.0
## 4 4 Split Three girl… 117 Horr… 7.3 62 377291 138.
## 5 5 Batm… As Batman … 76 Anim… 6.4 NA 44553 3.78
## 6 6 Batm… Fearing th… 151 Acti… 6.5 44 587788 330.
## 7 7 Dead… A wisecrac… 108 Acti… 8 65 841926 363.
## 8 8 Rogu… The daught… 133 Acti… 7.8 65 491520 532.
## 9 9 Moana In Ancient… 107 Anim… 7.6 81 234017 249.
## 10 10 Capt… Political … 147 Acti… 7.8 75 605293 408.
## # … with 90 more rows, and 2 more variables: director <chr>, actor <chr>
qplot(data = IMDB,runtime,fill = genre,bins = 30)
gh<-ggplot(data=IMDB,aes( x=runtime,fill=genre))+
labs(title="Histogram of the Top 100 Films of 2016\nby Runtime, Color Filled by Genre")+
theme(plot.title = element_text(hjust=0,size=10))+
scale_fill_brewer(palette = "Set1")+
geom_histogram(color="white",bins=30)
ggplotly(gh)
makeHist<-IMDB # make new copy for putting histogram together
# compute bin breaks for geom_histogram
breaks<-hist(makeHist$runtime,breaks="Scott", plot=FALSE)$breaks
makeHist$bin<-cut(makeHist$runtime,breaks,labels=FALSE) # assign each row to a bin
makeHist<- makeHist %>%
mutate(bin=as.character(bin)) %>%
group_by( bin,genre) %>%
mutate(meanRuntime=mean(runtime),meanRank=mean(rank),
minRuntime=min(runtime),maxRuntime=max(runtime),
meanGross=mean(gross,na.rm=TRUE),meanRating=mean(rating)) %>%
select(runtime,bin,genre,meanRuntime,minRuntime,maxRuntime,
meanRank,meanGross,meanRating) %>%
mutate(meanGross=replace(meanGross,is.na(meanGross),0)) %>%
arrange(bin,genre)
gh<-ggplot(data=makeHist,aes( x=meanRuntime,fill=genre,
labelMark="______________", labelBin=bin,
label0=meanRuntime,labelMax=maxRuntime,
labelMin=minRuntime,label1=meanRank,
label2=meanGross,label3=meanRating
))+
labs(title="Histogram of the Top 100 Films of 2016\nby Runtime, Color Filled by Genre")+
theme(plot.title = element_text(hjust=0,size=10))+
scale_fill_brewer(palette = "Set1")+
geom_histogram(color="white",breaks=breaks)
#gh
ggplotly(gh)
answer<-IMDB %>%
filter(runtime>150) %>%
select(title,genre,runtime) %>%
filter(runtime>=max(runtime))
answer
## # A tibble: 2 x 3
## title genre runtime
## <chr> <chr> <int>
## 1 Silence Drama 161
## 2 Dangal Action 161
ggplot(IMDB,aes(x=runtime,y=rating))+
geom_point(aes(size=votes,col=genre))
gp<-ggplot(IMDB)+
aes(x=runtime,y=rating, size=votes,fill=genre,
label0="-------",label1=rank, label2=title, label4=runtime, label5=genre, label6=rating, label7=metascore, label8=votes, label9=gross, label10=director, label11=actor)+
labs(title="Point Plot of the Top 100 Films of 2016\nsized by votes and color filled by genre")+
theme(plot.title = element_text(hjust=0,size=10))+
guides(size=FALSE)+ # supress legend for size/votes
scale_color_brewer(palette = "Set1")+
scale_fill_brewer(palette = "Set1")+
geom_point(col="white")+
geom_rug(inherit.aes = FALSE,
data=IMDB,
aes(x=runtime,y=rating,color=genre))
ggplotly(gp)
Source: snapshot of http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature
#gp
# Diagnostice for submitting to Stack Overflow
#dput(IMDB[,c("runtime","title","rating","votes","genre","rank","metascore","gross","director","actor")])
answer<- IMDB %>%
filter(runtime>129 & runtime<161) %>%
group_by(genre) %>%
summarise(sumVotes=sum(votes)) %>%
arrange(desc(sumVotes))
answer[1,]
## # A tibble: 1 x 2
## genre sumVotes
## <chr> <int>
## 1 Action 2330847
ggplot(IMDB,aes(x=runtime,y=gross))+
geom_point(aes(size=rating,col=genre))
## Warning: Removed 8 rows containing missing values (geom_point).
gp<-ggplot(IMDB)+
aes(x=runtime,y=gross, size=rating,color=genre,
label0="-------",label1=rank, label2=title, label4=runtime, label5=genre, label6=rating, label7=metascore, label8=votes, label9=gross, label10=director, label11=actor)+
labs(title="Point Plot of the Top 100 Films of 2016\nsized by rating and color filled by genre")+
theme(plot.title = element_text(hjust=0,size=10))+
guides(size=FALSE)+ # supress legend for size/rating
scale_color_brewer(palette = "Set1")+
scale_fill_brewer(palette = "Set1")+
geom_point()+
geom_rug(inherit.aes = FALSE,data=IMDB,aes(x=runtime,y=gross,color=genre))
ggplotly(gp)
answer<-IMDB %>%
filter(runtime>99 & runtime<121) %>%
group_by(genre) %>%
summarise(meanGross=mean(gross,na.rm=TRUE)) %>%
arrange(desc(meanGross))
answer[1,]
## # A tibble: 1 x 2
## genre meanGross
## <chr> <dbl>
## 1 Adventure 364
I can’t explain what happens with the fill and boundary colors of my two point plots above using colorbrewer above.
The code is essentially identical for both plots, however the first call to geom_point requires ‘col=“white”’ to avoid having the points surrounded by a black boundary line. Including ‘col=“white”’ in the second call to geom_point turns the points completely white.
Additionally I don’t understand why the legend key labels take the format (label,1).