By: Lawrence Lo
Email: ollecnerwal180@gmail.com
Web scraping is the extraction of data from websites that can be used for further analytic procedures. This presentation is a demonstration of web scraping in R using the rvest package. In this demonstration, we will be web scraping thousands of data points across multiple web pages from metacritic, one of the largest online aggregation sites for media product reviews.
Although some website data can appear conveniently in a table-like format, we will be scraping data scattered into unconventional data formats. For this demonstration we will be extracting data on the top user rated music albums from metacritic. The layout of the website appears below:
Metacritic top 100 user rated albums
From this site, we can see the following information:
We will extract all of this information and conform it to a data frame that can be analyzed and further manipulated.
This presetation will cover the following topics:
Cascading Style Sheets (CSS) selectors are tools used to target specific HTML elements on a web page. These are helpful tools outside of R that are used to gather specific HTML information neccesary for using the rvest functions. A popular CSS selector that is demonstrated here is selectorgadget. Selectorgadget can be installed as a google chrome extension and uses a simple point and click interface.
For example, using the selectorgadget on the album names will produce the following screen:
Selector gadget highlighting album names
In the small selectorgadget user interface overlayed on the website, we can see the “.product_title a” CSS text appear. This text will need to be input into R in order to extract the album names from the web page.
Here is another example where the user scores have been selected:
Selector gadget highlighting user scores names
Now, the CSS text is “.positive”; this will be used to extract the user ratings.
Now that we can use the selectorgadget to extract the CSS information we need, let’s use this within R.
Before we begin, let’s take a look at the R packages we will be using:
First, we’ll load these packages into R.
rr library(rvest) library(dplyr) library(stringr) library(ggplot2)
The read_html function takes the url of the web page we will be scraping. Let’s assign this to the object, ‘wp’.
rr wp <- read_html(://www.metacritic.com/browse/albums/release-date/available/userscore?view=detailed&page=0)
In the first example of using the selectorgadget, we found the CSS field related to album names was“.product_title a”. We can input this into the html_nodes function as shown below to extract the album names. Here we assign the album names to the object, ‘album_name’, and take a look at the first few entries.
rr album_name <- wp %>% html_nodes(.product_title a) %>% html_text() head(album_name)
[1] \Scarlet's Walk\ \Good Kid
Using the selectorgadget to highlight the artist names produces the CSS field text, “.product_artist”. Let’s assign this data to the object, ‘artist_name’.
rr artist_name <- wp %>% html_nodes(.product_artist) %>% html_text() head(artist_name)
[1] \ - Tori Amos\ \ - Kendrick Lamar\ \ - Kendrick Lamar\
[4] \ - Kenny Chesney\ \ - Led Zeppelin\ \ - Blur\
Notice that each artist has a " - " prefix. Web scraping can often provide little blemishes in our data such as this. We can eliminate this with the following modification to the code:
rr artist_name <- wp %>% html_nodes(.product_artist) %>% html_text() %>% str_replace( - ,\) head(artist_name)
[1] \Tori Amos\ \Kendrick Lamar\ \Kendrick Lamar\ \Kenny Chesney\
[5] \Led Zeppelin\ \Blur\
Let’s extract the user scores into the object, ‘user_score’.
rr user_score <- wp %>% html_nodes(.positive) %>% html_text() head(user_score)
[1] \9.1\ \9.1\ \9.1\ \9.0\ \9.0\ \9.0\
From glancing at the first few entries, we see that these have been extracted as character strings. Let’s modify our code to convert this into numeric entries.
rr user_score <- wp %>% html_nodes(.positive) %>% html_text() %>% as.numeric() head(user_score)
[1] 9.1 9.1 9.1 9.0 9.0 9.0
The metacritic scores can be handled in a similar way.
rr meta_score <- wp %>% html_nodes(.textscore) %>% html_text() %>% as.numeric() head(meta_score)
[1] 76 91 96 67 97 88
The genre data will appear in a non-optimal format.
rr genre <- wp %>% html_nodes(.genre .data) %>% html_text() head(genre)
[1] \\n Rock
The following code will achieve 2 things. First, it will format the data to be more compatible as an entry within a dataframe cell. Second, we will extract only the primary genre for ease of further analyses.
rr primary_genre <- wp %>% html_nodes(.genre .data) %>% html_text() %>% str_squish() %>% str_replace(
```
[1] \Rock\ \Rap\ \Rap\ \Country\ \Rock\ \Rock\
When using the selectorgadget, it will indicate a count of how many elements that have been selected. Upon selecting the genre data, it is quickly noticeable that 4 of the 100 displayed albums are without a genre listing. We can confirm this discrepancy in R.
rr data.frame(album_name_length = length(album_name), genre_length = length(primary_genre))
Unfortunately, the web extraction does not notice missing fields and therefore cannot correctly place missing data indicators where they need to be. This can be a common hurdle in web scraping. Overcoming these hurdles requires some flexibility in R.
We could visibly inspect the webpage for missing genre entries and place the missing values in our data manually, but an automated approach will allow us to handle other webpages without manual inspection. After using the selector gadget to highlight simultaneous fields, we can extract a larger subset including the user score, genre, and date.
rr test1 <- wp %>% html_nodes(.data) %>% html_text() %>% str_squish() head(test1)
[1] \Oct 29
Converting this into a numeric vector, we should notice a pattern of (NA,NA,numeric).
rr test2 <- as.numeric(test1)
NAs introduced by coercion
rr head(test2)
[1] NA NA 76 NA NA 91
Using this pattern, we can use the following for-loop routine to impute the missing data as ‘NA’ into the proper location.
rr #Identify count of missings NAcount <- 100-length(primary_genre) #Store NA indicators NAwhich <- rep(0, NAcount) #Loop for finding (NA,NA,numeric) patterns for(i in 1:NAcount){ for(j in 0:100){ #cat(i,j,!is.na(test2[3*j+2]),\n) if(!is.na(test2[3*j+2])){ test2 <- append(test2, NA, 3*j) NAwhich[i] <- j+1 break } } } #Impute NAs in correct order for(i in NAwhich){ primary_genre <- append(primary_genre, NA, i-1) }
The release data will be our last extracted variable.
rr release_date <- wp %>% html_nodes(.release_date .data) %>% html_text() head(release_date)
[1] \Oct 29
From this release data character string, let’s extract the year into a separate numeric variable for our later analysis.
rr release_year <- release_date %>% str_replace(.*
<!-- rnb-source-end -->
<!-- rnb-output-begin eyJkYXRhIjoiWzFdIDIwMDIgMjAxMiAyMDE1IDIwMDQgMjAwMyAyMDAwXG4ifQ== -->
[1] 2002 2012 2015 2004 2003 2000 ```
Finally, let’s enter all this scraped data into a data frame.
rr data_music <- data.frame(album_name, artist_name, user_score, meta_score, primary_genre, release_date, release_year) head(data_music)
Now we have a data frame with all of our web scraped data.
Now that we have demonstrated extracting data variables from a single web page, let’s generalize this method to extract similar data across multiple web pages.
First, we will define all the individual steps above into a function that will take a url as an input and will produce the desired data frame as an output.
rr meta_music_webscrape <- function(url){
wp <- read_html(url)
album_name <- wp %>% html_nodes(.product_title a) %>% html_text() album_name
artist_name <- wp %>% html_nodes(.product_artist) %>% html_text() %>% str_replace( - ,\) artist_name
user_score <- wp %>% html_nodes(.positive) %>% html_text() %>% as.numeric() user_score
meta_score <- wp %>% html_nodes(.textscore) %>% html_text() %>% as.numeric() meta_score
primary_genre <- wp %>% html_nodes(.genre .data) %>% html_text() %>% str_squish() %>% str_replace(
```
Let’s test this custom function on the same web page to confirm that we get the same data frame.
rr url1 <- ://www.metacritic.com/browse/albums/release-date/available/userscore?view=detailed&page=0
out1 <- meta_music_webscrape(url1)
NAs introduced by coercion
rr head(out1)
Now let’s test it on a second similar web page, giving the 101-200 listing of top user rated artists.
rr url2 <- ://www.metacritic.com/browse/albums/release-date/available/userscore?view=detailed&page=1
out2 <- meta_music_webscrape(url2)
NAs introduced by coercion
rr head(out2)
Now let’s automate this function over 10 web pages, collecting data on the top 1000 user rated artists. This can be done by applying the custom function within a for-loop while using the paste0 function to call sequential web pages. We’ll call this data frame, ‘top1000’.
rr #Create empty data frame top1000 <- data.frame(album_name = character(), artist_name = character(), user_score = numeric(), meta_score = numeric(), primary_genre = character(), release_date = character(), release_year = numeric()) for(i in 0:9){ top1000 <- rbind(top1000, meta_music_webscrape(paste0( ://www.metacritic.com/browse/albums/release-date/available/userscore?view=detailed&page=, i ))) }
NAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercion
Let’s check the structure of this data frame.
rr str(top1000)
'data.frame': 1000 obs. of 7 variables:
$ album_name : Factor w/ 996 levels \()\,\10
Now that we have extracted all this data, let’s do some simple analysis.
Let’s begin by looking at some summary statistcs of the ‘user_score’ variable.
rr summary(top1000$user_score)
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.700 8.800 8.800 8.828 8.900 9.800
rr paste(deviation: , sd(top1000$user_score))
[1] \standard deviation: 0.0727492436555354\
Let’s plot the distribution of this variable.
rr ggplot(top1000, aes(x = user_score)) + geom_bar() + labs(title=of user scores)
Now the same summary for the metacritic scores.
rr summary(top1000$meta_score)
Min. 1st Qu. Median Mean 3rd Qu. Max.
47.00 72.00 78.00 77.06 82.25 100.00
rr paste(deviation: , sd(top1000$meta_score))
[1] \standard deviation: 8.33352288072719\
rr ggplot(top1000, aes(x = meta_score)) + geom_histogram() + labs(title=of metacritic scores)
What does the distribution of primary genres look like?
rr table(top1000$primary_genre)
Country Dance Electronic
13 17 66
Folk Heavy Metal Indie
8 3 269
Jazz Pop Pop/Rock
3 43 150
R&B Rap Rock
24 73 270
[There is no word to Adult Alternative Album Rock
1 7 1
Alternative Pop/Rock Altternative Doom Metal
3 1 1
Indie Rock Industrial Soundtrack
2 1 3
World Alternative Alternative-Folk
1 4 1
Alternative/Indie Rock Britpop Classical
2 1 2
Emo Singer/Songwriter Soul
4 2 1
Adult Contemporary Americana Avant-Garde
3 1 2
Gospel Post-Rock Southern Rap
1 1 1
Trip-Hop Bluegrass Experimental
1 2 1
As we can see, there are several minor genres that have few entries. Let’s collapse this variable into an alternate form that only has the top 5 genres and an ‘other’ category capturing the remaining genres.
rr Other_genres <- setdiff(unique(top1000\(primary_genre), names(sort(table(top1000\)primary_genre), decreasing = T)[1:5]))
top1000\(primary_genre2 <- top1000\)primary_genre levels(top1000\(primary_genre2) <- c(levels(top1000\)primary_genre2), ) top1000\(primary_genre2[top1000\)primary_genre2 %in% Other_genres] <- as.factor() top1000\(primary_genre2 <- droplevels(top1000\)primary_genre2) table(top1000$primary_genre2)
Electronic Indie Pop/Rock Rap Rock Other
66 269 150 73 270 172
Now let’s visualize this table into a bar plot.
rr ggplot(top1000, aes(x = primary_genre2)) + geom_bar() + geom_text(aes(y = ..count.. -10, label = paste0(round(prop.table(..count..),4) * 100, ‘%’)), stat = ‘count’, position = position_dodge(.1), size = 3)+ labs(title=of primary genres)
It appears as if ‘rock’ and ‘indie’ are the most popular genres within the top 1000 user rated albums.
How about the distribution of album relase years?
rr table(top1000$release_year)
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
6 50 73 72 87 88 65 49 47 72 82 41 65 65 51
2014 2015 2016 2017
43 24 16 4
rr ggplot(top1000, aes(x = release_year)) + geom_histogram(binwidth = 1) + labs(title=of release years)
It appears as if the most popular music was released in the years 2003, 2004, and 2009.
Now for some basic inferential statistics looking at some relationships between these variables.
What is the relationship between user score and the metacritic score?
rr paste(of user score and meta score: , cor(top1000\(user_score,top1000\)meta_score))
[1] \Correlation of user score and meta score: 0.172502580668426\
rr ggplot(top1000, aes(x= user_score, y = meta_score)) + geom_point() + geom_smooth(method=lm) + labs(title=of user and metacritic scores)
From the rather small correlation coefficient of .17 and the round dispersion of the scatterplot, it appears as if the relationship between user and metacritic score is moderate at best.
Let’s run a few regression models to look at some genre by rating relationships. Within these top 1000 albums, is there a difference in the average user rating by genre type?
rr fit1 <- lm(user_score ~ relevel(primary_genre2, ref = ), data = top1000) summary(fit1)
Call:
lm(formula = user_score ~ relevel(primary_genre2, ref = \Other\),
data = top1000)
Residuals:
Min 1Q Median 3Q Max
-0.13836 -0.02907 -0.02825 0.07093 0.97067
Coefficients:
Estimate Std. Error
(Intercept) 8.8290698 0.0055522
relevel(primary_genre2, ref = \Other\)Electronic -0.0108879 0.0105434
relevel(primary_genre2, ref = \Other\)Indie -0.0008170 0.0071090
relevel(primary_genre2, ref = \Other\)Pop/Rock 0.0002636 0.0081348
relevel(primary_genre2, ref = \Other\)Rap 0.0092864 0.0101715
relevel(primary_genre2, ref = \Other\)Rock -0.0038846 0.0071038
t value Pr(>|t|)
(Intercept) 1590.197 <2e-16 ***
relevel(primary_genre2, ref = \Other\)Electronic -1.033 0.302
relevel(primary_genre2, ref = \Other\)Indie -0.115 0.909
relevel(primary_genre2, ref = \Other\)Pop/Rock 0.032 0.974
relevel(primary_genre2, ref = \Other\)Rap 0.913 0.361
relevel(primary_genre2, ref = \Other\)Rock -0.547 0.585
---
Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
Residual standard error: 0.07282 on 994 degrees of freedom
Multiple R-squared: 0.003172, Adjusted R-squared: -0.001842
F-statistic: 0.6326 on 5 and 994 DF, p-value: 0.6749
It appears as if there are no statistically significant differences in user ratings by genre type.
How about genre type differences in metacritic scores?
rr fit2 <- lm(meta_score ~ relevel(primary_genre2, ref = ), data = top1000) summary(fit2)
Call:
lm(formula = meta_score ~ relevel(primary_genre2, ref = \Other\),
data = top1000)
Residuals:
Min 1Q Median 3Q Max
-29.1233 -5.0523 0.3365 5.0669 25.9444
Coefficients:
Estimate Std. Error
(Intercept) 77.0523 0.6174
relevel(primary_genre2, ref = \Other\)Electronic -0.6584 1.1724
relevel(primary_genre2, ref = \Other\)Indie 1.8808 0.7905
relevel(primary_genre2, ref = \Other\)Pop/Rock 0.8810 0.9046
relevel(primary_genre2, ref = \Other\)Rap 3.0710 1.1311
relevel(primary_genre2, ref = \Other\)Rock -2.9968 0.7899
t value Pr(>|t|)
(Intercept) 124.803 < 2e-16 ***
relevel(primary_genre2, ref = \Other\)Electronic -0.562 0.574537
relevel(primary_genre2, ref = \Other\)Indie 2.379 0.017539 *
relevel(primary_genre2, ref = \Other\)Pop/Rock 0.974 0.330319
relevel(primary_genre2, ref = \Other\)Rap 2.715 0.006740 **
relevel(primary_genre2, ref = \Other\)Rock -3.794 0.000157 ***
---
Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
Residual standard error: 8.097 on 994 degrees of freedom
Multiple R-squared: 0.06068, Adjusted R-squared: 0.05595
F-statistic: 12.84 on 5 and 994 DF, p-value: 4.038e-12
It appears as if there are some statistically significant differences here; let’s visualize this.
rr ggplot(top1000, aes(x = primary_genre2, y = meta_score)) + geom_boxplot() + labs(title=of metacritic score by primary genre)
Within these top 1000 albums, the metacritic scores seemed to prefer rap and indie genres while disfavoring rock.
In this presentation we collected data on the top 1000 user rated albums from metacritic. We first introduced the selectorgadget tool and R packages necessary for web scraping. Second, we demonstrated basic use of these functions for extracting single layers from individual web pages. We proceeded to automate this process for collection of large quantities of data over multiple web pages. Finally, we carried out some basic descriptive and inferential statistics on this web scraped data.
Here are some things we found out about the top 1000 user rated albums: