Unit 2: Web Scraping, Extra Credit

Introduction

This lab follows Beginner’s Guide on Web Scraping in R using rvest with hands-on example.

“Webscraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used.” For this assignment, we will be examining the Taylor Swift Wiki utilizing the “rvest” package and the SelectorGadget Chrome extension.

Scraping a webpage using R

In order to begin the project, we need to read the html of the web page we are interested in, so that we can begin scraping the data we would like to scrape from the website.

#Accessing the rvest library
library("rvest")
## Warning: package 'rvest' was built under R version 3.6.1
## Loading required package: xml2
#Specifying the url for desired website to be scraped
url <- 'https://en.wikipedia.org/wiki/Taylor_Swift_discography'

#Reading the HTML code from the website
webpage <- read_html(url)

Here is a list of data we are interested under Studio Albums: * Title * Release Date * Peak US Chart Position * World Sales * US Sales

Scraping the Studio Album Titles

After using the CSS SelectorGadget to select only the Title for Studio Albums, I will use the following code to get all of the titles and save them into the variable title_data.

#Using CSS selectors to scrape the titles section
title_data_html <- html_nodes(webpage,'.plainrowheaders:nth-child(10) i a')

#Converting the title data to text
title_data <- html_text(title_data_html)

#Let's have a look at the titles
title_data
## [1] "Taylor Swift" "Fearless"     "Speak Now"    "Red"         
## [5] "1989"         "Reputation"   "Lover"

This looks correct as Taylor Swift only has 7 studio albums.

Scraping the Release Dates

After using the CSS SelectorGadget to select only the Release Dates for Studio Albums, I will use the following code to get all of the release dates and save them into the variable release_data.

#Using CSS selectors to scrape the release date section
release_data_html <- html_nodes(webpage,'.plainrowheaders:nth-child(10) th+ td li:nth-child(1)')

#Converting the release date to text
release_data <- html_text(release_data_html)

#Let's have a look at the release dates
head(release_data)
## [1] "Released: October 24, 2006"  "Released: November 11, 2008"
## [3] "Released: October 25, 2010"  "Released: October 22, 2012" 
## [5] "Released: October 27, 2014"  "Released: November 10, 2017"

For the release_data, we will clean it by removing the “Released:” string before each value, removing the “” string following each value, and converting to the Date format.

#Data-Preprocessing: removing excess words
release_data<-gsub("Released: ","",release_data)

#Data-Preprocessing: removing commas spaces
release_data<-gsub(",","",release_data)

#Data-Preprocessing: converting to date format
release_data2<-as.Date(release_data, format = "%B %d %Y")

#Let's have another look at the release date data
head(release_data2)
## [1] "2006-10-24" "2008-11-11" "2010-10-25" "2012-10-22" "2014-10-27"
## [6] "2017-11-10"

Scraping the US Chart Positions

After using the CSS SelectorGadget to select only the peak US Chart Positions for Studio Albums, I will use the following code and save them into the variable US_data.

#Using CSS selectors to scrape the peak US Chart Positions section
US_data_html <- html_nodes(webpage,'.plainrowheaders:nth-child(10) td:nth-child(3)')

#Converting the US chart data to text
US_data <- html_text(US_data_html)

#Let's have a look at the US chart data
US_data
## [1] "5"                "1"                "1"               
## [4] "1"                "1"                "1"               
## [7] "To be released\n"

For the US_data, we will clean it by inserting an NA value for the 7th album, which is to be released and then converting the values to numeric.

#Data-Preprocessing: changing string to NA
is.na(US_data) <- US_data == "To be released\n"

#Data-Preprocessing: converting to numeric
US_data <- as.numeric(US_data)


#Let's have another look at the US chart data
US_data
## [1]  5  1  1  1  1  1 NA

Scraping the World Sales

After using the CSS SelectorGadget to select only the World Sales for Studio Albums, I will use the following code and save them into the variable worldsales_data.

#Using CSS selectors to scrape the World Sales section
worldsales_data_html <- html_nodes(webpage,'.table-no2 , .plainrowheaders:nth-child(10) td:nth-child(13) li:nth-child(1)')

#Converting the World Sales data to text
worldsales_data <- html_text(worldsales_data_html)

#Let's have a look at the World Sales data
worldsales_data
## [1] "WW: 5,500,000[A]"  "WW: 12,000,000[C]" "WW: 5,000,000[G]" 
## [4] "WW: 6,000,000[J]"  "WW: 10,100,000[N]" "WW: 4,500,000[T]" 
## [7] "To be released\n"

For the worldsales_data, we will clean it by removing all non-numeric characters, inserting an NA value for the 7th album, and converting it to a numeric data type.

#Data-Preprocessing: changing string to NA
is.na(worldsales_data) <- worldsales_data == "To be released\n"

#Data-Preprocessing: removing nondigits
worldsales_data<-gsub("\\D","",worldsales_data)

#Data-Preprocessing: converting to numeric
worldsales_data<-as.numeric(worldsales_data)

#Let's have another look at the release date data
worldsales_data
## [1]  5500000 12000000  5000000  6000000 10100000  4500000       NA
#Converting to millions
worldsales_data2<-worldsales_data/1000000
worldsales_data2
## [1]  5.5 12.0  5.0  6.0 10.1  4.5   NA

Scraping the US Sales

After using the CSS SelectorGadget to select only the US Sales for Studio Albums, I will use the following code and save them into the variable USsales_data.

#Using CSS selectors to scrape the World Sales section
USsales_data_html <- html_nodes(webpage,'.table-no2 , caption+ tbody td:nth-child(13) li:nth-child(2)')

#Converting the World Sales data to text
USsales_data <- html_text(USsales_data_html)

#Let's have a look at the World Sales data
USsales_data
## [1] "US: 5,720,000[B]" "US: 7,180,000[D]" "US: 4,680,000[H]"
## [4] "US: 4,450,000[K]" "US: 6,190,000[O]" "US: 2,230,000[U]"
## [7] "To be released\n"

For the USsales_data, we will clean it by removing all non-numeric characters, inserting an NA value for the 7th album, and converting it to a numeric data type.

#Data-Preprocessing: changing string to NA
is.na(USsales_data) <- USsales_data == "To be released\n"

#Data-Preprocessing: removing nondigits
USsales_data<-gsub("\\D","",USsales_data)

#Data-Preprocessing: converting to numeric
USsales_data2<-as.numeric(USsales_data)

#Let's have another look at the release date data
USsales_data2
## [1] 5720000 7180000 4680000 4450000 6190000 2230000      NA
#Converting to millions
USsales_data3<-USsales_data2/1000000
USsales_data3
## [1] 5.72 7.18 4.68 4.45 6.19 2.23   NA

Adding album length

album_length <- c("40:28", "53:41",  "67:29", "65:11", "48:41", "55:38", "54:04")
library("lubridate")
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
seconds <- as.numeric(as.period(ms(album_length), unit = "sec"))
seconds
## [1] 2428 3221 4049 3911 2921 3338 3244

Creating a Dataframe

Now that all of the 5 variables we are interested in have been scraped, we will construct a data frame and inspect its structure.

#Combining all the lists to form a data frame
ts_df<-data.frame(Album = title_data, Release_Date = release_data2, US_Chart_Peak = US_data, World_Sales = worldsales_data2, US_Sales = USsales_data3, Length = album_length)

#Creating factors for the titles to order albums by release date (oldest to newest)
ts_df$Album <- factor(ts_df$Album,levels = c("Taylor Swift", "Fearless", "Speak Now", "Red", "1989", "Reputation", "Lover"))

#Structure of the data frame
str(ts_df)
## 'data.frame':    7 obs. of  6 variables:
##  $ Album        : Factor w/ 7 levels "Taylor Swift",..: 1 2 3 4 5 6 7
##  $ Release_Date : Date, format: "2006-10-24" "2008-11-11" ...
##  $ US_Chart_Peak: num  5 1 1 1 1 1 NA
##  $ World_Sales  : num  5.5 12 5 6 10.1 4.5 NA
##  $ US_Sales     : num  5.72 7.18 4.68 4.45 6.19 2.23 NA
##  $ Length       : Factor w/ 7 levels "40:28","48:41",..: 1 3 7 6 2 5 4
ts_df
##          Album Release_Date US_Chart_Peak World_Sales US_Sales Length
## 1 Taylor Swift   2006-10-24             5         5.5     5.72  40:28
## 2     Fearless   2008-11-11             1        12.0     7.18  53:41
## 3    Speak Now   2010-10-25             1         5.0     4.68  67:29
## 4          Red   2012-10-22             1         6.0     4.45  65:11
## 5         1989   2014-10-27             1        10.1     6.19  48:41
## 6   Reputation   2017-11-10             1         4.5     2.23  55:38
## 7        Lover         <NA>            NA          NA       NA  54:04

Analyzing Scraped Data from the Web

Barplot for Taylor Swift U.S. Studio Album Sales
#Initial Barplot
library('ggplot2')
ggplot(data=ts_df, aes(x=Album, y=US_Sales, fill=Album)) +
  geom_bar(stat = "identity") +
  labs(title = "Taylor Swift U.S. Studio Album Sales", x = "Album Title", y = "U.S. Sales (Millions)")
## Warning: Removed 1 rows containing missing values (position_stack).

Barplot for Taylor Swift World Wide Studio Album Sales
#Initial Barplot
library('ggplot2')
ggplot(data=ts_df, aes(x=Album, y=World_Sales, fill=Album)) +
  geom_bar(stat = "identity") +
  labs(title = "Taylor Swift World Studio Album Sales", x = "Album Title", y = "World Sales (Millions)") +
  scale_y_continuous(breaks = seq(0, 13, by = 1))
## Warning: Removed 1 rows containing missing values (position_stack).

Creating a New Data Frame with Pronoun Info

#importing the pronoun csv file
ts_pronouns<-read.csv("tspronouns.csv", stringsAsFactors = FALSE)

#merging the two data frames by the variable "Album"
ts_full <- merge(ts_df, ts_pronouns, by="Album")

#Data-Preprocessing: changing empty string to NA
is.na(ts_full$Singles) <- ts_full$Singles == ""
is.na(ts_full$Type) <- ts_full$Type == ""

#looking at the data
head(ts_full, 20)
##       Album Release_Date US_Chart_Peak World_Sales US_Sales Length
## 1      1989   2014-10-27             1        10.1     6.19  48:41
## 2      1989   2014-10-27             1        10.1     6.19  48:41
## 3      1989   2014-10-27             1        10.1     6.19  48:41
## 4      1989   2014-10-27             1        10.1     6.19  48:41
## 5      1989   2014-10-27             1        10.1     6.19  48:41
## 6      1989   2014-10-27             1        10.1     6.19  48:41
## 7      1989   2014-10-27             1        10.1     6.19  48:41
## 8      1989   2014-10-27             1        10.1     6.19  48:41
## 9      1989   2014-10-27             1        10.1     6.19  48:41
## 10     1989   2014-10-27             1        10.1     6.19  48:41
## 11     1989   2014-10-27             1        10.1     6.19  48:41
## 12     1989   2014-10-27             1        10.1     6.19  48:41
## 13     1989   2014-10-27             1        10.1     6.19  48:41
## 14     1989   2014-10-27             1        10.1     6.19  48:41
## 15     1989   2014-10-27             1        10.1     6.19  48:41
## 16     1989   2014-10-27             1        10.1     6.19  48:41
## 17 Fearless   2008-11-11             1        12.0     7.18  53:41
## 18 Fearless   2008-11-11             1        12.0     7.18  53:41
## 19 Fearless   2008-11-11             1        12.0     7.18  53:41
## 20 Fearless   2008-11-11             1        12.0     7.18  53:41
##                          Song         Type Singles
## 1         Welcome to New York Non Romantic    <NA>
## 2                 Blank Space Non Romantic  Single
## 3                       Style Male Pronoun  Single
## 4            Out of the Woods      Neutral  Single
## 5  All You Had to Do Was Stay      Neutral    <NA>
## 6                Shake It Off Non Romantic  Single
## 7            I Wish You Would      Neutral    <NA>
## 8                   Bad Blood      Neutral  Single
## 9              Wildest Dreams Male Pronoun  Single
## 10       How You Get the Girl      Neutral    <NA>
## 11                  This Love      Neutral    <NA>
## 12              I Know Places      Neutral    <NA>
## 13                      Clean      Neutral    <NA>
## 14                 Wonderland      Neutral    <NA>
## 15            You Are In Love Male Pronoun    <NA>
## 16              New Romantics      Neutral  Single
## 17             Jump Then Fall      Neutral    <NA>
## 18                Untouchable      Neutral    <NA>
## 19           Forever & Always Male Pronoun    <NA>
## 20      Come In With The Rain      Neutral    <NA>
Stacked Barplot for Taylor Swift Pronouns
#Count Barplot for TS Pronouns by Album
pronouns2 <- ggplot(ts_full, aes(x=Album, fill=Type)) + 
  geom_bar(stat="count") +
  labs(title = "Taylor Swift Pronoun Usuage by Album", x = "Album Title", y = "Number of Pronouns Used") +
  scale_y_continuous(breaks = seq(0, 20, by = 5))
pronouns2

#Pronoun Barplot by Percent
pronouns <- ggplot(ts_full, aes(fill=Type, x=Album)) + 
  geom_bar(stat="count", position="fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "% of TS Pronoun Usuage by Album", x = "Album Title", y = "% of Pronouns Used")
pronouns