Setup

  1. Load required libraries
library(RCurl)
library(xml2)
library(rvest)
library(dplyr)
library(stringr)
library(tidyr)
library(knitr)
library(ggplot2)

Dataset #1: Wikipedia Table of Simpsons Seasons

1.1 Scrape HTML from web using data from HTML using rvest package

The read_html function in the rvest package takes a URL and creates an object in R containing all the HTML code used to write the website. Creating a local object is the first step to parsing scraped web data.

# Read the HTML file
html <- read_html("https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes")

1.2 Extract data from HTML using rvest package

Using the rvest package, we can:

  1. Extract all of the html “table” nodes into an R table called “Simpsons.table”

    • This will create a table where each row is the HTML code starting with “<table …”"
  2. Create an R table with the from the row of Simpsons.table, which contains the information on the Seasons

    • html_table will produce a list with length 1–this is not useful for our analysis

    • To create a data frame, we will use the as.data.frame() function in the pipe

#(Credit for inspiration for code from StackOverflow: https://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package))
Simpsons.tables<-html_nodes(html, "table")
Seasons.raw <- html_table(Simpsons.tables[1], fill = TRUE) %>% as.data.frame() %>% glimpse()
## Observations: 32
## Variables: 9
## $ Season                <chr> "Season", "", "", "", "", "", "", "", ""...
## $ Season.1              <chr> "Season", "1", "2", "3", "4", "5", "6", ...
## $ Episodes              <chr> "Episodes", "13", "22", "24", "22", "22"...
## $ Episodes.1            <chr> "Episodes", "13", "22", "24", "22", "22"...
## $ Originally.aired      <chr> "Originally aired", "December 17, 1989 (...
## $ Originally.aired.1    <chr> "Originally aired", "May 13, 1990 (1990-...
## $ Nielsen.ratings.19.   <chr> "Nielsen ratings[19]", "13.4m households...
## $ Nielsen.ratings.19..1 <chr> "First aired", "30th", "38th[20]", "33rd...
## $ Nielsen.ratings.19..2 <chr> "Last aired", "14.5", "N/A", "N/A", "13....
head(Seasons.raw)
##   Season Season.1 Episodes Episodes.1                Originally.aired
## 1 Season   Season Episodes   Episodes                Originally aired
## 2               1       13         13  December 17, 1989 (1989-12-17)
## 3               2       22         22   October 11, 1990 (1990-10-11)
## 4               3       24         24 September 19, 1991 (1991-09-19)
## 5               4       22         22 September 24, 1992 (1992-09-24)
## 6               5       22         22 September 30, 1993 (1993-09-30)
##             Originally.aired.1      Nielsen.ratings.19.
## 1             Originally aired      Nielsen ratings[19]
## 2    May 13, 1990 (1990-05-13) 13.4m households[n1][13]
## 3   July 11, 1991 (1991-07-11) 12.2m households[n1][n2]
## 4 August 27, 1992 (1992-08-27) 12.0m households[n1][n3]
## 5    May 13, 1993 (1993-05-13) 12.1m households[n1][22]
## 6    May 19, 1994 (1994-05-19) 10.5m households[n1][n4]
##   Nielsen.ratings.19..1 Nielsen.ratings.19..2
## 1           First aired            Last aired
## 2                  30th                  14.5
## 3              38th[20]                   N/A
## 4              33rd[21]                   N/A
## 5                  30th                  13.0
## 6              53rd[23]                   N/A

1.3 Clean data using dplyr package

There are a few errors generated by the extraction and original format of the table that require cleaning to produce a tidy data set:

  1. The first column is empty, and there is an extra “Episodes” column (resulting from the formatting with an extra merged header column at row 19)

  2. The “Film” row is not information about a season of the TV show—this row will be removed.

  3. The first row is really column names and not all of the column names are correctly extracted from the HTMl—this is because the original table uses a nested header

  4. Several columns have extra text, which were links to footnotes in the original webpage

  5. The ratings column lists information with two different units—these should be split into two different variables

  6. Dates are given as text and as YYYY-MM-DD in the same column.

  7. Several columns give numeric information that is classified as character

We will tidy these errors one at a time in the code below.

1.3.1 Drop empty, duplicate columns

Using the dplyr package, we can drop the empty column and rename the columns correctly.

Seasons <- Seasons.raw %>% select(-Season, -Episodes.1) %>% glimpse()
## Observations: 32
## Variables: 7
## $ Season.1              <chr> "Season", "1", "2", "3", "4", "5", "6", ...
## $ Episodes              <chr> "Episodes", "13", "22", "24", "22", "22"...
## $ Originally.aired      <chr> "Originally aired", "December 17, 1989 (...
## $ Originally.aired.1    <chr> "Originally aired", "May 13, 1990 (1990-...
## $ Nielsen.ratings.19.   <chr> "Nielsen ratings[19]", "13.4m households...
## $ Nielsen.ratings.19..1 <chr> "First aired", "30th", "38th[20]", "33rd...
## $ Nielsen.ratings.19..2 <chr> "Last aired", "14.5", "N/A", "N/A", "13....

1.3.2 Drop extraneous “Film” row

#Drop "Film"" row using filter from dplyr and the "not" operater "!"
Seasons <- Seasons %>% filter(Episodes!="Film")

1.3.3 Correct column names, drop duplicate column names row

The second header column in the original table contains some more relevant variable information, but is not exactly correct either. This was coded as the first row of data in our extaction process. We will drop the dupilicate first row, then clean up any remaining visible errros manually.

#Drop first row
Seasons = Seasons[-1,]

#Manually correct remaining column names using position, due to duplicate column names
names(Seasons)
## [1] "Season.1"              "Episodes"              "Originally.aired"     
## [4] "Originally.aired.1"    "Nielsen.ratings.19."   "Nielsen.ratings.19..1"
## [7] "Nielsen.ratings.19..2"
Seasons <- rename(Seasons, 
       Season = 1,
       "First aired" = 3,
       "Last aired" = 4,
       Ratings = 5,
       Rank = 6,
       Ratings.Score = 7
       )  
names(Seasons)
## [1] "Season"        "Episodes"      "First aired"   "Last aired"   
## [5] "Ratings"       "Rank"          "Ratings.Score"

1.3.4 Remove footnote references from table cells

This can be done using a RegEx and str_remove_all() from the stringr pacakge. Because the footnotes are coded with square brackets, however, this is tricky!

Double backslashes are needed to escape each bracket: "\\[..\\]"

Seasons$Ratings <- str_remove_all(Seasons$Ratings, "(\\[)..\\]")
Seasons$Rank <- str_remove_all(Seasons$Rank, "(\\[)..\\]")

1.3.5 Split Ratings column into two variables: Households and Viewers

Because househoulds and viewers are entirely different types of measurements, I would like to split the “Ratings” column into 2, where each row has “NA” for the data type not measured. We can do this by first extracting the data type from the “Ratings” column, and then using the tidyr package to spread the data into two columns, “Household Ratings” and “Viewers.”

Lastly, because this is numeric data, we will remove all string information from the columns and reclassify the data using the as.numeric function.

Seasons <- Seasons %>% 
  mutate(
    Ratings.Type = str_extract(Ratings, "(households)|(viewers)"),
    Ratings = str_remove(Ratings, "(households)|(viewers)"))

Seasons <- Seasons %>% 
#Extract numeric information from Ratings
  mutate(Ratings = as.numeric(str_extract(Ratings, "[[:digit:]]+(\\.)[[:digit:]]"))) %>% 
#Split Ratings into columns by rating type
  spread(Ratings.Type, Ratings) %>% 
#Rename columns
  rename(Ratings.Millions.Households = households, Ratings.Millions.Viewers = viewers) %>% 
#Drop empty column
  select(-9) %>% 
  glimpse()
## Observations: 30
## Variables: 8
## $ Season                      <chr> "1", "10", "11", "12", "13", "14",...
## $ Episodes                    <chr> "13", "23", "22", "21", "22", "22"...
## $ `First aired`               <chr> "December 17, 1989 (1989-12-17)", ...
## $ `Last aired`                <chr> "May 13, 1990 (1990-05-13)", "May ...
## $ Rank                        <chr> "30th", "46th", "44th", "21st", "3...
## $ Ratings.Score               <chr> "14.5", "N/A", "N/A", "N/A", "N/A"...
## $ Ratings.Millions.Households <dbl> 13.4, 7.9, 8.2, NA, NA, NA, NA, NA...
## $ Ratings.Millions.Viewers    <dbl> NA, NA, NA, 14.7, 12.4, 13.4, 10.6...

1.3.6 Extract date information from columns including dates

Lastly, we will remove the duplicate date information in the two columns with dates, and reclassify the date information as a date with as.Date( )

Seasons <- Seasons %>% 
#Extract numeric information from Ratings
  mutate(
    `First aired` = as.Date(str_extract(`First aired`, "[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}")),
    `Last aired` = as.Date(str_extract(`Last aired`, "[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}"))
    ) 

1.3.7 Reclassify numeric data from text

Seasons <- Seasons %>% 
  mutate(Rank = as.numeric(str_extract(Rank, "[[:digit:]]+")),
         Season = as.numeric(Season),
         Episodes = as.numeric(Episodes),
         Ratings.Score = as.numeric(Ratings.Score)
  ) %>% 
#Arrange by Season, now that data is properly classified
  arrange(Season)
## Warning in evalq(as.numeric(Episodes), <environment>): NAs introduced by
## coercion
## Warning in evalq(as.numeric(Ratings.Score), <environment>): NAs introduced
## by coercion

1.4 Cleaned data table

kable(Seasons, caption = "Simposons Seasons, via Wikipedia")
Simposons Seasons, via Wikipedia
Season Episodes First aired Last aired Rank Ratings.Score Ratings.Millions.Households Ratings.Millions.Viewers
1 13 1989-12-17 1990-05-13 30 14.5 13.4 NA
2 22 1990-10-11 1991-07-11 38 NA 12.2 NA
3 24 1991-09-19 1992-08-27 33 NA 12.0 NA
4 22 1992-09-24 1993-05-13 30 13.0 12.1 NA
5 22 1993-09-30 1994-05-19 53 NA 10.5 NA
6 25 1994-09-04 1995-05-21 67 NA 9.0 NA
7 25 1995-09-17 1996-05-19 75 NA 8.0 NA
8 25 1996-10-27 1997-05-18 53 NA 8.6 NA
9 25 1997-09-21 1998-05-17 30 9.2 9.1 NA
10 23 1998-08-23 1999-05-16 46 NA 7.9 NA
11 22 1999-09-26 2000-05-21 44 NA 8.2 NA
12 21 2000-11-01 2001-05-20 21 NA NA 14.7
13 22 2001-11-06 2002-05-22 30 NA NA 12.4
14 22 2002-11-03 2003-05-18 25 NA NA 13.4
15 22 2003-11-02 2004-05-23 42 NA NA 10.6
16 21 2004-11-07 2005-05-15 52 NA NA 9.6
17 22 2005-09-11 2006-05-21 62 3.2 NA 9.1
18 22 2006-09-10 2007-05-20 60 4.1 NA 8.6
19 20 2007-09-23 2008-05-18 87 NA NA 8.0
20 21 2008-09-28 2009-05-17 77 NA NA 6.9
21 23 2009-09-27 2010-05-23 61 3.4 NA 7.2
22 22 2010-09-26 2011-05-22 65 3.3 NA 7.3
23 22 2011-09-25 2012-05-20 69 3.3 NA 7.0
24 22 2012-09-30 2013-05-19 70 2.9 NA 6.3
25 22 2013-09-29 2014-05-18 81 NA NA 5.6
26 22 2014-09-28 2015-05-17 100 2.6 NA 5.6
27 22 2015-09-27 2016-05-22 102 2.1 NA 4.7
28 22 2016-09-25 2017-05-21 92 2.1 NA 4.8
29 21 2017-10-01 2018-05-20 122 1.7 NA 4.1
30 NA 2018-09-30 NA NA NA NA NA

1.6 Analysis

Whew! What a lot of work. Now that our data is clean, we can perform some analysis.

How has The Simpsons’s rank changed with the change in ratings? For simplicity, we will look at ratings scored as millions of households and millions of viewers, separately.

Per the tables below, it looks like—as expected— the higher the number of ratings, the lower (better) thank national ranking. Interestingly, this relationship is much tighted when looking at viewers than households—perhaps why Nielson switched metrics!

Seasons %>% 
  filter(!is.na(Ratings.Millions.Households)) %>% 
  ggplot(aes(x=Ratings.Millions.Households, y=Rank))+
  geom_jitter()+
  geom_smooth(method=lm)+
  labs(
    x="Nielson Ratings (Millions of Households)",
    title="Relationship between Nielson Ratings", 
    subtitle="by Household and National Ratings Ranking",
    caption="Source: 'https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes'")

Seasons %>% 
  filter(!is.na(Ratings.Millions.Viewers)) %>% 
  ggplot(aes(x=Ratings.Millions.Viewers, y=Rank))+
  geom_jitter()+
  geom_smooth(method=lm)+
  labs(
    x="Nielson Ratings (Millions of Viewers)",
    title="Relationship between Nielson Ratings", 
    subtitle="by Viewers and National Ratings Ranking",
    caption="Source: 'https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes'")

Dataset #2: San Francisco Food Trucks

2.1 Read data from CSV

First, we will read the data from a CSV.

food.raw <- read.csv("https://raw.githubusercontent.com/aliceafriedman/DATA607_Proj2/master/Mobile_Food_Facility_Permit.csv", sep=",", header = TRUE)

2.2 Tidy Data

This data is fairly tidy already. One column which could reasonably be split are the days and hours a food truck is open, currently located in one column, “dayshours”.

library(tidyr)
food <- food.raw %>% separate(dayshours, c("Days", "Hours"), sep=":")
## Warning: Expected 2 pieces. Additional pieces discarded in 24 rows [15, 33,
## 46, 61, 112, 158, 185, 234, 264, 280, 283, 284, 314, 322, 339, 353, 376,
## 378, 391, 417, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 60 rows
## [2, 3, 5, 10, 11, 12, 13, 14, 16, 17, 39, 41, 47, 48, 49, 51, 55, 56, 57,
## 119, ...].

We can also replace data coded as “Mo/Tu/We/Th/Fr” as “Mo-Fr” for consistency.

food <- food %>% 
  mutate(
    Days = str_replace_all(food$Days, "Mo/Tu/We/Th/Fr", "Mo-Fr"),
    Days = str_replace_all(food$Days, "Mo/Mo/Mo/Mo/Mo", "Mo"))

More tidying could be done–this data would best be split into 14 columns, one for hours open each day and one for closing hour each day, where days closed will just be NA. Just this small amount of tidying will allow us to do the below analysis.

2.3 Analysis

How many food trucks are open each combination of days?

Open <- food %>% group_by(Days) %>% summarize(n=n()) %>% arrange(desc(n))
kable(Open)
Days n
Mo-Fr 348
60
Mo-Su 43
Mo/We/Fr 22
Mo 19
Sa-Su 13
Mo/Tu/We/Th/Fr 12
Tu/Th 11
Mo-We 10
Tu/We/Th 10
Su 9
Sa 8
Th/Fr/Sa 6
Fr 5
Su/Fr/Sa 5
Tu/Th/Fr 5
Mo/We 4
Mo/Tu/Th 3
We/Th/Fr 3
Mo/Fr 2
Mo/Th/Fr 2
Mo/Tu/Fr 2
Mo/Tu/We/Th 2
Tu/Th/Sa 2
We/Th 2
Mo/Mo/Tu/Tu/We 1
Mo/Tu/Sa 1
Mo/Tu/Th/Fr 1
Mo/Tu/We/Fr 1
Su/We 1
Su/We/Sa 1
Th/Fr 1
Tu/Fr 1
Tu/Sa 1
Tu/We 1
Tu/We/Fr 1
We 1
We/Fr/Sa 1