library(tidyverse)
library(rvest)
library(gtools)
library(stringi)Super Bowl Data
Import Libraries
Scrape Data
Use read_html to scrape the data from the website.
url <- "https://www.espn.com/nfl/superbowl/history/winners"
webpage <- read_html(url)Extract Data
Use the html_nodes and html_table to extract HTML table element and conver it to dataframe.
super_bowl_table <- html_nodes(webpage,'table')
super_bowl_df <- html_table(super_bowl_table)[[1]]
head(super_bowl_df)# A tibble: 6 × 4
X1 X2 X3 X4
<chr> <chr> <chr> <chr>
1 Super Bowl Winners and Results Super Bowl Winners and Results Super Bow… Supe…
2 NO. DATE SITE RESU…
3 I Jan. 15, 1967 Los Angel… Gree…
4 II Jan. 14, 1968 Orange Bo… Gree…
5 III Jan. 12, 1969 Orange Bo… New …
6 IV Jan. 11, 1970 Tulane St… Kans…
Remove the first 2 rows and set column names
super_bowl_df <- super_bowl_df[-c(1,2),]
colnames(super_bowl_df) <- c("Super Bowl", "Date", "Venue", "Results")
head(super_bowl_df)# A tibble: 6 × 4
`Super Bowl` Date Venue Results
<chr> <chr> <chr> <chr>
1 I Jan. 15, 1967 Los Angeles Memorial Coliseum Green Bay 35, Kansas…
2 II Jan. 14, 1968 Orange Bowl (Miami) Green Bay 33, Oaklan…
3 III Jan. 12, 1969 Orange Bowl (Miami) New York Jets 16, Ba…
4 IV Jan. 11, 1970 Tulane Stadium (New Orleans) Kansas City 23, Minn…
5 V Jan. 17, 1971 Orange Bowl (Miami) Baltimore 16, Dallas…
6 VI Jan. 16, 1972 Tulane Stadium (New Orleans) Dallas 24, Miami 3
Convert Roman numerals to standard format and convert date to standard format
super_bowl_df$`Super Bowl` <- roman2int(super_bowl_df$`Super Bowl`)Data Cleaning
Separate the results column into 2 columns
super_bowl_df <- super_bowl_df %>%
separate("Results", into = c("Winner", "Loser"), sep = ",")
head(super_bowl_df)# A tibble: 6 × 5
`Super Bowl` Date Venue Winner Loser
<int> <chr> <chr> <chr> <chr>
1 1 Jan. 15, 1967 Los Angeles Memorial Coliseum Green Bay 35 " Ka…
2 2 Jan. 14, 1968 Orange Bowl (Miami) Green Bay 33 " Oa…
3 3 Jan. 12, 1969 Orange Bowl (Miami) New York Jets … " Ba…
4 4 Jan. 11, 1970 Tulane Stadium (New Orleans) Kansas City 23 " Mi…
5 5 Jan. 17, 1971 Orange Bowl (Miami) Baltimore 16 " Da…
6 6 Jan. 16, 1972 Tulane Stadium (New Orleans) Dallas 24 " Mi…
Split off the scores from the winner and loser columns
pattern = " \\d+$"
super_bowl_df <- super_bowl_df %>%
mutate(Winner = str_trim(Winner),
Loser = str_trim(Loser),
Winner_Score = as.numeric(stri_extract_last_regex(Winner, pattern)),
Loser_Score = as.numeric(stri_extract_last_regex(Loser, pattern)),
Winner = stri_replace_last_regex(Winner, pattern, ""),
Loser = stri_replace_last_regex(Loser, pattern, ""))
head(super_bowl_df)# A tibble: 6 × 7
`Super Bowl` Date Venue Winner Loser Winner_Score Loser_Score
<int> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1 Jan. 15, 1967 Los Angeles … Green… Kans… 35 10
2 2 Jan. 14, 1968 Orange Bowl … Green… Oakl… 33 14
3 3 Jan. 12, 1969 Orange Bowl … New Y… Balt… 16 7
4 4 Jan. 11, 1970 Tulane Stadi… Kansa… Minn… 23 7
5 5 Jan. 17, 1971 Orange Bowl … Balti… Dall… 16 13
6 6 Jan. 16, 1972 Tulane Stadi… Dallas Miami 24 3