Super Bowl Data

Author

Rashad Long

Import Libraries

library(tidyverse)
library(rvest)
library(gtools)
library(stringi)

Scrape Data

Use read_html to scrape the data from the website.

url <- "https://www.espn.com/nfl/superbowl/history/winners"
webpage <- read_html(url)

Extract Data

Use the html_nodes and html_table to extract HTML table element and conver it to dataframe.

super_bowl_table <- html_nodes(webpage,'table')
super_bowl_df <- html_table(super_bowl_table)[[1]]
head(super_bowl_df)
# A tibble: 6 × 4
  X1                             X2                             X3         X4   
  <chr>                          <chr>                          <chr>      <chr>
1 Super Bowl Winners and Results Super Bowl Winners and Results Super Bow… Supe…
2 NO.                            DATE                           SITE       RESU…
3 I                              Jan. 15, 1967                  Los Angel… Gree…
4 II                             Jan. 14, 1968                  Orange Bo… Gree…
5 III                            Jan. 12, 1969                  Orange Bo… New …
6 IV                             Jan. 11, 1970                  Tulane St… Kans…

Remove the first 2 rows and set column names

super_bowl_df <- super_bowl_df[-c(1,2),]
colnames(super_bowl_df) <- c("Super Bowl", "Date", "Venue", "Results")
head(super_bowl_df)
# A tibble: 6 × 4
  `Super Bowl` Date          Venue                         Results              
  <chr>        <chr>         <chr>                         <chr>                
1 I            Jan. 15, 1967 Los Angeles Memorial Coliseum Green Bay 35, Kansas…
2 II           Jan. 14, 1968 Orange Bowl (Miami)           Green Bay 33, Oaklan…
3 III          Jan. 12, 1969 Orange Bowl (Miami)           New York Jets 16, Ba…
4 IV           Jan. 11, 1970 Tulane Stadium (New Orleans)  Kansas City 23, Minn…
5 V            Jan. 17, 1971 Orange Bowl (Miami)           Baltimore 16, Dallas…
6 VI           Jan. 16, 1972 Tulane Stadium (New Orleans)  Dallas 24, Miami 3   

Convert Roman numerals to standard format and convert date to standard format

super_bowl_df$`Super Bowl` <- roman2int(super_bowl_df$`Super Bowl`)

Data Cleaning

Separate the results column into 2 columns

super_bowl_df <- super_bowl_df %>%
  separate("Results", into = c("Winner", "Loser"), sep = ",")
head(super_bowl_df)
# A tibble: 6 × 5
  `Super Bowl` Date          Venue                         Winner          Loser
         <int> <chr>         <chr>                         <chr>           <chr>
1            1 Jan. 15, 1967 Los Angeles Memorial Coliseum Green Bay 35    " Ka…
2            2 Jan. 14, 1968 Orange Bowl (Miami)           Green Bay 33    " Oa…
3            3 Jan. 12, 1969 Orange Bowl (Miami)           New York Jets … " Ba…
4            4 Jan. 11, 1970 Tulane Stadium (New Orleans)  Kansas City 23  " Mi…
5            5 Jan. 17, 1971 Orange Bowl (Miami)           Baltimore 16    " Da…
6            6 Jan. 16, 1972 Tulane Stadium (New Orleans)  Dallas 24       " Mi…

Split off the scores from the winner and loser columns

pattern = " \\d+$"

super_bowl_df <- super_bowl_df %>%
  mutate(Winner = str_trim(Winner),
         Loser = str_trim(Loser),
         Winner_Score = as.numeric(stri_extract_last_regex(Winner, pattern)),
         Loser_Score = as.numeric(stri_extract_last_regex(Loser, pattern)),
         Winner = stri_replace_last_regex(Winner, pattern, ""),
         Loser = stri_replace_last_regex(Loser, pattern, ""))

head(super_bowl_df)
# A tibble: 6 × 7
  `Super Bowl` Date          Venue         Winner Loser Winner_Score Loser_Score
         <int> <chr>         <chr>         <chr>  <chr>        <dbl>       <dbl>
1            1 Jan. 15, 1967 Los Angeles … Green… Kans…           35          10
2            2 Jan. 14, 1968 Orange Bowl … Green… Oakl…           33          14
3            3 Jan. 12, 1969 Orange Bowl … New Y… Balt…           16           7
4            4 Jan. 11, 1970 Tulane Stadi… Kansa… Minn…           23           7
5            5 Jan. 17, 1971 Orange Bowl … Balti… Dall…           16          13
6            6 Jan. 16, 1972 Tulane Stadi… Dallas Miami           24           3