library(stringr)
library(dplyr)
library(tidyr)
library(zoo)
library(ggplot2)
library(knitr)
library(rvest)
library(tibble)
We start by using rvest to scrape the data from espn.com. We can see that the table needs a bit of cleaning.
url <- "http://www.espn.com/nfl/superbowl/history/mvps"
url_html <- url %>% read_html()
raw_df <- url_html %>% html_table(fill=T) %>% .[[1]]
raw_df %>% head() %>% kable()
| X1 | X2 | X3 |
|---|---|---|
| Super Bowl Most Valuable Players | Super Bowl Most Valuable Players | Super Bowl Most Valuable Players |
| NO. | PLAYER | HIGHLIGHTS |
| I | Bart Starr, QB, Green Bay | Two touchdown passes |
| II | Bart Starr, QB, Green Bay | 202 yards passing, 1 TD |
| III | Joe Namath, QB, New York Jets | 206 yards passing |
| IV | Len Dawson, QB, Kansas City | 142 yards passing, 1 TD |
We begin the cleaning process by turning the second row into column names and removing the non-data rows. We then convert the roman numerals into regular integers with help of the as.roman() function in base R. Finally, we use the separate() function to easily turn the PLAYER column into three separate columns of player, position, and team.
sb_mvp_df <- raw_df
names(sb_mvp_df) <- sb_mvp_df[2,] %>% tolower() %>% gsub("[[:punct:]]", "", .)
sb_mvp_df <- sb_mvp_df %>% slice(-c(1:2)) %>% as_tibble()
sb_mvp_df <- sb_mvp_df %>% mutate(sb_number = as.integer(as.roman(no))) %>% select(sb_number, player, highlights)
sb_mvp_df <- sb_mvp_df %>% separate(player, into=c("player", "position", "team"), sep=", ")
sb_mvp_df %>% head() %>% kable()
| sb_number | player | position | team | highlights |
|---|---|---|---|---|
| 1 | Bart Starr | QB | Green Bay | Two touchdown passes |
| 2 | Bart Starr | QB | Green Bay | 202 yards passing, 1 TD |
| 3 | Joe Namath | QB | New York Jets | 206 yards passing |
| 4 | Len Dawson | QB | Kansas City | 142 yards passing, 1 TD |
| 5 | Chuck Howley | LB | Dallas | Two interceptions, fumble recovery |
| 6 | Roger Staubach | QB | Dallas | 119 yards passing, 2 TDs |
While this data presented a nice opportunity to tidy using some less common functions, it does not present a particularly rich opportunity for analysis. Nevertheless, we will look at a few obvious trends.
Which position won MVP the most times?
sb_mvp_grp_pos_df <- sb_mvp_df %>%
group_by(position) %>%
summarise(n = length(position)) %>%
ungroup() %>% arrange(desc(n))
sb_mvp_grp_pos_df %>% kable()
| position | n |
|---|---|
| QB | 29 |
| WR | 7 |
| RB | 6 |
| LB | 4 |
| CB | 2 |
| DE | 1 |
| DL | 1 |
| FB | 1 |
| KR | 1 |
| S | 1 |
sb_mvp_grp_pos_df %>%
ggplot(aes(x=position, y=n, fill=position)) +
scale_x_discrete(limits=sb_mvp_grp_pos_df$position) +
geom_bar(stat="identity") +
labs(title = "Number of Super Bowl MVPs by Position", x="Position", y="Count") +
theme(legend.position = "none")
Quarterback is over four times more prevalent than the next position.
Which Quarterback had the most passing yards?
To answer this question we can filter the table for QBs, then use regular expressions on the text in the highlights column.
qb_df <- suppressWarnings( sb_mvp_df %>%
filter(position=="QB") %>%
mutate(passing_yds = str_extract_all(highlights, "[[:digit:]]+ yards") %>% gsub(" yards", "", .) %>% as.integer(.)))
qb_df %>% arrange(desc(passing_yds)) %>% head() %>% kable()
| sb_number | player | position | team | highlights | passing_yds |
|---|---|---|---|---|---|
| 51 | Tom Brady | QB | New England | 43-of-62, 466 yards, 2 TDs | 466 |
| 34 | Kurt Warner | QB | St. Louis | 414 yards passing, 2 TDs | 414 |
| 52 | Nick Foles | QB | Philadelphia | 28-of-43, 373 yards, 3 TDs, 1 TD reception | 373 |
| 38 | Tom Brady | QB | New England | 354 yards passing, 3 TDs | 354 |
| 22 | Doug Williams | QB | Washington | 340 yards passing, 4 TDs | 340 |
| 33 | John Elway | QB | Denver | 336 yards passing, 1 TD | 336 |
Tom Brady had the most, with 466 yards in Super Bowl 51.
Which player had the most Touchdowns?
We can do a similar regular expression match on the full table to extract touchdown information.
suppressWarnings( sb_mvp_df %>%
mutate(TD = str_extract_all(highlights, "[[:digit:]]+ TD") %>% gsub(" TD", "", .) %>% as.integer(.)) %>%
arrange(desc(TD)) ) %>% head() %>% kable()
| sb_number | player | position | team | highlights | TD |
|---|---|---|---|---|---|
| 29 | Steve Young | QB | San Francisco | 325 yards passing, 6 TDs | 6 |
| 24 | Joe Montana | QB | San Francisco | 297 yards passing, 5 TDs | 5 |
| 13 | Terry Bradshaw | QB | Pittsburgh | 318 yards passing, 4 TDs | 4 |
| 22 | Doug Williams | QB | Washington | 340 yards passing, 4 TDs | 4 |
| 27 | Troy Aikman | QB | Dallas | 273 yards passing, 4 TDs | 4 |
| 49 | Tom Brady | QB | New England | 37-of-50, 328 yards, 4 TDs | 4 |
Steve Young had 6 touchdowns in Super Bowl 29.
While this data did not necessarily lend itself to complex analysis, it did offer the opportunity to tidy and analyze using functions like separate(), as.roman(), and various regular expressions.