Load libraries

library(stringr)
library(dplyr)
library(tidyr)
library(zoo)
library(ggplot2)
library(knitr)
library(rvest)
library(tibble)

Get the data

We start by using rvest to scrape the data from espn.com. We can see that the table needs a bit of cleaning.

url <- "http://www.espn.com/nfl/superbowl/history/mvps"
url_html <- url %>% read_html()
raw_df <- url_html %>% html_table(fill=T) %>% .[[1]]
raw_df %>% head() %>% kable()
X1 X2 X3
Super Bowl Most Valuable Players Super Bowl Most Valuable Players Super Bowl Most Valuable Players
NO. PLAYER HIGHLIGHTS
I Bart Starr, QB, Green Bay Two touchdown passes
II Bart Starr, QB, Green Bay 202 yards passing, 1 TD
III Joe Namath, QB, New York Jets 206 yards passing
IV Len Dawson, QB, Kansas City 142 yards passing, 1 TD

Tidy the data

We begin the cleaning process by turning the second row into column names and removing the non-data rows. We then convert the roman numerals into regular integers with help of the as.roman() function in base R. Finally, we use the separate() function to easily turn the PLAYER column into three separate columns of player, position, and team.

sb_mvp_df <- raw_df
names(sb_mvp_df) <- sb_mvp_df[2,] %>% tolower() %>% gsub("[[:punct:]]", "", .)
sb_mvp_df <- sb_mvp_df %>% slice(-c(1:2)) %>% as_tibble()
sb_mvp_df <- sb_mvp_df %>% mutate(sb_number = as.integer(as.roman(no))) %>% select(sb_number, player, highlights)
sb_mvp_df <- sb_mvp_df %>% separate(player, into=c("player", "position", "team"), sep=", ")
sb_mvp_df %>% head() %>% kable()
sb_number player position team highlights
1 Bart Starr QB Green Bay Two touchdown passes
2 Bart Starr QB Green Bay 202 yards passing, 1 TD
3 Joe Namath QB New York Jets 206 yards passing
4 Len Dawson QB Kansas City 142 yards passing, 1 TD
5 Chuck Howley LB Dallas Two interceptions, fumble recovery
6 Roger Staubach QB Dallas 119 yards passing, 2 TDs

Analysis

While this data presented a nice opportunity to tidy using some less common functions, it does not present a particularly rich opportunity for analysis. Nevertheless, we will look at a few obvious trends.


Which position won MVP the most times?

sb_mvp_grp_pos_df <- sb_mvp_df %>% 
  group_by(position) %>% 
  summarise(n = length(position)) %>% 
  ungroup() %>% arrange(desc(n))
sb_mvp_grp_pos_df %>% kable()
position n
QB 29
WR 7
RB 6
LB 4
CB 2
DE 1
DL 1
FB 1
KR 1
S 1
sb_mvp_grp_pos_df %>% 
  ggplot(aes(x=position, y=n, fill=position)) + 
  scale_x_discrete(limits=sb_mvp_grp_pos_df$position) + 
  geom_bar(stat="identity") + 
  labs(title = "Number of Super Bowl MVPs by Position", x="Position", y="Count") + 
  theme(legend.position = "none")

Quarterback is over four times more prevalent than the next position.


Which Quarterback had the most passing yards?

To answer this question we can filter the table for QBs, then use regular expressions on the text in the highlights column.

qb_df <- suppressWarnings( sb_mvp_df %>% 
  filter(position=="QB") %>% 
  mutate(passing_yds = str_extract_all(highlights, "[[:digit:]]+ yards") %>% gsub(" yards", "", .) %>% as.integer(.)))
qb_df %>% arrange(desc(passing_yds)) %>% head() %>% kable()
sb_number player position team highlights passing_yds
51 Tom Brady QB New England 43-of-62, 466 yards, 2 TDs 466
34 Kurt Warner QB St. Louis 414 yards passing, 2 TDs 414
52 Nick Foles QB Philadelphia 28-of-43, 373 yards, 3 TDs, 1 TD reception 373
38 Tom Brady QB New England 354 yards passing, 3 TDs 354
22 Doug Williams QB Washington 340 yards passing, 4 TDs 340
33 John Elway QB Denver 336 yards passing, 1 TD 336

Tom Brady had the most, with 466 yards in Super Bowl 51.


Which player had the most Touchdowns?

We can do a similar regular expression match on the full table to extract touchdown information.

suppressWarnings( sb_mvp_df %>% 
  mutate(TD = str_extract_all(highlights, "[[:digit:]]+ TD") %>% gsub(" TD", "", .) %>% as.integer(.)) %>% 
  arrange(desc(TD)) ) %>% head() %>% kable()
sb_number player position team highlights TD
29 Steve Young QB San Francisco 325 yards passing, 6 TDs 6
24 Joe Montana QB San Francisco 297 yards passing, 5 TDs 5
13 Terry Bradshaw QB Pittsburgh 318 yards passing, 4 TDs 4
22 Doug Williams QB Washington 340 yards passing, 4 TDs 4
27 Troy Aikman QB Dallas 273 yards passing, 4 TDs 4
49 Tom Brady QB New England 37-of-50, 328 yards, 4 TDs 4

Steve Young had 6 touchdowns in Super Bowl 29.


Summary

While this data did not necessarily lend itself to complex analysis, it did offer the opportunity to tidy and analyze using functions like separate(), as.roman(), and various regular expressions.