CUNY 607

Load libraries

library(stringr)
library(dplyr)
library(tidyr)
library(zoo)
library(ggplot2)
library(knitr)
library(rvest)
library(tibble)

Get the data

We start by using rvest to scrape the data from espn.com. We can see that the table needs a bit of cleaning.

url <- "http://www.espn.com/nfl/superbowl/history/mvps"
url_html <- url %>% read_html()
raw_df <- url_html %>% html_table(fill=T) %>% .[[1]]
raw_df %>% head() %>% kable()

X1	X2	X3
Super Bowl Most Valuable Players	Super Bowl Most Valuable Players	Super Bowl Most Valuable Players
NO.	PLAYER	HIGHLIGHTS
I	Bart Starr, QB, Green Bay	Two touchdown passes
II	Bart Starr, QB, Green Bay	202 yards passing, 1 TD
III	Joe Namath, QB, New York Jets	206 yards passing
IV	Len Dawson, QB, Kansas City	142 yards passing, 1 TD

Tidy the data

We begin the cleaning process by turning the second row into column names and removing the non-data rows. We then convert the roman numerals into regular integers with help of the as.roman() function in base R. Finally, we use the separate() function to easily turn the PLAYER column into three separate columns of player, position, and team.

sb_mvp_df <- raw_df
names(sb_mvp_df) <- sb_mvp_df[2,] %>% tolower() %>% gsub("[[:punct:]]", "", .)
sb_mvp_df <- sb_mvp_df %>% slice(-c(1:2)) %>% as_tibble()
sb_mvp_df <- sb_mvp_df %>% mutate(sb_number = as.integer(as.roman(no))) %>% select(sb_number, player, highlights)
sb_mvp_df <- sb_mvp_df %>% separate(player, into=c("player", "position", "team"), sep=", ")
sb_mvp_df %>% head() %>% kable()

sb_number	player	position	team	highlights
1	Bart Starr	QB	Green Bay	Two touchdown passes
2	Bart Starr	QB	Green Bay	202 yards passing, 1 TD
3	Joe Namath	QB	New York Jets	206 yards passing
4	Len Dawson	QB	Kansas City	142 yards passing, 1 TD
5	Chuck Howley	LB	Dallas	Two interceptions, fumble recovery
6	Roger Staubach	QB	Dallas	119 yards passing, 2 TDs

Analysis

While this data presented a nice opportunity to tidy using some less common functions, it does not present a particularly rich opportunity for analysis. Nevertheless, we will look at a few obvious trends.

Which position won MVP the most times?

sb_mvp_grp_pos_df <- sb_mvp_df %>% 
  group_by(position) %>% 
  summarise(n = length(position)) %>% 
  ungroup() %>% arrange(desc(n))
sb_mvp_grp_pos_df %>% kable()

position	n
QB	29
WR	7
RB	6
LB	4
CB	2
DE	1
DL	1
FB	1
KR	1
S	1

sb_mvp_grp_pos_df %>% 
  ggplot(aes(x=position, y=n, fill=position)) + 
  scale_x_discrete(limits=sb_mvp_grp_pos_df$position) + 
  geom_bar(stat="identity") + 
  labs(title = "Number of Super Bowl MVPs by Position", x="Position", y="Count") + 
  theme(legend.position = "none")

Quarterback is over four times more prevalent than the next position.

Which Quarterback had the most passing yards?

To answer this question we can filter the table for QBs, then use regular expressions on the text in the highlights column.

qb_df <- suppressWarnings( sb_mvp_df %>% 
  filter(position=="QB") %>% 
  mutate(passing_yds = str_extract_all(highlights, "[[:digit:]]+ yards") %>% gsub(" yards", "", .) %>% as.integer(.)))
qb_df %>% arrange(desc(passing_yds)) %>% head() %>% kable()

sb_number	player	position	team	highlights	passing_yds
51	Tom Brady	QB	New England	43-of-62, 466 yards, 2 TDs	466
34	Kurt Warner	QB	St. Louis	414 yards passing, 2 TDs	414
52	Nick Foles	QB	Philadelphia	28-of-43, 373 yards, 3 TDs, 1 TD reception	373
38	Tom Brady	QB	New England	354 yards passing, 3 TDs	354
22	Doug Williams	QB	Washington	340 yards passing, 4 TDs	340
33	John Elway	QB	Denver	336 yards passing, 1 TD	336

Tom Brady had the most, with 466 yards in Super Bowl 51.

Which player had the most Touchdowns?

We can do a similar regular expression match on the full table to extract touchdown information.

suppressWarnings( sb_mvp_df %>% 
  mutate(TD = str_extract_all(highlights, "[[:digit:]]+ TD") %>% gsub(" TD", "", .) %>% as.integer(.)) %>% 
  arrange(desc(TD)) ) %>% head() %>% kable()

sb_number	player	position	team	highlights	TD
29	Steve Young	QB	San Francisco	325 yards passing, 6 TDs	6
24	Joe Montana	QB	San Francisco	297 yards passing, 5 TDs	5
13	Terry Bradshaw	QB	Pittsburgh	318 yards passing, 4 TDs	4
22	Doug Williams	QB	Washington	340 yards passing, 4 TDs	4
27	Troy Aikman	QB	Dallas	273 yards passing, 4 TDs	4
49	Tom Brady	QB	New England	37-of-50, 328 yards, 4 TDs	4

Steve Young had 6 touchdowns in Super Bowl 29.

Summary

While this data did not necessarily lend itself to complex analysis, it did offer the opportunity to tidy and analyze using functions like separate(), as.roman(), and various regular expressions.

CUNY 607

Project 2c: Super Bowl MVPs

mehtablocker

March 09, 2019

Load libraries

Get the data

Tidy the data

Analysis

Summary