library(readr)
library(ggplot2)
library(dplyr)
library(tibble)
library(lubridate)
library(reshape2)
library(tidyr)
library(knitr)
library(plotly)
library(DT)
This sections contains all the procedures followed in getting the data analysis ready. Each step has been explained and the codes have been given
Original data (attendance, games and standings) were obtained from Pro Football Reference website and downloaded and cited through Github Rfordatascience
# Read in original data sets from Github
attendance <- as_tibble(read.csv("https://raw.githubusercontent.com/angiezhangty/nfl-analysis/master/attendance.csv", stringsAsFactors = FALSE))
games<- as_tibble(read.csv("https://raw.githubusercontent.com/angiezhangty/nfl-analysis/master/games.csv", stringsAsFactors = FALSE))
standings <- as_tibble(read.csv("https://raw.githubusercontent.com/angiezhangty/nfl-analysis/master/standings.csv", stringsAsFactors = FALSE))
All data cleaning steps were performed in RStudio in a R script. We used a full join first on games and attendance by team_name (attendance) = home_team_name (games), year and week. R automatically merged team_name into home_team_name and we saved the result into a table named “nfl”.
Then we used a left join on nfl and standing by year and home_team_name (nfl) = team_name (standings) and saved the result into a new table named “nfl_cleaned_data”. And we started examining each variable in the new data set. Since we have 2 team columns in the data and they are exactly the same as home_team_city, we deleted them and kept home_team_city instead.
Next we started renaming column names: We changed day into weekday. We separated date into month and day, combined them with year to create game_date, and changed the class of game_date into Date. Then we changed total, home, away to total_attendance, home_attendance, away_attendance. We also replaced some values: We changed playoffs values: Playoffs to 1, No Playoffs to 0, and converted it into a numeric variable.
We renamed sb_winner to SuperBowl_winner and did the same thing to replace values: Won Superbowl to 1, No Superbowl to 0 and converted it into a numeric variable.
Last, we saved the final file into our local folder and named it “nfl_cleaned_data”. However, we did not use the data cleaning steps in this R Markdown report to save time but we have listed the detailed steps below.
#change attendance week column to character
attendance$week <- as.character(attendance$week)
#full join games and attendance data together
nfl<- full_join(games,attendance,by = c("home_team_name"="team_name", "year", "week"))
#delete duplicate team column
nfl <- select(nfl,-"team")
#join new cleaned data to standings data
nfl_cleaned_data<- left_join(nfl,standings,by=c("home_team_name"="team_name","year","home_team_city"="team"))
#rename column name "day" to "weekday"
names(nfl_cleaned_data)[7] <- "weekday"
#separate data column to month and day columns
nfl_cleaned_data <-separate(nfl_cleaned_data, date, c("month", "day"), sep = " ")
#unite year,month and day columns to game_data column
nfl_cleaned_data <- unite(nfl_cleaned_data, game_date, year, month, day, sep = "-")
#change game_date to standard time format
nfl_cleaned_data$game_date <- as.Date(nfl_cleaned_data$game_date, format = "%Y-%B-%d")
#add game_year columns
nfl_cleaned_data$game_year <- year(nfl_cleaned_data$game_date)
#rename columns name
names(nfl_cleaned_data)[19:21] <- c("total_attendance", "home_attendance","away_attendance")
names(nfl_cleaned_data)[34] <- c("superbowl_winner")
#replace no playoffs to 0, and playoffs to 1 and change to numeric data type
nfl_cleaned_data$playoffs <- str_replace(nfl_cleaned_data$playoffs, "No Playoffs", "0")
nfl_cleaned_data$playoffs <- str_replace(nfl_cleaned_data$playoffs, "Playoffs", "1")
nfl_cleaned_data$playoffs <- as.numeric(nfl_cleaned_data$playoffs)
#replace no Superbowl to 0, and Won Superbowl to 1 and change to numeric data type
nfl_cleaned_data$superbowl_winner <- str_replace(nfl_cleaned_data$superbowl_winner, "No Superbowl", "0")
nfl_cleaned_data$superbowl_winner <- str_replace(nfl_cleaned_data$superbowl_winner, "Won Superbowl", "1")
nfl_cleaned_data$superbowl_winner <- as.numeric(nfl_cleaned_data$superbowl_winner)
After the data cleaning, we once again checked the number of rows and columns, as shown in the code below.
colnames(nfl_cleaned_data)
## [1] "game_date" "week" "home_team"
## [4] "away_team" "winner" "tie"
## [7] "weekday" "time" "pts_win"
## [10] "pts_loss" "yds_win" "turnovers_win"
## [13] "yds_loss" "turnovers_loss" "home_team_name"
## [16] "home_team_city" "away_team_name" "away_team_city"
## [19] "total_attendance" "home_attendance" "away_attendance"
## [22] "weekly_attendance" "wins" "loss"
## [25] "points_for" "points_against" "points_differential"
## [28] "margin_of_victory" "strength_of_schedule" "simple_rating"
## [31] "offensive_ranking" "defensive_ranking" "playoffs"
## [34] "superbowl_winner" "game_year"
dim(nfl_cleaned_data)
## [1] 11066 35
-There are a lot of NA’s in the data set after we performed full join and left join with the original data, but we think it should stay NA to be true to our data.
-Week variable: Since there are Wildcard, Division and ConfChamp in the playoffs, we decided to keep the variable as a character.
-Tie variable: only very few games had values in there because ties are only allowed in the regular season and we decided to keep them that way.
-We believe that all the other changes that we performed are for the benefit of the analysis, but we will come back to change it if we believe it is better to do so.
# View data with DT package
datatable(head(nfl_cleaned_data, 50))
In this section, we used a variety of packages, functions and graphical methods to explore the NFL data set.
# Here we are looking at the overall attendance trend for all NFL teams from 2000 to 2019.
annual_total <- nfl_cleaned_data%>%
drop_na('weekly_attendance', 'game_year')%>%
group_by(game_year)%>%
summarize(total = sum(weekly_attendance))
annual_total_plot <- ggplot(annual_total, aes(x=game_year,y= total/1000000))+
geom_point(size=3)+
geom_line()+
labs(title ='Total NFL Regular Games Attendance by Year',x="Year",y="Attendance (Millions)",caption="Source: Pro-football Reference") +
theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
scale_x_continuous(breaks = seq(2000, 2019, 1))+
annotate(geom = "point", x = 2016, y = 17.788671, colour = "orange", size = 4) +
annotate(geom = "point", x = 2016, y = 17.788671) +
annotate(geom = "text", size=3,x = 2015, y = 17.788671, label = "New team; Los Angeles Rams added
to NFL hence an increase in attendance", hjust = "right")+
annotate(geom = "point", x = 2017, y = 17.253425, colour = "orange", size = 4) +
annotate(geom = "point", x = 2017, y = 17.253425) +
annotate(geom = "curve", x = 2015, y = 16.9, xend = 2017, yend = 17.23,curvature = .3, arrow = arrow(length = unit(2, "mm")) ) +
annotate(geom = "text", size=3,x = 2015, y = 16.8, label = "Drop in attendence is due to Chargers after
their move from a Qualcomm Stadium that
had a maximum capacity of of 70,4561 people
to LA Dignity Health Sports Park which holds
a max capacity of 27,000 people" , hjust = "right")+
annotate("rect", xmin = 1999.5, xmax = 2001.5, ymin = 16.2, ymax = 16.5,alpha = .2)+
annotate(geom = "curve", x = 2003, y = 16.4, xend = 2001.5, yend = 16.4,curvature = .0, arrow = arrow(length = unit(2, "mm")) ) +
annotate(geom = "text", size=3,x = 2003, y = 16.4, label = "New England Patriots 1st superbowl win and the new team
Houston Texans joining NFL leads to increase in
attendence in 2002." , hjust = "left")
annual_total_plot
From the graph below, the Dallas Cowboys, New York Giants, New York Jets and Washington Redskins have the highest total attendance over the years whereas the Los Angeles Chargers and Los Angeles Rams have the lowest attendance. It should be noted that the Chargers and the Rams are still new in NFL since they joined in 2016.
# In this section we are looking at the ranking of the teams attendance.
team_trend <- nfl_cleaned_data%>%
drop_na('weekly_attendance','home_team')%>%
group_by(home_team)%>%
summarize(total = sum(weekly_attendance))
top5team<-team_trend%>%
arrange(desc(total))%>%
head(4)
bottom2team<-team_trend%>%
arrange(total)%>%
head(2)
team_trend_graph <- ggplot(team_trend, aes(x=home_team,y= total/1000000))+
geom_point(size=3)+
geom_point(data=top5team,color="red",size=3)+
geom_point(data=bottom2team,color="blue",size=3)+
theme_classic()+
theme(axis.text.x = element_text(angle = 90,size = 10))+
labs(title ='Total Regular Season Attendance for Each NFL Team Since 2000',x="Team",y="Attendance (Millions)")+
annotate("rect", xmin = 'Kansas City Chiefs', xmax = 'Miami Dolphins', ymin = .0, ymax = 3,alpha = .2)+
annotate(geom = "curve", x = 'Indianapolis Colts', y = 2.5, xend = 'Kansas City Chiefs', yend = 2,curvature = .0, arrow = arrow(length = unit(2, "mm")) ) +
annotate(geom = "text", x = 'Indianapolis Colts', y = 2.5, label = 'New teams in NFL' , hjust = "right")
team_trend_graph
Some interesting things we observed were that the Dallas Cowboys had continuously high attendance record based on the boxplots below and the Arizona Cardinals had the lowest weekly attendance ever compared to other teams.
# In this section we are looking at each team's attendance
team_box <- nfl_cleaned_data%>%
drop_na('weekly_attendance','home_team')
team_trend_box <- ggplot(team_box, aes(x=home_team,y= weekly_attendance/1000))+
geom_boxplot()+
labs(title ='NFL Weekly Games Attendance by Team',x="Home team",y="Weekly Attendance (Thousands)") +
theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
team_trend_box
From the graph we can see that for most teams, winning the Superbowl did not increase their regular season attendance significantly except for the Patriots in 2001. The 1st win brought the team fame and popularity making their attendance data go through the roof.
# We created a sbwinner data frame to showcase the list of Superbowl winners and their attendance data for each year from 2000 to 2019
sbwinner<- nfl_cleaned_data %>%
filter(superbowl_winner == 1) %>%
group_by(game_year, home_team) %>%
summarize(attendance = mean(total_attendance, na.rm = TRUE))
kable(sbwinner, col.names = c("Game Year", "Team Name", "Total Attendance"), align = "ccc")
| Game Year | Team Name | Total Attendance |
|---|---|---|
| 2000 | Baltimore Ravens | 1062373 |
| 2001 | New England Patriots | 977717 |
| 2002 | Tampa Bay Buccaneers | 1044920 |
| 2003 | New England Patriots | 1127515 |
| 2004 | New England Patriots | 1108210 |
| 2005 | Pittsburgh Steelers | 1048739 |
| 2006 | Indianapolis Colts | 1028634 |
| 2007 | New York Giants | 1197501 |
| 2008 | Pittsburgh Steelers | 1076998 |
| 2009 | New Orleans Saints | 1116009 |
| 2010 | Green Bay Packers | 1122668 |
| 2011 | New York Giants | 1232157 |
| 2012 | Baltimore Ravens | 1109404 |
| 2013 | Seattle Seahawks | 1094133 |
| 2014 | New England Patriots | 1111983 |
| 2015 | Denver Broncos | 1141878 |
| 2016 | New England Patriots | 1094772 |
| 2017 | Philadelphia Eagles | 1117141 |
| 2018 | New England Patriots | 1068530 |
| 2019 | Kansas City Chiefs | 1115771 |
# This is a for loop to create a list to get a list of team names and the corresponding attendance data
team <- list()
next_year <- list()
for (x in unique(sbwinner$home_team)) {
temp <- sbwinner[sbwinner$home_team == x,]
for (y in unique(temp$game_year)) {
team <- c(team, x)
y = y + 1
next_year <- c(next_year, y)
}
}
# We put these 2 lists into a tibble df using dplyr and inner joined the new list with the nfl_cleaned_data to eventually find the attendance data
next_year_list <- tibble(next_year = as.numeric(next_year), team = as.character(team)) %>%
arrange(next_year)
next_sbwinner <- right_join(nfl_cleaned_data, next_year_list, by =c("game_year"="next_year", "home_team"="team")) %>%
group_by(game_year, home_team) %>%
select(next_game_year = game_year, alt_home_team = home_team, total_attendance) %>%
summarise(next_attendance = mean(total_attendance, na.rm = TRUE))
# Now we can combine the sbwinner and next_sbwinner together to get a consolidated table to show the differences
sbwinner_combine <- tibble(sbwinner, next_sbwinner) %>%
select(-next_game_year, -alt_home_team)
view(sbwinner_combine)
# We plotted the sbwinner_combine graph as well
sbwinner_graph <- ggplot(data = sbwinner_combine) +
geom_col(aes(x=game_year, y=attendance/1000000, alpha=0.3, fill=home_team), show.legend = FALSE) +
geom_line(aes(x=game_year, y=next_attendance/1000000)) +
geom_point(aes(x=game_year, y=next_attendance/1000000), size=4) +
geom_text(aes(x=game_year, y=attendance/1000000, label=home_team), angle=90, size=3, hjust=2) +
annotate("text", x=2016, y=1.3, size = 3, label="Note: The dots represent the year subsequent
to winning the Superbowl") +
labs(x="Year", y="Total Attendance") +
scale_x_continuous(breaks = seq(2000, 2019, 1)) +
scale_y_continuous(breaks = seq(0.0, 1.5, 0.1)) +
ggtitle("Superbowl Winners Attendance From 2000 to 2020") +
theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
sbwinner_graph
New England Patriots
As we can see, in 2001, the Patriots had the lowest regular season attendance until they won the Superbowl and attendance suddenly skyrocketed. Then people started expecting the Patriots to win again next year and they did. Back then, they did not think the Patriots could win 2 years in a row so attendance fell a little in 2004, after which attendance fluctuated for a while until they won again in 2014. We can see the rebound in the subsequent years after they won each game. But we can definitely see a gradual decline since 2015.
# So we selected New England Patriots' attendance data for every year because they are a special case
nep_attendence <- nfl_cleaned_data %>%
filter(home_team_name == "Patriots") %>%
group_by(game_year,superbowl_winner) %>%
summarize(attendance = mean(total_attendance, na.rm = TRUE))
# We filtered out the years where New England Patriots won a Superbowl
nep_superbowl<- nep_attendence %>%
filter(superbowl_winner == 1)
# We built a line chart with emphasis on the Patriot's winning years
nep_trends <- ggplot(nep_attendence, aes(x=game_year, y=attendance/1000000)) +
geom_line() +
geom_point(data=nep_superbowl, aes(x=game_year, y=attendance/1000000),color= 'red', size=3) +
geom_text(data=nep_superbowl, aes(label=attendance), size = 3, vjust = 2) +
annotate("text", x=2016, y=1.14, label="Note: The dots represent the year the Patriots won the Superbowl") +
labs(x="Year",y= "Total Attendance (Millions)", title="New England Patriots Attendance From 2000 - 2019") +
scale_x_continuous(breaks = seq(2000, 2019, 1)) +
theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.45))
nep_trends
We are wondering if advancing to the Division games could increase regular attendance the following year.
# We got the teams attendance data for each year
teams_attendence <- nfl_cleaned_data %>%
group_by(game_year, home_team) %>%
summarize(attendance = mean(total_attendance, na.rm = TRUE))
# We got the team names that entered division games
division_teams <- nfl_cleaned_data %>%
filter(week =="Division") %>%
group_by(game_year)%>%
select(home_team)
# We got team attendance of every team that entered division games
division_teams_attendance<- left_join(division_teams,teams_attendence,by=c("game_year"="game_year","home_team"="home_team"))
# We used another for loop for the same purpose as last time
teamD <- list()
next_yearD <- list()
for (x in unique(division_teams_attendance$home_team)) {
temp <- division_teams_attendance[division_teams_attendance$home_team == x,]
for (y in unique(temp$game_year)) {
teamD <- c(teamD, x)
y = y + 1
next_yearD <- c(next_yearD, y)
}
}
# We put these 2 lists into a tibble df using dplyr and inner joined the new list with the nfl_cleaned_data to eventually find the attendance data
next_year_listD <- tibble(next_year = as.numeric(next_yearD), team = as.character(teamD)) %>%
arrange(next_yearD)
# We got team next year attendance that entered division games
division_team_attendance1<-left_join(next_year_listD,nfl_cleaned_data,by= c("team"="home_team","next_year"="game_year"))%>%
group_by(next_year,team)%>%
summarise(next_attendance=mean(total_attendance,na.rm = TRUE))
# We got team attendance of this year and next year
division_combine <- tibble(division_teams_attendance, division_team_attendance1) %>%
select(-team)
# We melted divison team attendance
data.m <- melt(division_combine, id=c("game_year", "home_team"), measure.vars = c("attendance", "next_attendance"))
# And got bar chart for division team
division_combine_trend<-
ggplot(data.m, aes(x = home_team, y = value/1000000)) +
geom_bar(aes(fill = variable), stat="identity", width=0.5, position = position_dodge(width=0.5))+
theme_classic()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
scale_fill_manual(values=c("#999999", "#E69F00"))+
labs(title = "Entering the Divison Games on Team Attendance Trends", x ="Team Name",y = "Total Attendance (Millions)")
division_combine_trend
According to this plot, as we can see, the teams that entered the division games did not see a great increase in their total attendance for the following year. What interested us was that for the Dallas Cowboys, after entering the Division games, their attendance for the following year decreased quite a lot. We do not know the exact reason why but we were wondering if it had anything to do with their long-time drought for a team that only entered the division games twice in the past 20 years but considered themselves America’s team.
In order to understand the impact of NFL game attendance on the revenue generated, we needed to have a revenue data set from the attendances broken down by the team and game played. We could not find that information available to the public.
The three data sets that we had, had no unique ids, thus making it hard to joining into one data set without creating many records with NA values.
Also, the team had to formulate the problem statement for the project. This meant that we had to limit the questions only to those that can be answered by the available data. Thus, the team had to do an initial data exploration prior to setting up the problem statement and make our own assumptions based on our limited knowledge in football.