Data Wrangling Midterm Project

Introduction

Motivation

The motivation behind choosing this dataset is largely centered around our love for sports, in particular football. With Cincinnati football on the rise, both collegiately and professionally, we thought it would be interesting to dive into the attendance data of the National Football League. As Bengals fans, we’ve experienced several years of below average football, and anticipate that the attendance numbers will be reflective of our performance on the field. We think the results will be beneficial to not only every day fans, but also NFL teams around the league. The Cincinnati Bengals organization is widely known as one of the cheapest in the league, and our hope is through the results of our analysis they will recognize that in order to improve attendance in this city, our team’s performance must improve.

Data and Methodology

For this project, we will be utilizing summary statistics in R to initially explore the various variables within our dataset in order to uncover new information and insights surrounding NFL stadium attendance from the 2000 season through the 2019 season. This analysis will lead us to a greater understanding of the variables affecting attendance at NFL games.

The NFL dataset includes data sourced from Pro Football Reference team standings. In total, the NFL Attendance data is comprised of three unique datasets:

Attendance - 8 variables
Standings - 15 variables
Games - 19 variables

Planned Analysis

With the data we have selected, there are many options for analysis regarding the variables surrounding attendance at NFL games. Through our exploration, we aim to analyze:

Past Game Attendance - This analysis will explore to what degree past game attendance for NFL teams in years prior can be an indicator of future game attendance.
Team Standing - This analysis will examine how individual NFL team rankings in standing tables prior to said team’s games influence game attendance.
Team Game Performance - This analysis will delve into how various metrics recorded over prior games, such as total points, influence the attendance recorded at games.
Attendance Factors - This analysis will allow us find correlation between different variables and possibly build a regression to predict fan attendance based on different factors.

Analysis Impact

As addressed earlier, our analysis of this data will prove to be useful not only for every day football fans, but for NFL executives as well. In the most simplistic manner, our research will provide in-depth analysis of the leaders in attendance over the time frame. With that being said, we believe our research can prove to be more beneficial. In professional sports, front offices are constantly using cost-benefit analysis to determine whether it is worth it to spend more money in order to improve their team, or whether they should maintain their current roster regardless of talent. While this data does not dive into the metrics of how much they should spend on players, our analysis of the correlation between performance variables and attendance will allow them to see if it’s worth it to consider investing in better players/coaches to improve the team. Additionally, by diving into trends in points, win/loss percentage, offensive ranking, and defensive ranking, they will be able to determine where their teams have been lacking, and address those issues accordingly.

Packages Used

The project description mentions there are thousands of packages in R that allow for analysis past base R, and this project requires a variety of those packages. The packages below are packages that provide the best use and results while cleaning, understanding, and visualizing the data, but as we continue our analysis we anticipate multiple packages being added.

Many of our packages provide different purposes, as some below are involved in the Data Preparation Section while others are planned to be used in the Exploratory Data Analysis section. Packages such as dplyr and tidyverse were used for data cleaning and manipulation to dive deeper into the data, while packages such as ggplot2 and tibble are planned to be used to make good visualizations to allow us to understand the data and results.

While many are packages are standard because of our growing experience and education in R, we look forward to growing our knowledge on new packages and uses of ones we have already worked with. Packages loaded for this project so far are displayed below.

library(dplyr) #for data cleaning
library(tidyverse) #for data cleaning and data manipulation
library(DT) #for creating and viewing data tables
library(knitr) #for dynamic report generation
library(skimr) # for displaying summary statistics
library(ggplot2) # for data visualization and plotting
library(tibble) # for data visualization and data manipulation

Data Preparation

Data Source

Data selected for this project originally came from our professor, Tianhai Zu, among three other datasets. The data is from GitHub. The GitHub page titled ‘NFL Stadium Attendance’ credits Pro Football Reference for the data, specifically their Team Standings page.

This data is made publicly available by Sports Reference, which makes websites for different sports including both basic and sabermetric statistics and resources for sports fans everywhere. Pro Football Reference includes the full statistical history of the NFL including seasons, teams, players, coaches, drafts, and more since 1960.

Overview

The data includes data for all teams and games played from the beginning of the 2000-2001 NFL season through the end of the 2019-2020 NFL season. The dataset contains three separate tables:

attendance - contains 8 variables and 10,846 observations that list the weekly and seasonal attendance for every week for every team.
standings - contains 15 variables and 638 observations that display the record and basic stats and standings for every team each season.
games - contains 19 variables and 5,324 observations containing the game results and stats for every game from the 2000-2001 season through the 2019-2020 season.

Any of the three tables could be joined relatively nicely by joining on “year”, “team_name”, and “team”.

Missing data is denoted by NA, but only two variables overall contain NAs. Both variables are necessary to the data, as one is weekly_attendance that is only NA when a team is on its by week (when the team does not play) and the other is tie, when is NA unless there is a tie.

Data Importing and Cleaning

Our first step was to set the working directory and import the three data sets.

setwd("/Users/drewgreiner/Library/Mobile Documents/com~apple~CloudDocs/5th Year/Data Wrangling/Project/Midterm") #Sets working directory

# NFL Attendance Data
attendance <- read.csv("/Users/drewgreiner/Documents/nfl/attendance.csv") #imports attendance table
standings <- read.csv("/Users/drewgreiner/Documents/nfl/standings.csv") #imports standings table
games <- read.csv("/Users/drewgreiner/Documents/nfl/games.csv") #imports games table

Joining the Datasets

# Join the data relatively nicely with dplyr
nfl_attendance <- left_join(attendance, standings, games, by = c("year", "team_name", "team"))

Table Cleaning and Summarizing

attendance Table

The original attendance dataset contains 10,846 observations and eight variables. In the data, there are two character variables - team and team_name. There also six numeric variables - year, total, home, away, week, weekly_attendance. The data spans from the 2000 season through the 2019 season, and only for weeks 1 through 17 are included, with no playoff attendance data. A view of the original attendance dataset is available below.

# View the attendance data set for the 2000 Arizona Cardinals
datatable(head(attendance, 17))

Below is a data dictionary for the attendance dataset.

Variable Name	Variable Data Type	Variable Desciption
team	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	integer	Year
total	integer	Total attendance per season
home	integer	Total attendance at home games per season
away	integer	Total attendance at away games per season
week	integer	Week in which game was played
weekly_attendance	integer	Attendance for given week

As mentioned above, there are missing values in the attendance dataset in the weekly_attendance column. Also mentioned above, these NA’s make sense as each team has a bye week in the NFL in which they do not play. In order to be consistent with counts and to not skew the data, we have omitted bye weeks below, resulting in 638 fewer observations.

colSums(is.na(attendance)) # Show the number of NA per column in attendance

##              team         team_name              year             total 
##                 0                 0                 0                 0 
##              home              away              week weekly_attendance 
##                 0                 0                 0               638

attendance <- na.omit(attendance)
colSums(is.na(attendance)) # Show the NA have been omitted

##              team         team_name              year             total 
##                 0                 0                 0                 0 
##              home              away              week weekly_attendance 
##                 0                 0                 0                 0

We also decided to change some of the variable names in the original attendance dataset in order to better define and describe the variables. We made name changes to four variables - team, total, home, away.

attendance <- attendance %>% rename(
  team_city = team,
  total_season_attendance = total,
  season_home_attendance = home,
  season_away_attendance = away
)
names(attendance)

## [1] "team_city"               "team_name"              
## [3] "year"                    "total_season_attendance"
## [5] "season_home_attendance"  "season_away_attendance" 
## [7] "week"                    "weekly_attendance"

The final step in cleaning the attendance dataset is creating two new datasets, one with only weekly attendance information and one with total attendance information (including total_season_attendance, season_home_attendance, season_away_attendance).

weekly_attendance <- attendance[c(1:3, 7:8)] # Removing season totals for attendance
datatable(head(weekly_attendance,16)) # Viewing the new weekly_attendance dataset

total_attendance <- attendance[c(1:6)] # Removing weekly attendance numbers
total_attendance <- total_attendance[!duplicated(attendance), ] # Removing weekly duplicates
datatable(head(total_attendance,31)) # Viewing the new total_attendance dataset

Lastly, you can see a full summary of the cleaned attendance table.

skim(attendance)

Data summary
Name	attendance
Number of rows	10208
Number of columns	8
_______________________
Column type frequency:
character	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
team_city	0	1	5	13	0	32	0
team_name	0	1	4	10	0	32	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2009.53	5.75	2000	2005.0	2010	2015.00	2019	▇▇▇▇▇
total_season_attendance	1	1080910.03	72877.18	760644	1040509.0	1081090	1123230.00	1322087	▁▁▇▆▁
season_home_attendance	1	540455.01	66774.84	202687	504360.0	543185	578342.00	741775	▁▁▅▇▁
season_away_attendance	1	540455.01	25509.40	450295	524974.0	541757	557741.00	601655	▁▂▇▇▂
week	1	9.10	4.99	1	5.0	9	13.00	17	▇▅▅▆▇
weekly_attendance	1	67556.88	9022.02	23127	63245.5	68334	72544.75	105121	▁▁▇▃▁

standings Table

The original standings data set contains 6,638 observations and 15 variables. In the data, there are four character variables - team, team_name, playoffs, sb_winner. There are also 11 numeric variables - year, wins, loss, points_for, points_against, points_differential, margin_of_victory, strength_of_schedule, simple_rating, offensive_ranking, defensive_ranking. The data spans from the 2000 season through the 2019 season. A view of the original standings dataset is available below.

# View the standings data set for the 2000 season
datatable(head(standings, 31))

Below is a data dictionary for the standings dataset.

Variable Name	Variable Data Type	Variable Desciption
team	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	integer	Year
wins	integer	Total wins per season (0 to 16)
loss	integer	Total losses per season (0 to 16)
points_for	integer	Total points the team scored per season
points_against	integer	Total points the opponent scored on the team per season
points_differential	integer	The difference between the total points for the team and against the team
margin_of_victory	numeric	Points differential divided by the total number of games per season
strength_of_schedule	numeric	Difficulty of schedule based on opponent records
simple_rating	numeric	A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)
offensive_ranking	numeric	A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)
defensive_ranking	numeric	A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)
playoffs	character	Stating whether or not the team made it to the playoffs
sb_winner	character	Stating whether or not the team won the Super Bowl for the season

Next, we wanted to check for any missing values in the columns of the standings dataset. There were no missing value or NA value in the standings dataset, as seen below.

colSums(is.na(standings)) # Show the number of NA per column in standings

##                 team            team_name                 year 
##                    0                    0                    0 
##                 wins                 loss           points_for 
##                    0                    0                    0 
##       points_against  points_differential    margin_of_victory 
##                    0                    0                    0 
## strength_of_schedule        simple_rating    offensive_ranking 
##                    0                    0                    0 
##    defensive_ranking             playoffs            sb_winner 
##                    0                    0                    0

Looking at the last two character variables, we decided to change them to binary variables, and both only have two different values, as shown below.

unique(standings$playoffs) # Show the unique values for playoffs

## [1] "Playoffs"    "No Playoffs"

unique(standings$sb_winner) # Show the unique values for sb_winner

## [1] "No Superbowl"  "Won Superbowl"

For the playoffs variable, we changed the “Playoffs” value to 1, and “No Playoffs” to 0.

standings$playoffs[standings$playoffs == "Playoffs"] <- "1"
standings$playoffs[standings$playoffs == "No Playoffs"] <- "0"

standings$playoffs <- as.numeric(standings$playoffs) # Changing the binary values to numerics

For the sb_winner variable, we changed the “Won Superbowl” value to 1, and “No Superbowl” to 0.

standings$sb_winner[standings$sb_winner == "Won Superbowl"] <- "1"
standings$sb_winner[standings$sb_winner == "No Superbowl"] <- "0"

standings$sb_winner <- as.numeric(standings$sb_winner) # Changing the binary values to numerics

Below you can see the new updated unique values for playoffs and sb_winner.

unique(standings$playoffs) # Show the new unique values for playoffs

## [1] 1 0

unique(standings$sb_winner) # Show the new unique values for sb_winner

## [1] 0 1

With the data now cleaned after checking for missing values and changing two variables to binaries, a view of the cleaned standings dataset is below.

datatable(head(standings,31)) # Viewing the cleaned standings dataset

Lastly, you can see a full summary of the cleaned standings table.

skim(standings)

Data summary
Name	standings
Number of rows	638
Number of columns	15
_______________________
Column type frequency:
character	2
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
team	0	1	5	13	0	32	0
team_name	0	1	4	10	0	32	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2009.53	5.76	2000.0	2005.00	2010.0	2014.75	2019.0	▇▇▇▇▇
wins	1	7.98	3.08	0.0	6.00	8.0	10.00	16.0	▂▆▇▆▂
loss	1	7.98	3.08	0.0	6.00	8.0	10.00	16.0	▂▆▇▆▂
points_for	1	350.28	71.40	161.0	299.00	348.0	396.00	606.0	▂▇▇▂▁
points_against	1	350.28	59.55	165.0	310.00	347.0	391.50	517.0	▁▃▇▆▁
points_differential	1	0.00	101.09	-261.0	-75.00	1.5	72.75	315.0	▂▆▇▅▁
margin_of_victory	1	0.00	6.32	-16.3	-4.70	0.1	4.57	19.7	▂▆▇▅▁
strength_of_schedule	1	0.00	1.63	-4.6	-1.10	0.0	1.20	4.3	▁▅▇▅▁
simple_rating	1	0.00	6.20	-17.4	-4.47	0.0	4.50	20.1	▁▆▇▅▁
offensive_ranking	1	0.00	4.34	-11.7	-3.18	0.0	2.70	15.9	▁▇▇▂▁
defensive_ranking	1	0.00	3.57	-9.8	-2.40	0.1	2.50	9.8	▁▅▇▅▁
playoffs	1	0.38	0.48	0.0	0.00	0.0	1.00	1.0	▇▁▁▁▅
sb_winner	1	0.03	0.17	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁

games Table

The original games data set contains 5,324 observations and 19 variables. In the data, there are 12 character variables - week, home_team, away_team, winner, tie, day, date, time, home_team_name, home_team_city, away_team_name, away_team_city. There are also seven numeric variables - year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, turnovers_loss. The data spans from the 2000 season through the 2019 season. A view of the original standings dataset is available below.

# View the games data set for the Week 1 of the 2000 season
datatable(head(games,16))

Below is a data dictionary for the games dataset.

Variable Name	Variable Data Type	Variable Desciption
year	integer	Year
week	character	Week of the season in which the game was played
home_team	character	Home team for the game
away_team	character	Away team for the game
winner	character	Winner of the game
tie	character	Was there a tie? (if so, the other team will be listed in this column)
day	character	Day of the week in which the game was played
date	character	Date of the game
time	character	Time of the day in which the game was played
pts_win	integer	Number of points the winning team scored
pts_loss	integer	Number of points the losing team scored
yds_win	integer	Total number of yards the winning team had
turnovers_win	integer	Total number of turnovers the winning team had
yds_loss	integer	Total number of yards the losing team had
turnovers_loss	integer	Total number of turnovers the losing team had
home_team_name	character	Name or mascot of the winning team
home_team_city	character	City of the winning team
away_team_name	character	Name or mascot of the losing team
away_team_city	character	City of the losing team

Based on looking at the above dataset, we removed the last four columns (home_team_name, home_team_city, away_team_name, away_team_city) as we felt they were repetitive.

names(games) # Viewing the names of the columns in the games dataset

##  [1] "year"           "week"           "home_team"      "away_team"     
##  [5] "winner"         "tie"            "day"            "date"          
##  [9] "time"           "pts_win"        "pts_loss"       "yds_win"       
## [13] "turnovers_win"  "yds_loss"       "turnovers_loss" "home_team_name"
## [17] "home_team_city" "away_team_name" "away_team_city"

games <- games[-c(16:19)] # Removing the repeitive columns from the games dataset
names(games) # Viewing the names of the columns in the updated games dataset

##  [1] "year"           "week"           "home_team"      "away_team"     
##  [5] "winner"         "tie"            "day"            "date"          
##  [9] "time"           "pts_win"        "pts_loss"       "yds_win"       
## [13] "turnovers_win"  "yds_loss"       "turnovers_loss"

As mentioned above, there are missing values in the games dataset in the tie column. Also mentioned above, these NA’s make sense as there are very few ties in the NFL. Knowing that tie contains all of the NA, we also must look at the unique values for tie.

colSums(is.na(games)) # Show the number of NA per column in games

##           year           week      home_team      away_team         winner 
##              0              0              0              0              0 
##            tie            day           date           time        pts_win 
##           5314              0              0              0              0 
##       pts_loss        yds_win  turnovers_win       yds_loss turnovers_loss 
##              0              0              0              0              0

unique(games$tie) # Viewing the unique values in the tie column in games

## [1] NA                   "Atlanta Falcons"    "Cincinnati Bengals"
## [4] "St. Louis Rams"     "Green Bay Packers"  "Carolina Panthers" 
## [7] "Arizona Cardinals"  "Cleveland Browns"

games$winner[games$tie != is.na(games$tie)] <- "Tie" # Entering tie into winner column for ties
games <- games[-c(6)] # Removing the ties column
colSums(is.na(games)) # Checking the number of NA per column in games

##           year           week      home_team      away_team         winner 
##              0              0              0              0              0 
##            day           date           time        pts_win       pts_loss 
##              0              0              0              0              0 
##        yds_win  turnovers_win       yds_loss turnovers_loss 
##              0              0              0              0

The final change needed is changing the week variable from a character variable to a numeric variable. This is done below.

games$week <- as.numeric(games$week) # Changing week to numeric

With the data now cleaned after checking for missing values and changing variables, a view of the cleaned games dataset is below.

datatable(head(games,16)) # Viewing the cleaned games dataset

Lastly, you can see a full summary of the cleaned games table.

skim(games)

Data summary
Name	games
Number of rows	5324
Number of columns	14
_______________________
Column type frequency:
character	6
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
home_team	1	13	20	34
away_team	1	13	20	34
winner	1	3	20	35
day	1	3	3	7
date	1	9	12	154
time	1	8	8	187

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2009.53	5.75	2000	2005	2010	2015	2019	▇▇▇▇▇
week	220	0.96	9.10	4.99	1	5	9	13	17	▇▅▅▆▇
pts_win	0	1.00	27.78	8.83	3	21	27	34	62	▁▇▇▂▁
pts_loss	0	1.00	16.09	8.14	0	10	16	21	51	▆▇▅▁▁
yds_win	0	1.00	361.64	78.58	47	308	361	415	653	▁▂▇▃▁
turnovers_win	0	1.00	1.08	1.04	0	0	1	2	7	▇▂▁▁▁
yds_loss	0	1.00	309.08	84.50	26	251	306	366	613	▁▅▇▃▁
turnovers_loss	0	1.00	2.17	1.42	0	1	2	3	8	▆▇▂▁▁

Exploratory Data Analysis

Discovery

For the discovery of new information in the data, we plan to utilize all three datasets in our analysis, which includes data on NFL attendance, game statistics, and standings. Our examination of attendance data will allow us to understand how standings and team game performance impact NFL attendance. This will allow us to find the following information surrounding attendance and team performance:

NFL teams who have had the most attendance over the seasons since 2000, looking at different summary statistics such as mean, median, total, etc.
Trends in points (mean, median, total, etc.) over the seasons for each team
Trends in win/loss percentage, offensive_ranking, defensive_ranking, and playoff berth by team over the seasons
Strength of divisions in relation to other divisions by comparing divisions’ total wins, playoff berths, and super bowl wins

We also plan to analyze how performance variables found in the standings dataset (such as wins, losses, offensive_ranking, defensive_ranking, and playoff berth) correlate to the team’s attendance data over the seasons. We plan to do this by creating a correlation matrix between the variables total (total attendance), wins, loss, offensive_ranking, defensive_ranking, playoffs, and sb_winner. We can also use the correlation functionality in R to find the exact correlation values across the board.

Plots and Tables

For the initial attendance analysis, plots and tables will be very prevalent in our results. For offensive_ranking, defensive_ranking, playoff, sb_winner, wins, points_for, points_against, and other variables, we plan to output tables showing the frequency of each response in relation to weekly_attendance over time. We will also utilize bar plots to visualize the frequency of each over time. A correlation matrix will be essential for our attendance analysis in visualizing the correlation between various variables, including most variables in our standings dataset. We are also looking to visualize the strength of each division in relation to one another. This could be through a plot of the count of different summary statistics of wins or participation in the playoffs by division.

Required Learning

Most of what is needed for our analysis has been covered in this course or in one of the other courses in R that we have already taken. The main learning that we will need to do before completing the project is a deeper dive into joins. We have an interesting challenge to face: there is not a common variable in all the datasets that can be used as a primary/foreign key (issues in the games data set with home vs away and combined team_name and team_city). We are looking for ways to create a common key using the variables that we have. This will be one of our required learnings. We will also dig a little deeper into plotting inside of R to ensure our plots come out as clean and professional.

Machine Learning Techniques

The incorporation of linear regression into our analysis will depend on preliminary results of the correlation values when we examine variables’ correlation with attendance. If we find that there is a strong correlation between any of the variables, that would prompt us to create a regression model to find more information on how attendance is typically affected by other variables present. One such correlation we hope will be worth looking into is the correlation between weekly_attendance and some variables in our standings dataset (i.e. wins).

Data Wrangling Midterm Project - NFL Attendance

Andrew Greiner, Daniel Keating, Charles MacKenzie, Samuel Martin, and Lucas Schirr

2021-11-04

Introduction

Motivation

Data and Methodology

Planned Analysis

Analysis Impact

Packages Used

Data Preparation

Data Source

Overview

Data Importing and Cleaning

Joining the Datasets

Table Cleaning and Summarizing

attendance Table

standings Table

games Table

Exploratory Data Analysis

Discovery

Plots and Tables

Required Learning

Machine Learning Techniques