The motivation behind choosing this dataset is largely centered around our love for sports, in particular football. With Cincinnati football on the rise, both collegiately and professionally, we thought it would be interesting to dive into the attendance data of the National Football League. As Bengals fans, we’ve experienced several years of below average football, and anticipate that the attendance numbers will be reflective of our performance on the field. We think the results will be beneficial to not only every day fans, but also NFL teams around the league. The Cincinnati Bengals organization is widely known as one of the cheapest in the league, and our hope is through the results of our analysis they will recognize that in order to improve attendance in this city, our team’s performance must improve.
For this project, we will be utilizing summary statistics in R to initially explore the various variables within our dataset in order to uncover new information and insights surrounding NFL stadium attendance from the 2000 season through the 2019 season. This analysis will lead us to a greater understanding of the variables affecting attendance at NFL games.
The NFL dataset includes data sourced from Pro Football Reference team standings. In total, the NFL Attendance data is comprised of three unique datasets:
With the data we have selected, there are many options for analysis regarding the variables surrounding attendance at NFL games. Through our exploration, we aim to analyze:
As addressed earlier, our analysis of this data will prove to be useful not only for every day football fans, but for NFL executives as well. In the most simplistic manner, our research will provide in-depth analysis of the leaders in attendance over the time frame. With that being said, we believe our research can prove to be more beneficial. In professional sports, front offices are constantly using cost-benefit analysis to determine whether it is worth it to spend more money in order to improve their team, or whether they should maintain their current roster regardless of talent. While this data does not dive into the metrics of how much they should spend on players, our analysis of the correlation between performance variables and attendance will allow them to see if it’s worth it to consider investing in better players/coaches to improve the team. Additionally, by diving into trends in points, win/loss percentage, offensive ranking, and defensive ranking, they will be able to determine where their teams have been lacking, and address those issues accordingly.
The project description mentions there are thousands of packages in R that allow for analysis past base R, and this project requires a variety of those packages. The packages below are packages that provide the best use and results while cleaning, understanding, and visualizing the data, but as we continue our analysis we anticipate multiple packages being added.
Many of our packages provide different purposes, as some below are involved in the Data Preparation Section while others are planned to be used in the Exploratory Data Analysis section. Packages such as dplyr and tidyverse were used for data cleaning and manipulation to dive deeper into the data, while packages such as ggplot2 and tibble are planned to be used to make good visualizations to allow us to understand the data and results.
While many are packages are standard because of our growing experience and education in R, we look forward to growing our knowledge on new packages and uses of ones we have already worked with. Packages loaded for this project so far are displayed below.
library(dplyr) #for data cleaning
library(tidyverse) #for data cleaning and data manipulation
library(DT) #for creating and viewing data tables
library(knitr) #for dynamic report generation
library(skimr) # for displaying summary statistics
library(ggplot2) # for data visualization and plotting
library(tibble) # for data visualization and data manipulation
Data selected for this project originally came from our professor, Tianhai Zu, among three other datasets. The data is from GitHub. The GitHub page titled ‘NFL Stadium Attendance’ credits Pro Football Reference for the data, specifically their Team Standings page.
This data is made publicly available by Sports Reference, which makes websites for different sports including both basic and sabermetric statistics and resources for sports fans everywhere. Pro Football Reference includes the full statistical history of the NFL including seasons, teams, players, coaches, drafts, and more since 1960.
The data includes data for all teams and games played from the beginning of the 2000-2001 NFL season through the end of the 2019-2020 NFL season. The dataset contains three separate tables:
Any of the three tables could be joined relatively nicely by joining on “year”, “team_name”, and “team”.
Missing data is denoted by NA, but only two variables overall contain NAs. Both variables are necessary to the data, as one is weekly_attendance that is only NA when a team is on its by week (when the team does not play) and the other is tie, when is NA unless there is a tie.
Our first step was to set the working directory and import the three data sets.
setwd("/Users/drewgreiner/Library/Mobile Documents/com~apple~CloudDocs/5th Year/Data Wrangling/Project/Midterm") #Sets working directory
# NFL Attendance Data
attendance <- read.csv("/Users/drewgreiner/Documents/nfl/attendance.csv") #imports attendance table
standings <- read.csv("/Users/drewgreiner/Documents/nfl/standings.csv") #imports standings table
games <- read.csv("/Users/drewgreiner/Documents/nfl/games.csv") #imports games table
# Join the data relatively nicely with dplyr
nfl_attendance <- left_join(attendance, standings, games, by = c("year", "team_name", "team"))
The original attendance dataset contains 10,846 observations and eight variables. In the data, there are two character variables - team and team_name. There also six numeric variables - year, total, home, away, week, weekly_attendance. The data spans from the 2000 season through the 2019 season, and only for weeks 1 through 17 are included, with no playoff attendance data. A view of the original attendance dataset is available below.
# View the attendance data set for the 2000 Arizona Cardinals
datatable(head(attendance, 17))
Below is a data dictionary for the attendance dataset.
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| team | character | City or state in which the team originates |
| team_name | character | Name or mascot of the team |
| year | integer | Year |
| total | integer | Total attendance per season |
| home | integer | Total attendance at home games per season |
| away | integer | Total attendance at away games per season |
| week | integer | Week in which game was played |
| weekly_attendance | integer | Attendance for given week |
As mentioned above, there are missing values in the attendance dataset in the weekly_attendance column. Also mentioned above, these NA’s make sense as each team has a bye week in the NFL in which they do not play. In order to be consistent with counts and to not skew the data, we have omitted bye weeks below, resulting in 638 fewer observations.
colSums(is.na(attendance)) # Show the number of NA per column in attendance
## team team_name year total
## 0 0 0 0
## home away week weekly_attendance
## 0 0 0 638
attendance <- na.omit(attendance)
colSums(is.na(attendance)) # Show the NA have been omitted
## team team_name year total
## 0 0 0 0
## home away week weekly_attendance
## 0 0 0 0
We also decided to change some of the variable names in the original attendance dataset in order to better define and describe the variables. We made name changes to four variables - team, total, home, away.
attendance <- attendance %>% rename(
team_city = team,
total_season_attendance = total,
season_home_attendance = home,
season_away_attendance = away
)
names(attendance)
## [1] "team_city" "team_name"
## [3] "year" "total_season_attendance"
## [5] "season_home_attendance" "season_away_attendance"
## [7] "week" "weekly_attendance"
The final step in cleaning the attendance dataset is creating two new datasets, one with only weekly attendance information and one with total attendance information (including total_season_attendance, season_home_attendance, season_away_attendance).
weekly_attendance <- attendance[c(1:3, 7:8)] # Removing season totals for attendance
datatable(head(weekly_attendance,16)) # Viewing the new weekly_attendance dataset
total_attendance <- attendance[c(1:6)] # Removing weekly attendance numbers
total_attendance <- total_attendance[!duplicated(attendance), ] # Removing weekly duplicates
datatable(head(total_attendance,31)) # Viewing the new total_attendance dataset
Lastly, you can see a full summary of the cleaned attendance table.
skim(attendance)
| Name | attendance |
| Number of rows | 10208 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| team_city | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
| team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2009.53 | 5.75 | 2000 | 2005.0 | 2010 | 2015.00 | 2019 | ▇▇▇▇▇ |
| total_season_attendance | 0 | 1 | 1080910.03 | 72877.18 | 760644 | 1040509.0 | 1081090 | 1123230.00 | 1322087 | ▁▁▇▆▁ |
| season_home_attendance | 0 | 1 | 540455.01 | 66774.84 | 202687 | 504360.0 | 543185 | 578342.00 | 741775 | ▁▁▅▇▁ |
| season_away_attendance | 0 | 1 | 540455.01 | 25509.40 | 450295 | 524974.0 | 541757 | 557741.00 | 601655 | ▁▂▇▇▂ |
| week | 0 | 1 | 9.10 | 4.99 | 1 | 5.0 | 9 | 13.00 | 17 | ▇▅▅▆▇ |
| weekly_attendance | 0 | 1 | 67556.88 | 9022.02 | 23127 | 63245.5 | 68334 | 72544.75 | 105121 | ▁▁▇▃▁ |
The original standings data set contains 6,638 observations and 15 variables. In the data, there are four character variables - team, team_name, playoffs, sb_winner. There are also 11 numeric variables - year, wins, loss, points_for, points_against, points_differential, margin_of_victory, strength_of_schedule, simple_rating, offensive_ranking, defensive_ranking. The data spans from the 2000 season through the 2019 season. A view of the original standings dataset is available below.
# View the standings data set for the 2000 season
datatable(head(standings, 31))
Below is a data dictionary for the standings dataset.
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| team | character | City or state in which the team originates |
| team_name | character | Name or mascot of the team |
| year | integer | Year |
| wins | integer | Total wins per season (0 to 16) |
| loss | integer | Total losses per season (0 to 16) |
| points_for | integer | Total points the team scored per season |
| points_against | integer | Total points the opponent scored on the team per season |
| points_differential | integer | The difference between the total points for the team and against the team |
| margin_of_victory | numeric | Points differential divided by the total number of games per season |
| strength_of_schedule | numeric | Difficulty of schedule based on opponent records |
| simple_rating | numeric | A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System) |
| offensive_ranking | numeric | A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System) |
| defensive_ranking | numeric | A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System) |
| playoffs | character | Stating whether or not the team made it to the playoffs |
| sb_winner | character | Stating whether or not the team won the Super Bowl for the season |
Next, we wanted to check for any missing values in the columns of the standings dataset. There were no missing value or NA value in the standings dataset, as seen below.
colSums(is.na(standings)) # Show the number of NA per column in standings
## team team_name year
## 0 0 0
## wins loss points_for
## 0 0 0
## points_against points_differential margin_of_victory
## 0 0 0
## strength_of_schedule simple_rating offensive_ranking
## 0 0 0
## defensive_ranking playoffs sb_winner
## 0 0 0
Looking at the last two character variables, we decided to change them to binary variables, and both only have two different values, as shown below.
unique(standings$playoffs) # Show the unique values for playoffs
## [1] "Playoffs" "No Playoffs"
unique(standings$sb_winner) # Show the unique values for sb_winner
## [1] "No Superbowl" "Won Superbowl"
For the playoffs variable, we changed the “Playoffs” value to 1, and “No Playoffs” to 0.
standings$playoffs[standings$playoffs == "Playoffs"] <- "1"
standings$playoffs[standings$playoffs == "No Playoffs"] <- "0"
standings$playoffs <- as.numeric(standings$playoffs) # Changing the binary values to numerics
For the sb_winner variable, we changed the “Won Superbowl” value to 1, and “No Superbowl” to 0.
standings$sb_winner[standings$sb_winner == "Won Superbowl"] <- "1"
standings$sb_winner[standings$sb_winner == "No Superbowl"] <- "0"
standings$sb_winner <- as.numeric(standings$sb_winner) # Changing the binary values to numerics
Below you can see the new updated unique values for playoffs and sb_winner.
unique(standings$playoffs) # Show the new unique values for playoffs
## [1] 1 0
unique(standings$sb_winner) # Show the new unique values for sb_winner
## [1] 0 1
With the data now cleaned after checking for missing values and changing two variables to binaries, a view of the cleaned standings dataset is below.
datatable(head(standings,31)) # Viewing the cleaned standings dataset
Lastly, you can see a full summary of the cleaned standings table.
skim(standings)
| Name | standings |
| Number of rows | 638 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
| team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2009.53 | 5.76 | 2000.0 | 2005.00 | 2010.0 | 2014.75 | 2019.0 | ▇▇▇▇▇ |
| wins | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
| loss | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
| points_for | 0 | 1 | 350.28 | 71.40 | 161.0 | 299.00 | 348.0 | 396.00 | 606.0 | ▂▇▇▂▁ |
| points_against | 0 | 1 | 350.28 | 59.55 | 165.0 | 310.00 | 347.0 | 391.50 | 517.0 | ▁▃▇▆▁ |
| points_differential | 0 | 1 | 0.00 | 101.09 | -261.0 | -75.00 | 1.5 | 72.75 | 315.0 | ▂▆▇▅▁ |
| margin_of_victory | 0 | 1 | 0.00 | 6.32 | -16.3 | -4.70 | 0.1 | 4.57 | 19.7 | ▂▆▇▅▁ |
| strength_of_schedule | 0 | 1 | 0.00 | 1.63 | -4.6 | -1.10 | 0.0 | 1.20 | 4.3 | ▁▅▇▅▁ |
| simple_rating | 0 | 1 | 0.00 | 6.20 | -17.4 | -4.47 | 0.0 | 4.50 | 20.1 | ▁▆▇▅▁ |
| offensive_ranking | 0 | 1 | 0.00 | 4.34 | -11.7 | -3.18 | 0.0 | 2.70 | 15.9 | ▁▇▇▂▁ |
| defensive_ranking | 0 | 1 | 0.00 | 3.57 | -9.8 | -2.40 | 0.1 | 2.50 | 9.8 | ▁▅▇▅▁ |
| playoffs | 0 | 1 | 0.38 | 0.48 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 | ▇▁▁▁▅ |
| sb_winner | 0 | 1 | 0.03 | 0.17 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | ▇▁▁▁▁ |
The original games data set contains 5,324 observations and 19 variables. In the data, there are 12 character variables - week, home_team, away_team, winner, tie, day, date, time, home_team_name, home_team_city, away_team_name, away_team_city. There are also seven numeric variables - year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, turnovers_loss. The data spans from the 2000 season through the 2019 season. A view of the original standings dataset is available below.
# View the games data set for the Week 1 of the 2000 season
datatable(head(games,16))
Below is a data dictionary for the games dataset.
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| year | integer | Year |
| week | character | Week of the season in which the game was played |
| home_team | character | Home team for the game |
| away_team | character | Away team for the game |
| winner | character | Winner of the game |
| tie | character | Was there a tie? (if so, the other team will be listed in this column) |
| day | character | Day of the week in which the game was played |
| date | character | Date of the game |
| time | character | Time of the day in which the game was played |
| pts_win | integer | Number of points the winning team scored |
| pts_loss | integer | Number of points the losing team scored |
| yds_win | integer | Total number of yards the winning team had |
| turnovers_win | integer | Total number of turnovers the winning team had |
| yds_loss | integer | Total number of yards the losing team had |
| turnovers_loss | integer | Total number of turnovers the losing team had |
| home_team_name | character | Name or mascot of the winning team |
| home_team_city | character | City of the winning team |
| away_team_name | character | Name or mascot of the losing team |
| away_team_city | character | City of the losing team |
Based on looking at the above dataset, we removed the last four columns (home_team_name, home_team_city, away_team_name, away_team_city) as we felt they were repetitive.
names(games) # Viewing the names of the columns in the games dataset
## [1] "year" "week" "home_team" "away_team"
## [5] "winner" "tie" "day" "date"
## [9] "time" "pts_win" "pts_loss" "yds_win"
## [13] "turnovers_win" "yds_loss" "turnovers_loss" "home_team_name"
## [17] "home_team_city" "away_team_name" "away_team_city"
games <- games[-c(16:19)] # Removing the repeitive columns from the games dataset
names(games) # Viewing the names of the columns in the updated games dataset
## [1] "year" "week" "home_team" "away_team"
## [5] "winner" "tie" "day" "date"
## [9] "time" "pts_win" "pts_loss" "yds_win"
## [13] "turnovers_win" "yds_loss" "turnovers_loss"
As mentioned above, there are missing values in the games dataset in the tie column. Also mentioned above, these NA’s make sense as there are very few ties in the NFL. Knowing that tie contains all of the NA, we also must look at the unique values for tie.
colSums(is.na(games)) # Show the number of NA per column in games
## year week home_team away_team winner
## 0 0 0 0 0
## tie day date time pts_win
## 5314 0 0 0 0
## pts_loss yds_win turnovers_win yds_loss turnovers_loss
## 0 0 0 0 0
unique(games$tie) # Viewing the unique values in the tie column in games
## [1] NA "Atlanta Falcons" "Cincinnati Bengals"
## [4] "St. Louis Rams" "Green Bay Packers" "Carolina Panthers"
## [7] "Arizona Cardinals" "Cleveland Browns"
games$winner[games$tie != is.na(games$tie)] <- "Tie" # Entering tie into winner column for ties
games <- games[-c(6)] # Removing the ties column
colSums(is.na(games)) # Checking the number of NA per column in games
## year week home_team away_team winner
## 0 0 0 0 0
## day date time pts_win pts_loss
## 0 0 0 0 0
## yds_win turnovers_win yds_loss turnovers_loss
## 0 0 0 0
The final change needed is changing the week variable from a character variable to a numeric variable. This is done below.
games$week <- as.numeric(games$week) # Changing week to numeric
With the data now cleaned after checking for missing values and changing variables, a view of the cleaned games dataset is below.
datatable(head(games,16)) # Viewing the cleaned games dataset
Lastly, you can see a full summary of the cleaned games table.
skim(games)
| Name | games |
| Number of rows | 5324 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| home_team | 0 | 1 | 13 | 20 | 0 | 34 | 0 |
| away_team | 0 | 1 | 13 | 20 | 0 | 34 | 0 |
| winner | 0 | 1 | 3 | 20 | 0 | 35 | 0 |
| day | 0 | 1 | 3 | 3 | 0 | 7 | 0 |
| date | 0 | 1 | 9 | 12 | 0 | 154 | 0 |
| time | 0 | 1 | 8 | 8 | 0 | 187 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2009.53 | 5.75 | 2000 | 2005 | 2010 | 2015 | 2019 | ▇▇▇▇▇ |
| week | 220 | 0.96 | 9.10 | 4.99 | 1 | 5 | 9 | 13 | 17 | ▇▅▅▆▇ |
| pts_win | 0 | 1.00 | 27.78 | 8.83 | 3 | 21 | 27 | 34 | 62 | ▁▇▇▂▁ |
| pts_loss | 0 | 1.00 | 16.09 | 8.14 | 0 | 10 | 16 | 21 | 51 | ▆▇▅▁▁ |
| yds_win | 0 | 1.00 | 361.64 | 78.58 | 47 | 308 | 361 | 415 | 653 | ▁▂▇▃▁ |
| turnovers_win | 0 | 1.00 | 1.08 | 1.04 | 0 | 0 | 1 | 2 | 7 | ▇▂▁▁▁ |
| yds_loss | 0 | 1.00 | 309.08 | 84.50 | 26 | 251 | 306 | 366 | 613 | ▁▅▇▃▁ |
| turnovers_loss | 0 | 1.00 | 2.17 | 1.42 | 0 | 1 | 2 | 3 | 8 | ▆▇▂▁▁ |
For the discovery of new information in the data, we plan to utilize all three datasets in our analysis, which includes data on NFL attendance, game statistics, and standings. Our examination of attendance data will allow us to understand how standings and team game performance impact NFL attendance. This will allow us to find the following information surrounding attendance and team performance:
offensive_ranking, defensive_ranking, and playoff berth by team over the seasonsWe also plan to analyze how performance variables found in the standings dataset (such as wins, losses, offensive_ranking, defensive_ranking, and playoff berth) correlate to the team’s attendance data over the seasons. We plan to do this by creating a correlation matrix between the variables total (total attendance), wins, loss, offensive_ranking, defensive_ranking, playoffs, and sb_winner. We can also use the correlation functionality in R to find the exact correlation values across the board.
For the initial attendance analysis, plots and tables will be very prevalent in our results. For offensive_ranking, defensive_ranking, playoff, sb_winner, wins, points_for, points_against, and other variables, we plan to output tables showing the frequency of each response in relation to weekly_attendance over time. We will also utilize bar plots to visualize the frequency of each over time. A correlation matrix will be essential for our attendance analysis in visualizing the correlation between various variables, including most variables in our standings dataset. We are also looking to visualize the strength of each division in relation to one another. This could be through a plot of the count of different summary statistics of wins or participation in the playoffs by division.
Most of what is needed for our analysis has been covered in this course or in one of the other courses in R that we have already taken. The main learning that we will need to do before completing the project is a deeper dive into joins. We have an interesting challenge to face: there is not a common variable in all the datasets that can be used as a primary/foreign key (issues in the games data set with home vs away and combined team_name and team_city). We are looking for ways to create a common key using the variables that we have. This will be one of our required learnings. We will also dig a little deeper into plotting inside of R to ensure our plots come out as clean and professional.
The incorporation of linear regression into our analysis will depend on preliminary results of the correlation values when we examine variables’ correlation with attendance. If we find that there is a strong correlation between any of the variables, that would prompt us to create a regression model to find more information on how attendance is typically affected by other variables present. One such correlation we hope will be worth looking into is the correlation between weekly_attendance and some variables in our standings dataset (i.e. wins).