Introduction

The NFL is a fast-growing sport, and understanding the factors influencing attendance at home games is vital for teams looking to increase revenue and add more fans. So today, we will be analyzing how the outcomes of games in the NFL and team standings affect attendance at home games. We plan to look at the 2 data sets, which are attendance and standings data sets, to find out how each team’s standings and performance during the season affect attendance at home games. First, we will clean out the data sets, such as handling the missing data and outliers. Then, we will join the data sets together to analyze it.

This will help the NFL teams because our analysis will show teams what will increase their attendance rates, which will help increase their revenue. Overall, this will help the NFL create a better experience for fans. Also, knowing this information is vital for the NFL teams, our data may provide information on how they might need to prepare for bigger crowds during winning seasons. They may need to hire more security, add more vendors, more seats, and more.

Packages

## **Packages Required**
library(tidyverse) ## Easy was to install packages needed
library(ggplot2) ## Create visualizations
library(dplyr) ## Manipulate data
library (magrittr) ## pipe operators
library(gtsummary) ## descriptive statistics and summary tables

Data Preperation

# Read the data from CSV files
setwd("/Users/mussab/Desktop/Data Managment Class/Week_5/nfl")
standings_df <- read_csv("standings.csv")
games_df <- read_csv("games.csv")
attendance_df <- read_csv("attendance.csv")

Initial Data

This code will provide you with an overview of each data set, including the number of variables and peculiarities related to missing values. The str function shows the data structure, and the summary function provides summary statistics for the variables, which can help you identify missing values and other characteristics of the data.

Each package below contains a unique data set from NFL Attendance Data. The attendance.csv data set displays weekly attendance numbers for each team’s city throughout the 17-week NFL season, including both home and away attendance. In the weekly_attendance column, NA is recorded for certain weeks when there is a bye week, indicating that the team did not have a game during that week.

The games.csv data set displays the points scored by the winning team and the points scored by the losing team, the amount of points scored for each team, and total yards each team has gained. We have decided not to use the games.csv data set for this analysis, as it records individual game details, and the information it provides is similar to that found in the standings.csv file. Using the standings.csv file allows us to analyze the season-long team performance instead game by game.

Similar to attendance.csv, the NA in the column for means that no game was played that week.

The standings.csv file shows the number of wins and losses for each NFL team, including the margin of victory, points differential, and whether or not the team has made the playoffs.

The purpose of each package is to research and provide insights into attendance patterns and whether the outcome of a team’s standings affects home game attendance

Cleaning Data

# Standings Data
standings_df %>% select(!c(team_name,team))%>% tbl_summary() # Descriptive data and summary and tells me the amount of observations
Characteristic N = 6381
year 2,010.0 (2,005.0, 2,014.8)
wins 8.0 (6.0, 10.0)
loss 8.0 (6.0, 10.0)
points_for 348 (299, 396)
points_against 347 (310, 392)
points_differential 2 (-75, 73)
margin_of_victory 0 (-5, 5)
strength_of_schedule 0.00 (-1.10, 1.20)
simple_rating 0 (-4, 5)
offensive_ranking 0.0 (-3.2, 2.7)
defensive_ranking 0.1 (-2.4, 2.5)
playoffs
    No Playoffs 398 (62%)
    Playoffs 240 (38%)
sb_winner
    No Superbowl 618 (97%)
    Won Superbowl 20 (3.1%)
1 Median (IQR); n (%)
 ncol(standings_df) 
## [1] 15
# For attendance_df
ncol(attendance_df) # shows me the amount of variables
## [1] 8
# Attendance Data
attendance_df %>% select(!c(team_name))%>% tbl_summary() # Descriptive data and summary and tells me the amount of observations
Characteristic N = 10,8461
team
    Arizona 340 (3.1%)
    Atlanta 340 (3.1%)
    Baltimore 340 (3.1%)
    Buffalo 340 (3.1%)
    Carolina 340 (3.1%)
    Chicago 340 (3.1%)
    Cincinnati 340 (3.1%)
    Cleveland 340 (3.1%)
    Dallas 340 (3.1%)
    Denver 340 (3.1%)
    Detroit 340 (3.1%)
    Green Bay 340 (3.1%)
    Houston 306 (2.8%)
    Indianapolis 340 (3.1%)
    Jacksonville 340 (3.1%)
    Kansas City 340 (3.1%)
    Los Angeles 119 (1.1%)
    Miami 340 (3.1%)
    Minnesota 340 (3.1%)
    New England 340 (3.1%)
    New Orleans 340 (3.1%)
    New York 680 (6.3%)
    Oakland 340 (3.1%)
    Philadelphia 340 (3.1%)
    Pittsburgh 340 (3.1%)
    San Diego 289 (2.7%)
    San Francisco 340 (3.1%)
    Seattle 340 (3.1%)
    St. Louis 272 (2.5%)
    Tampa Bay 340 (3.1%)
    Tennessee 340 (3.1%)
    Washington 340 (3.1%)
year 2,010.0 (2,005.0, 2,015.0)
total 1,081,090 (1,040,509, 1,123,230)
home 543,185 (504,360, 578,342)
away 541,757 (524,974, 557,741)
week 9.0 (5.0, 13.0)
weekly_attendance 68,334 (63,246, 72,545)
    Unknown 638
1 n (%); Median (IQR)

In the standings dataset we have 638 observations and we have 15 variables. In the attendance dataset has 10,846 observations and has 8 variables.

As you can see in the attendance data set, they are teams with fewer observations than others. This happened because some teams may have started after 2000, like the Houston Texan did. Also, some teams had relocated, like the San Diego Chargers becoming the Los Angles Chargers and the St. Louis Rams becoming the Los Angeles Rams.

colSums(is.na(attendance_df))
## combining team and team_name variables
attendance_df <- attendance_df %>% 
  mutate(team_name = paste(team, team_name, sep = " ")) %>%
  select(-team)

# Remove duplicate rows
attendance_df <- attendance_df %>%
  group_by(team_name, year,total,home,away) %>%
  mutate(weekly_attendance = mean(weekly_attendance, na.rm = TRUE)) %>%
  select(-week) %>%
 distinct(weekly_attendance)

## combining team and team_name variables
standings_df <- standings_df %>% 
  mutate(team_name = paste(team, team_name, sep = " ")) %>%
  select(-team)


# Remove duplicate rows
distinct(attendance_df)

# Example: Remove rows with missing values
standings_df <- standings_df %>% na.omit()

In the initial dataset, the only missing value was in the weekly attendance column, which is expected since it tells us that there was no game played during that specific week.

We decided to change the dataset. Instead of representing the weekly attendance rate for every week, we made the data reflect the average weekly attendance per year. This change allows analysis of attendance trends over time and provides us with a more manageable dataset.

There was no missing value in the initial dataset. We combined the team and team name columns because we wanted to remove redundancy in the dataset.

# checking missing values
colSums(is.na(standings_df))
##            team_name                 year                 wins 
##                    0                    0                    0 
##                 loss           points_for       points_against 
##                    0                    0                    0 
##  points_differential    margin_of_victory strength_of_schedule 
##                    0                    0                    0 
##        simple_rating    offensive_ranking    defensive_ranking 
##                    0                    0                    0 
##             playoffs            sb_winner 
##                    0                    0
# merging data 
merged_data<- standings_df %>% inner_join(attendance_df, by = c("team_name", "year"))

We decided to merge the standings and attendance data sets together.

Clean Dataset (First 10 Rows)

head(standings_df, 10)
head(attendance_df, 10)

Summary About Variables

summary(standings_df[c( "wins", "loss", "points_for", "points_against", "margin_of_victory", "playoffs", "offensive_ranking")])
##       wins             loss          points_for    points_against 
##  Min.   : 0.000   Min.   : 0.000   Min.   :161.0   Min.   :165.0  
##  1st Qu.: 6.000   1st Qu.: 6.000   1st Qu.:299.0   1st Qu.:310.0  
##  Median : 8.000   Median : 8.000   Median :348.0   Median :347.0  
##  Mean   : 7.984   Mean   : 7.984   Mean   :350.3   Mean   :350.3  
##  3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.:396.0   3rd Qu.:391.5  
##  Max.   :16.000   Max.   :16.000   Max.   :606.0   Max.   :517.0  
##  margin_of_victory      playoffs         offensive_ranking   
##  Min.   :-16.300000   Length:638         Min.   :-11.700000  
##  1st Qu.: -4.700000   Class :character   1st Qu.: -3.175000  
##  Median :  0.100000   Mode  :character   Median :  0.000000  
##  Mean   : -0.001881                      Mean   : -0.000157  
##  3rd Qu.:  4.575000                      3rd Qu.:  2.700000  
##  Max.   : 19.700000                      Max.   : 15.900000

The standings dataset provides information on each team’s yearly performance. It includes essential data points such as the number of wins and losses, which are key indicators of team success. The dataset offers other important statistics like offensive ratings, indicating the quality of a team’s offensive performance. The dataset contains information on if a team made the playoffs. It also provides the margin of victory for each team in a given year, which shows us how dominant the team was. All these statistics displays the teams overall performance.

summary(standings_df[c( "wins", "loss", "points_for", "points_against", "margin_of_victory", "playoffs", "offensive_ranking")])
##       wins             loss          points_for    points_against 
##  Min.   : 0.000   Min.   : 0.000   Min.   :161.0   Min.   :165.0  
##  1st Qu.: 6.000   1st Qu.: 6.000   1st Qu.:299.0   1st Qu.:310.0  
##  Median : 8.000   Median : 8.000   Median :348.0   Median :347.0  
##  Mean   : 7.984   Mean   : 7.984   Mean   :350.3   Mean   :350.3  
##  3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.:396.0   3rd Qu.:391.5  
##  Max.   :16.000   Max.   :16.000   Max.   :606.0   Max.   :517.0  
##  margin_of_victory      playoffs         offensive_ranking   
##  Min.   :-16.300000   Length:638         Min.   :-11.700000  
##  1st Qu.: -4.700000   Class :character   1st Qu.: -3.175000  
##  Median :  0.100000   Mode  :character   Median :  0.000000  
##  Mean   : -0.001881                      Mean   : -0.000157  
##  3rd Qu.:  4.575000                      3rd Qu.:  2.700000  
##  Max.   : 19.700000                      Max.   : 15.900000
summary(attendance_df[c("total", "home", "away", "weekly_attendance")])
##      total              home             away        weekly_attendance
##  Min.   : 760644   Min.   :202687   Min.   :450295   Min.   :47540    
##  1st Qu.:1040611   1st Qu.:504405   1st Qu.:524983   1st Qu.:65038    
##  Median :1081090   Median :543185   Median :541757   Median :67568    
##  Mean   :1080910   Mean   :540455   Mean   :540455   Mean   :67557    
##  3rd Qu.:1123187   3rd Qu.:578339   3rd Qu.:557700   3rd Qu.:70199    
##  Max.   :1322087   Max.   :741775   Max.   :601655   Max.   :82630

The attendance data set contains information about the attendance at NFL games. It provides information on average weekly attendance for each team and the total number of fans who attended their home games.

Both of these data sets span from the years of 2000-2019. ## How We Plan To Analyze Our Data

We think data visualization would be best choice to present the question,It could be bar charts, box plot even histogram. We plan on to combine separate data frames to compare and analyze our data. For example we plan to merge the standings and attendance data frames to analyze how a team’s performance in the standings correlates with attendance rates. This will allow us to explore how offensive performance and margin of victory impact attendance rates. We plan on analyzing by each team and seeing how attendance rates my change over specific variables and change over time.

We plan on using histogram,bar chart, and scatter plots as a way to illustrate the our question. This will helps us find good trends and correlation between variables.

Data Visulazations

All of these vizulaztions uses data from the years of 2000-2019

average_attendance <- merged_data %>%
  group_by(wins,playoffs) %>%
  summarise(average_home_attendance = mean(home))

ggplot(average_attendance, aes(x = wins, y = average_home_attendance, col = playoffs)) +
  geom_point() +
  labs(title = " Figure 1: Relationship between Wins and Attendance",
       x = "Wins",
       y = "Total Home Attendace") 

Looking at this scatter plot, we can see an increase in attendance rates with NFL teams with more wins. One outlier in the bunch is the 0-win team, which rarely happens in the NFL. This data provided that the more a team wins, the slight increase in attendance will occur.

Furthermore, we decided to dive deeper and analyze how the team’s performance may affect the attendance rates.

ggplot(merged_data, aes(x = wins, y = weekly_attendance, color = playoffs)) +
  geom_boxplot() +
  labs(title = "Figure 2 : Relationship between Home Wins and Weekly Attendance",
       x = " Wins",
       y = "Number of Home Attendance")+
 scale_y_continuous(breaks = seq(0, 90000, by = 5000),  # Specify breaks a intervals by 50000                    
  labels = scales::comma_format(scale = 1))   # Format labels with commas

Looking at the box plot, the graph above illustrates the relationship between weekly home attendance and the number of wins. What I leaned from this graph is that the attendance at home games has decreased, likely influenced by the number of games lost

Furthermore, we decided to dive deeper and analyze how the team’s performance may affect the attendance rates.

average_attendance2 <- merged_data %>%
  group_by(playoffs) %>%
  summarise(average_home_attendance = mean(home))

ggplot(average_attendance2, aes(x = playoffs , y = average_home_attendance,fill = playoffs )) +
  geom_bar(stat = "identity", na.rm = TRUE) +
  geom_text( aes(label = round(average_home_attendance,1)),  
             vjust = -0.4, hjust = .5) +  # Adjust position of the text labels 
  labs(title = "Figure 3: How Making The Playffs Effect Home Game Attendace",
       x = "Playoffs",
       y = "Average Home Attendance") +
  scale_y_continuous(breaks = seq(0, 600000, by = 50000),  # Specify breaks a intervals by 50000                    
  labels = scales::comma_format(scale = 1))   # Format labels with commas

I decided to see how going to the playoffs and not going to the playoffs may affect attendance rates at home. As you can see, attending the playoffs slightly increased the attendance rates. There isn’t a big enough difference to make a definitive conclusion. We decided to see how the game’s performance may affect attendance rates.

We decided to see how offensive efficiency may affect attendance rates, as people love watching high-scoring games these days.

average_attendance2 <- merged_data %>%
  group_by(offensive_ranking) %>%
  summarise(average_home_attendance = mean(home))

ggplot(average_attendance2, aes(x = offensive_ranking , y = average_home_attendance)) +
  geom_bar(stat = "identity") +
  scale_x_continuous("Offensive Ranking") +
  scale_y_continuous("Home game attendacne average", labels = scales::comma_format()) +
  ggtitle("Figure 4: How Offensive Ranking Effects Home Attendance")

While looking at this graph, I learned that offensive ranking didnt affect the home attendance. So, I decided to graph a bar chart of the team with the highest attendance rates from 2000-2019 to see who was at the top.

### Comparing of points for and weekly attendance number 
ggplot(merged_data, aes(x = points_for, y = home )) +
  geom_point(alpha = .5) +
  labs(title = "Figure 5: Relationship between points_for and weekly_attendance",
       x = "points_for",
       y = "weekly_attendance")

## Finds the average og home attendance by team
average_attendance2 <- merged_data %>%
  group_by(team_name) %>%
  summarise(home1 = mean(home))
  ## arranges data from greatest to lowest
average_attendance2 <- average_attendance2 %>%
  arrange(-home1)

## averages of wins per team from 2000-2019
average_attendance4 <- merged_data %>%
  group_by(team_name) %>%
  summarise( win1= mean(wins))

merged_data1<- average_attendance2 %>% inner_join(average_attendance4, by = c("team_name"))

ggplot(merged_data1, aes(y =  reorder(team_name, home1), x =home1, fill = win1)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = win1) , vjust = .5, hjust = -.3) +
  scale_fill_gradient(limits = c(6, 12), low = "blue", high = "red") +
  labs(title = "Figure 6 : Relationship Between Teams Attendance Rates and their Winning Performance",
       x = "Team",
       y = "Attendance") +
 scale_x_continuous(labels = scales::comma_format()) 

As analyzed, the data showed that teams in big cities like New York, Dallas, and Washington D.C. consistently draw the largest crowds for their home games. What’s interesting is that the main factor of attendance is the market size rather than the team’s on-field performance. While a winning streak may slightly increase attendance, the most crucial factor is how big of a city the team plays in.

Summary

We have analyzed how the outcomes of games in the NFL and team standings affect attendance at home games. We did this by taking a look at the 2 data sets, which are attendance and standings data sets, to find out how each team’s standings and performance during the season affect attendance at home games. First, we cleaned out the data sets, such as handling the missing data and outliers. Then we joined the data sets together to analyze it.

The overall insight that I got from this data is how well a team’s performance may increase their attendance at home games. NFL teams may see like. 5000 to 12000 increase in fans at home games the whole season, which is still an improvement but only a little for a sort as big as the NFL. While looking at this, I realized that the teams at the top of attendance were teams from big cities such as New York and Dallas.

This provides NFL teams with insights about how they can think of other ways to increase their fanbase by not just winning more but thinking of other new ideas and, for example, making their stadium more accessible by being more active on social media. Overall, this data showed that there was little of a correlation between how the team performed in games and the outcomes of these games that affected attendance at home games.

Some of our limitations were that they could have provided other aspects that may affect NFL attendance at games, like social media, TV, and others. This could have helped us compare and see which aspect may impact attendance the most.