Introduction

The problem we are addressing is how NFl game outcomes and team standings affect attendance at home games. This is interesting because we are all sports fan and love watching the NFL.

Methodolgy

We plan plan looking at the 3 data sets which are the attendance, standings, and games to find out how each teams standings and the outcomes of games affect attendances at home games. First we will clean out the data sets such as handling the missing data, and any outliers, and more. Then we will join the the data sets together so we can analyze it.

Why this is Important?

This will help the NFl teams because our analysis will show teams what will increase there attendance rates which will help increase there revenue. Overall this will help the NFl create a better experience for fans.

Packages

## **Packages Required**
library(tidyverse)

# Read the data from CSV files
setwd("/Users/mussab/Desktop/Data Managment Class/Week_5/nfl")
standings_df <- read_csv("standings.csv")
games_df <- read_csv("games.csv")
attendance_df <- read_csv("attendance.csv")

Each package below contains a unique data set from NFL Attendance Data. The attendance.csv data set displays weekly attendance numbers for each team’s city throughout the 17-week NFL season, including both home and away attendance. In the weekly_attendance column, NA is recorded for certain weeks when there is a bye week, indicating that the team did not have a game during that week.

The games.csv data set displays the points scored by the winning team and the points scored by the losing team, the amount of points scored for each team, and total yards each team has gained. Similar to attendance.csv, the NA in the column for means that no game was played that week.

The standings.csv file shows the number of wins and losses for each NFL team, including the margin of victory, points differential, and whether or not the team has made the playoffs.

The purpose of each package is to research and provide insights into attendance patterns and whether the outcome of a team’s standings affects home game attendance

We are loading the tidyverse package because it will allows read the files, clean the files and also analyze the files by creating different type of graphs and plots.

Initial Data

This code will provide you with an overview of each data set, including the number of variables and peculiarities related to missing values. The str function shows the data structure, and the summary function provides summary statistics for the variables, which can help you identify missing values and other characteristics of the data.

Cleaning Data

# Standings Data
str(standings_df)  # Display the structure of the data
## spc_tbl_ [638 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ team                : chr [1:638] "Miami" "Indianapolis" "New York" "Buffalo" ...
##  $ team_name           : chr [1:638] "Dolphins" "Colts" "Jets" "Bills" ...
##  $ year                : num [1:638] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ wins                : num [1:638] 11 10 9 8 5 13 12 9 7 4 ...
##  $ loss                : num [1:638] 5 6 7 8 11 3 4 7 9 12 ...
##  $ points_for          : num [1:638] 323 429 321 315 276 346 333 321 367 185 ...
##  $ points_against      : num [1:638] 226 326 321 350 338 191 165 255 327 359 ...
##  $ points_differential : num [1:638] 97 103 0 -35 -62 155 168 66 40 -174 ...
##  $ margin_of_victory   : num [1:638] 6.1 6.4 0 -2.2 -3.9 9.7 10.5 4.1 2.5 -10.9 ...
##  $ strength_of_schedule: num [1:638] 1 1.5 3.5 2.2 1.4 -1.3 -2.5 -0.2 -1.4 0.4 ...
##  $ simple_rating       : num [1:638] 7.1 7.9 3.5 0 -2.5 8.3 8 3.9 1.1 -10.5 ...
##  $ offensive_ranking   : num [1:638] 0 7.1 1.4 0.5 -2.7 1.5 0 0.6 3.2 -8.1 ...
##  $ defensive_ranking   : num [1:638] 7.1 0.8 2.2 -0.5 0.2 6.8 8 3.3 -2.1 -2.4 ...
##  $ playoffs            : chr [1:638] "Playoffs" "Playoffs" "No Playoffs" "No Playoffs" ...
##  $ sb_winner           : chr [1:638] "No Superbowl" "No Superbowl" "No Superbowl" "No Superbowl" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   team = col_character(),
##   ..   team_name = col_character(),
##   ..   year = col_double(),
##   ..   wins = col_double(),
##   ..   loss = col_double(),
##   ..   points_for = col_double(),
##   ..   points_against = col_double(),
##   ..   points_differential = col_double(),
##   ..   margin_of_victory = col_double(),
##   ..   strength_of_schedule = col_double(),
##   ..   simple_rating = col_double(),
##   ..   offensive_ranking = col_double(),
##   ..   defensive_ranking = col_double(),
##   ..   playoffs = col_character(),
##   ..   sb_winner = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(standings_df)  # Generate a summary of the data
##      team            team_name              year           wins       
##  Length:638         Length:638         Min.   :2000   Min.   : 0.000  
##  Class :character   Class :character   1st Qu.:2005   1st Qu.: 6.000  
##  Mode  :character   Mode  :character   Median :2010   Median : 8.000  
##                                        Mean   :2010   Mean   : 7.984  
##                                        3rd Qu.:2015   3rd Qu.:10.000  
##                                        Max.   :2019   Max.   :16.000  
##       loss          points_for    points_against  points_differential
##  Min.   : 0.000   Min.   :161.0   Min.   :165.0   Min.   :-261.00    
##  1st Qu.: 6.000   1st Qu.:299.0   1st Qu.:310.0   1st Qu.: -75.00    
##  Median : 8.000   Median :348.0   Median :347.0   Median :   1.50    
##  Mean   : 7.984   Mean   :350.3   Mean   :350.3   Mean   :   0.00    
##  3rd Qu.:10.000   3rd Qu.:396.0   3rd Qu.:391.5   3rd Qu.:  72.75    
##  Max.   :16.000   Max.   :606.0   Max.   :517.0   Max.   : 315.00    
##  margin_of_victory    strength_of_schedule simple_rating    
##  Min.   :-16.300000   Min.   :-4.600000    Min.   :-17.400  
##  1st Qu.: -4.700000   1st Qu.:-1.100000    1st Qu.: -4.475  
##  Median :  0.100000   Median : 0.000000    Median :  0.000  
##  Mean   : -0.001881   Mean   : 0.001097    Mean   :  0.000  
##  3rd Qu.:  4.575000   3rd Qu.: 1.200000    3rd Qu.:  4.500  
##  Max.   : 19.700000   Max.   : 4.300000    Max.   : 20.100  
##  offensive_ranking    defensive_ranking     playoffs          sb_winner        
##  Min.   :-11.700000   Min.   :-9.800000   Length:638         Length:638        
##  1st Qu.: -3.175000   1st Qu.:-2.400000   Class :character   Class :character  
##  Median :  0.000000   Median : 0.100000   Mode  :character   Mode  :character  
##  Mean   : -0.000157   Mean   :-0.001097                                        
##  3rd Qu.:  2.700000   3rd Qu.: 2.500000                                        
##  Max.   : 15.900000   Max.   : 9.800000
# Games Data
str(games_df)  # Display the structure of the data
## spc_tbl_ [5,324 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year          : num [1:5324] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ week          : chr [1:5324] "1" "1" "1" "1" ...
##  $ home_team     : chr [1:5324] "Minnesota Vikings" "Kansas City Chiefs" "Washington Redskins" "Atlanta Falcons" ...
##  $ away_team     : chr [1:5324] "Chicago Bears" "Indianapolis Colts" "Carolina Panthers" "San Francisco 49ers" ...
##  $ winner        : chr [1:5324] "Minnesota Vikings" "Indianapolis Colts" "Washington Redskins" "Atlanta Falcons" ...
##  $ tie           : chr [1:5324] NA NA NA NA ...
##  $ day           : chr [1:5324] "Sun" "Sun" "Sun" "Sun" ...
##  $ date          : chr [1:5324] "September 3" "September 3" "September 3" "September 3" ...
##  $ time          : 'hms' num [1:5324] 13:00:00 13:00:00 13:01:00 13:02:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ pts_win       : num [1:5324] 30 27 20 36 16 27 21 14 21 41 ...
##  $ pts_loss      : num [1:5324] 27 14 17 28 0 7 16 10 16 14 ...
##  $ yds_win       : num [1:5324] 374 386 396 359 336 398 296 187 395 425 ...
##  $ turnovers_win : num [1:5324] 1 2 0 1 0 0 1 2 2 3 ...
##  $ yds_loss      : num [1:5324] 425 280 236 339 223 249 278 252 355 167 ...
##  $ turnovers_loss: num [1:5324] 1 1 1 1 1 1 1 3 4 2 ...
##  $ home_team_name: chr [1:5324] "Vikings" "Chiefs" "Redskins" "Falcons" ...
##  $ home_team_city: chr [1:5324] "Minnesota" "Kansas City" "Washington" "Atlanta" ...
##  $ away_team_name: chr [1:5324] "Bears" "Colts" "Panthers" "49ers" ...
##  $ away_team_city: chr [1:5324] "Chicago" "Indianapolis" "Carolina" "San Francisco" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   week = col_character(),
##   ..   home_team = col_character(),
##   ..   away_team = col_character(),
##   ..   winner = col_character(),
##   ..   tie = col_character(),
##   ..   day = col_character(),
##   ..   date = col_character(),
##   ..   time = col_time(format = ""),
##   ..   pts_win = col_double(),
##   ..   pts_loss = col_double(),
##   ..   yds_win = col_double(),
##   ..   turnovers_win = col_double(),
##   ..   yds_loss = col_double(),
##   ..   turnovers_loss = col_double(),
##   ..   home_team_name = col_character(),
##   ..   home_team_city = col_character(),
##   ..   away_team_name = col_character(),
##   ..   away_team_city = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(games_df)  # Generate a summary of the data
##       year          week            home_team          away_team        
##  Min.   :2000   Length:5324        Length:5324        Length:5324       
##  1st Qu.:2005   Class :character   Class :character   Class :character  
##  Median :2010   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2010                                                           
##  3rd Qu.:2015                                                           
##  Max.   :2019                                                           
##     winner              tie                day                date          
##  Length:5324        Length:5324        Length:5324        Length:5324       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      time             pts_win         pts_loss        yds_win     
##  Length:5324       Min.   : 3.00   Min.   : 0.00   Min.   : 47.0  
##  Class1:hms        1st Qu.:21.00   1st Qu.:10.00   1st Qu.:308.0  
##  Class2:difftime   Median :27.00   Median :16.00   Median :361.0  
##  Mode  :numeric    Mean   :27.78   Mean   :16.09   Mean   :361.6  
##                    3rd Qu.:34.00   3rd Qu.:21.00   3rd Qu.:415.0  
##                    Max.   :62.00   Max.   :51.00   Max.   :653.0  
##  turnovers_win     yds_loss     turnovers_loss  home_team_name    
##  Min.   :0.00   Min.   : 26.0   Min.   :0.000   Length:5324       
##  1st Qu.:0.00   1st Qu.:251.0   1st Qu.:1.000   Class :character  
##  Median :1.00   Median :306.0   Median :2.000   Mode  :character  
##  Mean   :1.08   Mean   :309.1   Mean   :2.168                     
##  3rd Qu.:2.00   3rd Qu.:366.0   3rd Qu.:3.000                     
##  Max.   :7.00   Max.   :613.0   Max.   :8.000                     
##  home_team_city     away_team_name     away_team_city    
##  Length:5324        Length:5324        Length:5324       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
# Attendance Data
str(attendance_df)  # Display the structure of the data
## spc_tbl_ [10,846 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ team             : chr [1:10846] "Arizona" "Arizona" "Arizona" "Arizona" ...
##  $ team_name        : chr [1:10846] "Cardinals" "Cardinals" "Cardinals" "Cardinals" ...
##  $ year             : num [1:10846] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ total            : num [1:10846] 893926 893926 893926 893926 893926 ...
##  $ home             : num [1:10846] 387475 387475 387475 387475 387475 ...
##  $ away             : num [1:10846] 506451 506451 506451 506451 506451 ...
##  $ week             : num [1:10846] 1 2 3 4 5 6 7 8 9 10 ...
##  $ weekly_attendance: num [1:10846] 77434 66009 NA 71801 66985 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   team = col_character(),
##   ..   team_name = col_character(),
##   ..   year = col_double(),
##   ..   total = col_double(),
##   ..   home = col_double(),
##   ..   away = col_double(),
##   ..   week = col_double(),
##   ..   weekly_attendance = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(attendance_df)  # Generate a summary of the data
##      team            team_name              year          total        
##  Length:10846       Length:10846       Min.   :2000   Min.   : 760644  
##  Class :character   Class :character   1st Qu.:2005   1st Qu.:1040509  
##  Mode  :character   Mode  :character   Median :2010   Median :1081090  
##                                        Mean   :2010   Mean   :1080910  
##                                        3rd Qu.:2015   3rd Qu.:1123230  
##                                        Max.   :2019   Max.   :1322087  
##                                                                        
##       home             away             week    weekly_attendance
##  Min.   :202687   Min.   :450295   Min.   : 1   Min.   : 23127   
##  1st Qu.:504360   1st Qu.:524974   1st Qu.: 5   1st Qu.: 63246   
##  Median :543185   Median :541757   Median : 9   Median : 68334   
##  Mean   :540455   Mean   :540455   Mean   : 9   Mean   : 67557   
##  3rd Qu.:578342   3rd Qu.:557741   3rd Qu.:13   3rd Qu.: 72545   
##  Max.   :741775   Max.   :601655   Max.   :17   Max.   :105121   
##                                                 NA's   :638
#Checking for missing values
colSums(is.na(games_df)) 
##           year           week      home_team      away_team         winner 
##              0              0              0              0              0 
##            tie            day           date           time        pts_win 
##           5314              0              0              0              0 
##       pts_loss        yds_win  turnovers_win       yds_loss turnovers_loss 
##              0              0              0              0              0 
## home_team_name home_team_city away_team_name away_team_city 
##              0              0              0              0
## Removing columns
games_df <- games_df %>%
  select(-home_team_name, -home_team_city, -away_team_name, -away_team_city, -time, -date, -day)

# Create a new variable
games_df <- games_df %>%
  mutate(home_win = ifelse(winner == home_team, 1, 0))

In our initial dataset, the only missing data in the tie column, which is expected as it represents whether the game ended in a tie or not. Since our analysis does not require information about tied games, this missing data isn’t a problem.

We decided to remove certain columns from the dataset as they contained repetitive information. Specifically, we eliminated the home_team_city, away_team_name, and away_team_city columns. These columns were redundant because we already had columns that explicitly indicated the home and away teams. We also removed the time, date, and day columns from the dataset, as they were irrelevant for the specific problem we intend to analyze.

colSums(is.na(attendance_df))
##              team         team_name              year             total 
##                 0                 0                 0                 0 
##              home              away              week weekly_attendance 
##                 0                 0                 0               638
## combining team and team_name variables
attendance_df <- attendance_df %>% 
  mutate(team_name = paste(team, team_name, sep = " ")) %>%
  select(-team)

# Remove duplicate rows
attendance_df <- attendance_df %>%
  group_by(team_name, year,total,home,away) %>%
  mutate(weekly_attendance = mean(weekly_attendance, na.rm = TRUE)) %>%
  select(-week) %>%
 distinct(weekly_attendance)

In the initial dataset, the only missing value was in the weekly attendance column, which is expected since it tells us that there was no game played during that specific week.

We decided to change the dataset. Instead of representing the weekly attendance rate for every week, we made the data reflect the average weekly attendance per year. This change allows analysis of attendance trends over time and provides us witha more manageable dataset.

# checking missing values
colSums(is.na(standings_df))
##                 team            team_name                 year 
##                    0                    0                    0 
##                 wins                 loss           points_for 
##                    0                    0                    0 
##       points_against  points_differential    margin_of_victory 
##                    0                    0                    0 
## strength_of_schedule        simple_rating    offensive_ranking 
##                    0                    0                    0 
##    defensive_ranking             playoffs            sb_winner 
##                    0                    0                    0
 ## combining team and team_name variables
standings_df <- standings_df %>% 
  mutate(team_name = paste(team, team_name, sep = " ")) %>%
  select(-team)

There was no missing value in the initial dataset. We combined the team and team name columns because we wanted to remove redundancy in the dataset.

Clean Dataset (First 10 Rows)

head(standings_df, 10)
head(games_df, 10)
head(attendance_df, 10)
summary(standings_df[c( "wins", "loss", "points_for", "points_against", "margin_of_victory", "playoffs", "offensive_ranking")])
##       wins             loss          points_for    points_against 
##  Min.   : 0.000   Min.   : 0.000   Min.   :161.0   Min.   :165.0  
##  1st Qu.: 6.000   1st Qu.: 6.000   1st Qu.:299.0   1st Qu.:310.0  
##  Median : 8.000   Median : 8.000   Median :348.0   Median :347.0  
##  Mean   : 7.984   Mean   : 7.984   Mean   :350.3   Mean   :350.3  
##  3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.:396.0   3rd Qu.:391.5  
##  Max.   :16.000   Max.   :16.000   Max.   :606.0   Max.   :517.0  
##  margin_of_victory      playoffs         offensive_ranking   
##  Min.   :-16.300000   Length:638         Min.   :-11.700000  
##  1st Qu.: -4.700000   Class :character   1st Qu.: -3.175000  
##  Median :  0.100000   Mode  :character   Median :  0.000000  
##  Mean   : -0.001881                      Mean   : -0.000157  
##  3rd Qu.:  4.575000                      3rd Qu.:  2.700000  
##  Max.   : 19.700000                      Max.   : 15.900000
summary(games_df[c("winner", "tie", "pts_win")])
##     winner              tie               pts_win     
##  Length:5324        Length:5324        Min.   : 3.00  
##  Class :character   Class :character   1st Qu.:21.00  
##  Mode  :character   Mode  :character   Median :27.00  
##                                        Mean   :27.78  
##                                        3rd Qu.:34.00  
##                                        Max.   :62.00
summary(attendance_df[c("total", "home", "away", "weekly_attendance")])
##      total              home             away        weekly_attendance
##  Min.   : 760644   Min.   :202687   Min.   :450295   Min.   :47540    
##  1st Qu.:1040611   1st Qu.:504405   1st Qu.:524983   1st Qu.:65038    
##  Median :1081090   Median :543185   Median :541757   Median :67568    
##  Mean   :1080910   Mean   :540455   Mean   :540455   Mean   :67557    
##  3rd Qu.:1123187   3rd Qu.:578339   3rd Qu.:557700   3rd Qu.:70199    
##  Max.   :1322087   Max.   :741775   Max.   :601655   Max.   :82630

Summary About Varaibles

The standings dataset provides information on each team’s yearly performance. It includes essential data points such as the number of wins and losses, which are key indicators of team success. The dataset offers other important statistics like offensive ratings, indicating the quality of a team’s offensive performance. The dataset contains information on if a team made the playoffs. It also provides the margin of victory for each team in a given year, which shows us how dominant the team was. All these statistics displays the teams overall performance.

The attendance data set contains information about the attendance at NFL games. It provides information on average weekly attendance for each team and the total number of fans who attended their home games.

The games data set provides information about NFL games, including the teams involved, and game results. It provides us statistics such s which team won and the scores of these games.

How We Plan To Anylaze Our Data

We think data visualization would be best choice to present the question,It could be bar charts, box plot even histogram. We plan on to combine separate data frames to compare and analyze our data. For example we plan to merge the standings and attendance data frames to anyalze how a team’s performance in the standings correlates with attendance rates. This will allow us to explore how offensive performance and margin of victory impact attendance rates.Also, we will combine the games and attendance data frames to explore how game outcomes influence attendance rates.

We plan on using histogram,bar chart, and scatter plots as a way to illustrate the our question. This will helps us find good trends and correlation between variables.

What We Need To Learn

We think we need to Learn how to report the histogram or any other type of figures,graph properly. Also learn more about how to manipulate data sets.