The problem we are addressing is how NFl game outcomes and team standings affect attendance at home games. This is interesting because we are all sports fan and love watching the NFL.
We plan plan looking at the 3 data sets which are the attendance, standings, and games to find out how each teams standings and the outcomes of games affect attendances at home games. First we will clean out the data sets such as handling the missing data, and any outliers, and more. Then we will join the the data sets together so we can analyze it.
This will help the NFl teams because our analysis will show teams what will increase there attendance rates which will help increase there revenue. Overall this will help the NFl create a better experience for fans.
## **Packages Required**
library(tidyverse)
# Read the data from CSV files
setwd("/Users/mussab/Desktop/Data Managment Class/Week_5/nfl")
standings_df <- read_csv("standings.csv")
games_df <- read_csv("games.csv")
attendance_df <- read_csv("attendance.csv")
Each package below contains a unique data set from NFL Attendance
Data. The attendance.csv
data set displays weekly
attendance numbers for each team’s city throughout the 17-week NFL
season, including both home and away attendance. In the
weekly_attendance
column, NA
is recorded for
certain weeks when there is a bye week, indicating that the team did not
have a game during that week.
The games.csv
data set displays the points scored by the
winning team and the points scored by the losing team, the amount of
points scored for each team, and total yards each team has gained.
Similar to attendance.csv
, the NA
in the
column for means that no game was played that week.
The standings.csv
file shows the number of wins and
losses for each NFL team, including the margin of victory, points
differential, and whether or not the team has made the playoffs.
The purpose of each package is to research and provide insights into attendance patterns and whether the outcome of a team’s standings affects home game attendance
We are loading the tidyverse package
because it will
allows read the files, clean the files and also analyze the files by
creating different type of graphs and plots.
This code will provide you with an overview of each data set, including the number of variables and peculiarities related to missing values. The str function shows the data structure, and the summary function provides summary statistics for the variables, which can help you identify missing values and other characteristics of the data.
# Standings Data
str(standings_df) # Display the structure of the data
## spc_tbl_ [638 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ team : chr [1:638] "Miami" "Indianapolis" "New York" "Buffalo" ...
## $ team_name : chr [1:638] "Dolphins" "Colts" "Jets" "Bills" ...
## $ year : num [1:638] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ wins : num [1:638] 11 10 9 8 5 13 12 9 7 4 ...
## $ loss : num [1:638] 5 6 7 8 11 3 4 7 9 12 ...
## $ points_for : num [1:638] 323 429 321 315 276 346 333 321 367 185 ...
## $ points_against : num [1:638] 226 326 321 350 338 191 165 255 327 359 ...
## $ points_differential : num [1:638] 97 103 0 -35 -62 155 168 66 40 -174 ...
## $ margin_of_victory : num [1:638] 6.1 6.4 0 -2.2 -3.9 9.7 10.5 4.1 2.5 -10.9 ...
## $ strength_of_schedule: num [1:638] 1 1.5 3.5 2.2 1.4 -1.3 -2.5 -0.2 -1.4 0.4 ...
## $ simple_rating : num [1:638] 7.1 7.9 3.5 0 -2.5 8.3 8 3.9 1.1 -10.5 ...
## $ offensive_ranking : num [1:638] 0 7.1 1.4 0.5 -2.7 1.5 0 0.6 3.2 -8.1 ...
## $ defensive_ranking : num [1:638] 7.1 0.8 2.2 -0.5 0.2 6.8 8 3.3 -2.1 -2.4 ...
## $ playoffs : chr [1:638] "Playoffs" "Playoffs" "No Playoffs" "No Playoffs" ...
## $ sb_winner : chr [1:638] "No Superbowl" "No Superbowl" "No Superbowl" "No Superbowl" ...
## - attr(*, "spec")=
## .. cols(
## .. team = col_character(),
## .. team_name = col_character(),
## .. year = col_double(),
## .. wins = col_double(),
## .. loss = col_double(),
## .. points_for = col_double(),
## .. points_against = col_double(),
## .. points_differential = col_double(),
## .. margin_of_victory = col_double(),
## .. strength_of_schedule = col_double(),
## .. simple_rating = col_double(),
## .. offensive_ranking = col_double(),
## .. defensive_ranking = col_double(),
## .. playoffs = col_character(),
## .. sb_winner = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(standings_df) # Generate a summary of the data
## team team_name year wins
## Length:638 Length:638 Min. :2000 Min. : 0.000
## Class :character Class :character 1st Qu.:2005 1st Qu.: 6.000
## Mode :character Mode :character Median :2010 Median : 8.000
## Mean :2010 Mean : 7.984
## 3rd Qu.:2015 3rd Qu.:10.000
## Max. :2019 Max. :16.000
## loss points_for points_against points_differential
## Min. : 0.000 Min. :161.0 Min. :165.0 Min. :-261.00
## 1st Qu.: 6.000 1st Qu.:299.0 1st Qu.:310.0 1st Qu.: -75.00
## Median : 8.000 Median :348.0 Median :347.0 Median : 1.50
## Mean : 7.984 Mean :350.3 Mean :350.3 Mean : 0.00
## 3rd Qu.:10.000 3rd Qu.:396.0 3rd Qu.:391.5 3rd Qu.: 72.75
## Max. :16.000 Max. :606.0 Max. :517.0 Max. : 315.00
## margin_of_victory strength_of_schedule simple_rating
## Min. :-16.300000 Min. :-4.600000 Min. :-17.400
## 1st Qu.: -4.700000 1st Qu.:-1.100000 1st Qu.: -4.475
## Median : 0.100000 Median : 0.000000 Median : 0.000
## Mean : -0.001881 Mean : 0.001097 Mean : 0.000
## 3rd Qu.: 4.575000 3rd Qu.: 1.200000 3rd Qu.: 4.500
## Max. : 19.700000 Max. : 4.300000 Max. : 20.100
## offensive_ranking defensive_ranking playoffs sb_winner
## Min. :-11.700000 Min. :-9.800000 Length:638 Length:638
## 1st Qu.: -3.175000 1st Qu.:-2.400000 Class :character Class :character
## Median : 0.000000 Median : 0.100000 Mode :character Mode :character
## Mean : -0.000157 Mean :-0.001097
## 3rd Qu.: 2.700000 3rd Qu.: 2.500000
## Max. : 15.900000 Max. : 9.800000
# Games Data
str(games_df) # Display the structure of the data
## spc_tbl_ [5,324 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ year : num [1:5324] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ week : chr [1:5324] "1" "1" "1" "1" ...
## $ home_team : chr [1:5324] "Minnesota Vikings" "Kansas City Chiefs" "Washington Redskins" "Atlanta Falcons" ...
## $ away_team : chr [1:5324] "Chicago Bears" "Indianapolis Colts" "Carolina Panthers" "San Francisco 49ers" ...
## $ winner : chr [1:5324] "Minnesota Vikings" "Indianapolis Colts" "Washington Redskins" "Atlanta Falcons" ...
## $ tie : chr [1:5324] NA NA NA NA ...
## $ day : chr [1:5324] "Sun" "Sun" "Sun" "Sun" ...
## $ date : chr [1:5324] "September 3" "September 3" "September 3" "September 3" ...
## $ time : 'hms' num [1:5324] 13:00:00 13:00:00 13:01:00 13:02:00 ...
## ..- attr(*, "units")= chr "secs"
## $ pts_win : num [1:5324] 30 27 20 36 16 27 21 14 21 41 ...
## $ pts_loss : num [1:5324] 27 14 17 28 0 7 16 10 16 14 ...
## $ yds_win : num [1:5324] 374 386 396 359 336 398 296 187 395 425 ...
## $ turnovers_win : num [1:5324] 1 2 0 1 0 0 1 2 2 3 ...
## $ yds_loss : num [1:5324] 425 280 236 339 223 249 278 252 355 167 ...
## $ turnovers_loss: num [1:5324] 1 1 1 1 1 1 1 3 4 2 ...
## $ home_team_name: chr [1:5324] "Vikings" "Chiefs" "Redskins" "Falcons" ...
## $ home_team_city: chr [1:5324] "Minnesota" "Kansas City" "Washington" "Atlanta" ...
## $ away_team_name: chr [1:5324] "Bears" "Colts" "Panthers" "49ers" ...
## $ away_team_city: chr [1:5324] "Chicago" "Indianapolis" "Carolina" "San Francisco" ...
## - attr(*, "spec")=
## .. cols(
## .. year = col_double(),
## .. week = col_character(),
## .. home_team = col_character(),
## .. away_team = col_character(),
## .. winner = col_character(),
## .. tie = col_character(),
## .. day = col_character(),
## .. date = col_character(),
## .. time = col_time(format = ""),
## .. pts_win = col_double(),
## .. pts_loss = col_double(),
## .. yds_win = col_double(),
## .. turnovers_win = col_double(),
## .. yds_loss = col_double(),
## .. turnovers_loss = col_double(),
## .. home_team_name = col_character(),
## .. home_team_city = col_character(),
## .. away_team_name = col_character(),
## .. away_team_city = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(games_df) # Generate a summary of the data
## year week home_team away_team
## Min. :2000 Length:5324 Length:5324 Length:5324
## 1st Qu.:2005 Class :character Class :character Class :character
## Median :2010 Mode :character Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2015
## Max. :2019
## winner tie day date
## Length:5324 Length:5324 Length:5324 Length:5324
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## time pts_win pts_loss yds_win
## Length:5324 Min. : 3.00 Min. : 0.00 Min. : 47.0
## Class1:hms 1st Qu.:21.00 1st Qu.:10.00 1st Qu.:308.0
## Class2:difftime Median :27.00 Median :16.00 Median :361.0
## Mode :numeric Mean :27.78 Mean :16.09 Mean :361.6
## 3rd Qu.:34.00 3rd Qu.:21.00 3rd Qu.:415.0
## Max. :62.00 Max. :51.00 Max. :653.0
## turnovers_win yds_loss turnovers_loss home_team_name
## Min. :0.00 Min. : 26.0 Min. :0.000 Length:5324
## 1st Qu.:0.00 1st Qu.:251.0 1st Qu.:1.000 Class :character
## Median :1.00 Median :306.0 Median :2.000 Mode :character
## Mean :1.08 Mean :309.1 Mean :2.168
## 3rd Qu.:2.00 3rd Qu.:366.0 3rd Qu.:3.000
## Max. :7.00 Max. :613.0 Max. :8.000
## home_team_city away_team_name away_team_city
## Length:5324 Length:5324 Length:5324
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
# Attendance Data
str(attendance_df) # Display the structure of the data
## spc_tbl_ [10,846 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ team : chr [1:10846] "Arizona" "Arizona" "Arizona" "Arizona" ...
## $ team_name : chr [1:10846] "Cardinals" "Cardinals" "Cardinals" "Cardinals" ...
## $ year : num [1:10846] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ total : num [1:10846] 893926 893926 893926 893926 893926 ...
## $ home : num [1:10846] 387475 387475 387475 387475 387475 ...
## $ away : num [1:10846] 506451 506451 506451 506451 506451 ...
## $ week : num [1:10846] 1 2 3 4 5 6 7 8 9 10 ...
## $ weekly_attendance: num [1:10846] 77434 66009 NA 71801 66985 ...
## - attr(*, "spec")=
## .. cols(
## .. team = col_character(),
## .. team_name = col_character(),
## .. year = col_double(),
## .. total = col_double(),
## .. home = col_double(),
## .. away = col_double(),
## .. week = col_double(),
## .. weekly_attendance = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(attendance_df) # Generate a summary of the data
## team team_name year total
## Length:10846 Length:10846 Min. :2000 Min. : 760644
## Class :character Class :character 1st Qu.:2005 1st Qu.:1040509
## Mode :character Mode :character Median :2010 Median :1081090
## Mean :2010 Mean :1080910
## 3rd Qu.:2015 3rd Qu.:1123230
## Max. :2019 Max. :1322087
##
## home away week weekly_attendance
## Min. :202687 Min. :450295 Min. : 1 Min. : 23127
## 1st Qu.:504360 1st Qu.:524974 1st Qu.: 5 1st Qu.: 63246
## Median :543185 Median :541757 Median : 9 Median : 68334
## Mean :540455 Mean :540455 Mean : 9 Mean : 67557
## 3rd Qu.:578342 3rd Qu.:557741 3rd Qu.:13 3rd Qu.: 72545
## Max. :741775 Max. :601655 Max. :17 Max. :105121
## NA's :638
#Checking for missing values
colSums(is.na(games_df))
## year week home_team away_team winner
## 0 0 0 0 0
## tie day date time pts_win
## 5314 0 0 0 0
## pts_loss yds_win turnovers_win yds_loss turnovers_loss
## 0 0 0 0 0
## home_team_name home_team_city away_team_name away_team_city
## 0 0 0 0
## Removing columns
games_df <- games_df %>%
select(-home_team_name, -home_team_city, -away_team_name, -away_team_city, -time, -date, -day)
# Create a new variable
games_df <- games_df %>%
mutate(home_win = ifelse(winner == home_team, 1, 0))
In our initial dataset, the only missing data in the tie
column, which is expected as it represents whether the game ended in a
tie or not. Since our analysis does not require information about tied
games, this missing data isn’t a problem.
We decided to remove certain columns from the dataset as they
contained repetitive information. Specifically, we eliminated the
home_team_city
, away_team_name
, and
away_team_city
columns. These columns were redundant
because we already had columns that explicitly indicated the home and
away teams. We also removed the time
, date
,
and day
columns from the dataset, as they were irrelevant
for the specific problem we intend to analyze.
colSums(is.na(attendance_df))
## team team_name year total
## 0 0 0 0
## home away week weekly_attendance
## 0 0 0 638
## combining team and team_name variables
attendance_df <- attendance_df %>%
mutate(team_name = paste(team, team_name, sep = " ")) %>%
select(-team)
# Remove duplicate rows
attendance_df <- attendance_df %>%
group_by(team_name, year,total,home,away) %>%
mutate(weekly_attendance = mean(weekly_attendance, na.rm = TRUE)) %>%
select(-week) %>%
distinct(weekly_attendance)
In the initial dataset, the only missing value was in the
weekly attendance
column, which is expected since it tells
us that there was no game played during that specific week.
We decided to change the dataset. Instead of representing the weekly attendance rate for every week, we made the data reflect the average weekly attendance per year. This change allows analysis of attendance trends over time and provides us witha more manageable dataset.
# checking missing values
colSums(is.na(standings_df))
## team team_name year
## 0 0 0
## wins loss points_for
## 0 0 0
## points_against points_differential margin_of_victory
## 0 0 0
## strength_of_schedule simple_rating offensive_ranking
## 0 0 0
## defensive_ranking playoffs sb_winner
## 0 0 0
## combining team and team_name variables
standings_df <- standings_df %>%
mutate(team_name = paste(team, team_name, sep = " ")) %>%
select(-team)
There was no missing value in the initial dataset. We combined the
team
and team name
columns because we wanted
to remove redundancy in the dataset.
head(standings_df, 10)
head(games_df, 10)
head(attendance_df, 10)
summary(standings_df[c( "wins", "loss", "points_for", "points_against", "margin_of_victory", "playoffs", "offensive_ranking")])
## wins loss points_for points_against
## Min. : 0.000 Min. : 0.000 Min. :161.0 Min. :165.0
## 1st Qu.: 6.000 1st Qu.: 6.000 1st Qu.:299.0 1st Qu.:310.0
## Median : 8.000 Median : 8.000 Median :348.0 Median :347.0
## Mean : 7.984 Mean : 7.984 Mean :350.3 Mean :350.3
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:396.0 3rd Qu.:391.5
## Max. :16.000 Max. :16.000 Max. :606.0 Max. :517.0
## margin_of_victory playoffs offensive_ranking
## Min. :-16.300000 Length:638 Min. :-11.700000
## 1st Qu.: -4.700000 Class :character 1st Qu.: -3.175000
## Median : 0.100000 Mode :character Median : 0.000000
## Mean : -0.001881 Mean : -0.000157
## 3rd Qu.: 4.575000 3rd Qu.: 2.700000
## Max. : 19.700000 Max. : 15.900000
summary(games_df[c("winner", "tie", "pts_win")])
## winner tie pts_win
## Length:5324 Length:5324 Min. : 3.00
## Class :character Class :character 1st Qu.:21.00
## Mode :character Mode :character Median :27.00
## Mean :27.78
## 3rd Qu.:34.00
## Max. :62.00
summary(attendance_df[c("total", "home", "away", "weekly_attendance")])
## total home away weekly_attendance
## Min. : 760644 Min. :202687 Min. :450295 Min. :47540
## 1st Qu.:1040611 1st Qu.:504405 1st Qu.:524983 1st Qu.:65038
## Median :1081090 Median :543185 Median :541757 Median :67568
## Mean :1080910 Mean :540455 Mean :540455 Mean :67557
## 3rd Qu.:1123187 3rd Qu.:578339 3rd Qu.:557700 3rd Qu.:70199
## Max. :1322087 Max. :741775 Max. :601655 Max. :82630
The standings dataset provides information on each team’s yearly
performance. It includes essential data points such as the number of
wins and losses
, which are key indicators of team success.
The dataset offers other important statistics like
offensive ratings
, indicating the quality of a team’s
offensive performance. The dataset contains information on if a team
made the playoffs
. It also provides the
margin of victory
for each team in a given year, which
shows us how dominant the team was. All these statistics displays the
teams overall performance.
The attendance data set contains information about the attendance at
NFL games. It provides information on average
weekly attendance
for each team and the total number of
fans who attended their home games.
The games data set provides information about NFL games, including the teams involved, and game results. It provides us statistics such s which team won and the scores of these games.
We think data visualization would be best choice to present the question,It could be bar charts, box plot even histogram. We plan on to combine separate data frames to compare and analyze our data. For example we plan to merge the standings and attendance data frames to anyalze how a team’s performance in the standings correlates with attendance rates. This will allow us to explore how offensive performance and margin of victory impact attendance rates.Also, we will combine the games and attendance data frames to explore how game outcomes influence attendance rates.
We plan on using histogram,bar chart, and scatter plots as a way to illustrate the our question. This will helps us find good trends and correlation between variables.
We think we need to Learn how to report the histogram or any other type of figures,graph properly. Also learn more about how to manipulate data sets.