Introduction

We will also test to see if a team’s performance correlates to fan attendance. We will inspect the data through time-series graphs to understand attendance over time. We are using correlation and linear regression to test if there is a relationship between team performance and fan attendance. Our work will benefit the consumer by giving them easy to analyze visuals and formal analysis of why attendance may fluctuate. For the final, time willing, we want to create an Rshiny interactive application for users to be able to explore the data on their own, and we want to create a predictive model to give the likelihood of a team winning their game each week.

The first data set is attendance data which gives us: team, team name, year, total attendance for the team for the total year, total home attendance for the year, total away attendance for the year, the week of the season, and weekly attendance for each team. The second data set is individual games data which gives us: year, week of the season, home team, away team, the winner of the game, if there was a tie, the day of the week of the game, the date of the game in month and day of month format, the time of the game in military time, points by the winner, points by the loser, total yards for the winner, turnovers by the winner, yards by the loser, turnovers by the loser, the home team’s city, and the away team’s city. The last of the three data sets is standings data which gives us: team, year, wins and losses by team in the year, points for the year, points against for the year, points differential, average margin of victory, strength of schedule, a simple rating for each team, offense ranking, defense ranking, if a team made the playoffs, if a team won the super bowl.

The attendance data was collected to understand how many fans attend each game. The games data gives background data for each game played between teams. The standings data was collected to give an overall snapshot picture of how a team performed within a year.

Required Packages

We leveraged the Tidyverse. We are specifically using dplyr, tidyr, and ggplot2. We are using tidyr concepts of each row being an observation and each column containing one piece of information. The dplyr package is being using for data manipulation. The pipe operator, “%>%”, allows us to connect multiple operations, such as group_by, summarize, select, etc. This keeps multiple operations in one line of code and keeps the code concise and very readable. We are using ggplot2 for high-quality visualizations.

Data Cleaning

Fortunately, these data sets are extremely clean. There are no real missing values within the data sets, however, there are “NAs” within the attendance data set, which represent a “Bye” week, and within the games data set, which represents there was not a tie (or winner if there was a tie). We still have to do data cleaning while we are manipulating and transforming the data, as we need to merge different pieces of information from each data table to get the necessary “slices and dices”.

Merge Data

# remove bye weeks
att <- att %>% filter(!is.na(weekly_attendance))
# get rid of post season (att does not have postseason data)
games <- games %>% filter(week %in% 1:17)
df1 <- games %>% select(-away_team_name)
df2 <- games %>% select(-home_team_name)
df1$H.A <- "home"
df2$H.A <- "away"
df1 <- df1 %>% rename(team = home_team_name)
df2 <- df2 %>% rename(team = away_team_name)
gm <- rbind(df1,df2) %>% select(-home_team_city,-away_team_city,-home_team,-away_team)
att <- att %>% mutate(id = paste(year,week,team_name))
gm <- gm %>% mutate(id = paste(year,week,team))
temp <- merge(att,gm,by = "id")
stand <- stand %>% mutate(id1 = paste(year,team))
temp$id <- NULL
stand <- stand %>% mutate(id = paste(year,team_name))
df <- temp %>% mutate(id = paste(year.x,team_name)) %>%
  merge(.,stand,by = "id")
df <- df %>% select(team_name.x, year, week.x, home, away, weekly_attendance, winner,
                    day, date, time, pts_win, pts_loss, yds_win, yds_loss, turnovers_win, turnovers_loss,
                    H.A, wins, loss, points_for, points_against, points_differential,
                    margin_of_victory, strength_of_schedule, simple_rating, offensive_ranking,
                    defensive_ranking, playoffs, sb_winner)
df <- df %>% rename(team = team_name.x,week = week.x)
head(df)
##    team year week   home   away weekly_attendance              winner day
## 1 49ers 2000    1 541964 515990             54626     Atlanta Falcons Sun
## 2 49ers 2000    3 541964 515990             65945      St. Louis Rams Sun
## 3 49ers 2000    2 541964 515990             66879   Carolina Panthers Sun
## 4 49ers 2000   14 541964 515990             57255 San Francisco 49ers Sun
## 5 49ers 2000   10 541964 515990             64900  New Orleans Saints Sun
## 6 49ers 2000   16 541964 515990             68306 San Francisco 49ers Sun
##           date     time pts_win pts_loss yds_win yds_loss turnovers_win
## 1  September 3 13:02:00      36       28     359      339             1
## 2 September 17 13:02:00      41       24     529      401             2
## 3 September 10 16:15:00      38       22     450      391             1
## 4   December 3 16:05:00      45       17     385      303             0
## 5   November 5 13:02:00      31       15     360      346             0
## 6  December 17 16:00:00      17        0     456      104             1
##   turnovers_loss  H.A wins loss points_for points_against points_differential
## 1              1 away    6   10        388            422                 -34
## 2              2 away    6   10        388            422                 -34
## 3              2 home    6   10        388            422                 -34
## 4              5 away    6   10        388            422                 -34
## 5              2 away    6   10        388            422                 -34
## 6              1 home    6   10        388            422                 -34
##   margin_of_victory strength_of_schedule simple_rating offensive_ranking
## 1              -2.1                 -1.7          -3.8               1.7
## 2              -2.1                 -1.7          -3.8               1.7
## 3              -2.1                 -1.7          -3.8               1.7
## 4              -2.1                 -1.7          -3.8               1.7
## 5              -2.1                 -1.7          -3.8               1.7
## 6              -2.1                 -1.7          -3.8               1.7
##   defensive_ranking    playoffs    sb_winner
## 1              -5.5 No Playoffs No Superbowl
## 2              -5.5 No Playoffs No Superbowl
## 3              -5.5 No Playoffs No Superbowl
## 4              -5.5 No Playoffs No Superbowl
## 5              -5.5 No Playoffs No Superbowl
## 6              -5.5 No Playoffs No Superbowl
conf.look <- data.frame(team = c(unique(as.character(df$team)))
                        ,conf = c("NFC","NFC","AFC","AFC","AFC","NFC","NFC",
                                "AFC","AFC","AFC","NFC","AFC","NFC","NFC","NFC",
                                "AFC","AFC","NFC","NFC","NFC","AFC","AFC",
                                "NFC","AFC","NFC","NFC","NFC","AFC","AFC",
                                "NFC","AFC","AFC"))
geo.look <- data.frame(team = c(unique(as.character(df$team)))
                       ,geo = c("west","north","east","west","north","south","west",
                               "west","west","south","east","east","east","south","east",
                               "south","east","north","north","south","east","west",
                               "west","north","east","south","west","north","south",
                               "north","north","south"))
df$conf <- conf.look[match(df$team, conf.look$team),]$conf
df$geo <- geo.look[match(df$team, geo.look$team),]$geo
df$tot.att <- df$home + df$away
df$att.per.gm <- df$tot.att / 16
df <- df %>% arrange(year, week, team)

Data Analysis

The three data sets, each in a vacuum, give good information, but to uncover insights we had to slice and dice the data in different ways. We also have had to create new tables from joining disparate pieces of information from each data set to other pieces from the other data sets. For example, we had to join individual team’s attendance per year from the attendance table to the team’s win and loss record from the standings table.

Visualizations

We will be using the ggplot2 library for visualizing the data. Visualiztions will give us quick insights into the data. The goal for the final will be to create interactive visualiztions, which the end consumer will be able to explore.

Weekly attendance by team by division

Weekly attendance by team over time

Conclusion

As we are not to the final product yet we have not reached final conclusions. However, we have already started finding interesting insights in the data regarding “loyal fans”, attendance and team performance, and more. We will have formalized final conclusions for the final submission.

Correlation: Attendance vs Team Record by Year
Team_Name Corr
Arizona Cardinals 0.4199511
Atlanta Falcons 0.1901702
Baltimore Ravens -0.0428109
Buffalo Bills 0.4651406
Carolina Panthers 0.5102904
Chicago Bears 0.2892674
Cincinnati Bengals 0.4491210
Cleveland Browns 0.4408702
Dallas Cowboys 0.2096577
Denver Broncos 0.4200890
Detroit Lions 0.3227719
Green Bay Packers -0.1622443
Houston Texans 0.4511456
Indianapolis Colts -0.1849546
Jacksonville Jaguars 0.1018813
Kansas City Chiefs 0.2790644
Los Angeles Chargers 0.0442534
Los Angeles Rams -0.7716305
Miami Dolphins 0.1356047
Minnesota Vikings 0.0110569
New England Patriots 0.5156219
New Orleans Saints 0.4942902
New York Giants 0.4425544
New York Jets 0.1868446
Oakland Raiders 0.2210064
Philadelphia Eagles -0.3473729
Pittsburgh Steelers 0.4464934
San Diego Chargers 0.6168275
San Francisco 49ers 0.2813740
Seattle Seahawks 0.2727016
St. Louis Rams 0.4889312
Tampa Bay Buccaneers 0.1277945
Tennessee Titans 0.0959730
Washington Redskins 0.2664414