More than Touchdowns: An NFL Data Analysis

Introduction

Breaking Down the NFL

From a young age, both of us have watched NFL football and cheered for our respective teams every week. We (Adam, a die-hard Bengals fan, and Katie, a lifelong Steelers fan), not only wanted to compare our teams individually, but also the league as a whole. The past twenty years have seen successes and failures from every NFL team. Looking at the datasets available for this project, we were instantly drawn to analyzing the NFL option, as it showcases many variables that affect an NFL team. We then thought, “What all goes into winning an NFL game, and what teams are historically successful in the final standings?” Using the past 20 years worth of data, we sought to investigate this problem.

Our Focus

We plan on using the functions in R to deliver overall summary statistics on games and standings. Additionally, we will use the data to develop potential correlations and plot respective data visualizations. Utilizing descriptive analysis of the past 20 years, we are looking to see if there can be predictive tendencies for NFL teams.

The NFL dataset contained three individuals datasets:

NFL Attendance (nfl_attendance)
NFL Standings (nfl_standings)
NFL Games (nfl_games)

More detailed information about each dataset can be found in the Data Preparation tab.

Analytical Technique and Approach

The datasets contain loads of information for the NFL. With a wide range of variables, many options are available to analytically investigate the NFL. With the three datasets at hand, we looked to compare them to draw conclusions about team performances. To see if statistical significance or rational conclusions related to the NFL could be realized, the following situations were explored:

The Importance of Fan Attendance - This data analysis will look into if the number of fans in attendance correlates to a team’s success (number of games won). Additionally, it will provide comparisons of how teams fare in home versus away games while keeping in mind the attendance at those games.
Standings over the Years - The NFL has two conferences: the American Football Conference (AFC) and National Football Conference (NFC). Each conference contains four divisions with four teams in each division. Each division then has a winner over the 16 game regular season. Our analysis will look into the qualities of the division winners, and the attributes that teams high in the standings have over teams in the lower portion of the standings. Furthermore, this approach will discover what separates Super Bowl Champions from the 31 other teams each season.
Offense vs. Defense - The two main parts of a NFL team are the offense and defense. The goal for each team is to be great on both sides. However, this is rarely the case. Using individual game data and season-long statistics, a thorough breakdown of how having a great offense or defense improves teams will be given. We will also see if having a better offense or defense is critical to success over the years.
Individual Game Observations - The nfl_games dataset contains many variables for games. Turnovers, day of the week, points, etc. are shown for every match-up. Correlations into why teams win or lose will be the goal of this analysis. Using a plethora of variables, significance of certain variables will be essential for further understanding.

Moving Forward

The NFL is a multi-billion dollar industry. Millions of fans across the world cheer for these 32 teams every year. People are now looking for ways to understand the game better.

Coaches want to understand what makes a team more successful. Sports gamblers want to get an edge and make the correct picks based on more than just gut feelings. Fans want to know if their team is progressing in the right direction. This analysis is useful for all of these situations. Using descriptive analysis, past results can be better explained. As such, trends can be deduced to predict how NFL games and seasons will occur. Although no one can see into the future, understanding the data sheds a better light on the probability of certain results occurring in the NFL.

Packages Required

This project requires a variety of packages. Given there are over 10,000 packages in R, we want to focus on the ones that will provide us with the best results while cleaning and interpreting the data. Because we still consider ourselves very much novices, many of these packages are standard.

Some packages will be more useful than others. For example, ggplot2 allows for great visualizations that provide better understanding of the data. Additionally, dplyr can drill deeper into the three datasets to come to conclusions that may be hidden at first. R has powerful functions that can derive explanations for questions to massive datasets. Please see below for all of the packages loaded for this analysis:

# Packages required
library(tidyverse) # Use to tidy data
library(dplyr) # Use to manipulate data
library(ggplot2) # Use to plot data and create visualizations
library(tibble) # Use to manipulate and re-imagine data
library(readr) # Use to import data cleanly and efficiently
library(DT) # Use to create comprehensive data tables with HTML output
library(knitr) # Use for dynamic report generation
library(base) # Contains Base R functions

Data Preparation

The data was obtained from our professor, Tianhai Zu, for this class. He had provided four different datasets in which to choose, and we chose the NFL Attendance Data option. This dataset can be found on GitHub. Reading the information on GitHub led us to find the original source of the data, which is Pro Football Reference Standings and Pro Football Reference Attendance.

The NFL dataset contains of three individual datasets - (1) nfl_attendance, (2) nfl_standings, and (3) nfl_games. We first merged the three datasets into one dataframe called nfl_df. We decided it might be beneficial to have multiple frames of reference, some utilizing individual datasets, and another by looking at the combined dataframe. Rather than using str() and summary() to show descriptive statistics for each variable, we decided to create comprehensive tables.

# Get working directory
getwd()

## [1] "C:/Users/katie/OneDrive - University of Cincinnati/FS20/Second Half/Data Wrangling (BANA 7025)/Final Project"

# Get the data
nfl_attendance <- readr::read_csv('attendance.csv')
nfl_standings <- readr::read_csv('standings.csv')
nfl_games <- readr::read_csv('games.csv')

# To use 2020 data you need to update tidytuesdayR from GitHub
# Install via devtools::install_github("thebioengineer/tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2020-02-04')

## 
##  Downloading file 1 of 3: `attendance.csv`
##  Downloading file 2 of 3: `games.csv`
##  Downloading file 3 of 3: `standings.csv`

tuesdata <- tidytuesdayR::tt_load(2020, week = 6)

## 
##  Downloading file 1 of 3: `attendance.csv`
##  Downloading file 2 of 3: `games.csv`
##  Downloading file 3 of 3: `standings.csv`

attendance <- tuesdata$attendance

# Join the data relatively nicely with dplyr
nfl_df <- dplyr::left_join(nfl_attendance, nfl_standings, nfl_games, by = c("year", "team_name", "team"))

Attendance

As aforementioned, the NFL Attendance data was imported and obtained from Pro Football Reference. The original data contains 10,846 observations and eight variables. There are two character type variables, team and team_name. There are six numeric type variables, year, total, home, away, week, weekly_attendance. The variables are described in the data dictionary below. The data was collected from 2000 - 2020, and the values for the columns were observed during the 17 weeks of the NFL season. See the ORIGINAL dataset below.

# Examine the structure of the dataset
datatable(head(nfl_attendance, 50))

# Create a data dictionary for attendance
var_names_att <- colnames(nfl_attendance)
var_types_att <- lapply(nfl_attendance, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season", "Week in which game was played", "Attendance for given week")
data_dict_att <- as_tibble(cbind(var_names_att, var_types_att, var_desciptions_att))
colnames(data_dict_att) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_att) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
total	numeric	Total attendance per season
home	numeric	Total attendance at home games per season
away	numeric	Total attendance at away games per season
week	numeric	Week in which game was played
weekly_attendance	numeric	Attendance for given week

Looking at the missing data values, the only column in which missing values exist is the weekly_attendance. This makes sense, as each NFL team has at least one bye week during the regular season. We decided to omit these values as they would skew the data and misrepresent the trends for each team.

colSums(is.na(nfl_attendance)) # Find the number of missing values per column

##              team         team_name              year             total 
##                 0                 0                 0                 0 
##              home              away              week weekly_attendance 
##                 0                 0                 0               638

nfl_attendance <- na.omit(nfl_attendance)
colSums(is.na(nfl_attendance)) # Confirm there are no missing values

##              team         team_name              year             total 
##                 0                 0                 0                 0 
##              home              away              week weekly_attendance 
##                 0                 0                 0                 0

Looking at this above original dataset, we decided to first rename the columns to better describe the data.

nfl_attendance <- nfl_attendance %>% rename(
  team_location = team,
  total_attendance = total,
  total_home_attendance = home,
  total_away_attendance = away
)

Additionally, we split it into two dataframes, the first omitting the weekly data, and the second omitting the season totals. This decision was made largely to remove duplicates, and we knew it would bode for better visualizations during the exploratory data analysis (EDA).

The first dataset, nfl_total_attendance erased the two columns, week and weekly_attendance. This dataset will show the season totals for attendance per each team. The second dataset, nfl_weekly_attendance erased the season total data columns, total, home, and away.

nfl_total_attendance <- nfl_attendance[-c(7, 8)] # Remove weekly data
nfl_total_attendance <- nfl_total_attendance[!duplicated(nfl_total_attendance), ] # Remove duplicates
datatable(head(nfl_total_attendance, 50))

nfl_weekly_attendance <- nfl_attendance[-c(4, 5, 6)] # Remove season total attendance data
datatable(head(nfl_weekly_attendance, 50))

Now, for a summary of the two datasets and associated tables of the CLEANED data, please see below.

# Examine the final summary and structure of the nfl_total_attendance dataset
datatable(head(nfl_total_attendance, 50))

# Create a data dictionary for nfl_total_attendance
var_names_att <- colnames(nfl_total_attendance)
var_types_att <- lapply(nfl_total_attendance, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season")
data_dict_total_att <- as_tibble(cbind(var_names_att, var_types_att, var_desciptions_att))
colnames(data_dict_total_att) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_total_att) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team_location	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
total_attendance	numeric	Total attendance per season
total_home_attendance	numeric	Total attendance at home games per season
total_away_attendance	numeric	Total attendance at away games per season

# Examine the final summary and structure of the nfl_weekly_attendance dataset
datatable(head(nfl_weekly_attendance, 50))

# Create a data dictionary for nfl_weekly_attendance
var_names_att <- colnames(nfl_weekly_attendance)
var_types_att <- lapply(nfl_weekly_attendance, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Week in which game was played", "Attendance for given week")
data_dict_weekly_att <- as_tibble(cbind(var_names_att, var_types_att, var_desciptions_att))
colnames(data_dict_weekly_att) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_weekly_att) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team_location	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
week	numeric	Week in which game was played
weekly_attendance	numeric	Attendance for given week

Standings

Similar to above, the NFL Standings data was imported and obtained from Pro Football Reference. The original data contains 638 observations and 15 variables. There are four character type variables, team, team_name, playoffs, and sb_winner. There are 11 numeric type variables, year, wins, loss, points_for, points_against, points_differential, margin_of_victory, strength_of_schedule, simple_rating, offensive_ranking, and defensive_ranking. The variables are described in the data dictionary below. The data observed was collected from 2000 - 2020. See the ORIGINAL NFL Standings data below.

# Examine the structure of the dataset
datatable(head(nfl_standings, 50))

# Create a data dictionary for standings
var_names_st <- colnames(nfl_standings)
var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season")
data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_desciptions_st))
colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_st) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
wins	numeric	Total wins per season (0 to 16)
loss	numeric	Total losses per season (0 to 16)
points_for	numeric	Total points the team scored per season
points_against	numeric	Total points the opponent scored on the team per season
points_differential	numeric	The difference between the total points for the team and against the team
margin_of_victory	numeric	Points differential divided by the total number of games per season
strength_of_schedule	numeric	Difficulty of schedule based on opponent records
simple_rating	numeric	A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)
offensive_ranking	numeric	A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)
defensive_ranking	numeric	A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)
playoffs	character	Stating whether or not the team made it to the playoffs
sb_winner	character	Stating whether or not the team won the Super Bowl for the season

Looking at the above dataset, we first decided to change the column names to better describe the data.

nfl_standings <- nfl_standings %>% rename(
  team_location = team,
  total_wins = wins,
  total_losses = loss
)

It is important to note as well that a few of the variable names refer to calculated values. The calculated value for points_differential is: points_differential = points_for - points_against. Additionally, margin_of_victory is calculated by: points_scored - points_allowed / games_played.

Lastly, the simple_rating is calculated by: \[SRS = MoV + SoS = OSRS + DSRS\]

In layman’s terms, the simple rating system is equal to the margin of victory plus the strength of schedule. This is equal to the offensive simple rating standing plus the defensive simple rating standing.

Next, we wanted to see what the sum of missing values was per column. As evident below, there are no missing values.

colSums(is.na(nfl_standings))

##        team_location            team_name                 year 
##                    0                    0                    0 
##           total_wins         total_losses           points_for 
##                    0                    0                    0 
##       points_against  points_differential    margin_of_victory 
##                    0                    0                    0 
## strength_of_schedule        simple_rating    offensive_ranking 
##                    0                    0                    0 
##    defensive_ranking             playoffs            sb_winner 
##                    0                    0                    0

Moving forward, we decided to change both the playoffs and sb_winner to binary variables. This is because they both only have two unique values.

unique(nfl_standings$playoffs, incomparables = FALSE) # View the unique values for the playoffs column

## [1] "Playoffs"    "No Playoffs"

unique(nfl_standings$sb_winner, incomparables = FALSE) # View the unique values for the sb_winner column

## [1] "No Superbowl"  "Won Superbowl"

Knowing this, we changed the two columns to binary variables. For the playoffs column, a value of one stands for “Playoffs”, and a value of zero stands for “No Playoffs”.

nfl_standings$playoffs[nfl_standings$playoffs == "Playoffs"] <- "1"
nfl_standings$playoffs[nfl_standings$playoffs == "No Playoffs"] <- "0"

nfl_standings$playoffs <- as.numeric(nfl_standings$playoffs)

For the sb_winner column, a value of one denotes “Won Superbowl”, and a value of zero denotes “No Superbowl”.

nfl_standings$sb_winner[nfl_standings$sb_winner == "Won Superbowl"] <- "1"
nfl_standings$sb_winner[nfl_standings$sb_winner == "No Superbowl"] <- "0"

nfl_standings$sb_winner <- as.numeric(nfl_standings$sb_winner)

Now, for a summary of the dataset and associated table of the data, please see the CLEANED dataset below.

# Examine the structure of the dataset
datatable(head(nfl_standings, 50))

# Create a data dictionary for standings
var_names_st <- colnames(nfl_standings)
var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season")
data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_desciptions_st))
colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_st) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team_location	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
total_wins	numeric	Total wins per season (0 to 16)
total_losses	numeric	Total losses per season (0 to 16)
points_for	numeric	Total points the team scored per season
points_against	numeric	Total points the opponent scored on the team per season
points_differential	numeric	The difference between the total points for the team and against the team
margin_of_victory	numeric	Points differential divided by the total number of games per season
strength_of_schedule	numeric	Difficulty of schedule based on opponent records
simple_rating	numeric	A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)
offensive_ranking	numeric	A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)
defensive_ranking	numeric	A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)
playoffs	numeric	Stating whether or not the team made it to the playoffs
sb_winner	numeric	Stating whether or not the team won the Super Bowl for the season

Games

Once again the NFL Games data was imported and obtained from Pro Football Reference. The original data contains 5,324 observations and 19 variables. There are 11 character variables, week, home_team, away_team, winner, tie, day, date, home_team_name, home_team_city, away_team_name, and away_team_city. There are seven numeric type variables, year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, and turnovers_loss. See the ORIGINAL dataset below.

# Examine the structure of the dataset
datatable(head(nfl_games, 50))

# Create a data dictionary for games
var_names_games <- colnames(nfl_games)
var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Was there a tie?  (if so, the other team will be listed in this column)", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had", "Name or mascot of the winning team", "City of the winning team", "Name or mascot of the losing team", "City of the losing team")
data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_desciptions_games))
colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_games) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
year	numeric	Year
week	character	Week of the season in which the game was played
home_team	character	Home team for the game
away_team	character	Away team for the game
winner	character	Winner of the game
tie	character	Was there a tie? (if so, the other team will be listed in this column)
day	character	Day of the week in which the game was played
date	character	Date of the game
time	hms , difftime	Time of the day in which the game was played
pts_win	numeric	Number of points the winning team scored
pts_loss	numeric	Number of points the losing team scored
yds_win	numeric	Total number of yards the winning team had
turnovers_win	numeric	Total number of turnovers the winning team had
yds_loss	numeric	Total number of yards the losing team had
turnovers_loss	numeric	Total number of turnovers the losing team had
home_team_name	character	Name or mascot of the winning team
home_team_city	character	City of the winning team
away_team_name	character	Name or mascot of the losing team
away_team_city	character	City of the losing team

Looking at the above dataset, the first step we took to clean the data was to remove the last four unnecessary columns, as we felt they were redundant.

names(nfl_games)

##  [1] "year"           "week"           "home_team"      "away_team"     
##  [5] "winner"         "tie"            "day"            "date"          
##  [9] "time"           "pts_win"        "pts_loss"       "yds_win"       
## [13] "turnovers_win"  "yds_loss"       "turnovers_loss" "home_team_name"
## [17] "home_team_city" "away_team_name" "away_team_city"

nfl_games <- nfl_games[-c(16, 17, 18, 19)] # Remove redundant columns
names(nfl_games)

##  [1] "year"           "week"           "home_team"      "away_team"     
##  [5] "winner"         "tie"            "day"            "date"          
##  [9] "time"           "pts_win"        "pts_loss"       "yds_win"       
## [13] "turnovers_win"  "yds_loss"       "turnovers_loss"

Then, we changed the week column to be numeric.

nfl_games$week <- as.numeric(nfl_games$week)

Looking at missing values, the only column which contained them was the tie column. This makes sense, as very few NFL games result in a tie.

Next, the way in which a tie was denoted was by listing one team name in the winner column, and the opponent team name in the tie column. To fix this, we identified any game that resulted in a tie. Then, for these specific games, we renamed the value in the winner column to “Tie”. The tie column was then erased.

colSums(is.na(nfl_games))

##           year           week      home_team      away_team         winner 
##              0            220              0              0              0 
##            tie            day           date           time        pts_win 
##           5314              0              0              0              0 
##       pts_loss        yds_win  turnovers_win       yds_loss turnovers_loss 
##              0              0              0              0              0

unique(nfl_games$tie, incomparables = FALSE)

## [1] NA                   "Atlanta Falcons"    "Cincinnati Bengals"
## [4] "St. Louis Rams"     "Green Bay Packers"  "Carolina Panthers" 
## [7] "Arizona Cardinals"  "Cleveland Browns"

nfl_games$winner[nfl_games$tie != is.na(nfl_games$tie)] <- "Tie"
nfl_games <- nfl_games[-c(6)] # Remove the tie column
colSums(is.na(nfl_games)) # Confirm there are no missing values

##           year           week      home_team      away_team         winner 
##              0            220              0              0              0 
##            day           date           time        pts_win       pts_loss 
##              0              0              0              0              0 
##        yds_win  turnovers_win       yds_loss turnovers_loss 
##              0              0              0              0

To view the summary and structure of the CLEANED data:

# Examine the structure of the dataset
datatable(head(nfl_games, 50))

# Create a data dictionary for games
var_names_games <- colnames(nfl_games)
var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had")
data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_desciptions_games))
colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_games) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
year	numeric	Year
week	numeric	Week of the season in which the game was played
home_team	character	Home team for the game
away_team	character	Away team for the game
winner	character	Winner of the game
day	character	Day of the week in which the game was played
date	character	Date of the game
time	hms , difftime	Time of the day in which the game was played
pts_win	numeric	Number of points the winning team scored
pts_loss	numeric	Number of points the losing team scored
yds_win	numeric	Total number of yards the winning team had
turnovers_win	numeric	Total number of turnovers the winning team had
yds_loss	numeric	Total number of yards the losing team had
turnovers_loss	numeric	Total number of turnovers the losing team had

Proposed Exploratory Data Analysis

Initial Discoveries and Changes

With a large amount of NFL data in front of us, exploring the data is a great challenge. Drilling down into the data can be done in many ways using R.

To get an understanding of our data, we will run statistics to analyze the datasets and variables within each dataset. This will give us the base needed to uncover new information. Then, we will create new variables and modify the datasets as needed. For example, we split nfl_attendance into nfl_total_attendance and nfl_weekly_attendance as seen in Data Preparation. The original dataset was repetitive in the total numbers as each observation presented the weekly numbers for the year with the same yearly totals. As a result, splitting these datasets creates a more digestible format for analyzing NFL attendance. Data Preparation contains additional changes to the datasets and cleaning procedures completed.

Uncovering New Information

As detailed in the Introduction, there are four ways that the data will be analyzed from an overall standpoint. Graphical depictions as well as tabular representations will give viewers an understanding of the significance of our findings. These graphs will include histograms and boxplots for initial exploration of the data. Then, ggplot2 will provide phenomenal scatter plots to visualize findings.

Developing correlations between the datasets is also a critical part of our focus. Linear regression models and correlations tests will be crucial to uncovering this new information. The functions lm and cor in R are a few opportunities for us to display these findings.

Learning the Answers

There is a lot of uncertainty with our planned exploratory data analysis. Two findings are necessary for the final project to be successful.

We must investigate if the foci of our investigation has statistical significance - utilizing p-tests and other methods of significance will be necessary for this portion
Graphing the data into plots that people can follow and understand what conclusions are being made

This requires extensive knowledge of the packages in R. Although we have a rudimentary understanding of these functions at the moment, we will seek to grasp the available tools in dplyr, tidyverse, ggplot2, etc. to be able to present a logical and thorough project.

With the rest of the semester, we hope to develop a further expertise in R. Understanding regression analysis, plotting, complicated data preparation, etc. will prepare us for making the final project to be informative for all viewers. We are excited to continue learning new R functions and finish this project with additional findings and conclusions!