Package used for the analysis are:
library(readr)
library(lubridate)
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(knitr)
library(tibble)
library(data.table)
This part is to prepare the data for the analysis.
Original data (attendance, games and standings) were obtained from Pro Football Reference website and downloaded and cited through Github Rfordatascience
attendance <- read.csv("~/R/NFL dataset/attendance.csv")
games <- read.csv("~/R/NFL dataset/games.csv")
standings <- read.csv("~/R/NFL dataset/standings.csv")
Cleaning Steps
When looking at the NFL data sets such as attendance, games and standings, “attendance” and “games” data were recorded every week while the “standings” data was recorded every year. So, the “attendance” and “games” data were left-joined together because we wanted to return the rows of the “games” and not the “attendance” data. Before the “games” data was left-joined to “attendance”, the type of the variables “week” and time were changed.
Some variables names were changed for clarity when manipulating the data set. To solve the business problem some variables that were not needed for the analysis were removed.
When we left-joined the “games” table to “attendance’ by week, year, home_team-name and team_name, some missing values were generated in the process and this was because of the”week" variable in the “games” table. The “week” variable in the “games” table has 16 weeks(listed 1,2,3 until 16) and confChamp, Division, SuperBowl and WildCard while the “week” variable in the attendance table has only those 16 weeks. The missing values, NA, come from those features that the “week” variable in the “games” table has. We decided to leave those missing values in the dataset because they can be replaced by their values if there is attendance record of the confChamp, Division, SuperBowl and WildCard.
For the “standings” data, some variables were renamed for clarity between each variable. Also, we removed some variables that were not part of the analysis. The “Superbowl” variable has “won superbowl” and “No superbowl”. We changed “won superbowl” and “No superbowl” into binary variables, 1 and 0 respectively. We changed into binary variable to facilitate calculations and avoid dealing with character type of variables which would have made the analysis a bit difficult.
#Convert "week" variable type from the attendance file to character. We want "week" variable from games file to have the same type as the "week" variable from attendance file to be able to merge the tables by week.
attendance$week <- as.character(attendance$week)
#Change the type of the variable "time"
games$time <- hms(games$time)
class(games$time)
#We decided to combine only games and attendance because the data was recorded every week while standings data was recorded every year.
#We did a left_join because we want record from games file
nfl<- left_join(games,attendance,by = c("home_team_name"="team_name","year","week"))
#Change some columns names for clarity
names(nfl)[7] <- "weekday"
names(nfl)[22:23] <- c( "home_attendance", "away_attendance")
#Remove columns that are not needed to solve the Business Problem
nfl <- select(nfl, -c("team", "tie","home_team_name","home_team_city","away_team_name","away_team_city","yds_win","yds_loss", "turnovers_win","turnovers_loss","total","weekly_attendance"))
#Rename pts_wins and pts_loss
names(nfl)[9:10] <- c("points_wins","points_loss")
# Check if we have NA from the data NFL
is.na (nfl)
#Modify Standings file separately
standings1 <- standings
#Change names of columns to ease comprehension
names(standings1)[15] <- "superbowl_winner"
names(standings1)[1] <- "team_city"
names(standings1)[6:7] <- c("offensive_points", "defensive_points")
#Remove columns that are not needed to solve the business problem
standings1 <- select(standings1, -c("points_differential","margin_of_victory","simple_rating","offensive_ranking","defensive_ranking","playoffs"))
#Rename columns wins and loss
names(standings1)[4:5] <- c("game_wins", "game_loss")
#Change No Superbowl to 0 and won Superbowl to 1 to ease computation during the analysis
standings1$superbowl_winner <- str_replace(standings1$superbowl_winner, "No Superbowl", "0")
standings1$superbowl_winner <- str_replace(standings1$superbowl_winner, "Won Superbowl", "1")
standings1$superbowl_winner <- as.numeric(standings1$superbowl_winner)
# Check if we have NA from the data standings1
is.na(standings1)
NFL Cleaned Data
# Read the cleaned data
NFL_cleaned_data <- read.csv("~/R/NFL dataset/NFL_cleaned_data.csv")
datatable(head(NFL_cleaned_data, 25))
Standings Cleaned Data
Standings_cleaned_data <- read.csv("~/R/NFL dataset/Standings_cleaned_data.csv")
datatable(head(Standings_cleaned_data,25))
This analysis is performed by using graphs and tables to answer the business problem.
The table below shows the total of wins when teams were playing at home and when they were playing away. It reveals that there are a lot of home games victories compared to away games.
#Does a home game give a team more of advantage over an away team?
##Identify if the game that a team wins is a home_game or an away_game
team_home_away <- tibble(NFL_cleaned_data)%>%
mutate(home_or_away = if_else(winner==home_team, "home","away"))
#Change away to 0 and home to 1 for computation
team_home_away1<-team_home_away
team_home_away1$home_or_away <- str_replace(team_home_away1$home_or_away, "away", "0")
team_home_away1$home_or_away <- str_replace(team_home_away1$home_or_away, "home", "1")
team_home_away1$home_or_away <- as.numeric(team_home_away1$home_or_away)
##Create a variable team_home_winner to select all the winners that are home_team and wins the home_game
team_home_winner <- team_home_away1 %>%
filter (home_or_away== 1)%>%
select(home_team,home_or_away)%>%
gather(key,value1,-home_team) %>%
group_by (home_team) %>%
summarize(total_home_winner = sum(value1))
#Change away to 1 and home to 0 for computation
team_home_away2<-team_home_away
team_home_away2$home_or_away <- str_replace(team_home_away2$home_or_away, "away", "1")
team_home_away2$home_or_away <- str_replace(team_home_away2$home_or_away, "home", "0")
team_home_away2$home_or_away <- as.numeric(team_home_away2$home_or_away)
##Create a variable team_away_winner to select all the winners that are away_team and wins the away_game
team_away_winner <- team_home_away2 %>%
filter (home_or_away== 1)%>%
select(away_team,home_or_away)%>%
gather(key,value2,-away_team) %>%
group_by(away_team) %>%
summarize(total_away_winner= sum(value2))
##Create a table that shows the total of game wins, and either home or away game
home_game <- sum(team_home_winner$total_home_winner)
totalhomegame <- data.frame(home_game, row.names="Home game")
away_game <- sum(team_away_winner$total_away_winner)
totalawaygame <- data.frame(away_game, row.names = "Away game")
names(totalhomegame)<-names(totalawaygame)
table_home_away <- rbind(totalhomegame,totalawaygame)
names(table_home_away)[1] <-"Total Wins"
kable(table_home_away)
| Total Wins | |
|---|---|
| Home game | 3048 |
| Away game | 2276 |
The next graph shows the total home and away games won by each team from 2000-2020. The majority of teams win more at home except Los Angeles Chargers and Los Angeles Rams who won more away games than home games.
# team_home_winner and team_away_winner were combined and the away_team was removed
combine_home_away <- tibble(team_home_winner, team_away_winner)%>%
select(-away_team)
#we gather all the colums except home_team.The function gather() collapses multiple columns into key-value pairs. It produces a “long” data format from a “wide” one
home_away_data <- gather(combine_home_away, variable, value,-home_team)
#Plot of home_games and away_games total for each team
home_away_plot <- ggplot(home_away_data, aes(home_team,value,fill= variable))+ geom_col(position=position_dodge(width=0.5))+ scale_fill_manual(values=c("deepskyblue2","gray46"))+ theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 1)) + labs(title= "Total Home and Away Games Won by each Team", x="Team",y="Total of Games Won")
home_away_plot
Home and away attendances were looked into it to check whether attendances are contributing in winning a home or an away game. The graph below indicates that home games have more attendance than away games. Teams who play at home have more advantage than teams who played away. And this might be because they have a lot of people to support them.
#Look into home and away attendances to see if they play a role in the team ability to win a home or an away game
yearly_home_attendance <- team_home_away%>%
filter(home_or_away=="home")%>%
select(year,home_attendance)%>%
group_by(year)%>%
summarize(total_home_attendance=sum(home_attendance[!is.na(home_attendance)]))
yearly_away_attendance <- team_home_away%>%
filter(home_or_away=="away")%>%
select(year,away_attendance)%>%
group_by(year)%>%
summarize(total_away_attendance=sum(away_attendance[!is.na(away_attendance)]))
yearly_home_attendance_plot <- ggplot(yearly_home_attendance)+ geom_line(aes(year,total_home_attendance/1000000, colour="Home Game"), show.legend = TRUE) +geom_line(data=yearly_away_attendance,aes(year,total_away_attendance/1000000, colour="Away Game"),show.legend = TRUE)+ theme_classic() + theme(axis.text.x =element_text(angle= 90, vjust=1),legend.position = "right")+ labs(title ="Annual Total Home and Away Attendance", x="Years", y= "Total Attendance(millions)", colour="Attendance")+ scale_colour_manual(values=c("Home Game"="blue","Away Game"="orange")) +scale_x_continuous(breaks =seq(2000,2020,1))+ scale_y_continuous(breaks = seq(0,90,5))
yearly_home_attendance_plot
This chart shows the total team wins through the years 2000 to 2020. Over the past 20 years the New England Patriots have won the most games overall and the Cleveland Browns have won the least. As noticed below, the Chargers and Rams have two different colors and that is because the both recently switch cities. Both teams moved to the city of Los Angeles, California.
#table of total wins for each city
city_total_wins <- Standings_cleaned_data %>%
select(team_name, team_city, game_wins) %>%
gather(key, value, -team_name, -team_city) %>%
group_by(team_name, team_city) %>%
summarise(total_wins = sum(value))
#column chart for total wins for each city over all years
city_total_wins_plot <- ggplot(city_total_wins, aes(x = team_name, y = total_wins, fill = team_city)) +geom_col(show.legend = FALSE) +geom_text(aes(x=team_name, y=total_wins, label=team_city), angle=90, size=3, hjust=2) +labs(x="City of Team", y="Total Wins") +ggtitle("Total Team Wins from 2000 to 2020") + theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
city_total_wins_plot
Below shows the average wins per season from 2000 to 2020. Again, the New England Patriots are at the top of the list averaging 11 wins per season. Also, the Cleveland Browns are at the bottom of the list with only 4 wins per season.
#creating table for average team wins
city_average_wins <- Standings_cleaned_data %>%
select(team_name, team_city, game_wins) %>%
gather(key, value, -team_name, -team_city) %>%
group_by(team_name) %>%
summarise(average_wins = as.integer(mean(value)))
#creating column chart for average team wins
city_average_wins_plot <- ggplot(city_average_wins, aes(x = team_name, y = average_wins, fill = team_name)) +geom_col(show.legend = FALSE) + geom_text(aes(x=team_name, y=average_wins, label=average_wins), angle=90, size=3, hjust=4)+ labs(x="NFL Team", y="Average Wins") +ggtitle("Average Team Wins from 2000 to 2020")+ theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust =0.5))
city_average_wins_plot
The average number of points scored by a team per year is shown below. the New England Patriots are number one on the list with New Orleans Saints and Green Bay Packers close behind. The Cleveland Browns are again at the bottom of the list and is the only team averaging less than 300 points a season.
#table of average offensive points over the seasons
city_avg_offensive_points <- Standings_cleaned_data %>%
select(team_name, team_city, offensive_points) %>%
gather(key, value, -team_name, -team_city) %>%
group_by(team_name) %>%
summarise(average_offensive_points = as.integer(mean(value)))
#creating column chart for average points scored each year
offense_plot <- ggplot(city_avg_offensive_points, aes(x = team_name, y = average_offensive_points, fill = team_name)) +geom_col(show.legend = FALSE) +geom_text(aes(x=team_name, y=average_offensive_points, label=average_offensive_points), angle=90, size=3, hjust=4) +labs(x="NFL Team", y="Average Points Scored") +ggtitle("Average Points Scored per Team from 2000 to 2020") +theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
offense_plot
Below we have the average number of points given up by a team over the seasons. The Baltimore Ravens have given up the least amount of points on average and are followed by the New England Patriots and Pittsburgh Steelers. The Detroit Lions have given up the most points over seasons 2000 to 2020 and are followed closely by the Oakland Raiders and Los Angeles Rams.
#table of average points given up by teams over the seasons
city_avg_defensive_points <- Standings_cleaned_data %>%
select(team_name, team_city, defensive_points) %>%
gather(key, value, -team_name, -team_city) %>%
group_by(team_name) %>%
summarise(average_defensive_points = as.integer(mean(value)))
#creating column chart for average points given up by teams each year
defense_plot <- ggplot(city_avg_defensive_points, aes(x = team_name, y = average_defensive_points, fill = team_name)) + geom_col(show.legend = FALSE) +geom_text(aes(x=team_name, y=average_defensive_points, label=average_defensive_points), angle=90, size=3, hjust=4) +labs(x="NFL Team", y="Average Points Given Up") +ggtitle("Average Points Given Up per Team from 2000 to 2020") +theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
defense_plot
The graph below shows all of the Superbowl winners from 2000 to 2019 and how many times they won. As guessed, the patriots have won the most Superbowl with a total of 6 in the past 19 years. Baltimore, New York Giants, and Pittsburgh have won two Superbowl. Denver, Green Bay, Indianapolis, Kansas City, New Orleans, Philadelphia, Seattle, and Tampa Bay have all won once. Only 12, out of 32,teams have won the Superbowl in the past 19 years and New England has almost 1/3 of them.
#creating table for cities with 1 or more superbowls
SBWins <- Standings_cleaned_data %>%
select(team_city, superbowl_winner) %>%
gather(key, value, -team_city) %>%
group_by(team_city) %>%
summarise(SB_wins = sum(value)) %>%
filter(SB_wins >= 1)
#creating column chart for cities with 1 or more superbowls
SBWins_plot <- ggplot(SBWins, aes(x = team_city, y = SB_wins, fill = team_city)) +geom_col(show.legend = FALSE) +geom_text(aes(x=team_city, y=SB_wins, label=SB_wins), vjust=2) +labs(x="City of Team", y="Number of Superbowls") +ggtitle("Total Superbowls Won from 2000 to 2020") +
theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
SBWins_plot
The graph shows the teams who won the Superbowl from 2000 to 2020. New England Patriots has won Superbowl for two consecutive years, in 2003 and 2004.
#Winning the Superbowl next Year ?
##Create a variable SBWinner to select the teams that win the SuperBowl
SBWinner <- Standings_cleaned_data %>%
filter (superbowl_winner ==1)
#plot year vs game_wins (help to also know if the number of game wins matters in winning the Superbowl)
SBWinner_plot = ggplot(SBWinner,aes(year,game_wins, fill= team_name))+ geom_col(show.legend=FALSE) + geom_text(aes(year, game_wins, label=team_name), angle =90, size=3, hjust=3)+labs(x= "Year", y="Total Game Won") +scale_x_continuous(breaks= seq(2000,2019,1))+ scale_y_continuous(breaks = seq(0, 16, 1)) + ggtitle("Superbowl Winners from 2000 to 2019") +theme_classic() + theme (axis.text = element_text(angle=90, vjust=0.5))
SBWinner_plot
New England Patriots
Since New England Patriots won the Superbowl twice in a row, we wanted to know if winning the most games in the season help them in winning the Superbowl the next year. The number of games won by each team in 2003 is shown in the graph below. The purple points are the teams with the most games won and the red point is the total of games won by New England Patriots. Among the five teams who won the most games,the New England Patriots won the most. In 2003, New England Patriots has won the most games, 14 games over 16, and won the Superbowl.
#Check if the total of game wins matters in winning the Superbowl in 2003
team2003 <- Standings_cleaned_data%>%
filter(year==2003)%>%
group_by(team_name)%>%
mutate(total = sum(game_wins))
patriots_2003 <- team2003 %>%
filter (year == 2003, team_name == "Patriots") %>%
summarize(total = sum(game_wins))
high5_win_point <- team2003 %>%
arrange (desc(total))%>%
head (5)
Patriot2003_plot = ggplot(team2003, aes(team_name, total))+geom_point(size=3)+geom_point(data = high5_win_point,color = "purple", size=3, show.legend = TRUE)+ geom_point(data=patriots_2003,color ="red", size=3, show.legend=TRUE) + theme_classic () + theme (axis.text = element_text(angle=90, size=6))+ scale_y_continuous(breaks = seq(0, 16, 1)) + labs(title= "Patriots Wins Compared to the other Teams in 2003", x= "Teams", y= " Total Game Won")+ annotate("text", x="Giants",y=15, label="*Note*: the red point is the total games won by New England Patriots in 2003") + annotate("text", x="Giants",y=16, label="*Note*: the purple points represent the team with most games won in 2003")
Patriot2003_plot
The graph shows the number of games won by each team in 2004. New England Patriots belongs to the teams who won the most games in 2004. Patriots is the second team who won the most games, 14 games over 16, in 2004. Despite that New England Patriots did not win the most games in 2014, the team won the Superbowl again in 2014.
#Check if the total of game won matters in winning the Superbowl in 2004
team2004 <- Standings_cleaned_data%>%
filter(year==2004)%>%
group_by(team_name)%>%
mutate(total = sum(game_wins))
patriots_2004 <- team2004 %>%
filter (year == 2004, team_name == "Patriots") %>%
summarize(total = sum(game_wins))
high5_win_point <- team2004 %>%
arrange (desc(total))%>%
head (5)
Patriot2004_plot = ggplot(team2004, aes(team_name, total))+geom_point(size=3)+ geom_point(data = high5_win_point,color = "purple", size=3)+ geom_point(data=patriots_2004,color ="red", size=3)+theme_classic () + theme (axis.text = element_text(angle=90, size=6)) +scale_y_continuous(breaks = seq(0, 16, 1)) + labs(title= "Patriots Wins Compared to the other Teams in 2004", x= "Teams", y= " Total Game Won")+ annotate("text", x="Giants",y=16, label="*Note*: the red point is the total games won by New England Patriots in 2004") + annotate("text", x="Giants",y=17, label="*Note*: the purple points represent the team with most games won in 2004")
Patriot2004_plot
—
To understand more about a teams success, we needed more information about individual players and the coaching staff. This would have helped to understand why teams offenses or defenses were good or bad. This then would help understand why certain teams have success over the others. We could not find that information available to the public.
Injuries also play a big part in team success. If key player(s) are injured on a team, that could make a teams success lower than other teams that remain healthy. We could not find that information available to the public.
It was difficult to link all three data sets provided because there weren’t any unique ids and the way the data was recorded.
The analysis was limited to questions based on what could be answered from the available data. This made the data understanding very important to find what problems/questions could possibly be solved/answered, given limited football/NFL knowledge.