Ever since the departure of Sir Alex Ferguson (pictured below) on the 8th of May 2013, my local football team - a team supported by a global fan base of millions (>70 Million followers on Facebook) - began a sudden and painful decline after years of back to back success in multiple tournaments.
Since the inception of the Premier League in 1992, Manchester United have amassed a total of 13 trophies out of 28 attempts, more than any other football club.
Sir Alex Ferguson (front row, centre)
Assignment 1 of ‘What’s Your Story’ (McDonald, 2021) involved an introduction to a set of questions to analyse a data-set of sports statistics specifically for Manchester United Football Club (Manchester United Football Club, 2021), and utilised open source data from a UK based football historical data website (Football Data Website, 2021a). Football match statistics (from the past 10 Premier League Seasons) were cleaned and wrangled with the aim of answering the following questions:
Q1) How has Manchester United’s form changed over the last 10 years and with changes of 4 managers? Highlight spikes in the data linked to historical events.
Q2) Was there a specific moment where the overall team performance changed?
Q3) How well has the team “travelled” over the last 10 years? Use a scatterplot to highlight performance based on goals scored at home or away.
During the data wrangling process, several issues were encountered in building a final product during Assignment 1:
Merging of .csv files (due to inconsistent date formats in individual files).
Missing data for certain statistics.
By using a set of different functions than those available for Assignment 1 and using predictive models a more complete set of investigations have been performed with some additional questions proposed for Assignment 2. Ideally, the statistics should reflect any positive moves the football club made (e.g. player purchases in positions of weakness). Now that some new functions have been incorporated, the performance analysis can be applied to the data-set for any Premier League team. The additional questions proposed are as follows:
Q4) How does Manchester United’s Win-Percentage compare to their closest rivals?
Q5) Can a linear regression model be used to predict the relationship between Win-Percentage per Season and Goals per Season based on the last 10 Manchester United seasons?
Q6) What is the strength of the linear relationship between Win-Percentage per Season and Goals per Season for Manchester United and their closest rivals?
These questions are answered with relevant plots, results and discussions.
The following abbreviations are listed as they appear in the raw data-set. The abbreviations are explained on the raw data source website (Football Data Website, 2021b). For ease of use to the coder, these column-name-abbreviations have been renamed to their long form in the data wrangling process.
All data used in this assignment comes from a UK based football historical data website Football-Data (2020) (Football Data Website, 2021a). Football-Data acknowledge Xscores, Sportinglife, BBC Football, ESPN Soccer and Flashscore in the compilation of their results and odds files. All data is made available for download in CSV format.
Looking specifically at the English Premier League, the website has raw data dating back to the 1994. I have specifically chosen 2009-2020 to include 4 seasons of Sir Alex Ferguson’s managerial time. Other data-sets for earlier seasons might need to be included if a different analysis were to be done for a different Premier League team - this is all dependent on what needs to be analysed.
From the Data folder, the following files have been incorporated into Assignment 2:
The following packaged were utlised for Assignment 2.
All coding coding, plots and outputs were generated in Rmarkdown files using R Studio using the R programming language (R Core Team, 2020).
In order to do a season-by-season analysis and explore the data-set, all 11 CSV files need to be joined to form a complete data-set.
#Load tidyverse
library(tidyverse)
# Load readxl package to read in CSV files
library(readxl)
library(knitr)
# Read file location of data into object "path"
path <- "../../Data/Assignment 2"
csv_files <- list.files(path) # List files into object csv_files
# List all files in path
csv_files <- fs::dir_ls(path, regexp = "\\.csv$") #Read all files extensions "csv"
prem_all <- csv_files %>%
map_df(~read_csv(.x, col_types = cols(.default = "c"))) # Use map_df to apply a function to elements of a list, and then bind the dataframes together into prem_all
# This is a tidyverse solution which will bind the csv files into a data frame at the end.
my_data_all <- as_tibble(prem_all)
# Print my_data_all
print(my_data_all)
## # A tibble: 4,180 x 139
## Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 E0 2009-0~ Aston Vi~ Wigan 0 2 A 0 1 A M Clatt~
## 2 E0 2009-0~ Blackburn Man City 0 2 A 0 1 A M Dean
## 3 E0 2009-0~ Bolton Sunderl~ 0 1 A 0 1 A A Marri~
## 4 E0 2009-0~ Chelsea Hull 2 1 H 1 1 D A Wiley
## 5 E0 2009-0~ Everton Arsenal 1 6 A 0 3 A M Halsey
## 6 E0 2009-0~ Portsmou~ Fulham 0 1 A 0 1 A M Atkin~
## 7 E0 2009-0~ Stoke Burnley 2 0 H 2 0 H S Benne~
## 8 E0 2009-0~ Wolves West Ham 0 2 A 0 1 A C Foy
## 9 E0 2009-0~ Man Unit~ Birming~ 1 0 H 1 0 H L Mason
## 10 E0 2009-0~ Tottenham Liverpo~ 2 1 H 1 0 H P Dowd
## # ... with 4,170 more rows, and 128 more variables: HS <chr>, AS <chr>,
## # HST <chr>, AST <chr>, HF <chr>, AF <chr>, HC <chr>, AC <chr>, HY <chr>,
## # AY <chr>, HR <chr>, AR <chr>, B365H <chr>, B365D <chr>, B365A <chr>,
## # BWH <chr>, BWD <chr>, BWA <chr>, GBH <chr>, GBD <chr>, GBA <chr>,
## # IWH <chr>, IWD <chr>, IWA <chr>, LBH <chr>, LBD <chr>, LBA <chr>,
## # SBH <chr>, SBD <chr>, SBA <chr>, WHH <chr>, WHD <chr>, WHA <chr>,
## # SJH <chr>, SJD <chr>, SJA <chr>, VCH <chr>, VCD <chr>, VCA <chr>,
## # BSH <chr>, BSD <chr>, BSA <chr>, Bb1X2 <chr>, BbMxH <chr>, BbAvH <chr>,
## # BbMxD <chr>, BbAvD <chr>, BbMxA <chr>, BbAvA <chr>, BbOU <chr>,
## # BbMx>2.5 <chr>, BbAv>2.5 <chr>, BbMx<2.5 <chr>, BbAv<2.5 <chr>, BbAH <chr>,
## # BbAHh <chr>, BbMxAHH <chr>, BbAvAHH <chr>, BbMxAHA <chr>, BbAvAHA <chr>,
## # PSH <chr>, PSD <chr>, PSA <chr>, PSCH <chr>, PSCD <chr>, PSCA <chr>,
## # Time <chr>, MaxH <chr>, MaxD <chr>, MaxA <chr>, AvgH <chr>, AvgD <chr>,
## # AvgA <chr>, B365>2.5 <chr>, B365<2.5 <chr>, P>2.5 <chr>, P<2.5 <chr>,
## # Max>2.5 <chr>, Max<2.5 <chr>, Avg>2.5 <chr>, Avg<2.5 <chr>, AHh <chr>,
## # B365AHH <chr>, B365AHA <chr>, PAHH <chr>, PAHA <chr>, MaxAHH <chr>,
## # MaxAHA <chr>, AvgAHH <chr>, AvgAHA <chr>, B365CH <chr>, B365CD <chr>,
## # B365CA <chr>, BWCH <chr>, BWCD <chr>, BWCA <chr>, IWCH <chr>, IWCD <chr>,
## # IWCA <chr>, WHCH <chr>, ...
Using the map_df function from the tidyverse (Wickham et al., 2021), all 11 csv files have been successfully joined. This was a loading issue in Assignment 1. The join here is successful as the map_df function converts the data into “character” format. This can be observed using the visdat function (Tierney, 2017).
# Load visdat library to see data types for each variable.
library(visdat)
vis_dat(my_data_all)
The successful join shows much of the data variables contain missing values. This would be a potential problem if the purpose of this assignment was to use the betting statistics that are available in this data-set. Some variables could have their values imputed, and some would need to be dropped. The vis_miss function calculates the percentage of missing data.
# Load gridExtra library
library(gridExtra)
library(grid)
miss1 <- vis_miss(my_data_all)
# Select relevant variables for Assignment 2
my_data <- prem_all %>% select(1:23)
miss2 <- vis_miss(my_data)
grid.arrange(miss1, miss2, ncol = 2)
The majority of the 48.8% % of missing data in the data-set is due to only more recent football season data containing newer statistics. As much of the data-set contains information for betting odds, these will not be needing these going forward. select from the dyplr package (R. F. Hadley Wickham, Henry, & Müller, 2021) is used to select the required columns. In the selected data (right), all relevant data variables are present, therefore no N/A data analysis will be required, or any imputations of data points.
In Assignment 1, certain variables were originally numeric format and character format. As these were all converted during the import data process, the mutate_at function can be used to filter out the chosen columns and convert them to numeric format.
my_data <- my_data %>%
mutate_at(c(5:6, 8:9,12:23), as.numeric)
Further wrangling of the data is required to change the Date variable format. Although the column has been unified during the data join, some observations are in %d/%m/%y, %d/%m/%Y or %Y-%m-%d. Using a for-loop function, and using the lubridate library from the tidyverse package (Wickham et al., 2021), the Date column can be placed in a Date format, in a consistent presentation format (%Y-%m-%d).
# Load lubridate library to wrangle the date data
library(lubridate)
# Use solution (From https://stackoverflow.com/questions/13764514/how-to-change-multiple-date-formats-in-same-column)
multidate <- function(my_data, formats){
a<-list()
for(i in 1:length(formats)){
a[[i]]<- as.Date(my_data,format=formats[i])
a[[1]][!is.na(a[[i]])]<-a[[i]][!is.na(a[[i]])]
}
a[[1]]
}
my_data$Date <- multidate(my_data$Date,
c("%d/%m/%y","%Y-%m-%d","%d/%m/%Y"))
my_data$Date <- format(as.Date(my_data$Date, format = "%Y-%m-%d"), "%y-%m-%d")
my_data$Date <- format(as.Date(my_data$Date, format = "%y-%m-%d"), "%Y-%m-%d")
# Convert Date column to Date format
my_data <- my_data %>% mutate_at(c(2), as.Date)
To help display information based on performance for each season, an extra variable is needed to filter out results by Season. This involves filtering the dates and using the mutate function to create a new column to mark the start and finish of a football season. The process to do this uses the stringr function (Wickham, 2019) and is structured as follows:
# Load stringr library
library(stringr)
# Create new blank variable for "Season"
my_data <- mutate(my_data, Season = '', .after = Div)
# Create a __data.frame__ for each start date, end date and season, then use the as.POSIX() function to store date information
dates <- data.frame(Start = c("2009-08-01", "2010-08-01", "2011-08-01", "2012-08-01", "2013-08-01", "2014-08-01", "2015-08-01", "2016-08-01", "2017-08-01", "2018-08-01", "2019-08-01"),
End = c("2010-05-31", "2011-05-31", "2012-05-31", "2013-05-31", "2014-05-31", "2015-05-31", "2016-05-31", "2017-05-31", "2018-05-31", "2019-05-31", "2020-07-31"),
Label = c("09/10", "10/11", "11/12", "12/13", "13/14", "14/15", "15/16", "16/17", "17/18", "18/19", "19/20"),
stringsAsFactors = FALSE)
dates$Start <- as.POSIXct(as.Date(dates$Start))
dates$End <- as.POSIXct(as.Date(dates$End))
# Fill the Season column with "Label" using function(x)
my_data$Season<- sapply(my_data$Date, function(x) {
out <- dates$Label[x >= dates$Start & x <= dates$End]
if (length(out) == 0) "B" else out
})
To further simplify wrangling of the data, the rename function from the dyplr package (R. F. Hadley Wickham, Henry, & Müller, 2021) can be used to break out the variable abbreviations. visdat can then show the cleaned up data-set.
# Load dplyr library
library(dplyr)
# Rename abbreviated column variable names
my_data <- my_data %>%
rename(
"FTHomeGoals" = FTHG,
"FTAwayGoals" = FTAG,
"FTResult" = FTR,
"HTHomeGoals" = HTHG,
"HTAwayGoals" = HTAG,
"HTResult" = HTR,
"HomeTeamShots" = HS,
"AwayTeamShots" = AS,
"HomeTeamShotsonTagret" = HST,
"AwayTeamShotsonTarget" = AST,
"HomeTeamFoulsCom" = HF,
"AwayTeamFoulsCom" = AF,
"HomeTeamCorners" = HC,
"AwayTeamCorners" = AC,
"HomeTeamYellowCards"= HY,
"AwayTeamYellowCards" = AY,
"HomeTeamRedCards" = HR,
"AwayTeamRedCards" = AR,
)
# Use visdat() to display cleaned up data-set
vis_dat(my_data)
At the end of the basic data wrangling process, the object “my_data” contains 4180 observations and has been reduced down to 24 variables with no missing data.
head(my_data)
The most appropriate plot to perform this analysis will be a bar plot to compare the wins, losses and draws for each season. This will give some idea of how Manchester United’s performances have changed over the course of 10 years. Some data wrangling is required to display this properly.
Of the 23 columns selected earlier, only a few will be required for this first plot. The gather function can be used to place any football match statistics into a single column, changing the data-set from wide form into long form. This function places the value of each statistic in a column “value.” With the data-set in long-form, the “statistic” variable is filtered on “FTResult” (full time result) which has a “value” of “H” (home win), “A” (away win) or “D” (draw). This can be converted into a wide format by creating 5 news columns “Home_Win,” “Home_Loss,” “Away_win,” “Away_Loss” and “draw,” filling each category with a “1” or “0” depending on the value in the value column, and then using the pivot_longer function to allow for summarising the Home_win, Away_win or Draw count.
To focus on Manchester United’s fixtures, the filter function can be used to filter out all fixtures where Manchester United played at Home, or Away:
# Setup "Team" object to allow for analysis of different teams
Team = "Man United"
# Use filter() function to filter the data-set based on a particular team
Team_filt <- my_data %>%
filter(HomeTeam %in% c(Team)|AwayTeam %in% c(Team))
# Create object Team_filt_long with new columns for football match results
Team_filt_long <- Team_filt %>%
gather(key = "Statistic", value = "value", -Div, -Season, -Date, -HomeTeam, -AwayTeam, -Referee) %>% # Convert to long-form by gathering all football-match variables
filter(Statistic == "FTResult") %>%
mutate(Home_Win = ifelse(`HomeTeam` == Team & `value` == "H", 1, 0)) %>%
mutate(Home_Draw = ifelse(`HomeTeam` == Team & `value` == "D", 1, 0)) %>%
mutate(Home_Loss = ifelse(`HomeTeam` == Team &`value` == "A", 1, 0)) %>%
mutate(Away_Win = ifelse(`AwayTeam` == Team & `value` == "A", 1, 0)) %>%
mutate(Away_Draw = ifelse(`AwayTeam` == Team & `value` == "D", 1, 0)) %>%
mutate(Away_Loss = ifelse(`AwayTeam` == Team & `value` == "H", 1, 0)) %>%
mutate(Win = Home_Win + Away_Win) %>%
mutate(Loss = Home_Loss + Away_Loss) %>%
mutate(Draw = Home_Draw + Away_Draw) %>%
pivot_longer(cols = Home_Win:Draw,
names_to = "Result",
values_to = "Count")
print(Team_filt_long)
## # A tibble: 3,762 x 10
## Div Season Date HomeTeam AwayTeam Referee Statistic value Result
## <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 E0 09/10 2009-08-16 Man United Birmingh~ L Mason FTResult H Home_Win
## 2 E0 09/10 2009-08-16 Man United Birmingh~ L Mason FTResult H Home_Dr~
## 3 E0 09/10 2009-08-16 Man United Birmingh~ L Mason FTResult H Home_Lo~
## 4 E0 09/10 2009-08-16 Man United Birmingh~ L Mason FTResult H Away_Win
## 5 E0 09/10 2009-08-16 Man United Birmingh~ L Mason FTResult H Away_Dr~
## 6 E0 09/10 2009-08-16 Man United Birmingh~ L Mason FTResult H Away_Lo~
## 7 E0 09/10 2009-08-16 Man United Birmingh~ L Mason FTResult H Win
## 8 E0 09/10 2009-08-16 Man United Birmingh~ L Mason FTResult H Loss
## 9 E0 09/10 2009-08-16 Man United Birmingh~ L Mason FTResult H Draw
## 10 E0 09/10 2009-08-19 Burnley Man Unit~ A Wiley FTResult H Home_Win
## # ... with 3,752 more rows, and 1 more variable: Count <dbl>
To display the first plot, the data needs to be grouped by Season and Result and then filtered on “Win,”“Draw,”“Loss.” By using the piping function, the mutate function is used to create a new column named “TotalCount” for “Win,”“Draw,”“Loss” for each Season. The final plot is an interactive bar graph created using the plotly package (Sievert, 2020) showing Season vs Total Count of wins, draws or losses.
# Load plotly() for interactive plots
library(plotly)
# Group by Season and Result, then filter for "Win","Draw","Loss"
by_seasonres1 <- Team_filt_long %>%
group_by(Season, Result) %>%
filter(Result %in% c("Win","Draw","Loss"))
# Sum the total count for each of the Results
Team_performance1 <- by_seasonres1 %>%
mutate(TotalCount = sum(Count, na.rm = TRUE)) %>%
ggplot(aes(x = Season, y = TotalCount, fill = Result)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Plot 1: Manchester United's Performances since 2009", x = "Season", y = "Number of Wins, Losses or Draws per Season", fill = "Result")+
scale_fill_brewer(palette="RdYlBu")+
scale_x_discrete(limits=rev)+
coord_flip() +
facet_wrap(~Result) +
theme(legend.position="top")
ggplotly(Team_performance1, height = 600, width = 800)
Plot 1 highlights how from 2009-2013, wins were much higher than in the most recent seasons. Any seasons post 2013 show a sharp increase in both losses and draws. These observations coincide with a change in manager from Sir Alex Ferguson to the tenure of David Moyes in 13/14. A spike in wins can be observed in the 17/18 season when Jose Mourinho lead the team to a 2nd place finish in the league.
To better understand some aspects of team performance, the same plot used to answer question 1 can be converted to show an absolute value - this displays the results as a ratio of 100%.
# Group by Season and Result, then filter for "Win","Draw","Loss"
by_seasonres2 <- Team_filt_long %>%
group_by(Season, Result) %>%
filter(Result %in% c("Win","Draw","Loss"))
# Bar-plot the results using absolute value for comparison
Team_performance2 <- by_seasonres2 %>%
mutate(total_count = sum(Count, na.rm = TRUE))%>%
ggplot(aes(x = Season, y = total_count, fill = Result)) +
geom_bar(stat = "identity", position = "fill") + # Change out position = "dodge" to "fill" to show in in terms of 100%
labs(title = "Plot 2: Manchester United's Performances since 2009 - Win:Loss:Draw Ratio", y = "Percentage")+
scale_x_discrete(limits=rev)+
coord_flip()+
scale_fill_brewer(palette="RdYlBu")
ggplotly(Team_performance2, height = 600, width = 800)
Using an absolute scale, the plot used in question 2 shows in which season the greatest percentage of wins to losses ratio occurred. The 11/12 and 12/13 seasons were Manchester United’s best season of the last 10 years, with the most recorded wins. The 13/14 season is an outlier in this plot, showing a severe reduction in team performances (highest number of total losses in 10 years). This plot helps answer the question of what specific moment overall team performance changed.
To help understand how well has the team “travelled” over the last 10 years, a scatterplot can be used to highlight performance based on goals scored at home or away. By using the piping function we can use the mutate function to create 2 new columns named “HomeGoalsFor,” “AwayGoalsFor” and “TotalGoals” for total home goals and total away goals for each season. These are calculated using the sum function.
# Filter team for analysis - choose Manchester United here
Team = "Man United"
# Create new columns for results and put into object "Team_filt_long2"
Team_filt_long2 <- Team_filt %>%
select(Div, Season, Date, HomeTeam, AwayTeam, FTResult, FTHomeGoals, FTAwayGoals) %>%
mutate(HomeGoalsFor = ifelse(HomeTeam == Team, FTHomeGoals, 0)) %>%
mutate(HomeGoalsConceded = ifelse(HomeTeam == Team, FTAwayGoals, 0)) %>%
mutate(AwayGoalsFor = ifelse(AwayTeam == Team, FTAwayGoals, 0)) %>%
mutate(AwayGoalsConceded = ifelse(AwayTeam == Team, FTHomeGoals, 0 )) %>%
mutate(GoalsConceded = HomeGoalsConceded + AwayGoalsConceded) %>%
mutate(GoalsFor = rowSums(cbind(HomeGoalsFor,AwayGoalsFor)))
# Sum Total Home Goals and Total Away Goals, then plot
by_seasonres3 <- group_by(Team_filt_long2, Season)
Team_performance3 <- by_seasonres3 %>%
group_by(Season) %>%
mutate(TotalHomeGoalsFor = sum(HomeGoalsFor, na.rm = TRUE)) %>%
mutate(TotalAwayGoalsFor = sum(AwayGoalsFor, na.rm = TRUE)) %>%
mutate(TotalGoals = TotalHomeGoalsFor + TotalAwayGoalsFor) %>%
ggplot(aes(x = TotalHomeGoalsFor, y = TotalAwayGoalsFor, colour = Season)) +
geom_point(aes(size = TotalGoals)) +
scale_colour_brewer(palette = "Paired") +
labs(title = "Plot 3: Manchester United's Home Goals vs Away Goals scored since 2009", x = "Total Home Goals", y = "Total Away Goals")+
xlim(20, 60) + # x and y axis need to be the same scale to highlight and differences
ylim(20, 60) +
geom_vline(xintercept=43, linetype="dashed", color = "red")
ggplotly(Team_performance3, height = 600, width = 800)
This plot shows some correlations linked to exterior factors. For the majority of seasons over the last 10 years, Manchester United have been relatively poor travellers (consistently scoring < 40 goals per season). However in the 11/12 and 12/13 seasons the team had figured out how to maximise performance in away-day games. To the right of the dashed vertical geom_vline (set to 43), a clear differential can be seen between the team performances during and after the reign of Sir Alex Ferguson. The 18/19 and 19/20 season data-points also highlight that there have been recent improvements during the reign of manager Ole Gunnar Solskjær.
To compare “Win Percentage” over the last 10 years with those of other rival teams, an interactive lineplot can be used to highlight variance in performance, season by season. To yield this information from the data-set requires some extra wrangling. Total home wins and total away wins need to be calculated by being grouped into respective categories and stored in separate objects, then have key variables selected, then the objects can be joined using the right_join function from the tidyverse (Wickham et al., 2021).
# Calculate total Home wins
All_Team_filt_long_Home <- my_data %>%
gather(key = "Statistic", value = "value", -Div, -Season, -Date, -HomeTeam, -AwayTeam, -Referee) %>% # Convert to long-form by gathering all football-match variables
filter(Statistic == "FTResult") %>%
group_by(Season, HomeTeam) %>%
mutate(Home_Win = ifelse(`value` == "H", 1, 0)) %>%
mutate(Away_Win = ifelse(`value` == "A", 1, 0)) %>%
mutate(Away_Draw = ifelse(`value` == "D", 1, 0)) %>%
mutate(Total_Home_Wins = (sum(Home_Win, na.rm = TRUE)))
# Calculate total Away wins
All_Team_filt_long_Away <- my_data %>%
gather(key = "Statistic", value = "value", -Div, -Season, -Date, -HomeTeam, -AwayTeam, -Referee) %>% # Convert to long-form by gathering all football-match variables
filter(Statistic == "FTResult") %>%
group_by(Season, AwayTeam) %>%
mutate(Home_Win = ifelse(`value` == "H", 1, 0)) %>%
mutate(Away_Win = ifelse(`value` == "A", 1, 0)) %>%
mutate(Away_Draw = ifelse(`value` == "D", 1, 0)) %>%
mutate(Total_Away_Wins = (sum(Away_Win, na.rm = TRUE)))
# Select key variables from Away data
All_Team_filt_long_Away <- All_Team_filt_long_Away %>%
select(Season, Date, AwayTeam, Total_Away_Wins)
# Select key variables from Home data
All_Team_filt_long_Home <- All_Team_filt_long_Home %>%
select(Season, Date, HomeTeam, Total_Home_Wins)
# Join Home and Away data using right_join
All_Team_filt_long <- All_Team_filt_long_Home %>% right_join(All_Team_filt_long_Away, by = c("HomeTeam" = "AwayTeam"))
# Calculate wins as a percentage for the season
All_Team_filt_long_Sum <- All_Team_filt_long %>%
group_by(Season.x, Season.y, HomeTeam) %>%
mutate(Win_Percentage = ifelse(Season.x == Season.y, (((Total_Home_Wins + Total_Away_Wins)/38)*100), 0))
# Filter data-set for key rival teams of Manchester United
All_Team_filt_Long_Final <- All_Team_filt_long_Sum %>%
select(Season.x, HomeTeam, Win_Percentage) %>%
distinct(Season.x, HomeTeam, Win_Percentage) %>%
filter(Season.x == Season.y) %>%
filter(HomeTeam %in% c("Arsenal","Chelsea", "Leicester", "Liverpool", "Man City", "Man United", "Tottenham"))
# Plot a facet_wrap line plot of "Win PErcentage" per season for each team
Plot4 <- ggplot(All_Team_filt_Long_Final) +
geom_line(aes(x = Season.x, y = Win_Percentage, group = HomeTeam, color=HomeTeam), stat='identity')+
facet_wrap(~HomeTeam)+
labs(title = "Plot 4: Comparison of Win Percentage between Manchester United's Rivals", color = "Team", y = "Win Percentage (%)", x = "Premier League Season")+
theme(axis.text.x = element_text(angle=50, hjust=1))
ggplotly(Plot4, height = 600, width = 800)
Using geom_line helps show the evolution of team performance. It can be observed that Manchester United’s performances have not recovered since 12/13 with performances on an overall downward trajectory. Manchester City’s performances have been on an overall upwards trajectory, peaking in 17/18 and 18/19. Arsenal have entered a very turbulent period in recent seasons. Liverpool have experienced a steady rise in performances peaking in the 19/20 season when they won the Premier League. Tottenham have made some improvements but have never achieved better than 70% win percentage. Leicester City have had a dramatic entry into Premier League, winning the league in only their 2nd season (15/16). The outlier in this data-set can be observed in Chelsea’s performances when they finished 10th in the 15/16 season.
Firstly, a scatterplot of “Win Percentage” versus “Goals Conceded per season” needs to be constructed.
# Calculate Total Goals Conceded per Season
Team_filt_long3 <- Team_filt %>%
select(Div, Season, Date, HomeTeam, AwayTeam, FTResult, FTHomeGoals, FTAwayGoals) %>%
group_by(Season) %>%
mutate(HomeGoalsFor = ifelse(HomeTeam == Team, FTHomeGoals, 0)) %>%
mutate(HomeGoalsConceded = ifelse(HomeTeam == Team, FTAwayGoals, 0)) %>%
mutate(AwayGoalsFor = ifelse(AwayTeam == Team, FTAwayGoals, 0)) %>%
mutate(AwayGoalsConceded = ifelse(AwayTeam == Team, FTHomeGoals, 0 )) %>%
mutate(GoalsConceded = HomeGoalsConceded + AwayGoalsConceded) %>%
mutate(GoalsFor = rowSums(cbind(HomeGoalsFor,AwayGoalsFor))) %>%
mutate(TotalHomeGoalsFor = sum(HomeGoalsFor, na.rm = TRUE)) %>%
mutate(TotalAwayGoalsFor = sum(AwayGoalsFor, na.rm = TRUE)) %>%
mutate(TotalHomeGoalsConceded = sum(HomeGoalsConceded, na.rm = TRUE)) %>%
mutate(TotalAwayGoalsConceded = sum(AwayGoalsConceded, na.rm = TRUE)) %>%
mutate(TotalGoals = TotalHomeGoalsFor + TotalAwayGoalsFor) %>%
mutate(TotalGoalsConceded = TotalHomeGoalsConceded + TotalAwayGoalsConceded)
# Calculate Win Percentage per Season
WinPct <- All_Team_filt_long_Sum %>%
select(Season.x, HomeTeam, Win_Percentage) %>%
distinct(Season.x, HomeTeam, Win_Percentage) %>%
filter(Season.x == Season.y) %>%
filter(HomeTeam == Team)
# Join Total Goals and Win Percentage data using right_join
JoinWinPCtGoals <- Team_filt_long3 %>% right_join(WinPct, by = c("Season" = "Season.x"))
JoinWinPCtGoals2 <- JoinWinPCtGoals %>%
select(7:14, 19:20, 23) %>%
distinct(TotalGoals, TotalGoalsConceded, Win_Percentage)
A linear regression model then needs to be trained to model “Goals per Season” using the joined data-set of “Win Percentage” and “Goals per season.” This process involves using the broom package (David Robinson et al., 2021).
# Load the broom package
library(broom)
# Fit a linear regression model of TotalGoals on Win_Percentage using the joined data-set
mod1 <- lm(formula = TotalGoals ~ Win_Percentage , data = JoinWinPCtGoals2)
summary(mod1)
##
## Call:
## lm(formula = TotalGoals ~ Win_Percentage, data = JoinWinPCtGoals2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.639 -2.639 2.510 3.861 8.218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.3676 11.4457 0.556 0.591543
## Win_Percentage 1.0854 0.1931 5.622 0.000325 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.618 on 9 degrees of freedom
## Multiple R-squared: 0.7783, Adjusted R-squared: 0.7537
## F-statistic: 31.6 on 1 and 9 DF, p-value: 0.0003251
The summary output of the fitted linear regression model of “Goals per Season” on “Win Percentage” can be written in equation form: \[ \widehat{TotalGoals} = 6.3676 + 1.0854 * Win Percentage \] These 2 chosen variables are naturally interlinked as expected, hence a high R-squared value (0.7783) is to be expected. This value means “Win Percentage” helps to explain 78% of the variation in “Goals per Season.”
Using augment from the broom package (David Robinson et al., 2021) the residual values of error for the model can be pulled and analysed.
# Pass information about the model fit (augment) into object stat_data
stat_data = augment(mod1)
# Plot fitted versus residual values
s1 <- stat_data %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point()+
geom_smooth()
# Histogram plot residual values
s2 <- stat_data %>%
ggplot(aes(x = .resid)) +
geom_histogram()
# Grid arrange the 2 plots.
grid.arrange(s1, s2, ncol = 2)
In the left plot, using the geom_smooth function displays a blue line showing the conditional mean when the fitted value is x. The right plot histogram displays the distribution of the residuals. As there are only 11 points of data in this example, it is difficult to assess if the model would be that accurate if more data points were added. It can be observed in the histogram of residuals on the right that the model is slightly left-skewed.
A the linear regression model of “Goals Conceded per Season” on “Win Percentage” should logically have a similar interlinked relationship with “Win Percentage” as “Goals per Season.”
# Fit a linear regression model of TotalGoals on Win_Percentage using the joined data-set
mod2 <- lm(formula = TotalGoalsConceded ~ Win_Percentage , data = JoinWinPCtGoals2)
summary(mod2)
##
## Call:
## lm(formula = TotalGoalsConceded ~ Win_Percentage, data = JoinWinPCtGoals2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.763 -4.720 -0.746 2.763 15.745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.9175 13.6755 3.504 0.00668 **
## Win_Percentage -0.1933 0.2307 -0.838 0.42387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.907 on 9 degrees of freedom
## Multiple R-squared: 0.07234, Adjusted R-squared: -0.03074
## F-statistic: 0.7018 on 1 and 9 DF, p-value: 0.4239
# Bind the 2 model's key data together with "rbind"
rbind(glance(mod1), glance(mod2)) %>%
select(r.squared, adj.r.squared, AIC, BIC)
## # A tibble: 2 x 4
## r.squared adj.r.squared AIC BIC
## <dbl> <dbl> <dbl> <dbl>
## 1 0.778 0.754 76.6 77.8
## 2 0.0723 -0.0307 80.5 81.7
By combining the summary of the models, it can be observed that the linear relationship for “Goals Conceded per Season” and “Win Percentage” has a lower “Goodness of fit” (Monash University, 2020) than that of “Goals per Season” and “Win Percentage.” We can quantify the relationship between “Goals Conceded per Season” and “Win Percentage” by showing that only 7% of the variation in “Goals Conceded per Season” can be explained by a fitted linear regression model.
# augment mod1 with JoinWinPCtGoals2 data
MU_mod <- augment(mod1, JoinWinPCtGoals2)
# Create a pltly geom_point plot with linerar regression model.
by_seasonres6 <- ggplot(data = MU_mod, aes(y = TotalGoals, x = Win_Percentage, colour = Season)) +
geom_point(aes(size = TotalGoals)) +
geom_line(aes(y=.fitted), colour="blue") +
scale_colour_brewer(palette = "Paired") +
labs(title = "Plot 5: Predicting Goals per Season with a Linear Regression of Win Percentage Comparison of Win Percentage", y = "Goals per Season", x = "Win Percentage (% for the Season)")
ggplotly(by_seasonres6)
As a predictive tool, the application of this model could be useful at team management level in terms of predicting the minimum number of goals required to achieve a certain position in the Premier League table, with previous Premier League success-markers in the prediction as a benchmark.
To compare the linear relationships between “Win Percentage” and “Total Goals Scored” per season of Manchester United’s main rivals over the last 10 years requires some wrangling. The Win Percentage" and “Total Goals Scored” are calculated using the sum function and joined using right_join.
# Sum Total Home Goals for all teams
All_Team_Goals_long_Home <- my_data %>%
select(Div, Season, Date, HomeTeam, AwayTeam, FTResult, FTHomeGoals, FTAwayGoals) %>%
group_by(Season, HomeTeam) %>%
mutate(Total_Home_Goals = (sum(FTHomeGoals, na.rm = TRUE)))
# Sum Total Away Goals for all teams
All_Team_Goals_long_Away <- my_data %>%
select(Div, Season, Date, HomeTeam, AwayTeam, FTResult, FTHomeGoals, FTAwayGoals) %>%
group_by(Season, AwayTeam) %>%
mutate(Total_Away_Goals = (sum(FTAwayGoals, na.rm = TRUE)))
# Select Relevant Columns from Away Goals Object
All_Team_Goals_long_Away <- All_Team_Goals_long_Away %>%
select(Season, Date, AwayTeam, Total_Away_Goals)
# Select Relevant Columns from Away Goals Object
All_Team_Goals_long_Home <- All_Team_Goals_long_Home %>%
select(Season, Date, HomeTeam, Total_Home_Goals)
# USe right_join() to join the 2 objects
All_Team_Goals_long <- All_Team_Goals_long_Home %>% right_join(All_Team_Goals_long_Away, by = c("HomeTeam" = "AwayTeam"))
# Combine Home and Away goals to cvreate a Total_Goals_Scored column
All_Team_Goals_long_Sum <- All_Team_Goals_long %>%
group_by(Season.x, Season.y, HomeTeam) %>%
select(1:4, 7) %>%
filter(Season.x == Season.y) %>%
mutate(Total_Goals_Scored = ifelse(Season.x == Season.y, (Total_Home_Goals + Total_Away_Goals), 0)) %>%
distinct(Season.x, Season.y, HomeTeam, Total_Goals_Scored)
# Call on Win Percentage data calculated in Plot 4
WinPctAll <- All_Team_filt_long_Sum %>%
select(Season.x, HomeTeam, Win_Percentage) %>%
distinct(Season.x, HomeTeam, Win_Percentage) %>%
filter(Season.x == Season.y)
# Combine Win Percentage data with Total_Goals_Scored and filter for"Arsenal","Chelsea", "Leicester", "Liverpool", "Man City", "Man United" and "Tottenham"
JoinWinPCtGoalsAll <- All_Team_Goals_long_Sum %>% right_join(WinPctAll, by = c("Season.x" = "Season.y"))
JoinWinPCtGoalsAll2 <- JoinWinPCtGoalsAll %>%
filter(Season.x == Season.y & HomeTeam.x == HomeTeam.y) %>%
select(2: 4, 7) %>%
filter(HomeTeam.x %in% c("Arsenal","Chelsea", "Leicester", "Liverpool", "Man City", "Man United", "Tottenham"))
# Calculate Win Percentage per Season
WinPct <- All_Team_filt_long_Sum %>%
select(Season.x, HomeTeam, Win_Percentage) %>%
distinct(Season.x, HomeTeam, Win_Percentage) %>%
filter(Season.x == Season.y) %>%
filter(HomeTeam == Team)
# Join Total Goals and Win Percentage data using right_join
JoinWinPCtGoals <- Team_filt_long3 %>% right_join(WinPct, by = c("Season" = "Season.x"))
JoinWinPCtGoals2 <- JoinWinPCtGoals %>%
select(7:14, 19:20, 23) %>%
distinct(TotalGoals, TotalGoalsConceded, Win_Percentage)
“Win Percentage” and “Total Goals Scored” per season filtered for Manchester United’s rivals (Arsenal, Chelsea, Leicester, Liverpool, Man City, Man United, Tottenham), a plot of a line graph grouped by team and its fitted model can be compared using the broom function.
# Load the broom package to bring in information from augment()
library(broom)
# Group by team, nest, use mutate to fit a model for each Team
by_team <- JoinWinPCtGoalsAll2 %>%
select(Season.x, HomeTeam.x, Total_Goals_Scored, Win_Percentage) %>%
group_by(HomeTeam.x) %>%
nest() %>%
mutate(
model = purrr::map(data, ~ lm(Total_Goals_Scored ~ Win_Percentage,
data = .))) %>%
ungroup()
# Unnest the model column to return the fitted values, residuals, etc.
team_model <- by_team %>%
mutate(model = map(model, broom::augment)) %>%
unnest(model)
# Line plot of "Win Percentage" versus "Total Goals Scored" for each team
m1 <- JoinWinPCtGoalsAll2 %>%
ggplot(aes(Win_Percentage, Total_Goals_Scored, group = HomeTeam.x, color = HomeTeam.x)) +
geom_line(alpha = 1) +
geom_point()+
labs(color = "Team", title = "Data", x = "Win Percentage (% for the Season)", y = "Goals per Season")
# Plot of fitted linear model for each team comparing the 2 variables.
m2 <- ggplot(team_model) +
geom_line(aes(x = Win_Percentage, y = .fitted, group = HomeTeam.x, color = HomeTeam.x), alpha = 1) +
labs(color = "Team", title = "Fitted Models", x = "Win Percentage (% for the Season)", y = "Prediction")
# Arrange above plots into 2-column grid
grid.arrange(m1, m2, ncol = 2)
The figure shows the “Goals per Season” of each of Manchester United’s main rivals against each season’s “Win Percentage” (left) and the fitted linear models of “Goals per Season” for each team’s “Win Percentage” (right). The small number of data points for each team representing 1 season each causes a messy line plot on the (left). The plot of fitted linear models shows no dramatic difference between the teams, this is to be expected as the teams being compared are all in the top 10 performing teams. Slopes and intercept would differ greatly between the top team and the lowest ranked team. A tidy faceted plot of the model fits can be seen below.
# Line plot of life expectancy for each country
m3 <- JoinWinPCtGoalsAll2 %>%
ggplot(aes(Win_Percentage, Total_Goals_Scored, group = HomeTeam.x, color = Season.x)) +
#geom_line(alpha = 1) +
geom_point(aes(size = Total_Goals_Scored))+ #+geom_text(aes(label=Season.x),hjust=0, vjust=0)+
facet_wrap(~HomeTeam.x)+
geom_smooth(method="lm", se=FALSE)+
labs(title = "Plot 6: Predicting Goals per Season with a Linear Regression of Win Percentage Comparison of Win Percentage", y = "Goals per Season", x = "Win Percentage (% for the Season)")
ggplotly(m3, height = 600, width = 800)
The estimated slope against the estimated intercept coefficient can be extracted (using the __unnest_, map and tidy functions) for each team’s fitted model for “Goals per Season” and “Win Percentage.” The relationship between a team’s “Win Percentage” and their output of “Goals per Season” can be explored which could offer commentary on the teams’ tactical approach in terms of being attacking minded or defensive.
# Unnest the model column in a tidy way
team_coefs <- by_team %>%
mutate(model = map(model, broom::tidy)) %>%
unnest(model)
# Wrangle the data - intercept and slope coefficient as columns
team_coefs <- team_coefs %>%
select(HomeTeam.x, data, term, estimate) %>%
pivot_wider(names_from = term, values_from = estimate) %>%
rename(intercept = `(Intercept)`)
# Scatter plot of estimated slope against intercept coefficient
sp <- ggplot(team_coefs, aes(x=intercept, y=Win_Percentage,
colour=HomeTeam.x, label=HomeTeam.x)) +
geom_point(alpha=0.5, size=2) +
labs(title = "Scatter plot of estimate slope and intercept coefficient",
x = "Intercept",
y = "Slope",
color = "Team") +
scale_color_brewer(palette = "Dark2")
# Extract the R-squared of each country's fitted linear model
team_fit <- by_team %>%
mutate(model = map(model, broom::glance)) %>%
unnest(model)
# Plot the R2 values as a histogram
r2 <- ggplot(team_fit, aes(x=r.squared, color = HomeTeam.x)) + #color = HomeTeam.x
geom_histogram(position="identity", alpha=0.5, bins=15, size=1) +
labs(title = "Distribution of R-squared from 7 linear models",
subtitle = "Models of each teams Goals per Season with Win Percentage as the explanatory variable",
x = "R-squared",
color = "Team")
# Arrange above plots into 2-column grid
grid.arrange(sp, r2, ncol = 2)
The scatter plot (left) of the estimated slope against the estimated intercept coefficient for every country shows that the relationship between the estimated coefficients is negative. The number of goals per season is generally higher for teams with a higher win percentage (which is expected). A lower angle of slope shows a model fit describing a team with increasing win percentage requiring a lower goals output to do so. This may reflect teams with a more defensive mindset and a lower attacking ability. Interestingly, the linear fits for Manchester United and Manchester City have high angle slopes and lower intercepts showing that with an increase in win percentage, a big increase in goals per season can be observed. Arsenal and Leicester have high intercepts and a low angle slope - this could reflect a more defensive mindset as there is a lower increase “Goals per Season” with increasing “Win Percentage” for these teams, who in more recent seasons have scored a lower number of goals than their main rivals.
The histogram plot (right) shows the distribution of goodness-of-fit (Monash University, 2020) for each team’s model. Coincidentally for this assignment, the model for Manchester United has the highest R-squared value (0.78), which means “Win Percentage” helps to explain 78% of the variation in “Goals per Season.” The R-squared value for Arsenal is much lower (0.49) - this lack of fit could be due to the erratic form and lack of consistency of their team performance over the last ten years. Using the Bayesian information criterion (BIC), a lower BIC means that a model is considered to be more likely to be the true model. Listing the BIC values for each team’s model (below) reveals that the R-squared value does not necessarily suggest a higher R-squared equals a true fit model. The model with the highest R-squared value and lowest BIC is Leicester City, whilst Chelsea has a low R-squared and a high BIC value. This means that on current trend, Leicester City’s model would be the best model to use in a predictive sense, however the linear fit for Leicester is based on a lower number of data points.
team_fit %>% select(HomeTeam.x, r.squared, adj.r.squared, AIC, BIC)
## # A tibble: 7 x 5
## HomeTeam.x r.squared adj.r.squared AIC BIC
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Chelsea 0.539 0.487 83.2 84.4
## 2 Man United 0.778 0.754 76.6 77.8
## 3 Tottenham 0.599 0.554 74.4 75.6
## 4 Liverpool 0.643 0.603 87.1 88.3
## 5 Arsenal 0.491 0.434 71.2 72.4
## 6 Man City 0.660 0.623 85.2 86.4
## 7 Leicester 0.768 0.710 40.2 39.6
With a thorough process of data wrangling and cleaning of Premier League statistics files, a cleaner, stream-lined and variable-appropriate data-set has been created for multi-team data analysis. By removing gambling statistical variables, there are no missing data points in the data-set. At the end of the basic data wrangling process, “my_data” contains 4180 observations and has been reduced down to 24 variables.
Question 1 looked at the relationship between time and results for Manchester United over the last 10 seasons and with changes of 4 managers? The bar chart plot was able to highlight periods where wins were much higher than in the most recent seasons (2009-2013). Any seasons post 2013 showed a sharp increase in both losses and draws, which coincides with management changes, and shows where some improvement was made in performance (17/18 season).
Question 2 used an absolute value plot (displaying the results as a ratio of 100%) to test specific moment where the overall performance changed. The 11/12 and 12/13 seasons were Manchester United’s best season of the last 10 years, with the most recorded wins. The 13/14 season is an outlier - showing the severe reduction in team performances (highest number of total losses in 10 years).
Question 3 used a scatterplot to show the relationship between total home goals and total away goals per season, to test how well has the team “travelled” over the last 10 Seasons. The plot highlights a clear differential between the team performances during and after the reign of Sir Alex Ferguson where performances had much higher home and away goals scored.
Question 4 used line plots to observe the Win Percentage for each of Manchester United’s main rivals. The spikes in the line plots, and their overall positive or negative trends correlate very well with known historical changes and external factors that affect team performance. Manchester United’s performances have been on an overall downward trajectory, whereas main rivals (Manchester City and Liverpool) have been on strong upward trajectories.
Question 5 tested if a linear regression model could be used to predict the relationship between “Win-Percentage” per Season and “Goals per Season” based on the last 10 Manchester United seasons using a scatterplot of “Win Percentage” versus “Goals per season.” The 2 chosen variables had a strong correlation (as would be expected) as shown by a high R-squared value. With only 11 data-points, it was difficult to assess if the model would be that accurate for a prediction outside of the one tested team (Manchester United). It can be observed in the histogram that the model is slightly left-skewed. A relationship between “Goals Conceded per Season” on “Win Percentage” was also investigated on the presumption that there would also be a close (inverse) relationship, however analysis showed that “Win Percentage” only helps to explain 7% of the variation in “Goals Conceded per Season.”
Question 6 applied the process of answering question 5 to Manchester United’s rival teams. When comparing the slopes and intercepts of fitted models, it could be observed that some teams would have an exponential increase in “Goals per Season” with increasing “Win Percentage,” whereas some team have high intercepts and a low angle slope - possibly reflecting a more defensive mindset in their strategies. When comparing the R-squared and BIC values for each model, the model for Leicester City had a high R-squared and low BIC value, indicating a higher likelihood that the this is a truer model fit, and thus might be a useful predictor for future “Goals per Season.”
This exercise has generated a better data-set to work with in order to investigate team performance. By incorporating linear regression models, and with further work it will be possible to use linear relationships between time and match statistics, or 2 differing match statistics as a predictive tool, eventually providing some potential value to football management or betting companies.