I set out to do an analysis on whether other goal keeper factors help to determine the amount of goals they have conceded thus far in the 2021-22 season of the English Premier League.
I got my original data from the website fbref.com but more specifically using the link https://fbref.com/en/comps/9/keepersadv/Premier-League-Stats#stats_keeper_adv. The data received was as of February 21, 2022. Upon downloading the requisite file, it was loaded into R using the code below
GK_Data <- read.csv('C:/Users/user/Documents/Edureka/GK_Data.csv')
names(GK_Data)[2] <- "Player"
names(GK_Data)[5] <- "Team"
names(GK_Data)[8] <- "Minutes_per_ninety"
names(GK_Data)[9] <- "Goals_Conceded"
names(GK_Data)[10] <- "PK_Conceded"
names(GK_Data)[11] <- "FK_Conceded"
names(GK_Data)[12] <- "CK_Conceded"
names(GK_Data)[18] <- "Passes_Completed_o_forty_yrds"
names(GK_Data)[19] <- "Passes_Attempted_o_forty"
names(GK_Data)[20] <- "Rate_of_Passes_Completed_o_forty"
names(GK_Data)[25] <- "Goal_Kicks_Attempted"
names(GK_Data)[27] <- "Avg_Distance_of_Goal_Kicks"
names(GK_Data)[28] <- "Opp_Crosses_Attempted"
names(GK_Data)[29] <- "Opp_Crosses_Caught"
After the data was read into R, a series of data cleaning and manipulation was done to ensure I had only the necessary data required to do the analysis. The data was then checked for any errors.
GK_Data <- GK_Data[-c(1,3:4,6:7,13:17,21:24,26,30:34)]
GK_Data <- GK_Data[-c(1),]
GK_Data1 <- apply(GK_Data[, 3:14], 2, as.numeric)
GK_Data2 <- GK_Data[-c(3:14)]
GK_Data <- cbind(GK_Data2, GK_Data1)
apply(GK_Data, 2, function(x) any (is.na(x)))
## Player Team
## FALSE FALSE
## Minutes_per_ninety Goals_Conceded
## FALSE FALSE
## PK_Conceded FK_Conceded
## FALSE FALSE
## CK_Conceded Passes_Completed_o_forty_yrds
## FALSE FALSE
## Passes_Attempted_o_forty Rate_of_Passes_Completed_o_forty
## FALSE FALSE
## Goal_Kicks_Attempted Avg_Distance_of_Goal_Kicks
## FALSE FALSE
## Opp_Crosses_Attempted Opp_Crosses_Caught
## FALSE FALSE
Upon perusing the data set, it was noted that some of the players’ names were misspelled due to the website having them in their native languages, I proceeded to correct these manually as each one was different and there were not a lot to be corrected. After this was completed, a series of calculations were done on the columns to be analyzed.
GK_Data[GK_Data == "Martin Dúbravka"] <- "Martin Dubravka"
GK_Data[GK_Data == "Åukasz FabiaÅ„ski"] <- "Lukasz Fabianski"
GK_Data[GK_Data == "Ãlvaro Fernández"] <- "Alvaro Fernandez"
GK_Data[GK_Data == "Emiliano MartÃnez"] <- "Emiliano Martinez"
GK_Data[GK_Data == "José Sá"] <- "Jose Sa"
GK_Data[GK_Data == "Robert Sánchez"] <- "Robert Sanchez"
GK_Data$Rate_of_Passes_Completed_o_forty <- (GK_Data$Rate_of_Passes_Completed_o_forty/100)
Rate_of_Crosses_Caught <- round((GK_Data$Opp_Crosses_Caught/GK_Data$Opp_Crosses_Attempted),2)
Using some of the columns in the data set, a couple new columns were created with data that would be useful in the plotting of the correlogram.
Rate_of_Crosses_Caught <- round((GK_Data$Opp_Crosses_Caught/GK_Data$Opp_Crosses_Attempted),2)
Goals_per_ninety <- round((GK_Data$Goals_Conceded/GK_Data$Minutes_per_ninety),2)
GK_Data <- cbind(GK_Data, Rate_of_Crosses_Caught)
GK_Data <- cbind(GK_Data, Goals_per_ninety)
After all that was done, the correlogram was plotted showing that there was practically no or very low correlation between the goals conceded from corner kicks, passing accuracy of their goal kicks and rate of crosses caught vs the number of goals conceded.
Therefore based on the analysis done, it can be safe to say that the goal keeper’s other abilities (i.e catching the ball from crosses and corners and their passing accuracy) have very little to do with the number of goals they concede. The goals conceded may be more indicative of they strength of both their teammates and the opposition.