library(rvest)
library(tidyverse)
library(lubridate)
library(deductive)
library(Hmisc)
library(outliers)
library(infotheo)
library(magrittr)
library(MVN)
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
The Relative Age Effect (RAE) was first researched in a sporting context by Barnsley, Thompson and Barnsley (1985). The RAE has been shown to have found that athletes with a greater relative age, those that are born in the first three months of the year, are more likely to be identified as talented athletes and be selected in sporting teams, because of their likely physical advantages, compared to athletes born in the later months of the year (Helsen, van Winckel & Williams, 2005).
There is limited published research into whether or not the RAE is present in Australian professional football (soccer), which makes the accurate pre-processing of publicly available data crucial for further analysis.
The data collected was scraped from the website www.Transfermarkt.com using each team’s unique teamcode to loop though the squad details and player statistics web pages to gather the required variables and merge them into a dataset for further pre-processing and analysis.
The merged data then had its data structure and variable types reviewed, with appropriate conversions made to factors and numeric variables.
Wickham and Grolemund’s (2016) three interrelated rules to make a tidy dataset were followed to separate the date of birth variable into new variables and the Goals and Assists variables were then mutated into new Per 90 minute variables.
The datasets missing values and obvious errors were identified and appropriately resolved using imputation methods and the numeric values were scanned for outliers.
Histogram, boxplot and scatterplots were used for univariate and bivariate analysis of outliers and appropriate capping and imputation methods were used to deal with outliers identified.
Lastly different variables were binned and normalised to allow comparative further analysis to take place across the dataset.
This dataset is a sample of Australian professional footbal (soccer) players who have appeared in the 2019/20 Season.
The data is based on the html data tables taken from the squad details and player statistics pages from each teams Transfermarkt.com page located at www.transfermarkt.com/a-league/startseite/wettbewerb/AUS1.
Although this dataset contains additional variables, only 16 are relevant for the purposes of this analysis. These variables are:
Team [Character]: Name of team
TeamCode [Character]: Uniquee Team ID number
Name [Character]: Name of player
Position [Factor]: Categorical variable with 16 factors (Goalkeeper, Right-Back, Centre-Back, Defender, Left-Back, Defensive Midfield, Central Midfield, Midfielder, Attacking Midfield, Second Striker, Right Winger, Right Midfield, Left Midfield, Left Winger, Centre-Forward, Forward)
Date of Birth [Date]: Date of birth of player
Height [Integer]: Numerical variable, height of player
Foot [Factor]: Categorical variable with 4 factors (left, both, right, unknown)
Age [Integer]: Numerical variable, age of player
Squad_Selection [Integer]: Numerical value, number of matchday squads selected in during season
Games_Played [Integer]: Numerical value, number of games played in season
Goals [Integer]: Numerical value, number of goals in season
Assists [Integer]: Numerical value, number of goals in season
Minutes_Played [Integer]: Numerical value, number of minutes played in season
The remaining variables, superfluous to this analysis, were dropped after reading the file.
teamcodes <- read_html("https://www.transfermarkt.com/a-league/startseite/wettbewerb/AUS1/plus/?saison_id=2019") %>%
html_nodes(".hide-for-pad .vereinprofil_tooltip") %>%
html_attr("href") %>%
data.frame() %>%
separate(.,.,into = c("blank","team","startseite","verein","teamcode","saison","year"), sep = "/") %>%
select(-1,-3,-4,-6,-7)
head(teamcodes, 3)
squadlist <- apply(teamcodes, 1, function(x){
teamcodes <- (x['teamcode'])
teamname <- (x['team'])
starturl <- "https://www.transfermarkt.com/sydney-fc/kader/verein/"
endurl <- "/plus/1/galerie/0?saison_id=2019"
theurl <- paste(starturl,teamcodes,endurl, sep = "")
squad <- read_html(theurl) %>%
html_nodes(.,"table") %>%
html_table(.,fill = TRUE) %>% .[2] %>%
data.frame() %>%
select(-1,-2,-3,-7,-8,-11,-12,-13,-14) %>%
drop_na() %>%
cbind(.,teamname) })
squad <- do.call(rbind.data.frame, squadlist)
names(squad) <- c("Name", "Position","DOB","Height","Foot","Team")
head(squad)
The same process and packages that were used to scrape the squad details were used for the player statsitics which come from a different url and provide a different dataset.
statslist <- apply(teamcodes, 1, function(x){
teamcodes <- (x['teamcode'])
teamname <- (x['team'])
starturl <- "https://www.transfermarkt.com/sydney-fc/leistungsdaten/verein/"
endurl <- "/plus/1?reldata=%262019"
theurl <- paste(starturl,teamcodes,endurl, sep = "")
stats <- read_html(theurl) %>%
html_nodes(.,"table") %>%
html_table(.,fill = TRUE) %>% .[2] %>%
data.frame() %>%
select(-1,-2,-3,-7,-12,-13,-14,-15,-16,-17,-19,-20,-21,-22,-23,-24,-25,-26,-27) %>%
drop_na() %>%
cbind(.,teamname) })
stats <- do.call(rbind.data.frame, statslist)
names(stats) <- c("Name", "Position","Age", "Squad_Selections", "Games_Played","Goals","Assists","Minutes_Played","Team")
head(stats, 3)
stats$Name %<>% str_replace_all(., "[:punct:]", "") %<>% str_trim()
squad$Name %<>% str_replace_all(., "[:punct:]", "") %<>% str_trim()
team_data <- stats %>% left_join(squad, by = c("Name","Position","Team"))
head(team_data, 3)
str(team_data)
## 'data.frame': 339 obs. of 12 variables:
## $ Name : chr "Tom GloverT Glover" "Dean BouzanisD Bouzanis" "Joe GauciJoe Gauci" "Jack HendryJ Hendry" ...
## $ Position : chr "Goalkeeper" "Goalkeeper" "Goalkeeper" "Centre-Back" ...
## $ Age : int 21 28 18 24 28 20 24 27 26 30 ...
## $ Squad_Selections: chr "25" "25" "11" "2" ...
## $ Games_Played : chr "15" "15" "Not used during this season" "2" ...
## $ Goals : chr "-" "-" "Not used during this season" "-" ...
## $ Assists : chr "-" "-" "Not used during this season" "-" ...
## $ Minutes_Played : chr "1.380'" "1.350'" "Not used during this season" "180'" ...
## $ Team : Factor w/ 11 levels "melbourne-heart-fc",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ DOB : chr "Dec 24, 1997 (21)" "Oct 2, 1990 (28)" "Jul 4, 2000 (18)" "May 7, 1995 (24)" ...
## $ Height : chr "1,90 m" "1,89 m" "m" "1,92 m" ...
## $ Foot : chr "-" "right" "-" "right" ...
attributes(team_data)
## $names
## [1] "Name" "Position" "Age" "Squad_Selections"
## [5] "Games_Played" "Goals" "Assists" "Minutes_Played"
## [9] "Team" "DOB" "Height" "Foot"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
## [325] 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339
##
## $class
## [1] "data.frame"
team_data$Position %<>% factor(
levels = c("Goalkeeper","Right-Back","Centre-Back","Defender","Left-Back","Defensive Midfield","Central Midfield","Midfielder","Attacking Midfield","Second Striker","Right Winger","Right Midfield","Left Midfield","Left Winger","Centre-Forward","Forward"),
labels = c("GK","DF","DF","DF","DF","MF","MF","MF","MF","MF","FW","FW","FW","FW","FW","FW"))
team_data$Foot %<>% factor(levels = c("left","both","right"))
team_data[,3:7] %<>% mutate_all(funs(str_replace_all(., "[:punct:]", "0")))
team_data[,8:12] %<>% mutate_all(funs(str_replace_all(., "[:punct:]", "")))
team_data[,3:8] <- sapply(team_data[,3:8],as.numeric)
team_data$Height %<>% str_replace_all(., "m","") %>% as.numeric()
team_data$DOB %<>% str_sub(., ,-4) %>% mdy()
str(team_data)
## 'data.frame': 339 obs. of 12 variables:
## $ Name : chr "Tom GloverT Glover" "Dean BouzanisD Bouzanis" "Joe GauciJoe Gauci" "Jack HendryJ Hendry" ...
## $ Position : Factor w/ 4 levels "GK","DF","MF",..: 1 1 1 2 2 2 2 2 2 2 ...
## $ Age : num 21 28 18 24 28 20 24 27 26 30 ...
## $ Squad_Selections: num 25 25 11 2 15 29 30 27 31 25 ...
## $ Games_Played : num 15 15 NA 2 9 23 24 26 28 24 ...
## $ Goals : num 0 0 NA 0 0 1 1 1 1 0 ...
## $ Assists : num 0 0 NA 0 1 0 1 0 1 2 ...
## $ Minutes_Played : num 1380 1350 NA 180 670 ...
## $ Team : chr "melbourneheartfc" "melbourneheartfc" "melbourneheartfc" "melbourneheartfc" ...
## $ DOB : Date, format: "1997-12-24" "1990-10-02" ...
## $ Height : num 190 189 NA 192 183 NA 176 193 187 178 ...
## $ Foot : chr NA "right" NA "right" ...
The team_data dataset does not comform to the tidy data principle as the DOB variable stores multiple variables that would be useful for analysing this data set.
# This is the R chunk for the Tidy & Manipulate Data I
team <- team_data %>% separate(DOB, into = c("BirthYear","BirthMonth","BirthDay"))
team[,11:13] <- sapply(team[,11:13],as.numeric)
head(team, 4)
After completing these steps we now meet the three interrelated rules which make a dataset tidy as itendified by Wickham and Grolemund (2016):
1. Each variable has its own column.
2. Each observation has its own row.
3. Each value has its own cell.
# This is the R chunk for the Tidy & Manipulate Data II
team %<>% mutate(., Goals_Per_90 = round((Goals / Minutes_Played) * 90,2))
team %<>% mutate(., Assists_Per_90 = round((Assists / Minutes_Played) * 90,2))
head(team, 4)
sapply(team, function(x) sum(is.na(x)))
## Name Position Age Squad_Selections
## 0 0 0 0
## Games_Played Goals Assists Minutes_Played
## 46 46 46 49
## Team BirthYear BirthMonth BirthDay
## 0 0 0 0
## Height Foot Goals_Per_90 Assists_Per_90
## 92 131 49 49
team[,c(5:8,15:16)] %<>% impute(., fun = 0)
team$Foot %<>% impute(., fun = "unknown")
team$Height %<>% impute(.,mean(team$Height, na.rm = TRUE))
sapply(team, function(x) sum(is.na(x)))
## Name Position Age Squad_Selections
## 0 0 0 0
## Games_Played Goals Assists Minutes_Played
## 0 0 0 0
## Team BirthYear BirthMonth BirthDay
## 0 0 0 0
## Height Foot Goals_Per_90 Assists_Per_90
## 0 0 0 0
The filter function frorm the {dplyr} package was used to find obvious errors in the Age and Height variables.
## Do the Age and Height fall between normal ranges
team %>% filter(., Age < 15 | Age > 45 | Height < 160 | Height > 210) %>% select(Name, Age, Height, Minutes_Played)
The height error found is likely a data entry or scraping error, so the which function from the {base} package is used to replace it with the mean.
team$Height[which(team$Height <50)] <- mean(team$Height, na.rm = TRUE)
The boxplot function from the {graphics} package was used to look at the univeriate distrbutions of all numeric variables to identify outliers.
par(mfrow=c(3,3))
team$Age %>% boxplot(xlab = "Age")
team$Squad_Selections %>% boxplot(xlab = "Squad Selections")
team$Games_Played %>% boxplot(xlab = "Games Played")
team$Height %>% as.numeric %>% boxplot(xlab = "Height")
team$Goals %>% boxplot(xlab = "Goals")
team$Assists %>% boxplot(xlab = "Assists")
team$Minutes_Played %>% boxplot(xlab = "Minutes Played")
team$Goals_Per_90 %>% boxplot(xlab = "Goals Per 90")
team$Assists_Per_90 %>% boxplot(xlab = "Assits Per 90")
The hist function from the {graphics} package was used to confirm a norrmal distrubtion across the Height variable.
The cap function created in the required packages section earlier is then used to cap the outliers below the lower limit to the 5th percentile and above the upper limit with the value of 95th percentile.
hist(team$Height, breaks = 10)
team$Height %<>% cap()
The boxplot function from the {graphics} package was used to look at the bivariate distributions of the Goals and Assists variables by Position.
par(mfrow=c(2,1))
boxplot(team$Goals ~ team$Position, main="Goals by Position", ylab = "Goals", xlab = "Position", horizontal = TRUE)
boxplot(team$Assists ~ team$Position, main="Assists by Position", ylab = "Assists", xlab = "Position", horizontal = TRUE)
The Positions data frame was created to provide a column including all the position variables.
The apply function from the {base} package was used to then loop though each position.
The cap function created in the required packages section earlier is then used to cap the outliers above the upper limit with the value of 95th percentile for each Positon.
team %<>% arrange(Position)
Positions <- data.frame(Pos = levels(team$Position))
GoalsCap <- apply(Positions, 1, function(x){
Pos <- (x['Pos'])
team %>% filter(Position == Pos) %>% select(Goals) %>% .$Goals %>% cap()})
team$Goals <- do.call(c, GoalsCap)
AssistCap <- apply(Positions, 1, function(x){
Pos <- (x['Pos'])
team %>% filter(Position == Pos) %>% select(Assists) %>% .$Assists %>% cap()})
team$Assists <- do.call(c, AssistCap)
The plot function from the {graphics} package was creat a scatter plot to look at the bivariate distributions of the Goals_Per_90 and Assists_Per_90 variables by Position.
team %>% plot(Goals_Per_90 ~ Minutes_Played, data = ., ylab="Goals Per 90", xlab="Minutes Played", main="Goals Per 90 by Minutes Played")
team %>% plot(Assists_Per_90 ~ Minutes_Played, data = ., ylab="Assists Per 90", xlab="Minutes Played", main="Assists Per 90 by Minutes Played")
Aftter observing an obvious error form eithr data entry or preprocessing, the mean function from the {base} package was used to impute the value with mean of both variables.
team$Assists_Per_90[which(team$Assists_Per_90 >2)] <- mean(team$Assists_Per_90, na.rm = TRUE)
team$Goals_Per_90[which(team$Goals_Per_90 >2)] <- mean(team$Goals_Per_90, na.rm = TRUE)
The same process for capping the outliers used for the Goals and Assists vairables by position is then used for the Per90 variables.
par(mfrow=c(2,1))
boxplot(team$Goals_Per_90 ~ team$Position, main="Goals Per 90 by Position", ylab = "Goals", xlab = "Position", horizontal = TRUE)
boxplot(team$Assists_Per_90 ~ team$Position, main="Assists Per 90 by Position", ylab = "Assists", xlab = "Position", horizontal = TRUE)
GoalsPer90Cap <- apply(Positions, 1, function(x){
Pos <- (x['Pos'])
team %>% filter(Position == Pos) %>% select(Goals_Per_90) %>% .$Goals_Per_90 %>% cap()})
team$Goals_Per_90 <- do.call(c, GoalsPer90Cap)
AssistPer90Cap <- apply(Positions, 1, function(x){
Pos <- (x['Pos'])
team %>% filter(Position == Pos) %>% select(Assists_Per_90) %>% .$Assists_Per_90 %>% cap()})
team$Assists_Per_90 <- do.call(c, AssistPer90Cap)
The discertize fuction from the {infotheo} package was used to discretise the numerical values.
BirthMonthQuartile <- discretize(team$BirthMonth, disc = "equalwidth", nbins=4)
BirthYearCycles <- discretize(team$BirthYear, disc = "equalwidth", nbins=3)
MinutesPlayedBins <- discretize(team$Minutes_Played, disc = "equalwidth", nbins=5)
names(BirthMonthQuartile) <- c("BirthMonthQuartile")
names(BirthYearCycles) <- c("BirthYearCycles")
names(MinutesPlayedBins) <- c("MinutesPlayedBins")
The Per90 variables were scaled to a fixed 0-1 range using the normalize function created earlier in the required packages section.
Goals_Per_90_Normalize <- data.frame(Goals_Per_90_Normalize = normalize(team$Goals_Per_90))
Assists_Per_90_Normalize <- data.frame(Assists_Per_90_Normalize = normalize(team$Goals_Per_90))
The new variables were then named and binded to the team data frame using the cbind function from the {base} package.
The final output is displayed.
team <- cbind(team, BirthMonthQuartile, BirthYearCycles, MinutesPlayedBins, Goals_Per_90_Normalize, Assists_Per_90_Normalize)
head(team)
Barnsley, R.H., Thompson A.H. and Barnsley P.E. (1985) Hockey success and birthdate: the relative age effect. Canadian Association for Health, Physical Education and Recreation 51(8), 23-80
Helsen, W.F., van Winckel, J, & Williams, A.M. (2005). The relative age effect in youth soccer across Europe. Journal of Sports Sciences, 23(6), 629–636. https://doi.org/10.1080/02640410400021310
Grolemund, Garrett, & Wickham, Hadley. (2016). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.