MATH2349 Data Wrangling

Required packages

library(rvest)
library(tidyverse)
library(lubridate)
library(deductive)
library(Hmisc)
library(outliers)
library(infotheo)
library(magrittr)
library(MVN)
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}

Executive Summary

The Relative Age Effect (RAE) was first researched in a sporting context by Barnsley, Thompson and Barnsley (1985). The RAE has been shown to have found that athletes with a greater relative age, those that are born in the first three months of the year, are more likely to be identified as talented athletes and be selected in sporting teams, because of their likely physical advantages, compared to athletes born in the later months of the year (Helsen, van Winckel & Williams, 2005).

There is limited published research into whether or not the RAE is present in Australian professional football (soccer), which makes the accurate pre-processing of publicly available data crucial for further analysis.

The data collected was scraped from the website www.Transfermarkt.com using each team’s unique teamcode to loop though the squad details and player statistics web pages to gather the required variables and merge them into a dataset for further pre-processing and analysis.

The merged data then had its data structure and variable types reviewed, with appropriate conversions made to factors and numeric variables.

Wickham and Grolemund’s (2016) three interrelated rules to make a tidy dataset were followed to separate the date of birth variable into new variables and the Goals and Assists variables were then mutated into new Per 90 minute variables.

The datasets missing values and obvious errors were identified and appropriately resolved using imputation methods and the numeric values were scanned for outliers.

Histogram, boxplot and scatterplots were used for univariate and bivariate analysis of outliers and appropriate capping and imputation methods were used to deal with outliers identified.

Lastly different variables were binned and normalised to allow comparative further analysis to take place across the dataset.

Data

Data Description

This dataset is a sample of Australian professional footbal (soccer) players who have appeared in the 2019/20 Season.
The data is based on the html data tables taken from the squad details and player statistics pages from each teams Transfermarkt.com page located at www.transfermarkt.com/a-league/startseite/wettbewerb/AUS1.

Although this dataset contains additional variables, only 16 are relevant for the purposes of this analysis. These variables are:
Team [Character]: Name of team
TeamCode [Character]: Uniquee Team ID number
Name [Character]: Name of player
Position [Factor]: Categorical variable with 16 factors (Goalkeeper, Right-Back, Centre-Back, Defender, Left-Back, Defensive Midfield, Central Midfield, Midfielder, Attacking Midfield, Second Striker, Right Winger, Right Midfield, Left Midfield, Left Winger, Centre-Forward, Forward)
Date of Birth [Date]: Date of birth of player
Height [Integer]: Numerical variable, height of player
Foot [Factor]: Categorical variable with 4 factors (left, both, right, unknown)
Age [Integer]: Numerical variable, age of player
Squad_Selection [Integer]: Numerical value, number of matchday squads selected in during season
Games_Played [Integer]: Numerical value, number of games played in season
Goals [Integer]: Numerical value, number of goals in season
Assists [Integer]: Numerical value, number of goals in season
Minutes_Played [Integer]: Numerical value, number of minutes played in season

The remaining variables, superfluous to this analysis, were dropped after reading the file.

Scrape Team Codes

The read_html function from the {xml2} package was used to scrape the data.
The html_nodes function from the {rvest} package was used to extract the '.hide-for-pad .vereinprofil_tooltip' node set.
The html_attr function from the {rvest} package was used to extract the 'href' attributes which provides links to each team page.
The data.frame function from the {base} package was used to store the team links.
The separate function from the {tidyr} package was used to seperate the link into different columns in the data frame.
The select function from the {dplyr} package was used to remove the superfluous vaiables from the data frame.

teamcodes <- read_html("https://www.transfermarkt.com/a-league/startseite/wettbewerb/AUS1/plus/?saison_id=2019") %>%
  html_nodes(".hide-for-pad .vereinprofil_tooltip") %>%
  html_attr("href") %>% 
  data.frame() %>% 
  separate(.,.,into = c("blank","team","startseite","verein","teamcode","saison","year"), sep = "/") %>% 
  select(-1,-3,-4,-6,-7)
head(teamcodes, 3)

Scrape Squad Details

The apply function from the {base} package was used to loop through the teamcodes and teamname and scrape squad details for each team into a list.
The paste function form the {base} package was used to create the team's squad url by pasting the teamcode into the url.
The read_html function from the {xml2} package was used to scrape the squad details data.
The html_nodes function from the {rvest} package was used to extract the 'table' node set.
The html_table function from the {rvest} package was used to extract the 2nd table.
The data.frame function from the {base} package was used to store the squad data.
The select function from the {dplyr} package was used to remove the superfluous vaiables from the data frame.
The drop_na function from the {tidyr} package was used to remove rows containing missing values.
The cbind funciton from the {base} package was used to add the teamname to the data frame.
The do.call and rbind.data.frame funciton from the {base} package were used to bind each team's squad data into one data frame.
The names function from the {base} package was used to name each column in the squd data frame.
The final output of the 'squad' data frame was then displayed.

squadlist <- apply(teamcodes, 1, function(x){
  teamcodes <- (x['teamcode'])
  teamname <- (x['team'])
  
  starturl <- "https://www.transfermarkt.com/sydney-fc/kader/verein/"
  endurl <- "/plus/1/galerie/0?saison_id=2019"
  theurl <- paste(starturl,teamcodes,endurl, sep = "")
  squad <- read_html(theurl) %>%
    html_nodes(.,"table") %>%
    html_table(.,fill = TRUE)  %>% .[2] %>%
    data.frame() %>% 
    select(-1,-2,-3,-7,-8,-11,-12,-13,-14) %>%
    drop_na() %>%
    cbind(.,teamname) })
squad <- do.call(rbind.data.frame, squadlist)
names(squad) <- c("Name", "Position","DOB","Height","Foot","Team")
head(squad)

Scrape Player Statistics

The same process and packages that were used to scrape the squad details were used for the player statsitics which come from a different url and provide a different dataset.

statslist <- apply(teamcodes, 1, function(x){
  teamcodes <- (x['teamcode'])
  teamname <- (x['team'])
  starturl <- "https://www.transfermarkt.com/sydney-fc/leistungsdaten/verein/"
  endurl <- "/plus/1?reldata=%262019"
  theurl <- paste(starturl,teamcodes,endurl, sep = "")
  
stats <- read_html(theurl) %>%
         html_nodes(.,"table") %>%
         html_table(.,fill = TRUE)  %>% .[2] %>%
         data.frame() %>% 
         select(-1,-2,-3,-7,-12,-13,-14,-15,-16,-17,-19,-20,-21,-22,-23,-24,-25,-26,-27) %>%
         drop_na() %>%
         cbind(.,teamname) })

stats <- do.call(rbind.data.frame, statslist)
names(stats) <- c("Name", "Position","Age", "Squad_Selections", "Games_Played","Goals","Assists","Minutes_Played","Team")
head(stats, 3)

Merging Data Sets

The str_replace_all function from the {stringr} package was used to strip all punctuation and symbols from the Name variable.
The str_trim function from the {stringr} package was then used to remove whitespace from the Name variable.
The left_join function from the {dplyr} package was used as a mutating join to merge the variables from the stats and squad data frames.

stats$Name %<>% str_replace_all(., "[:punct:]", "") %<>% str_trim()
squad$Name %<>% str_replace_all(., "[:punct:]", "") %<>% str_trim()
team_data <- stats %>% left_join(squad, by = c("Name","Position","Team"))
head(team_data, 3)

Understand

Inspecting Data Structure

The str function from the {utils} package was used to check the data types of the 'team_data' data frame.
The attributes function from the {base} package was used to check the dimensions of the 'team_data' data frame.

str(team_data)

## 'data.frame':    339 obs. of  12 variables:
##  $ Name            : chr  "Tom GloverT Glover" "Dean BouzanisD Bouzanis" "Joe GauciJoe Gauci" "Jack HendryJ Hendry" ...
##  $ Position        : chr  "Goalkeeper" "Goalkeeper" "Goalkeeper" "Centre-Back" ...
##  $ Age             : int  21 28 18 24 28 20 24 27 26 30 ...
##  $ Squad_Selections: chr  "25" "25" "11" "2" ...
##  $ Games_Played    : chr  "15" "15" "Not used during this season" "2" ...
##  $ Goals           : chr  "-" "-" "Not used during this season" "-" ...
##  $ Assists         : chr  "-" "-" "Not used during this season" "-" ...
##  $ Minutes_Played  : chr  "1.380'" "1.350'" "Not used during this season" "180'" ...
##  $ Team            : Factor w/ 11 levels "melbourne-heart-fc",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ DOB             : chr  "Dec 24, 1997 (21)" "Oct 2, 1990 (28)" "Jul 4, 2000 (18)" "May 7, 1995 (24)" ...
##  $ Height          : chr  "1,90 m" "1,89 m" "m" "1,92 m" ...
##  $ Foot            : chr  "-" "right" "-" "right" ...

attributes(team_data)

## $names
##  [1] "Name"             "Position"         "Age"              "Squad_Selections"
##  [5] "Games_Played"     "Goals"            "Assists"          "Minutes_Played"  
##  [9] "Team"             "DOB"              "Height"           "Foot"            
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
## [325] 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339
## 
## $class
## [1] "data.frame"

Applying Data Type Conversions

The factor function from the {base} package was used to coerce the Position variable into a factor, with labels and levels added.
The same function was then used to coerce the Foot variable into a factor.
The mutate_all and funs function from the {dplyr} package was used, along with the str_replace_all function from the {stringr} package, to replace the "-" used in the dataset with "0" for the Age, Squad_Selections, Games_Played, Goals and Assists variables.
The same functions were also then used to strip all punction and symbols from the Minutes_Played, Team, DOB, Height and Foot variables.
The sapply and as.numeric functions from the {base} package were used to coerce the Age, Squad_Selections, Games_Played, Goals, Assists and Mintues_Played variables into numeric data type.
The str_replace_all function from the {stringr} package was used to remove "m" in the height variable, and that variable was then coerceed into a numeric type using the as.numeric function.
The str_sub function from the {stringr} package was used to strip the last four characters of the DOB variable string (which displayed the age in brackets and was superfluous) and was then coerced into a Date type using the mdy function from the {lubridate} package.
The str function from the {utils} package was then used to check the data types of the 'team_data' data frame.

team_data$Position %<>% factor(
  levels = c("Goalkeeper","Right-Back","Centre-Back","Defender","Left-Back","Defensive Midfield","Central Midfield","Midfielder","Attacking Midfield","Second Striker","Right Winger","Right Midfield","Left Midfield","Left Winger","Centre-Forward","Forward"),
  labels = c("GK","DF","DF","DF","DF","MF","MF","MF","MF","MF","FW","FW","FW","FW","FW","FW"))
team_data$Foot %<>% factor(levels = c("left","both","right"))
team_data[,3:7] %<>% mutate_all(funs(str_replace_all(., "[:punct:]", "0")))
team_data[,8:12] %<>% mutate_all(funs(str_replace_all(., "[:punct:]", "")))
team_data[,3:8] <- sapply(team_data[,3:8],as.numeric)
team_data$Height %<>% str_replace_all(., "m","") %>% as.numeric()
team_data$DOB %<>% str_sub(., ,-4) %>% mdy()
str(team_data)

## 'data.frame':    339 obs. of  12 variables:
##  $ Name            : chr  "Tom GloverT Glover" "Dean BouzanisD Bouzanis" "Joe GauciJoe Gauci" "Jack HendryJ Hendry" ...
##  $ Position        : Factor w/ 4 levels "GK","DF","MF",..: 1 1 1 2 2 2 2 2 2 2 ...
##  $ Age             : num  21 28 18 24 28 20 24 27 26 30 ...
##  $ Squad_Selections: num  25 25 11 2 15 29 30 27 31 25 ...
##  $ Games_Played    : num  15 15 NA 2 9 23 24 26 28 24 ...
##  $ Goals           : num  0 0 NA 0 0 1 1 1 1 0 ...
##  $ Assists         : num  0 0 NA 0 1 0 1 0 1 2 ...
##  $ Minutes_Played  : num  1380 1350 NA 180 670 ...
##  $ Team            : chr  "melbourneheartfc" "melbourneheartfc" "melbourneheartfc" "melbourneheartfc" ...
##  $ DOB             : Date, format: "1997-12-24" "1990-10-02" ...
##  $ Height          : num  190 189 NA 192 183 NA 176 193 187 178 ...
##  $ Foot            : chr  NA "right" NA "right" ...

Tidy & Manipulate Data

Tidy Data Principles

The team_data dataset does not comform to the tidy data principle as the DOB variable stores multiple variables that would be useful for analysing this data set.

The separate function from the {tidyr} package was used to extract the DOB variable and store the the information into three new variables "BirthYear", "BirthMonth", "BirthDay".
The sapply and as.numeric functions from the {base} package were used to coerce the new variables into numeric type.

# This is the R chunk for the Tidy & Manipulate Data I 
team <- team_data %>% separate(DOB, into =  c("BirthYear","BirthMonth","BirthDay"))
team[,11:13] <- sapply(team[,11:13],as.numeric)
head(team, 4)

After completing these steps we now meet the three interrelated rules which make a dataset tidy as itendified by Wickham and Grolemund (2016):
1. Each variable has its own column.
2. Each observation has its own row.
3. Each value has its own cell.

Creating New Variables

The mutate function from the {dplyr} package was used to create a new variable "Goals_Per_90" by dividing the Goals by Minutes Played and then multiplying that by 90.
The same process was repeated to created a new "Assists_Per_90" vaiable from the Assists variable.

# This is the R chunk for the Tidy & Manipulate Data II 
team %<>% mutate(., Goals_Per_90 = round((Goals / Minutes_Played) * 90,2))
team %<>% mutate(., Assists_Per_90 = round((Assists / Minutes_Played) * 90,2))
head(team, 4)

Scan I

Finding Missing Values

The sapply, colSums and is.na functions from the {base} package were used to provide a count of how many rows had missing values (NAs).

sapply(team, function(x) sum(is.na(x)))

##             Name         Position              Age Squad_Selections 
##                0                0                0                0 
##     Games_Played            Goals          Assists   Minutes_Played 
##               46               46               46               49 
##             Team        BirthYear       BirthMonth         BirthDay 
##                0                0                0                0 
##           Height             Foot     Goals_Per_90   Assists_Per_90 
##               92              131               49               49

The impute function from the {Hmisc} package was used to fill in the NA values forr the numerical varirables with 0.
The impute function was then used to fill in the NA values for the Foot variable with "unknown" and the Height variable with the mean.

team[,c(5:8,15:16)] %<>% impute(., fun = 0)
team$Foot %<>% impute(., fun = "unknown")
team$Height %<>%  impute(.,mean(team$Height, na.rm = TRUE))
sapply(team, function(x) sum(is.na(x)))

##             Name         Position              Age Squad_Selections 
##                0                0                0                0 
##     Games_Played            Goals          Assists   Minutes_Played 
##                0                0                0                0 
##             Team        BirthYear       BirthMonth         BirthDay 
##                0                0                0                0 
##           Height             Foot     Goals_Per_90   Assists_Per_90 
##                0                0                0                0

Finding Obvious Errors

The filter function frorm the {dplyr} package was used to find obvious errors in the Age and Height variables.

## Do the Age and Height fall between normal ranges
team %>% filter(., Age < 15 | Age > 45 | Height < 160 | Height > 210) %>% select(Name, Age, Height, Minutes_Played)

The height error found is likely a data entry or scraping error, so the which function from the {base} package is used to replace it with the mean.

team$Height[which(team$Height <50)] <- mean(team$Height, na.rm = TRUE)

Scan II

Scanning for Outliers

The boxplot function from the {graphics} package was used to look at the univeriate distrbutions of all numeric variables to identify outliers.

par(mfrow=c(3,3))
team$Age %>% boxplot(xlab = "Age")
team$Squad_Selections %>% boxplot(xlab = "Squad Selections")
team$Games_Played %>% boxplot(xlab = "Games Played")
team$Height %>% as.numeric %>% boxplot(xlab = "Height")
team$Goals %>% boxplot(xlab = "Goals")
team$Assists %>% boxplot(xlab = "Assists")
team$Minutes_Played %>% boxplot(xlab = "Minutes Played")
team$Goals_Per_90 %>% boxplot(xlab = "Goals Per 90")
team$Assists_Per_90 %>% boxplot(xlab = "Assits Per 90")

Checking Normal Distribution

The hist function from the {graphics} package was used to confirm a norrmal distrubtion across the Height variable.
The cap function created in the required packages section earlier is then used to cap the outliers below the lower limit to the 5th percentile and above the upper limit with the value of 95th percentile.

hist(team$Height, breaks = 10)

team$Height %<>% cap()

Outlier Detection - Box Plots

The boxplot function from the {graphics} package was used to look at the bivariate distributions of the Goals and Assists variables by Position.

par(mfrow=c(2,1))
boxplot(team$Goals ~ team$Position, main="Goals by Position", ylab = "Goals", xlab = "Position", horizontal = TRUE)
boxplot(team$Assists ~ team$Position, main="Assists by Position", ylab = "Assists", xlab = "Position", horizontal = TRUE)

The Positions data frame was created to provide a column including all the position variables.
The apply function from the {base} package was used to then loop though each position.
The cap function created in the required packages section earlier is then used to cap the outliers above the upper limit with the value of 95th percentile for each Positon.

team %<>% arrange(Position)
Positions <- data.frame(Pos = levels(team$Position))

GoalsCap <- apply(Positions, 1, function(x){
  Pos <- (x['Pos'])
team %>% filter(Position == Pos) %>% select(Goals) %>% .$Goals %>% cap()})
team$Goals <- do.call(c, GoalsCap)

AssistCap <- apply(Positions, 1, function(x){
  Pos <- (x['Pos'])
team %>% filter(Position == Pos) %>% select(Assists) %>% .$Assists %>% cap()})
team$Assists <- do.call(c, AssistCap)

Outlier Detection - Scatter Plots

The plot function from the {graphics} package was creat a scatter plot to look at the bivariate distributions of the Goals_Per_90 and Assists_Per_90 variables by Position.

team %>% plot(Goals_Per_90 ~ Minutes_Played, data = ., ylab="Goals Per 90", xlab="Minutes Played", main="Goals Per 90 by Minutes Played")

team %>% plot(Assists_Per_90 ~ Minutes_Played, data = ., ylab="Assists Per 90", xlab="Minutes Played", main="Assists Per 90 by Minutes Played")

Aftter observing an obvious error form eithr data entry or preprocessing, the mean function from the {base} package was used to impute the value with mean of both variables.

team$Assists_Per_90[which(team$Assists_Per_90 >2)] <- mean(team$Assists_Per_90, na.rm = TRUE)
team$Goals_Per_90[which(team$Goals_Per_90 >2)] <- mean(team$Goals_Per_90, na.rm = TRUE)

The same process for capping the outliers used for the Goals and Assists vairables by position is then used for the Per90 variables.

par(mfrow=c(2,1))
boxplot(team$Goals_Per_90 ~ team$Position, main="Goals Per 90 by Position", ylab = "Goals", xlab = "Position", horizontal = TRUE)
boxplot(team$Assists_Per_90 ~ team$Position, main="Assists Per 90 by Position", ylab = "Assists", xlab = "Position", horizontal = TRUE)

GoalsPer90Cap <- apply(Positions, 1, function(x){
  Pos <- (x['Pos'])
team %>% filter(Position == Pos) %>% select(Goals_Per_90) %>% .$Goals_Per_90 %>% cap()})
team$Goals_Per_90 <- do.call(c, GoalsPer90Cap)

AssistPer90Cap <- apply(Positions, 1, function(x){
  Pos <- (x['Pos'])
team %>% filter(Position == Pos) %>% select(Assists_Per_90) %>% .$Assists_Per_90 %>% cap()})
team$Assists_Per_90 <- do.call(c, AssistPer90Cap)

Transform

Binning

The discertize fuction from the {infotheo} package was used to discretise the numerical values.

BirthMonthQuartile <- discretize(team$BirthMonth, disc = "equalwidth", nbins=4)
BirthYearCycles <- discretize(team$BirthYear, disc = "equalwidth", nbins=3)
MinutesPlayedBins <- discretize(team$Minutes_Played, disc = "equalwidth", nbins=5)

names(BirthMonthQuartile) <- c("BirthMonthQuartile")
names(BirthYearCycles) <- c("BirthYearCycles")
names(MinutesPlayedBins) <- c("MinutesPlayedBins")

Min - Max Normalisation

The Per90 variables were scaled to a fixed 0-1 range using the normalize function created earlier in the required packages section.

Goals_Per_90_Normalize <- data.frame(Goals_Per_90_Normalize = normalize(team$Goals_Per_90))
Assists_Per_90_Normalize <- data.frame(Assists_Per_90_Normalize = normalize(team$Goals_Per_90))

Final Output

The new variables were then named and binded to the team data frame using the cbind function from the {base} package.
The final output is displayed.

team <- cbind(team, BirthMonthQuartile, BirthYearCycles, MinutesPlayedBins, Goals_Per_90_Normalize, Assists_Per_90_Normalize)
head(team)

References

Barnsley, R.H., Thompson A.H. and Barnsley P.E. (1985) Hockey success and birthdate: the relative age effect. Canadian Association for Health, Physical Education and Recreation 51(8), 23-80
Helsen, W.F., van Winckel, J, & Williams, A.M. (2005). The relative age effect in youth soccer across Europe. Journal of Sports Sciences, 23(6), 629–636. https://doi.org/10.1080/02640410400021310
Grolemund, Garrett, & Wickham, Hadley. (2016). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.