Summary

Clustering analysis of average performance scores can be used to identify players with any desired proportions of offensive and defensive skills.**

Reading and preparing data

library(RSQLite)
## Loading required package: DBI
#connect to database
con <- dbConnect(SQLite(), dbname="database.sqlite")

# list all tables
dbListTables(con)
## [1] "Country"         "League"          "Match"           "Player"         
## [5] "Player_Stats"    "Team"            "sqlite_sequence"
# import tables of interest
player=dbGetQuery(con,"SELECT * FROM player")
player_stats=dbGetQuery(con,"SELECT * FROM Player_Stats")

#merge tabels
merged_table=merge(player,player_stats,by = "player_api_id")

#remove NA
merged_table=na.omit(merged_table)

names(merged_table)
##  [1] "player_api_id"        "id.x"                 "player_name"         
##  [4] "player_fifa_api_id.x" "birthday"             "height"              
##  [7] "weight"               "id.y"                 "player_fifa_api_id.y"
## [10] "date_stat"            "overall_rating"       "potential"           
## [13] "preferred_foot"       "attacking_work_rate"  "defensive_work_rate" 
## [16] "crossing"             "finishing"            "heading_accuracy"    
## [19] "short_passing"        "volleys"              "dribbling"           
## [22] "curve"                "free_kick_accuracy"   "long_passing"        
## [25] "ball_control"         "acceleration"         "sprint_speed"        
## [28] "agility"              "reactions"            "balance"             
## [31] "shot_power"           "jumping"              "stamina"             
## [34] "strength"             "long_shots"           "aggression"          
## [37] "interceptions"        "positioning"          "vision"              
## [40] "penalties"            "marking"              "standing_tackle"     
## [43] "sliding_tackle"       "gk_diving"            "gk_handling"         
## [46] "gk_kicking"           "gk_positioning"       "gk_reflexes"
#retain columns of interest
stats=merged_table[,-c(1,2,4,5,6,7,8,9,10,13,14,15)]

Reducing dimensionality

I create vectors of average scores for each player. I then reduce the dimensionality with Principal Component Analysis.

source("setPowerPointStyle.R")
setPowerPointStyle()

#To have more robust stats, I remove players whose score is available
#for <10 matches
bad_stats=names(which(table(stats$player_name)<10))

stats=stats[-which(stats[,1] %in% bad_stats), ]

#compute average scores for each player with >10 matches 
avg_scores=apply(stats[,2:ncol(stats)],2,function(x)
  tapply(x,stats$player_name,mean))

#principal component analysis
avg_scores_pca=prcomp(avg_scores,scale. = T)

#using a smooth scatter because too many points prevent from 
#seeing the global picture
smoothScatter(avg_scores_pca$x[,1:2],nrpoints = 0, 
              colramp = colorRampPalette(c("white", "gray5")))

The previous scatter reveals three major clusters. To precisely define these clusters, I apply a gaussian mixture model with 3 components.

Player segmentation using Gaussian Mixture Model

source("setPowerPointStyle.R")
setPowerPointStyle()

require(mclust)
## Loading required package: mclust
## Package 'mclust' version 5.1
## Type 'citation("mclust")' for citing this R package in publications.
xyMclust <- Mclust(data.frame (avg_scores_pca$x[,1],avg_scores_pca$x[,2]),3)

plot(xyMclust,what = "uncertainty",xlab="PC1",ylab = "PC2")

player_class=xyMclust$classification

#number of players in each cluster
table(player_class)
## player_class
##    1    2    3 
## 3876 3154  675

The three clusters correspond to different roles in the team

To make sense of the three clusters, let’s first have a look at which players they contain.

source("setPowerPointStyle.R")
setPowerPointStyle()

#examples of players from each cluster 1
names(player_class[player_class==1][1:5])
## [1] "Aaron Cresswell" "Aaron Galindo"   "Aaron Hughes"    "Aaron Meijers"  
## [5] "Aaron Ramsey"
#examples of players from each cluster 2
names(player_class[player_class==2][1:5])
## [1] "Aaron Doran"  "Aaron Hunt"   "Aaron Lennon" "Aaron Mooy"  
## [5] "Aaron Niguez"
#examples of players from each cluster 3
names(player_class[player_class==3][1:5])
## [1] "Abdoulaye Diallo" "Achille Coser"    "Adam Bogdan"     
## [4] "Adam Collin"      "Adam Federici"

I am not an soccer expert. Sad but true, I enjoy this analysis much more than an average match ;) Anyways, a google search helped me figure out that these names point to specific players’ roles. We have: cluster 1 -> defensive players; cluster 2 -> offensive players; cluster 3 -> goal keepers.

player_class[which(player_class==1)]="defensive"
player_class[which(player_class==2)]="offensive"
player_class[which(player_class==3)]="goal_keeper"
player_class=factor(player_class, levels = c("defensive", "offensive", "goal_keeper"))

To cross-check the results, I looked for variables that are specific for each group of players. Here I show one illustrative example for each group.

source("setPowerPointStyle.R")
setPowerPointStyle()

#defensive
k=24
boxplot(avg_scores[,k]~player_class,main=colnames(avg_scores)[k])

#offensive
k=4
boxplot(avg_scores[,k]~player_class,main=colnames(avg_scores)[k])

#goal keeper 
k=31
boxplot(avg_scores[,k]~player_class,main=colnames(avg_scores)[k])

So we can see that these variables correlate well with the corresponding roles: interception for defensive players, finishing for offensive players, and diving for goal keepers. One thing that’s disturbing: some defensive and offensive players still have relatively high average scores in activities related to goal keepers, such as diving. This may be due to misclassification, but it also raises some doubts on these scores…

Players with markedly defined roles

One cool thing about soft assignemnts (as the ones provided by Gaussian Mixture models) is that we can find players with any desired proportions of defensive Vs. offensive characteristics. First, let’s identify players with well defined roles, which translates to very high probabilities to belong to the group of defensive players, offensive players, and goal keepers.

source("setPowerPointStyle.R")
setPowerPointStyle()

#border players (min probabilities)
tot=data.frame(player_class,xyMclust$z[,1:3])
colnames(tot)[2:4]=c("prob_defensive","prob_offensive","prob_goal keeper")

defensive=tot[which(tot[,1]=="defensive"),]
offensive=tot[which(tot[,1]=="offensive"),]

#very defensive
head(defensive[order(defensive[,2],decreasing = T),])
##                     player_class prob_defensive prob_offensive
## Kamil Glik             defensive      1.0000000   1.790649e-08
## Oswaldo Vizcarrondo    defensive      1.0000000   1.918247e-08
## Antonio Amaya          defensive      1.0000000   3.665562e-08
## Matheus Doria          defensive      1.0000000   4.787919e-08
## Richard Dunne          defensive      0.9999999   6.422006e-08
## Michael Dawson         defensive      0.9999999   6.882468e-08
##                     prob_goal keeper
## Kamil Glik              4.872341e-71
## Oswaldo Vizcarrondo     3.809482e-72
## Antonio Amaya           3.704699e-66
## Matheus Doria           5.026979e-70
## Richard Dunne           5.786958e-67
## Michael Dawson          1.762635e-75
#very offensive
head(offensive[order(offensive[,3],decreasing = T),])
##                 player_class prob_defensive prob_offensive
## Zakaria Bakkali    offensive   6.355754e-11              1
## Marco Rojas        offensive   1.204555e-10              1
## Corentin Jean      offensive   5.382365e-10              1
## Ryo Miyaichi       offensive   1.105527e-09              1
## Genaro Snijders    offensive   1.596754e-09              1
## Adnan Januzaj      offensive   1.649873e-09              1
##                 prob_goal keeper
## Zakaria Bakkali     2.135584e-55
## Marco Rojas         6.141453e-43
## Corentin Jean       2.574687e-40
## Ryo Miyaichi        2.989940e-42
## Genaro Snijders     2.968936e-32
## Adnan Januzaj       1.213090e-48

I do not know enough about soccer to critically evaluate these results…

Border line players

Now instead, let’s look for “border line players”," who turn out to be, for example, 50% defensive and 50% offensive.

head(defensive[order(defensive[,2]),])
##                   player_class prob_defensive prob_offensive
## Tripy Makonda        defensive      0.5001102      0.4998898
## Morgan Amalfitano    defensive      0.5040046      0.4959954
## Wesley               defensive      0.5045928      0.4954072
## David Wilson         defensive      0.5051634      0.4948366
## Bernd Nehrig         defensive      0.5052580      0.4947420
## Pawel Sasin          defensive      0.5056551      0.4943449
##                   prob_goal keeper
## Tripy Makonda         1.279985e-46
## Morgan Amalfitano     2.741544e-59
## Wesley                4.803103e-60
## David Wilson          1.306375e-24
## Bernd Nehrig          1.289163e-47
## Pawel Sasin           1.064414e-34

By playing with the balance defensive/offensive skills, a coach would have an objective criterion to select the players.