Abstract

Sports analytics has been effectively used in sports such as baseball and basketball. However, research in soccer analytics with statistical learning techniques is limited. There is a necessity to find out if applying these techniques can bring better and more insightful results in soccer analytics. In this paper, our aim is to perform descriptive as well as predictive analysis of soccer’s player performances. In soccer, it is popular to rely on ratings by specialists to assess a player’s performance. However, the specialists do not unravel the measures they use for their rating. We try to identify the most important attributes of player’s performance which determine their overall ratings. In this way we find the inherent knowledge which the specialists use to assign ratings to players.

Series of Supervised classifications and regression techniques were performed including Classification and Regression Decision Tree algorithms, Random Forest, Bagging, and Boosting. In addition, Unsupervised PCA, K-Means also applied. Neural Networks with 10-fold Cross Validation.

Reaserch Question

“Does the dataset tell us what are the set of skills/attributes that determines the overall rating of a soccer plyer performance? Therefore, we can decide to buy that player or not.”

Reaserch Hypotheses

“Soccer players with certain set of skills/attributes are more likely to have higher performance rating than other players”.

The DataSet

The Soccer database consist of seven datasets with 183 attributes and 200,000 observations. We used for this project Player dataset with 11,060 soccer player’s observations and 7 attributes, and Player_Attributes dataset with 183,978 observations and 42 attributes, the response variable will detainment whether or not the player will be consider a high rating player or not. Players and Teams’ attributes sourced from EA Sports’ FIFA video game series, http://sofifa.com/ FIFA series and all FIFA assets property of EA Sports.

Decision Tree (Classification)

require(tree)
## Loading required package: tree
require(ISLR)
## Loading required package: ISLR
player_com <- read.csv("Player_Attrib.csv")
PureDat <- na.omit(player_com)
#str(PureDat)
Rating=ifelse(PureDat$overall_rating > "60","Buy","NoBuy")
##dim(Rating)
playerRate= data.frame(PureDat,Rating)

# **Dividing 50 % training data and 50% testing data**

playerRate=playerRate[,-6]
set.seed(2)
observations=dim(playerRate)[1]
train=sample(1:observations,0.5*observations)
##str(train)
test=-train
trainData=playerRate[train,]
testData=playerRate[test,]
testing_outcome=Rating[test]
##table(testing_outcome)

TreeModel=  tree(Rating~.,trainData)

Plot of Classification Decision Tree:

Running the Prediction model on test data

Tree_predict= predict(TreeModel, testData, type= "class")
Mis.Ctree = mean(Tree_predict != testing_outcome)
# Mis of Classification Tree:
Mis.Ctree
## [1] 0.06421815
#Prouning the tree and test its performance on the test dataset

set.seed(100)
cv_tree = cv.tree(TreeModel, FUN = prune.misclass)
cv_tree
## $size
## [1] 12  8  7  6  1
## 
## $dev
## [1]  5776  5776  5777  6348 10535
## 
## $k
## [1]  -Inf   0.0  10.0 569.0 833.4
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv_tree$size,cv_tree$dev, type="b", xlab= 'Tree Size', ylab= '')

# We decide the tree size to be 7 accourding to the elbow point in the graph
pruned_model= prune.tree(TreeModel, best=7)
plot(pruned_model)
text(pruned_model, pretty = 0)

Tree_predict.prouned = predict(pruned_model, testData, type= "class")
Mis.Ptree = mean(Tree_predict.prouned != testing_outcome)
# Misclassification of Classification Tree after Prouning:
Mis.Ptree
## [1] 0.08414562

Decision Tree (Regression)

player_com <- read.csv("Player_Attrib.csv")
PureDat <- na.omit(player_com)
set.seed(2)
#PureDat = PureDat[,-6]
#str(PureDat)
observations=dim(PureDat)[1]
train=sample(1:observations,0.5*observations)
test=-train
trainData=PureDat[train,]
testData=PureDat[test,]
attach(PureDat)
testing_outcome=overall_rating[test]
TreeModel=  tree(overall_rating~.,trainData)
plot(TreeModel)
text(TreeModel, pretty = 0)

#str(PureDat)
Tree_predict= predict(TreeModel, testData)
MSE.Regtree = mean((Tree_predict - testing_outcome)^2)
# MSE of Regression Tree:
MSE.Regtree
## [1] 14.29938

pruning the tree and test its performance on the test dataset

set.seed(10)
cv_tree = cv.tree(TreeModel, K=10)
#names(cv_tree)
plot(cv_tree$size,cv_tree$dev, type="b", xlab= 'Tree Size', ylab= '')

We decide the tree to be 7 accourding to the elbow point in the graph

pruned_model= prune.tree(TreeModel, best=7)
plot(pruned_model)
text(pruned_model, pretty = 0)

Tree_predict.prouned = predict(pruned_model, testData)
MSE.RegTreePrun = mean((Tree_predict.prouned - testing_outcome)^2)
# MSE of Regression Prouned Tree:
MSE.RegTreePrun
## [1] 16.93663

Bagging, Random Forests, Boosting

Bagging

bag.player= randomForest(overall_rating~.,data=trainData,mtry=41, importance=TRUE)
bag.player
## 
## Call:
##  randomForest(formula = overall_rating ~ ., data = trainData,      mtry = 41, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 41
## 
##           Mean of squared residuals: 2.987862
##                     % Var explained: 93.8
summary(bag.player)
##                 Length Class  Mode     
## call               5   -none- call     
## type               1   -none- character
## predicted       9017   -none- numeric  
## mse              500   -none- numeric  
## rsq              500   -none- numeric  
## oob.times       9017   -none- numeric  
## importance        82   -none- numeric  
## importanceSD      41   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            11   -none- list     
## coefs              0   -none- NULL     
## y               9017   -none- numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call
predict.bagging= predict(bag.player,newdata = testData)
MSE.bagging=mean((predict.bagging - testing_outcome)^2)
MSE.bagging
## [1] 3.064333
plot(predict.bagging,testing_outcome)
abline(0,1,col="red")

Random Forests

set.seed (1)
rf.player =randomForest(overall_rating~.,trainData , mtry=41, importance =TRUE)
rf.player
## 
## Call:
##  randomForest(formula = overall_rating ~ ., data = trainData,      mtry = 41, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 41
## 
##           Mean of squared residuals: 2.975786
##                     % Var explained: 93.82
predict.rf = predict (rf.player , testData)
plot(predict.rf,testing_outcome)
abline(0,1,col="red")

MSErf= mean(( predict.rf - testing_outcome)^2)
MSErf
## [1] 3.066119

Boosting

library (gbm)
set.seed (1)
boost.player = gbm(overall_rating~.,data=trainData, distribution="gaussian",n.trees =5000,interaction.depth =4,shrinkage = 0.01)
boost.player
## gbm(formula = overall_rating ~ ., distribution = "gaussian", 
##     data = trainData, n.trees = 5000, interaction.depth = 4, 
##     shrinkage = 0.01)
## A gradient boosted model with gaussian loss function.
## 5000 iterations were performed.
## There were 41 predictors of which 40 had non-zero influence.
#plot(boost.player,i="reaction")
predict.boost=predict(boost.player,testData,n.trees=5000)
plot(predict.boost,testing_outcome)
abline(0,1,col="red")

MSE.boost= mean((predict.boost - testing_outcome)^2)
MSE.boost
## [1] 2.043377

PCA

Is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. In our case The first five componant contributed to more than 80%

Player_Attributes <- read.csv("Player_Attributes.csv")
PureDat <- na.omit(Player_Attributes)

pcadata=PureDat[,-1:-5]

pcadata=pcadata[,-3:-5]

pca <- prcomp(pcadata, scale = T)

summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.9739 2.3548 1.78321 1.38805 1.24921 0.92994
## Proportion of Variance 0.4512 0.1584 0.09085 0.05505 0.04459 0.02471
## Cumulative Proportion  0.4512 0.6096 0.70047 0.75552 0.80011 0.82482
##                            PC7     PC8    PC9    PC10    PC11    PC12
## Standard deviation     0.82940 0.75174 0.6720 0.63081 0.60798 0.56705
## Proportion of Variance 0.01965 0.01615 0.0129 0.01137 0.01056 0.00919
## Cumulative Proportion  0.84447 0.86062 0.8735 0.88489 0.89545 0.90464
##                           PC13    PC14    PC15    PC16   PC17    PC18
## Standard deviation     0.55789 0.53839 0.53167 0.49789 0.4769 0.45934
## Proportion of Variance 0.00889 0.00828 0.00808 0.00708 0.0065 0.00603
## Cumulative Proportion  0.91353 0.92181 0.92989 0.93697 0.9435 0.94950
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.42929 0.42130 0.41440 0.41023 0.38759 0.37255
## Proportion of Variance 0.00527 0.00507 0.00491 0.00481 0.00429 0.00397
## Cumulative Proportion  0.95476 0.95983 0.96474 0.96955 0.97384 0.97781
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.36133 0.32539 0.31974 0.29824 0.28911 0.23982
## Proportion of Variance 0.00373 0.00303 0.00292 0.00254 0.00239 0.00164
## Cumulative Proportion  0.98154 0.98456 0.98748 0.99002 0.99241 0.99406
##                           PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.23464 0.21849 0.19771 0.18436 0.17942
## Proportion of Variance 0.00157 0.00136 0.00112 0.00097 0.00092
## Cumulative Proportion  0.99563 0.99699 0.99811 0.99908 1.00000
library("factoextra")
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
fviz_contrib(pca, choice = "var", axes = 1:2)

fviz_screeplot(pca, ncp=10)

fviz_pca_var(pca, col.var="contrib") +
  scale_color_gradient2(low="white", mid="blue", 
                        high="red", midpoint=50) + theme_minimal()

#K-Means Clustering K-Means clusters the data into K distinct, non-overlapping clusters. the k-means was performed with 3 clusters where the k clusters lived in the 2 dimesional space of the data. Plot of the three clusters shown below represent the desigered set of skills in red, the less desiger attrebuts in green, and the goal keepers attrebutes in blue.

library(cluster)
## Warning: package 'cluster' was built under R version 3.3.2
library(NbClust)
## Warning: package 'NbClust' was built under R version 3.3.2
scaleddata=scale(pcadata)

set.seed(123)
km.res <- kmeans(scaleddata, 3, nstart = 25)

fviz_cluster(km.res, data = scaleddata, geom = "point",
             stand = FALSE, ellipse.type = "norm")

Neural Network

Neural networks consist of multiple layers, and the signal path traverses from front to back. Back propagation is where the forward stimulation is used to reset weights on the “front” neural units and this is sometimes done in combination with training where the correct result is known[wiki]. In our example we use the neural network as a classifier with 11 input that was suggested by the PCA and two hidden layer and 1 output layer. Data Error rate of %13.59236701 was reported after prediction

require(neuralnet)
## Loading required package: neuralnet
Player_Attributes <- read.csv("Player_Attributes.csv")
PureDat <- na.omit(Player_Attributes)

Rating=ifelse(PureDat$overall_rating > "60","1","0")
playerRate= data.frame(PureDat,Rating)
playerRate=playerRate[,-1:-5]
playerRate=playerRate[,-3:-5]
playerRate=playerRate[,-1]

observations=dim(playerRate)[1]
set.seed(2)
NNplayerRate=sample(1:observations,0.1*observations)
NNplayerData=playerRate[NNplayerRate,]
NNplayerData$Rating = as.integer(levels(NNplayerData$Rating))[NNplayerData$Rating]

nn <- neuralnet(Rating~potential+ball_control+standing_tackle+dribbling+marking+sliding_tackle+short_passing+finishing+long_shots+positioning+volleys,data=NNplayerData,stepmax = 1e+09, hidden = 2,learningrate = 0.1, err.fct = "ce", linear.output = FALSE)

nn$result.matrix
##                                             1
## error                       6445.567640393469
## reached.threshold              0.009641624018
## steps                        241.000000000000
## Intercept.to.1layhid1         -0.152952331640
## potential.to.1layhid1          0.118863169096
## ball_control.to.1layhid1       0.819030533196
## standing_tackle.to.1layhid1    0.406049198777
## dribbling.to.1layhid1         -0.550054096252
## marking.to.1layhid1           -0.898152717477
## sliding_tackle.to.1layhid1    -0.221742949149
## short_passing.to.1layhid1      0.102127045796
## finishing.to.1layhid1          0.925149767207
## long_shots.to.1layhid1         1.010477700362
## positioning.to.1layhid1        1.837084549566
## volleys.to.1layhid1            0.760471635096
## Intercept.to.1layhid2          0.162351369355
## potential.to.1layhid2          0.306421798913
## ball_control.to.1layhid2      -1.521894360663
## standing_tackle.to.1layhid2   -1.190251074383
## dribbling.to.1layhid2         -0.705679572282
## marking.to.1layhid2           -1.525087681825
## sliding_tackle.to.1layhid2     0.084303831524
## short_passing.to.1layhid2     -0.659104731183
## finishing.to.1layhid2         -0.014421031579
## long_shots.to.1layhid2         1.840475048273
## positioning.to.1layhid2        0.057007871606
## volleys.to.1layhid2            0.244852565076
## Intercept.to.Rating            0.911653532953
## 1layhid.1.to.Rating            1.126170350447
## 1layhid.2.to.Rating          -16.668528207840
infert$case
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [211] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [246] 0 0 0
pred.nn <- prediction(nn)
## Data Error:  13.59236701;
plot(nn, rep="best")

In Conclution

It was an amazing experience to see how Statistical Learning techniques on soccer player’s data can identify the most important attributes of player’s performance, which determine the ratings assigned by soccer specialists. After performing a series of classification and regression experiments we got a set of skills/attributes that mostly contribute to high rating of a soccer player. The attributes are potential, ball_control, standing_tackle, dribbling, marking, sliding_tackle, short_passing, finishing, long_shots, positioning, and volleys. The following tables shows the MSE of the regression methods:

                    Regression Method                 MSE
                    ======================================
                    Regression Tree:              14.29938
                    Regression Tree + Pruning:    16.93663
                    Bagging:                      3.064333
                    Random Fores:                 2.975786
                    Boosting:                     2.043377

The following tables shows the misclassification of the classification methods:

                    Classification Methods                        Misclassification
                    ===============================================================
                    Classification Decision Tree:                   0.06421815
                    Classification Decision Tree + Pruning:         0.08414562

The neural network as a classifier with 11 input that was suggested by the PCA and two hidden layer and 1 output layer. After prediction We got % 13.59236701 Data Error Rate.

In project 3 we will utilize additional method such as Naïve Bayes using python.