Sports analytics has been effectively used in sports such as baseball and basketball. However, research in soccer analytics with statistical learning techniques is limited. There is a necessity to find out if applying these techniques can bring better and more insightful results in soccer analytics. In this paper, our aim is to perform descriptive as well as predictive analysis of soccer’s player performances. In soccer, it is popular to rely on ratings by specialists to assess a player’s performance. However, the specialists do not unravel the measures they use for their rating. We try to identify the most important attributes of player’s performance which determine their overall ratings. In this way we find the inherent knowledge which the specialists use to assign ratings to players.
Series of Supervised classifications and regression techniques were performed including Classification and Regression Decision Tree algorithms, Random Forest, Bagging, and Boosting. In addition, Unsupervised PCA, K-Means also applied. Neural Networks with 10-fold Cross Validation.
“Does the dataset tell us what are the set of skills/attributes that determines the overall rating of a soccer plyer performance? Therefore, we can decide to buy that player or not.”
“Soccer players with certain set of skills/attributes are more likely to have higher performance rating than other players”.
The Soccer database consist of seven datasets with 183 attributes and 200,000 observations. We used for this project Player dataset with 11,060 soccer player’s observations and 7 attributes, and Player_Attributes dataset with 183,978 observations and 42 attributes, the response variable will detainment whether or not the player will be consider a high rating player or not. Players and Teams’ attributes sourced from EA Sports’ FIFA video game series, http://sofifa.com/ FIFA series and all FIFA assets property of EA Sports.
require(tree)## Loading required package: tree
require(ISLR)## Loading required package: ISLR
player_com <- read.csv("Player_Attrib.csv")
PureDat <- na.omit(player_com)
#str(PureDat)
Rating=ifelse(PureDat$overall_rating > "60","Buy","NoBuy")
##dim(Rating)
playerRate= data.frame(PureDat,Rating)
# **Dividing 50 % training data and 50% testing data**
playerRate=playerRate[,-6]
set.seed(2)
observations=dim(playerRate)[1]
train=sample(1:observations,0.5*observations)
##str(train)
test=-train
trainData=playerRate[train,]
testData=playerRate[test,]
testing_outcome=Rating[test]
##table(testing_outcome)
TreeModel= tree(Rating~.,trainData)Plot of Classification Decision Tree:
Running the Prediction model on test data
Tree_predict= predict(TreeModel, testData, type= "class")
Mis.Ctree = mean(Tree_predict != testing_outcome)
# Mis of Classification Tree:
Mis.Ctree## [1] 0.06421815
#Prouning the tree and test its performance on the test dataset
set.seed(100)
cv_tree = cv.tree(TreeModel, FUN = prune.misclass)
cv_tree## $size
## [1] 12 8 7 6 1
##
## $dev
## [1] 5776 5776 5777 6348 10535
##
## $k
## [1] -Inf 0.0 10.0 569.0 833.4
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv_tree$size,cv_tree$dev, type="b", xlab= 'Tree Size', ylab= '')# We decide the tree size to be 7 accourding to the elbow point in the graph
pruned_model= prune.tree(TreeModel, best=7)
plot(pruned_model)
text(pruned_model, pretty = 0)Tree_predict.prouned = predict(pruned_model, testData, type= "class")
Mis.Ptree = mean(Tree_predict.prouned != testing_outcome)
# Misclassification of Classification Tree after Prouning:
Mis.Ptree## [1] 0.08414562
player_com <- read.csv("Player_Attrib.csv")
PureDat <- na.omit(player_com)
set.seed(2)
#PureDat = PureDat[,-6]
#str(PureDat)
observations=dim(PureDat)[1]
train=sample(1:observations,0.5*observations)
test=-train
trainData=PureDat[train,]
testData=PureDat[test,]
attach(PureDat)
testing_outcome=overall_rating[test]
TreeModel= tree(overall_rating~.,trainData)
plot(TreeModel)
text(TreeModel, pretty = 0)#str(PureDat)
Tree_predict= predict(TreeModel, testData)
MSE.Regtree = mean((Tree_predict - testing_outcome)^2)
# MSE of Regression Tree:
MSE.Regtree## [1] 14.29938
pruning the tree and test its performance on the test dataset
set.seed(10)
cv_tree = cv.tree(TreeModel, K=10)
#names(cv_tree)
plot(cv_tree$size,cv_tree$dev, type="b", xlab= 'Tree Size', ylab= '')We decide the tree to be 7 accourding to the elbow point in the graph
pruned_model= prune.tree(TreeModel, best=7)
plot(pruned_model)
text(pruned_model, pretty = 0)Tree_predict.prouned = predict(pruned_model, testData)
MSE.RegTreePrun = mean((Tree_predict.prouned - testing_outcome)^2)
# MSE of Regression Prouned Tree:
MSE.RegTreePrun## [1] 16.93663
bag.player= randomForest(overall_rating~.,data=trainData,mtry=41, importance=TRUE)
bag.player##
## Call:
## randomForest(formula = overall_rating ~ ., data = trainData, mtry = 41, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 41
##
## Mean of squared residuals: 2.987862
## % Var explained: 93.8
summary(bag.player)## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 9017 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 9017 -none- numeric
## importance 82 -none- numeric
## importanceSD 41 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 9017 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
predict.bagging= predict(bag.player,newdata = testData)
MSE.bagging=mean((predict.bagging - testing_outcome)^2)
MSE.bagging## [1] 3.064333
plot(predict.bagging,testing_outcome)
abline(0,1,col="red")set.seed (1)
rf.player =randomForest(overall_rating~.,trainData , mtry=41, importance =TRUE)
rf.player##
## Call:
## randomForest(formula = overall_rating ~ ., data = trainData, mtry = 41, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 41
##
## Mean of squared residuals: 2.975786
## % Var explained: 93.82
predict.rf = predict (rf.player , testData)
plot(predict.rf,testing_outcome)
abline(0,1,col="red")MSErf= mean(( predict.rf - testing_outcome)^2)
MSErf## [1] 3.066119
library (gbm)
set.seed (1)
boost.player = gbm(overall_rating~.,data=trainData, distribution="gaussian",n.trees =5000,interaction.depth =4,shrinkage = 0.01)
boost.player## gbm(formula = overall_rating ~ ., distribution = "gaussian",
## data = trainData, n.trees = 5000, interaction.depth = 4,
## shrinkage = 0.01)
## A gradient boosted model with gaussian loss function.
## 5000 iterations were performed.
## There were 41 predictors of which 40 had non-zero influence.
#plot(boost.player,i="reaction")
predict.boost=predict(boost.player,testData,n.trees=5000)
plot(predict.boost,testing_outcome)
abline(0,1,col="red")MSE.boost= mean((predict.boost - testing_outcome)^2)
MSE.boost## [1] 2.043377
Is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. In our case The first five componant contributed to more than 80%
Player_Attributes <- read.csv("Player_Attributes.csv")
PureDat <- na.omit(Player_Attributes)
pcadata=PureDat[,-1:-5]
pcadata=pcadata[,-3:-5]
pca <- prcomp(pcadata, scale = T)
summary(pca)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 3.9739 2.3548 1.78321 1.38805 1.24921 0.92994
## Proportion of Variance 0.4512 0.1584 0.09085 0.05505 0.04459 0.02471
## Cumulative Proportion 0.4512 0.6096 0.70047 0.75552 0.80011 0.82482
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.82940 0.75174 0.6720 0.63081 0.60798 0.56705
## Proportion of Variance 0.01965 0.01615 0.0129 0.01137 0.01056 0.00919
## Cumulative Proportion 0.84447 0.86062 0.8735 0.88489 0.89545 0.90464
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.55789 0.53839 0.53167 0.49789 0.4769 0.45934
## Proportion of Variance 0.00889 0.00828 0.00808 0.00708 0.0065 0.00603
## Cumulative Proportion 0.91353 0.92181 0.92989 0.93697 0.9435 0.94950
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.42929 0.42130 0.41440 0.41023 0.38759 0.37255
## Proportion of Variance 0.00527 0.00507 0.00491 0.00481 0.00429 0.00397
## Cumulative Proportion 0.95476 0.95983 0.96474 0.96955 0.97384 0.97781
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.36133 0.32539 0.31974 0.29824 0.28911 0.23982
## Proportion of Variance 0.00373 0.00303 0.00292 0.00254 0.00239 0.00164
## Cumulative Proportion 0.98154 0.98456 0.98748 0.99002 0.99241 0.99406
## PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.23464 0.21849 0.19771 0.18436 0.17942
## Proportion of Variance 0.00157 0.00136 0.00112 0.00097 0.00092
## Cumulative Proportion 0.99563 0.99699 0.99811 0.99908 1.00000
library("factoextra")## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
fviz_contrib(pca, choice = "var", axes = 1:2)fviz_screeplot(pca, ncp=10)fviz_pca_var(pca, col.var="contrib") +
scale_color_gradient2(low="white", mid="blue",
high="red", midpoint=50) + theme_minimal() #K-Means Clustering K-Means clusters the data into K distinct, non-overlapping clusters. the k-means was performed with 3 clusters where the k clusters lived in the 2 dimesional space of the data. Plot of the three clusters shown below represent the desigered set of skills in red, the less desiger attrebuts in green, and the goal keepers attrebutes in blue.
library(cluster)## Warning: package 'cluster' was built under R version 3.3.2
library(NbClust)## Warning: package 'NbClust' was built under R version 3.3.2
scaleddata=scale(pcadata)
set.seed(123)
km.res <- kmeans(scaleddata, 3, nstart = 25)
fviz_cluster(km.res, data = scaleddata, geom = "point",
stand = FALSE, ellipse.type = "norm")Neural networks consist of multiple layers, and the signal path traverses from front to back. Back propagation is where the forward stimulation is used to reset weights on the “front” neural units and this is sometimes done in combination with training where the correct result is known[wiki]. In our example we use the neural network as a classifier with 11 input that was suggested by the PCA and two hidden layer and 1 output layer. Data Error rate of %13.59236701 was reported after prediction
require(neuralnet)## Loading required package: neuralnet
Player_Attributes <- read.csv("Player_Attributes.csv")
PureDat <- na.omit(Player_Attributes)
Rating=ifelse(PureDat$overall_rating > "60","1","0")
playerRate= data.frame(PureDat,Rating)
playerRate=playerRate[,-1:-5]
playerRate=playerRate[,-3:-5]
playerRate=playerRate[,-1]
observations=dim(playerRate)[1]
set.seed(2)
NNplayerRate=sample(1:observations,0.1*observations)
NNplayerData=playerRate[NNplayerRate,]
NNplayerData$Rating = as.integer(levels(NNplayerData$Rating))[NNplayerData$Rating]
nn <- neuralnet(Rating~potential+ball_control+standing_tackle+dribbling+marking+sliding_tackle+short_passing+finishing+long_shots+positioning+volleys,data=NNplayerData,stepmax = 1e+09, hidden = 2,learningrate = 0.1, err.fct = "ce", linear.output = FALSE)
nn$result.matrix## 1
## error 6445.567640393469
## reached.threshold 0.009641624018
## steps 241.000000000000
## Intercept.to.1layhid1 -0.152952331640
## potential.to.1layhid1 0.118863169096
## ball_control.to.1layhid1 0.819030533196
## standing_tackle.to.1layhid1 0.406049198777
## dribbling.to.1layhid1 -0.550054096252
## marking.to.1layhid1 -0.898152717477
## sliding_tackle.to.1layhid1 -0.221742949149
## short_passing.to.1layhid1 0.102127045796
## finishing.to.1layhid1 0.925149767207
## long_shots.to.1layhid1 1.010477700362
## positioning.to.1layhid1 1.837084549566
## volleys.to.1layhid1 0.760471635096
## Intercept.to.1layhid2 0.162351369355
## potential.to.1layhid2 0.306421798913
## ball_control.to.1layhid2 -1.521894360663
## standing_tackle.to.1layhid2 -1.190251074383
## dribbling.to.1layhid2 -0.705679572282
## marking.to.1layhid2 -1.525087681825
## sliding_tackle.to.1layhid2 0.084303831524
## short_passing.to.1layhid2 -0.659104731183
## finishing.to.1layhid2 -0.014421031579
## long_shots.to.1layhid2 1.840475048273
## positioning.to.1layhid2 0.057007871606
## volleys.to.1layhid2 0.244852565076
## Intercept.to.Rating 0.911653532953
## 1layhid.1.to.Rating 1.126170350447
## 1layhid.2.to.Rating -16.668528207840
infert$case## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [211] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [246] 0 0 0
pred.nn <- prediction(nn)## Data Error: 13.59236701;
plot(nn, rep="best")It was an amazing experience to see how Statistical Learning techniques on soccer player’s data can identify the most important attributes of player’s performance, which determine the ratings assigned by soccer specialists. After performing a series of classification and regression experiments we got a set of skills/attributes that mostly contribute to high rating of a soccer player. The attributes are potential, ball_control, standing_tackle, dribbling, marking, sliding_tackle, short_passing, finishing, long_shots, positioning, and volleys. The following tables shows the MSE of the regression methods:
Regression Method MSE
======================================
Regression Tree: 14.29938
Regression Tree + Pruning: 16.93663
Bagging: 3.064333
Random Fores: 2.975786
Boosting: 2.043377
The following tables shows the misclassification of the classification methods:
Classification Methods Misclassification
===============================================================
Classification Decision Tree: 0.06421815
Classification Decision Tree + Pruning: 0.08414562
The neural network as a classifier with 11 input that was suggested by the PCA and two hidden layer and 1 output layer. After prediction We got % 13.59236701 Data Error Rate.
In project 3 we will utilize additional method such as Naïve Bayes using python.