library(tidyverse)
library(rpart)
library(rattle)
library(class)
library(dplyr)
library(randomForest)
library(caret)
library(caretEnsemble)
library(lattice)
library(stringr)
library(ipred)
library(fastAdaboost)
library(MASS)
library(pls)
library(e1071)
This dataset was directly obtained from Kaggle and contains very detailed attributes for every player in the latest edition of the FIFA 19 database.The dataset contains many of the player’s attributes, ranging from their Age to their ball control ranking in the game. The majority of the skill set columns are on a ranking from 1 to 100. Since this dataset contains many columns of information pertaining to the ranking of the player’s for a particular skill, the following questions arised:
1.Can we predict a player’s general position(Foward, Goalie,etc) based on the player’s characteristics and ranking of skills?
1a.What patterns/relationships exist between player of similar rankings or of certain positions? 1b.Which skill ranking seems to have a larger impact on a player’s overall ranking?
Before diving into conducting any test, we will quickly take a deeper look into the dataset.
soccer1<-read.csv("SoccerFIFA19.csv")
#Cleaning Data
soccer1$Weight<-str_replace_all(soccer1$Weight,"[:alpha:]","")
soccer1$Weight<-as.numeric(soccer1$Weight)
#soccer1[,23:56]%>%
# head(10)
#Adding group rankings
soccer.updated<-soccer1%>%
mutate(Pace= (Acceleration+SprintSpeed)/2,
Physical= (Aggression + Jumping+ Stamina+Strength)/4,
GoalKeeping= (GKDiving+GKHandling+GKKicking+GKPositioning+GKReflexes)/5)
soccer.prac<- soccer.updated%>%
mutate(rank.group = ifelse(Overall >= 90,
"90+",
ifelse(Overall >= 80 & Overall < 90,
"80-89",
ifelse(Overall >= 70 & Overall < 80,
"70-79",
ifelse(Overall >= 60 & Overall < 70,
"60-69",
ifelse(Overall >= 50 & Overall < 60,
"50-59",
ifelse(Overall >= 40 & Overall < 50,
"40-49",
"rank below 40")))))))
#Adding a binary column regarding whether a player's right foot is dominant
soccer.prac<-soccer.prac %>%
mutate(dom.right.foot = ifelse(Preferred.Foot == "Right",
1,0))
soccer.prac$Preferred.Foot<-as.character(soccer.prac$Preferred.Foot)
## Adding a more general breakdown of player positions- Forward, Mid-Fielder,Defensive Midfielder,Defender, or Goalie
soccer.prac<-soccer.prac%>%
mutate(gen.position= ifelse(Position == "GK",
"GK",
ifelse(Position == "RW"|Position=="RF"|Position == "LF"|Position == "CF"|Position== "ST"|Position=="LS"|Position=="RS"|Position=="LW",
"FWD",
ifelse(Position=="LM"|Position=="RCM"|Position=="CM"|Position=="CAM"|Position=="CDM"|Position=="RAM"|Position=="LAM"|Position=="RM"|Position=="LCM"|Position=="RDM"|Position=="LDM",
"MID",
ifelse( Position=="LWB"|Position=="LB"|Position=="LCB"|Position=="CB"|Position=="RCB"|Position=="RB"|Position=="RWB",
"DEF",
"none"))
)))
#soccer.prac[,c(17,60)]
soccer.updated<- soccer.prac%>%
na.omit()
#soccer.updated[,c(60,62,61,63)]
soccer.updated<-soccer.updated%>%
dplyr::select(-Acceleration,-SprintSpeed,-Aggression,-Jumping,-Stamina,-Strength,-GKDiving,-GKHandling,-GKKicking,-GKPositioning,-GKReflexes,-Value,-Wage,-Special,-Real.Face,-Release.Clause)
#soccer.new[,c(1,10,58)]
#creating a subset of the original with 503 players. 3 choosen and 500 selected randomly
soccer.new<-soccer.updated[c(1,2,3),]
rows<- sample(4:nrow(soccer.updated),
size = 500,
replace= FALSE)
soccer.new<- rbind(soccer.new,soccer.updated[rows,])
rownames(soccer.new) <- NULL
soccer.new<-soccer.new%>%
na.omit()
Note: The grouping of player’s position into categories such as Mid-fielder, Defender, may add some bias to the analysis of the data as others may choose to group them differently- such as with grouping LDM or RDM with Defenders- or choosen to leave a player’s specific position alone. Additionally, we will use a subset of the large data.
ggplot(soccer.new,aes(Age))+
geom_histogram(color="blue",aes(fill=Age))+
ggtitle("Distribution based on Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(soccer.new,aes(gen.position))+
geom_bar(color="black",aes(fill=gen.position))+
ggtitle("Distribution based General Position")
ggplot(soccer.updated,aes(gen.position))+
geom_bar(color="black",aes(fill=gen.position))+
ggtitle("Distribution based General Position for all Players in Dataset")
ggplot(soccer.new,aes(rank.group))+
geom_bar(aes(fill=rank.group))+
ggtitle("Distribution based Overall Rating Group")
From the first graph based on Age, we can see that a majority of players seem to be between the ages of 20-26. There appears to be some wide ranges of differences between the counted number of players for each position. Suprisingly, looking at the full dataset-as in the third graph- we see that this similar distrubtion of players in each position is roughly following a similar pattern as of the graph with only a subset of players. Lastly, the final graph demonstrate a rough normal distribution of players rating in ranking group categories.
Before we dive into using supervised learning techniques, let us begin with looking for any connection that may exist between the different players in terms of attributes.
set.seed(5)
x<-sample(1:nrow(soccer.new),20)
pca.arrests <- prcomp(soccer.new[x,]%>%
dplyr::select(Crossing,BallControl,Interceptions,GoalKeeping,Physical),
scale = TRUE)
#We can visualize how much variation in our data
#is explained by each PC
pca.arrests$sdev^2/sum(pca.arrests$sdev^2)
## [1] 0.70771162 0.19038356 0.05375391 0.03370873 0.01444219
soccer.new[x,c(1,13,47)]
## Name Position gen.position
## 322 M. Thorsby LDM MID
## 363 N. Zanellato CM MID
## 185 K. Lulić CAM MID
## 207 J. Obi LCM MID
## 203 A. Pyatov GK GK
## 377 J. Bajandouh CDM MID
## 297 J. Rasheed GK GK
## 213 A. Miranchuk CAM MID
## 222 B. Halliday RB DEF
## 71 F. Plach GK GK
## 403 J. Łoś ST FWD
## 387 S. Svendsen ST FWD
## 294 K. Benzema ST FWD
## 314 L. Krajnc CB DEF
## 431 Yang Jiawei RM MID
## 309 H. Rivera LM MID
## 332 C. Robles CDM MID
## 16 R. Manning LCM MID
## 410 F. Orlando RW FWD
## 128 F. Ranocchia CM MID
biplot(pca.arrests, scale = 0,cex=.7)
We have decided to look at a few attributes and few players in order to not overwhelm our visualization. From this, we will notice that some of the players seem to be clumped or relatively close to othe players. In refering back to our dataframe, we notice that in fact many of these players have the same general positions and even exact specific - positions such as Center Back. Thus, we may assume that this visualization of attributes can be used as indication of a player’s position in addition to indication of what other player they are most similar to in terms of skills. In having background knowledge about soccer, this may not be too suprising as certain positions tend to specialize in a particular skill set. However, this leads one to question as to whether there are specific atributes that have a larger say in what a players position may be, This leads us back to our two intial questions:
1.Can we predict a player’s general position(Foward, Goalie,etc) based on the player’s characteristics and ranking of skills? 1.a What patterns/relationships exist between player of similar positions?
Now that we have some generalized idea about our data, let us begin using different types of supervised learning techniques to answer these questions. We will begin K-nearest neighbors.
KNN for Positions
## KNN with caret
trainControl1 <- trainControl(classProbs = TRUE)
knn.caret.pos<-train(factor(gen.position) ~ Age+ Overall+ Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping ,
data = soccer.new,
method="knn",
tuneLength = 12)
##KNN function-
set.seed(1)
class(soccer.new$Finishing)
## [1] "integer"
#summary(soccer.new$Agility)
#Let's look at the different ways in which combination variables play a specific role in a player's overall rating group.
Finishing.grid<- seq(from = 1, to= 100,by= 1 )
agility.grid<- seq(from= 1, to = 100, by= 1)
grid2<-expand.grid(Finishing.grid, agility.grid)
colnames(grid2)<- c("Finishing", "Agility")
predictions2<- knn(soccer.new%>%dplyr::select(Finishing,Agility),
grid2, soccer.new$gen.position,k=13)
grid.data2<-data.frame(grid2,predictions2)
##Graph 1 of knn
grid.data2%>%
ggplot(aes(x=Finishing,
y=Agility))+
geom_point(aes(color = factor(predictions2)),
size=2,
alpha=0.1)+
geom_point(data = soccer.new,
mapping = aes(x=Finishing,
y=Agility,
color=gen.position),
size=2)+guides(color=guide_legend(title= "Position"))+
ggtitle("KNN Actual and Predicted Positions for K = 13") +
xlab("Finishing Rating") +
ylab("Agility Rating")
##KNN 2nd graph-Rating group based on Physical and Pace
Stamina.grid<- seq(from = 1, to= 100,by= 1 )
Balance.grid<- seq(from= 1, to = 100, by= 1)
grid3<-expand.grid(Stamina.grid, Balance.grid)
colnames(grid3)<- c("Physical", "Pace")
predictions3<- knn(soccer.new%>%
dplyr::select(Physical,Pace),
grid3,
soccer.new$gen.position,
13)
grid.data3<-data.frame(grid3,predictions3)
grid.data3%>%
ggplot(aes(x=Physical,
y=Pace))+
geom_point(aes(color = factor(predictions3)),
size=2,
alpha=0.1)+
geom_point(data = soccer.new,
mapping = aes(x=Physical,
y=Pace,
color=gen.position),
size=2)+guides(color=guide_legend(title= "Position"))+
ggtitle("KNN Actual and Predicted Positions for K = 13") +
xlab("Physical Rating") +
ylab("Pace Rating")
From using the caret function along with knn, we find that the optimal value for k= 13. From this, we see that on some levels our predictions seem to do much better. For example, the graph with Variables Finishing and Agility Rating, seems to less have less mixture between the predicted and actual positions of players with these specifice variable ratings. Although it does appear that both variables in the first graph have some part in determining a player position, it appears that the Finishing Rate is a stronger indicator of this as seen with Forwards having a strong rating for this variable than defenders and goalies. In order to have a deeper understanding through visuals, it appears that we are limited to two dimensions and must create combinations of variables to have a better sense of which attribute may be a stronger indicator of positions. But overall, this provides some sense of what position a future player may be placed as.
Now let us turn to decision trees.
##Decision Tree of rating based on players attributes
set.seed(2)
soccer.tree3<- rpart(factor(gen.position)~ Age +Overall+Finishing +Dribbling+ Crossing+ HeadingAccuracy+ ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data=soccer.new)
fancyRpartPlot(soccer.tree3,cex=.6)
##Decision Tree made using caret
decision.caret.pos<- train(factor(gen.position)~ Age+ Overall+ Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data=soccer.new,
method = "rpart",
tuneLength = 4)
#decision.caret.pos
Unlike with knn, we are able to visually see all the variables in determining a players position( well the variables determined to be the most meaningful in categorizing a player).Additionally, we are unable to see predictions along with the actual data, which helps indicate whether the prediction was anywhere near accurate. However, this is not necessarily bad. From this decision tree it appears that certain variables such as long passing plays a large role in determining whether a player will be categorized as a Forward or a Mid-Fielder. This appears to be reasonable as Forwards tend to be on the receiving end of those long passes and are more fixated on scoring goals. Overall, decision trees seem a bit reasonable to use, but it is important to recall that decision trees are prone to overfitting the data.
Let us now look at Random Forests.
#Random Forest with caret
model.rf.pos <- train(factor(gen.position)~Age+Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping ,
data=soccer.new,
method = "ranger")
#random forest to create VarImpPlot
rf1.pos<- randomForest(factor(gen.position)~Age+ Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data=soccer.new)
varImpPlot(rf1.pos, main= "Variable Importance in indicating Player position", cex=.7)
From the visualization itself, we are being told a different story in comparison to the last two tests. This image demonstrates the variable importance in predicting a players position. Like the previous models, there exist some indication of the dependence of a players position based on certain variables. However, this model’s graph demonstrates the role of all variables included in the model and demonstrates their ranking in the prediction itself. From this, we see that SlidingTackle seems to the variable that can be used for a more certain indication of a player’s position.
Here, we will quickly look at a svm.
svm1.rate <- svm(factor(gen.position) ~ Age+Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data = soccer.new,
kernel = "linear")
#Making predictions with svm
pred.svm<-predict(svm1.rate, soccer.new)
#Let's check our confusion matrix
table(predict(svm1.rate, soccer.new), soccer.new$gen.position)
##
## DEF FWD GK MID
## DEF 161 0 0 14
## FWD 0 79 0 11
## GK 0 0 39 0
## MID 10 15 0 174
svm.caret.rate1<- train(factor(gen.position)~Age+Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data = soccer.new,
method = "svmLinear",
tuneGrid = data.frame(C = c(.1,.3,1,10)))
Now, we will look at Linear Discriminant Analysis
lda1.pos <- lda(gen.position~ Age+Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,data=soccer.new)
#summary(lda1.pos)
#Make some predictions and visualize their classification
predictions5 <- predict(lda1.pos)$class
new.default <- data.frame(soccer.new, predictions5)
new.default %>%
ggplot(aes(x = Interceptions,
y = Finishing)) + geom_point(aes(color = predictions5))
lda1.pos$prior
## DEF FWD GK MID
## 0.33996024 0.18687873 0.07753479 0.39562624
lda.caret<-train(factor(gen.position) ~ Age+Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data = soccer.new,
method = "lda")
This model has a similar visual output as knn. The groups seem to cluster about in the respective position. However, there are no indications as to whether these predictions are correct as in knn. From this we can see that Finishing and Interceptions do seem to be strong indicators of positions as seen in the visualization of our random Forest model. However, there are no insights to the other variables unless other graphs are created.
A deeper comparison:
#Confusion Matrix
confusionMatrix(knn.caret.pos, mode = "everything")
## Bootstrapped (25 reps) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction DEF FWD GK MID
## DEF 26.9 0.0 0.0 3.4
## FWD 0.0 14.8 0.0 2.6
## GK 0.0 0.0 7.9 0.0
## MID 6.0 4.6 0.0 33.8
##
## Accuracy (average) : 0.8338
confusionMatrix(decision.caret.pos, mode = "everything")
## Bootstrapped (25 reps) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction DEF FWD GK MID
## DEF 26.9 0.0 0.0 6.7
## FWD 0.0 12.6 0.0 4.1
## GK 0.1 0.0 7.7 0.0
## MID 6.4 6.6 0.0 28.7
##
## Accuracy (average) : 0.7599
confusionMatrix(model.rf.pos, mode = "everything")
## Bootstrapped (25 reps) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction DEF FWD GK MID
## DEF 29.9 0.0 0.0 4.2
## FWD 0.0 14.9 0.0 2.2
## GK 0.0 0.0 8.0 0.0
## MID 3.9 3.8 0.0 33.1
##
## Accuracy (average) : 0.8585
We will now look at the Confusion Matrix for each of the models. It appears that overall, the accuracy of each model appears to be relatively close to one another, with the exception of the decision tree model. It is interesting to see that the predictions of LDA, SVM and the Random Forest Models seem to be relatively similar in both its correct predictions, as well as their wrong predictions. More specifically, they seem to make similar false predictions for the same predictions. However, it does appear that the SVM model appears to do the best,followed by Random Forest, and lastly, LDA. Meanwhile, there exists some similarities between the Confusion Matrix of knn and the decision tree model. However, it is no surprise that there exists more errors within the decision tree model. In order to determine the “best” model(s) we will fixate our attention to the top three models for futher investigation- Random Forest, LDA, and SVM. Lastly, we will focus our attention to the accuracy of each model in addition to its Kappa values. Although accuracy will be an excellent indicator of which model appears to be the “best”, it is also important to consider as to whether the prediction that turned out to correct, were due to the model or by random chance. This is important to consider due to the distribution of players in each position. More specifically, we saw that there were higher numbers of players classified as Mid-Fielders, followed by defenders. The difference of these two positions in comparison to the others is a large gap.
We will now take a deeper look into the actual prediction of each model.
indexes<-sample(nrow(soccer.new),size = nrow(soccer.new)*.7)
soccer.train<-soccer.new[indexes,]
soccer.test <- soccer.new[indexes,]
rf.soccer.model<- randomForest(factor(gen.position)~Age+ Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data=soccer.train)
lda.soccer.model<-lda(gen.position~ Age+Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data=soccer.train)
svm(factor(gen.position) ~ Age+Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data = soccer.train,
kernel = "linear")
##
## Call:
## svm(formula = factor(gen.position) ~ Age + Finishing + Dribbling +
## Crossing + HeadingAccuracy + ShortPassing + Volleys + Curve +
## FKAccuracy + LongPassing + BallControl + Pace + Agility + Reactions +
## Balance + ShotPower + Physical + LongShots + Interceptions +
## Positioning + Vision + Penalties + Composure + Marking + StandingTackle +
## SlidingTackle + GoalKeeping, data = soccer.train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 124
#Making predictions
rf.pred<-predict(rf.soccer.model,soccer.test)
pred.svm<-predict(svm1.rate, soccer.test)
pred.lda <- predict(lda.soccer.model)$class
model.data<-data.frame(rf.pred,
pred.svm,
pred.lda,
obs=1:length(rf.pred))
colors <- colorRampPalette(c("blue", "green", "yellow", "red"))
gathered.model.data <- model.data %>%
gather(key = "model",
value = "pred",
-obs)
gathered.model.data %>%
filter(obs <= 50) %>%
ggplot(aes(x = model,
y = obs)) +
geom_tile(aes(fill = factor(pred)),
color = "black")+scale_color_brewer(palette="Dark2")
#the accuracy of each model
true.vals <-soccer.train$gen.position
model.data <- data.frame(rf = rf.pred == true.vals,
lda= pred.lda == true.vals,
svm=pred.svm == true.vals,
obs = 1:length(rf.pred))
gathered.model.data1 <- model.data %>%
gather(key = "model",
value = "pred",
-obs)
gathered.model.data1 %>%
filter(obs <= 50) %>%
ggplot(aes(x = model,
y = obs)) +
geom_tile(aes(fill = factor(pred)),
color = "black")+scale_color_brewer(palette="Dark2")
In the first graph, we are able to see the first 50 predictions that the three models made for each player’s position. There appears to be many more differences between the random forest model’s predictions and the LDA model’s predictions. SVM appears to be the model that constantly agrees with another model upon the third model making a different prediction. Meanwhile, in the second graph we are able to see the accuracy between each model’s prediction and the true position of said player. It appears that the random forest model has predicted the correct first 50 players of the testing model correctly. On the other hand, the other two models do not appear as accurate with a few false predictions. From this graph and the breakdown of the confusion matrix, we can begin to assume that our Random Forest model appears to be the “best” model to predict a future player’s position based on their attributes alone. However, let us look at the overall accuracy and more specifically, the Kappa values of each model through the caret package.
library(lattice)
model.rf.pos
## Random Forest
##
## 503 samples
## 27 predictor
## 4 classes: 'DEF', 'FWD', 'GK', 'MID'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 503, 503, 503, 503, 503, 503, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.8459665 0.7748837
## 2 extratrees 0.8465541 0.7746558
## 14 gini 0.8454638 0.7746201
## 14 extratrees 0.8566080 0.7905239
## 26 gini 0.8372849 0.7627554
## 26 extratrees 0.8584580 0.7931676
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 26, splitrule = extratrees
## and min.node.size = 1.
lda.caret
## Linear Discriminant Analysis
##
## 503 samples
## 27 predictor
## 4 classes: 'DEF', 'FWD', 'GK', 'MID'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 503, 503, 503, 503, 503, 503, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8599427 0.7960396
svm.caret.rate1
## Support Vector Machines with Linear Kernel
##
## 503 samples
## 27 predictor
## 4 classes: 'DEF', 'FWD', 'GK', 'MID'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 503, 503, 503, 503, 503, 503, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.1 0.8528128 0.7831194
## 0.3 0.8487055 0.7774382
## 1.0 0.8414224 0.7666905
## 10.0 0.8302358 0.7505808
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.1.
From this we see that our Random Forest model appears to have the highest accuracy values in addition to the highest Kappa values. Hence, we will declare our Random Forest model to be the “best” model to approach our beginning questions. Thus, let us take a closer look at our model’s prediction.
#Random Forest with caret
model.rf.pos <- train(factor(gen.position)~Age+Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping ,
data=soccer.new,
method = "ranger")
#random forest to create VarImpPlot
rf1.pos<- randomForest(factor(gen.position)~Age+ Finishing+Dribbling+Crossing+HeadingAccuracy+ShortPassing+Volleys+Curve+FKAccuracy+LongPassing+BallControl+Pace+Agility+ Reactions+Balance+ShotPower+Physical+ LongShots+ Interceptions+ Positioning+Vision+ Penalties+ Composure+Marking+ StandingTackle+SlidingTackle+ GoalKeeping,
data=soccer.new)
varImpPlot(rf1.pos, main= "Variable Importance in indicating Player position", cex=.7)
model.rf.pos
## Random Forest
##
## 503 samples
## 27 predictor
## 4 classes: 'DEF', 'FWD', 'GK', 'MID'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 503, 503, 503, 503, 503, 503, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.8429451 0.7705198
## 2 extratrees 0.8431619 0.7701751
## 14 gini 0.8372689 0.7630167
## 14 extratrees 0.8501737 0.7811839
## 26 gini 0.8266335 0.7479682
## 26 extratrees 0.8489856 0.7796592
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 14, splitrule = extratrees
## and min.node.size = 1.
From this model, not only can we predict the position of a future player with confidence of a 85% accuracy, but we can also see the variables/attributes that are making the impact in deciding the overall position. As we can see from our graph, These top variables are as follows. SlidingTacke,StandingTackle,HeadingAccuracy, Finishing, and Interceptions. Overall, this is not very alarming as these player attributes are key skills for a respective position. More specifically, we can see that the two different types of tackling are more skills of a defender, with maybe an exception of a few midfielders. Additionally, higher ratings of Finishing indicates that the player is most likely a forward as they are more likely to have more opportunities to try making goals in comparison to Defenders or Goalies. This is not to say that the other attributes do not have an impact on the prediction of a players position, but they are attributes that players of different positions may have as well.
(It should be noted that the graphs of accuracy were based on only 50 players and the entire dataset we were working with is a random sample of a much larger data-set. Thus, one should be wary of bias that this included into the models and the final conclusions.)