Exploratory Analysis and Predictive Model Using Game Data from FIFA 2020

(1) Overview

I obtained the following dataset that contains 100+ attributes for over 18,000 players from FIFA 20. I’ve used this data to conduct an exploratory analysis of the data.

(2) Exploring the Game Data

Figure 1: Based on the barplot, most players are right-footed players. In my experience being left-footed is rare.
Figure 2: The scatter plot illustrates that wages must have some correlation with market value. Its obvious players with higher market values typically are paid at a higher rate.

Summarizing the geographical composition of the data require manual engineering. Using the “Countrycode” package I was able to classify the continent in which the player is from. Certain countries for some reason were unable to be classified so I manually formatted these values.

## Warning in countrycode(sourcevar = data$nationality, origin = "country.name", : Some values were not matched unambiguously: Central African Rep., England, Kosovo, Northern Ireland, Scotland, Wales

Figure 3:A majority of the players are of European descent which makes sense given the biggest 4 four leagues in world are all European. Soccer is huge in countries like Mexico, Brazil, and Argentina which why I would thinkthe Americas are the 2nd largest portion of descent in FIFA.

Similar to the geographical analysis, I wanted to explore information regarding the positions of players in the game. It would also require manual engineering of the data. First, I created a column that pulled the specific/preferred position where the player is on the field. Secondly, I created a column that categorizes the players position into the 4 general positions on the field - forward, midfield, defense, and goalkeeper.
Figure 4: Most of the players are centerbacks and strikers, which are the most important positions on the field. But when we categorize the positions we find something interesting.
Figure 5: Upon further analysis, most of the players play in the midfield position which does not include the CB or ST position. Midfielders arguably influence the game more than any other positions. Given there are so many roles in midfield it does not surprise me that this is the most heavily populated position.

Creating a Model

So far my analysis has involved fairly basic questions, I wanted to create a model that could predict the overall speed of the player based on features such as his/her height, weight, and athletic attributes. In order to get a basic understanding of overall speed ratings I created a histogram that illustrates the distribution among player’s speed ratings. Based on my personal experience with the game I understand that goalkeepers have a null sprint speed attribute, so in order to create an accurate model I will remove all goalkeepers from the dataframe I’ve created.
Figure 6 As we can see the data is distributed normally and very few players have sprint attributes over 85, which is considered world class. Similarly, a small percentage of players have speed attributes less than 50.

I obviously didn’t extract all attributes from the original dataset, I only extracted the items that I thought logically made sense to, excluding anything that is skill related like skills, headers, shot power, etc. I created a dataframe to summarize the correlation between speed and these variables.

##   height weight   age pace physical acceleration agility reactions balance
## 1  -0.36  -0.33 -0.19 0.97    -0.15         0.88    0.65       0.1    0.48
##   jumping stamina strength
## 1    0.06    0.28    -0.27

I decided any variables with a correlation < .20 or > -.20 would not be worth including in my model. So age, physical, jumping, and reactions will not be included in my model. I created the model below and rounded the prediction to the nearest whole number.

set.seed(123)
sample<- sample.split(stats, SplitRatio = .75)
train<- subset(stats, sample == TRUE)
test<- subset(stats, sample==FALSE)

model<- lm( speed ~ height + weight + pace + acceleration + agility  + balance  + stamina + strength, data = train)
test$prediction<- round(predict(model,test),0)

(3) Evaluating the results

## [1] "The model predicted 2550 shoes sizes correctly out of 4777"

## [1] "Approx. 53.38 %"

The model predicted roughly 53% of the test set successfully. I was expecting a higher percentage but as look further the model was not as inaccurate as its seems. Since we tracked the number of cases the model correctly predicted a players speed, I ran a summary of the results and the min/max are both 1. This indicates that though our model is off, it will never be off more than 1 whole integer.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.000000  0.000000  0.000000  0.002303  0.000000  1.000000

## 
## Call:
## lm(formula = speed ~ height + weight + pace + acceleration + 
##     agility + balance + stamina + strength, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23883 -0.39861  0.00914  0.41631  1.25859 
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   1.733e-02  2.608e-01    0.066   0.9470    
## height       -1.145e-04  1.426e-03   -0.080   0.9360    
## weight       -2.232e-04  1.273e-03   -0.175   0.8608    
## pace          1.798e+00  1.627e-03 1105.128   <2e-16 ***
## acceleration -7.976e-01  1.631e-03 -488.955   <2e-16 ***
## agility      -5.605e-04  7.081e-04   -0.792   0.4286    
## balance      -3.739e-05  7.734e-04   -0.048   0.9614    
## stamina       8.525e-04  5.099e-04    1.672   0.0946 .  
## strength     -1.343e-04  6.523e-04   -0.206   0.8369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5244 on 11456 degrees of freedom
## Multiple R-squared:  0.9979, Adjusted R-squared:  0.9979 
## F-statistic: 6.928e+05 on 8 and 11456 DF,  p-value: < 2.2e-16

As look at the summary of the model, it appears clear that pace and acceleration are the only significant variables in the model. The rest of the variables have p-values that are greater than .05, indicating they’re not statistically significant. The model’s R-squared is almost 1, which indicates that the model’s independent variables explains almost 100% of the values from the dependent variable.Perhaps if we exclude the insignificant variables we will have a higher rate of success.

(4) Conclusion

This project was interesting to me and I felt like I learned something even though I’m a passionate soccer fan. We got some high-level exploratory analytics and created a linear model that is pretty accurate. It would fun to create a similar project using actual player data.