Prepare the library and FIFADataset

suppressWarnings(library(caTools))
suppressWarnings(library(ggplot2))
suppressWarnings(library(corrplot))
suppressWarnings(library(forcats))
suppressWarnings(library(devtools))
suppressWarnings(library(psych))
suppressWarnings(library(corrr))
suppressWarnings(library(PerformanceAnalytics))
suppressWarnings(library(e1071))
fifa <- read.csv("C:/Users/atan/Desktop/All/1 study/5 kaggle/fifa.csv", header=TRUE)

FIFA Players Euro Value Analysis

Background Info

As 2018 Summer is approaching, FIFA players become the center of puplics attention again.In the following graph, we will show the top 10 most valuable players in the world. From the graph, we begin to wonder why Neymar, L.Messi, L.Suarez and C.Ronaldo are the most expensive players? What are the key characteristics made them world class players?

player <- fifa[, c(2,4,17)]
player1<- player[order(-player$eur_value),]
player2<- player1[1:10,]# top 10 players
ggplot(data=player2, aes(x=reorder(name,eur_value), y=eur_value))+geom_bar(stat="identity", aes(fill=club))+theme(axis.text.x = element_text(angle = 60, hjust = 1)) +ggtitle("10 Most Valuable Soccer Players in 2018") + xlab("Players") +ylab("Players Euro Value")

After we do a simple correlation analysis, we have noticed a players euro market value is highly associated with players euro wage, internation reputation, overall score and potentail score. Although euro value has an extremely high correlation coefficient, those two variables are too smiliar to anlysis. In addition, internation reputation is too abstract to do analysis. Therefore we would investigate the relationship between the euro value and a player’s overall ability.

value <- fifa[, c(17,18,20,21,28)]
cvalue <- cor(value)
corrplot(cvalue,title = "Correlation Plot of Euro value")

Analysis on Euro Value VS players’ overall ability

We will extract the variables contain plays basic body features (age, height, weight) and play abilities together with the players euro value to inverstigate the relationship between them using SVM.

fifa1 <- fifa[, c(4, 7,10,11, 17:21,28 ,34:62)]
lmfifa1 <- svm(formula= eur_value ~ overall, data=fifa1, type="eps-regression")

ggplot()+ geom_point(aes(x=fifa1$overall, y=fifa1$eur_value), colour="green")+
geom_line(aes(x=fifa1$overall, y=predict(lmfifa1, newdat1=fifa1)),color="purple")+ggtitle('FIFA players Euro value vs Players overall score') +xlab("Players overall score") +ylab("Players Euro value")

The graph demonstrated there is a strong relationship between players overall score and euro values. the approximation line fits the data pretty well up to overall score=90. For players overall ability higher than 90, then their value is determined by other factors, we guess club might be an important factor.However, we will not look into this, instead, we will explore what determine a player’s overall score.

Analysis on players overall score

1. Correlation analysis

We willperform a correlation analysis to select highly correlated variables with overall score for players and plot the correlation map and heat map.

fifa2 <- fifa1[,-c(1,5,6,7,9,10)]
cfifa <- cor(fifa2)

cor1 <- head(round(cfifa,2))
cor1
##            age height_cm weight_kg overall crossing finishing
## age       1.00      0.08      0.22    0.46     0.14      0.07
## height_cm 0.08      1.00      0.77    0.04    -0.49     -0.36
## weight_kg 0.22      0.77      1.00    0.14    -0.41     -0.30
## overall   0.46      0.04      0.14    1.00     0.39      0.32
## crossing  0.14     -0.49     -0.41    0.39     1.00      0.65
## finishing 0.07     -0.36     -0.30    0.32     0.65      1.00
##           heading_accuracy short_passing volleys dribbling curve
## age                   0.14          0.13    0.13      0.01  0.15
## height_cm             0.01         -0.37   -0.35     -0.50 -0.44
## weight_kg             0.02         -0.31   -0.28     -0.43 -0.36
## overall               0.34          0.49    0.38      0.35  0.41
## crossing              0.47          0.81    0.69      0.85  0.83
## finishing             0.47          0.66    0.88      0.82  0.75
##           free_kick_accuracy long_passing ball_control acceleration
## age                     0.20         0.19         0.08        -0.15
## height_cm              -0.39        -0.33        -0.42        -0.54
## weight_kg              -0.31        -0.27        -0.36        -0.48
## overall                 0.40         0.47         0.45         0.20
## crossing                0.76         0.76         0.84         0.66
## finishing               0.69         0.50         0.78         0.60
##           sprint_speed agility reactions balance shot_power jumping
## age              -0.14   -0.02      0.46   -0.09       0.15    0.17
## height_cm        -0.47   -0.62     -0.02   -0.79      -0.29   -0.07
## weight_kg        -0.42   -0.55      0.07   -0.68      -0.20    0.00
## overall           0.22    0.26      0.84    0.10       0.43    0.27
## crossing          0.64    0.69      0.38    0.62       0.70    0.12
## finishing         0.59    0.63      0.32    0.51       0.80    0.08
##           stamina strength long_shots aggression interceptions positioning
## age          0.09     0.32       0.15       0.27          0.20        0.08
## height_cm   -0.29     0.54      -0.37      -0.05         -0.05       -0.43
## weight_kg   -0.24     0.61      -0.29       0.02         -0.03       -0.36
## overall      0.36     0.36       0.41       0.40          0.32        0.35
## crossing     0.66    -0.04       0.74       0.47          0.42        0.78
## finishing    0.50    -0.02       0.87       0.23         -0.04        0.88
##           vision penalties composure marking standing_tackle
## age         0.19      0.13      0.32    0.15            0.13
## height_cm  -0.37     -0.34     -0.19   -0.04           -0.05
## weight_kg  -0.30     -0.26     -0.12   -0.03           -0.05
## overall     0.48      0.33      0.63    0.24            0.25
## crossing    0.68      0.65      0.64    0.39            0.42
## finishing   0.69      0.83      0.59   -0.08           -0.05
##           sliding_tackle
## age                 0.11
## height_cm          -0.07
## weight_kg          -0.06
## overall             0.22
## crossing            0.40
## finishing          -0.09
# pick the highest correlation with overall
cfifa1<- cfifa[c(1,4,8,11:14,18,20,24,25,28,30), c(1,4,8,11:14,18,20,24,25,28,30)]

#correlation map
corrplot(cfifa1,title = "Correlation Plot for Overall")

#heatmap
col <- colorRampPalette(c("darkblue", "white", "darkorange"))(20)
heatmap(x=cfifa1, col=col, symm=TRUE)

#Correlation map details
test<- cfifa1[, -c(1)]
suppressWarnings(chart.Correlation(test, histogram = TRUE, pch=19 , method="pearson"))

2.Split training and testing dataset

#spliting the dataset with ratio 0.75 with highly correlated variables
fifa3<- fifa2[,c(1,4,11:14,18,20,24,25,28,30)]

value <- factor(fifa3$overall)
pairs(fifa3[,-2], col=value, upper.panel = NULL, pch=16, cex=0.5)

set.seed(124)
split = sample.split(fifa3$overall, SplitRatio = 0.75)
train = subset(fifa3, split == TRUE)
test = subset(fifa3, split == FALSE)

3. PCA analysis

From the correlation analysis we picked 10 most important variables, now we will perform PCA analysis to see if we can reduce the dimensions down and if so, what is the most important PCs.

pcafifa <-  train [, -c(1)]
pca.value <- train[, 1]

#Perform PCA
pca1 <-prcomp(na.omit(pcafifa), center=TRUE, scale.=TRUE , cor=TRUE)
## Warning: In prcomp.default(na.omit(pcafifa), center = TRUE, scale. = TRUE, 
##     cor = TRUE) :
##  extra argument 'cor' will be disregarded
summary(pca1)
## Importance of components%s:
##                          PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.674 1.1552 0.90195 0.63045 0.5490 0.50663 0.47034
## Proportion of Variance 0.650 0.1213 0.07396 0.03613 0.0274 0.02333 0.02011
## Cumulative Proportion  0.650 0.7714 0.84533 0.88146 0.9089 0.93220 0.95231
##                           PC8     PC9    PC10    PC11
## Standard deviation     0.4048 0.39052 0.33448 0.31045
## Proportion of Variance 0.0149 0.01386 0.01017 0.00876
## Cumulative Proportion  0.9672 0.98107 0.99124 1.00000
#plot PCA
plot(pca1, type="l")

biplot(pca1 , choices=1:2, cex=.7, expand=3, xlim=c(-0.1, 0), ylim=c(-.07,0.03))

The plot shows the variances associated the PCs, we can see the first 3 PCs explained most of the variability in the data. Now we use the predict function on the test dataset just by 3 variables, reactions, composure and long passing.
#### 4.SVM Regression

Now we will perform SVM Regression, see how it fits the data

lmfifa <- svm(formula= overall ~ reactions+ composure + long_passing, data=train)
1. visualization of the training set
ggplot()+ geom_point(aes(x=train$reactions, y=train$overall), colour="pink")+ geom_line(aes(x=train$reactions, y=predict(lmfifa, newdat=train)),color="green")+ggtitle('FIFA players Overall ability vs Players Reactions in Training Set') +xlab("Players Reaction") +ylab("Players Overall")

2.Visiualization of test set
ggplot()+ geom_point(aes(x=test$reactions, y=test$overall), colour="cyan")+ geom_line(aes(x=test$reactions, y=predict(lmfifa, newdat=test)),color="coral")+ggtitle('FIFA players Overall ability vs Players Reactions in Test Set') +xlab("Players Reaction") +ylab("Players Overall")

As we can see from the test dataset graph, our SVM model did a pretty good predictions with Players Overall Scores.

Conclusion

For soccer players who want to increase their euro market value, with this data analysis, we can suggest them to improve their overall score first by practice compsure, long_passing and train their reactions in the playfield. Wish the players will perform wonderful games in Russia, 2018!!