Prepare the library and FIFADataset
suppressWarnings(library(caTools))
suppressWarnings(library(ggplot2))
suppressWarnings(library(corrplot))
suppressWarnings(library(forcats))
suppressWarnings(library(devtools))
suppressWarnings(library(psych))
suppressWarnings(library(corrr))
suppressWarnings(library(PerformanceAnalytics))
suppressWarnings(library(e1071))
fifa <- read.csv("C:/Users/atan/Desktop/All/1 study/5 kaggle/fifa.csv", header=TRUE)
As 2018 Summer is approaching, FIFA players become the center of puplics attention again.In the following graph, we will show the top 10 most valuable players in the world. From the graph, we begin to wonder why Neymar, L.Messi, L.Suarez and C.Ronaldo are the most expensive players? What are the key characteristics made them world class players?
player <- fifa[, c(2,4,17)]
player1<- player[order(-player$eur_value),]
player2<- player1[1:10,]# top 10 players
ggplot(data=player2, aes(x=reorder(name,eur_value), y=eur_value))+geom_bar(stat="identity", aes(fill=club))+theme(axis.text.x = element_text(angle = 60, hjust = 1)) +ggtitle("10 Most Valuable Soccer Players in 2018") + xlab("Players") +ylab("Players Euro Value")
After we do a simple correlation analysis, we have noticed a players euro market value is highly associated with players euro wage, internation reputation, overall score and potentail score. Although euro value has an extremely high correlation coefficient, those two variables are too smiliar to anlysis. In addition, internation reputation is too abstract to do analysis. Therefore we would investigate the relationship between the euro value and a player’s overall ability.
value <- fifa[, c(17,18,20,21,28)]
cvalue <- cor(value)
corrplot(cvalue,title = "Correlation Plot of Euro value")
We will extract the variables contain plays basic body features (age, height, weight) and play abilities together with the players euro value to inverstigate the relationship between them using SVM.
fifa1 <- fifa[, c(4, 7,10,11, 17:21,28 ,34:62)]
lmfifa1 <- svm(formula= eur_value ~ overall, data=fifa1, type="eps-regression")
ggplot()+ geom_point(aes(x=fifa1$overall, y=fifa1$eur_value), colour="green")+
geom_line(aes(x=fifa1$overall, y=predict(lmfifa1, newdat1=fifa1)),color="purple")+ggtitle('FIFA players Euro value vs Players overall score') +xlab("Players overall score") +ylab("Players Euro value")
The graph demonstrated there is a strong relationship between players overall score and euro values. the approximation line fits the data pretty well up to overall score=90. For players overall ability higher than 90, then their value is determined by other factors, we guess club might be an important factor.However, we will not look into this, instead, we will explore what determine a player’s overall score.
We willperform a correlation analysis to select highly correlated variables with overall score for players and plot the correlation map and heat map.
fifa2 <- fifa1[,-c(1,5,6,7,9,10)]
cfifa <- cor(fifa2)
cor1 <- head(round(cfifa,2))
cor1
## age height_cm weight_kg overall crossing finishing
## age 1.00 0.08 0.22 0.46 0.14 0.07
## height_cm 0.08 1.00 0.77 0.04 -0.49 -0.36
## weight_kg 0.22 0.77 1.00 0.14 -0.41 -0.30
## overall 0.46 0.04 0.14 1.00 0.39 0.32
## crossing 0.14 -0.49 -0.41 0.39 1.00 0.65
## finishing 0.07 -0.36 -0.30 0.32 0.65 1.00
## heading_accuracy short_passing volleys dribbling curve
## age 0.14 0.13 0.13 0.01 0.15
## height_cm 0.01 -0.37 -0.35 -0.50 -0.44
## weight_kg 0.02 -0.31 -0.28 -0.43 -0.36
## overall 0.34 0.49 0.38 0.35 0.41
## crossing 0.47 0.81 0.69 0.85 0.83
## finishing 0.47 0.66 0.88 0.82 0.75
## free_kick_accuracy long_passing ball_control acceleration
## age 0.20 0.19 0.08 -0.15
## height_cm -0.39 -0.33 -0.42 -0.54
## weight_kg -0.31 -0.27 -0.36 -0.48
## overall 0.40 0.47 0.45 0.20
## crossing 0.76 0.76 0.84 0.66
## finishing 0.69 0.50 0.78 0.60
## sprint_speed agility reactions balance shot_power jumping
## age -0.14 -0.02 0.46 -0.09 0.15 0.17
## height_cm -0.47 -0.62 -0.02 -0.79 -0.29 -0.07
## weight_kg -0.42 -0.55 0.07 -0.68 -0.20 0.00
## overall 0.22 0.26 0.84 0.10 0.43 0.27
## crossing 0.64 0.69 0.38 0.62 0.70 0.12
## finishing 0.59 0.63 0.32 0.51 0.80 0.08
## stamina strength long_shots aggression interceptions positioning
## age 0.09 0.32 0.15 0.27 0.20 0.08
## height_cm -0.29 0.54 -0.37 -0.05 -0.05 -0.43
## weight_kg -0.24 0.61 -0.29 0.02 -0.03 -0.36
## overall 0.36 0.36 0.41 0.40 0.32 0.35
## crossing 0.66 -0.04 0.74 0.47 0.42 0.78
## finishing 0.50 -0.02 0.87 0.23 -0.04 0.88
## vision penalties composure marking standing_tackle
## age 0.19 0.13 0.32 0.15 0.13
## height_cm -0.37 -0.34 -0.19 -0.04 -0.05
## weight_kg -0.30 -0.26 -0.12 -0.03 -0.05
## overall 0.48 0.33 0.63 0.24 0.25
## crossing 0.68 0.65 0.64 0.39 0.42
## finishing 0.69 0.83 0.59 -0.08 -0.05
## sliding_tackle
## age 0.11
## height_cm -0.07
## weight_kg -0.06
## overall 0.22
## crossing 0.40
## finishing -0.09
# pick the highest correlation with overall
cfifa1<- cfifa[c(1,4,8,11:14,18,20,24,25,28,30), c(1,4,8,11:14,18,20,24,25,28,30)]
#correlation map
corrplot(cfifa1,title = "Correlation Plot for Overall")
#heatmap
col <- colorRampPalette(c("darkblue", "white", "darkorange"))(20)
heatmap(x=cfifa1, col=col, symm=TRUE)
#Correlation map details
test<- cfifa1[, -c(1)]
suppressWarnings(chart.Correlation(test, histogram = TRUE, pch=19 , method="pearson"))
#spliting the dataset with ratio 0.75 with highly correlated variables
fifa3<- fifa2[,c(1,4,11:14,18,20,24,25,28,30)]
value <- factor(fifa3$overall)
pairs(fifa3[,-2], col=value, upper.panel = NULL, pch=16, cex=0.5)
set.seed(124)
split = sample.split(fifa3$overall, SplitRatio = 0.75)
train = subset(fifa3, split == TRUE)
test = subset(fifa3, split == FALSE)
From the correlation analysis we picked 10 most important variables, now we will perform PCA analysis to see if we can reduce the dimensions down and if so, what is the most important PCs.
pcafifa <- train [, -c(1)]
pca.value <- train[, 1]
#Perform PCA
pca1 <-prcomp(na.omit(pcafifa), center=TRUE, scale.=TRUE , cor=TRUE)
## Warning: In prcomp.default(na.omit(pcafifa), center = TRUE, scale. = TRUE,
## cor = TRUE) :
## extra argument 'cor' will be disregarded
summary(pca1)
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.674 1.1552 0.90195 0.63045 0.5490 0.50663 0.47034
## Proportion of Variance 0.650 0.1213 0.07396 0.03613 0.0274 0.02333 0.02011
## Cumulative Proportion 0.650 0.7714 0.84533 0.88146 0.9089 0.93220 0.95231
## PC8 PC9 PC10 PC11
## Standard deviation 0.4048 0.39052 0.33448 0.31045
## Proportion of Variance 0.0149 0.01386 0.01017 0.00876
## Cumulative Proportion 0.9672 0.98107 0.99124 1.00000
#plot PCA
plot(pca1, type="l")
biplot(pca1 , choices=1:2, cex=.7, expand=3, xlim=c(-0.1, 0), ylim=c(-.07,0.03))
The plot shows the variances associated the PCs, we can see the first 3 PCs explained most of the variability in the data. Now we use the predict function on the test dataset just by 3 variables, reactions, composure and long passing.
#### 4.SVM Regression
Now we will perform SVM Regression, see how it fits the data
lmfifa <- svm(formula= overall ~ reactions+ composure + long_passing, data=train)
ggplot()+ geom_point(aes(x=train$reactions, y=train$overall), colour="pink")+ geom_line(aes(x=train$reactions, y=predict(lmfifa, newdat=train)),color="green")+ggtitle('FIFA players Overall ability vs Players Reactions in Training Set') +xlab("Players Reaction") +ylab("Players Overall")
ggplot()+ geom_point(aes(x=test$reactions, y=test$overall), colour="cyan")+ geom_line(aes(x=test$reactions, y=predict(lmfifa, newdat=test)),color="coral")+ggtitle('FIFA players Overall ability vs Players Reactions in Test Set') +xlab("Players Reaction") +ylab("Players Overall")
As we can see from the test dataset graph, our SVM model did a pretty good predictions with Players Overall Scores.
For soccer players who want to increase their euro market value, with this data analysis, we can suggest them to improve their overall score first by practice compsure, long_passing and train their reactions in the playfield. Wish the players will perform wonderful games in Russia, 2018!!