We are building a model that will help us predict which soccer player is most likely to be voted Most Valuable Player. In order to do this, we used a dataset of around 18,000 observations (number of players) and a second dataset of all teams that ever played in the World Cup. We combined both datasets and cleaned all teams and players with missing data. We ended up with 10,079 observations of individual players and 62 variables. The librabries we used in our analysis are the following:
data <- read.csv("fpdataset-2.csv")
library(ggplot2) library(corrplot) library(ResourceSelection) library(pscl) library(pROC) library(e1071) library(pastecs) library(leaps) library(ISLR) library(glmnet) library(rpart) library(randomForest) library(rpart.plot)
str(data)
## 'data.frame': 10079 obs. of 56 variables:
## $ PIN : int 12857 12250 2043 4302 15102 1491 10335 12412 3126 1253 ...
## $ ID : int 158973 172112 198145 207454 213272 213432 213624 214660 214973 215069 ...
## $ Name : Factor w/ 9545 levels "A. Öztürk",..: 5187 5599 8073 7304 5170 2791 8288 5410 5469 7775 ...
## $ Age : int 36 33 25 27 21 26 31 23 25 26 ...
## $ Nationality : Factor w/ 22 levels "Argentina","Australia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Total.Games : int 77 77 77 77 77 77 77 77 77 77 ...
## $ Wins : int 42 42 42 42 42 42 42 42 42 42 ...
## $ Draws : int 14 14 14 14 14 14 14 14 14 14 ...
## $ Losses : int 21 21 21 21 21 21 21 21 21 21 ...
## $ Goals : int 131 131 131 131 131 131 131 131 131 131 ...
## $ Goals.Against : int 84 84 84 84 84 84 84 84 84 84 ...
## $ Champions : int 2 2 2 2 2 2 2 2 2 2 ...
## $ Finals : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Win51 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SemiFinals : int 4 4 4 4 4 4 4 4 4 4 ...
## $ SFYesNo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ NewWins : int 56 56 56 56 56 56 56 56 56 56 ...
## $ WinOverLoss : num 2.67 2.67 2.67 2.67 2.67 ...
## $ Win.Perc : num 0.727 0.727 0.727 0.727 0.727 ...
## $ Overall : int 63 63 74 71 59 76 65 63 73 76 ...
## $ Potential : int 63 63 78 71 65 79 65 71 75 79 ...
## $ Club : Factor w/ 618 levels " SSV Jahn Regensburg",..: 480 181 22 22 50 502 316 557 415 416 ...
## $ Value.in.M.Euros : num 50 220 7.5 2.5 160 9.5 325 450 4.9 6.5 ...
## $ Wage.in.K.Euros : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Special : int 860 1649 1789 1773 1062 1751 1050 1759 1808 1201 ...
## $ Acceleration : int 21 67 74 79 49 70 44 75 79 55 ...
## $ Aggression : int 24 60 32 26 24 48 22 58 44 33 ...
## $ Agility : int 25 65 74 82 35 74 37 80 80 43 ...
## $ Balance : int 22 69 71 82 68 68 37 86 84 53 ...
## $ Ball.control : int 13 66 83 70 18 76 17 66 74 23 ...
## $ Composure : int 25 67 66 63 28 76 29 65 66 49 ...
## $ Crossing : int 19 65 59 73 12 49 12 64 75 18 ...
## $ Curve : int 15 70 81 72 12 46 17 62 73 17 ...
## $ Dribbling : int 11 58 75 72 17 69 14 68 74 20 ...
## $ Finishing : int 10 55 71 63 13 78 16 50 67 18 ...
## $ Free.kick.accuracy : int 17 66 59 69 21 40 12 61 70 17 ...
## $ Heading.accuracy : int 20 57 60 50 17 78 14 53 49 12 ...
## $ Interceptions : int 10 34 20 42 25 23 16 55 38 22 ...
## $ Jumping : int 37 64 49 72 58 73 59 75 59 67 ...
## $ Long.passing : int 15 63 59 69 26 53 24 61 72 13 ...
## $ Long.shots : int 10 64 72 60 15 76 14 54 62 13 ...
## $ Marking : int 14 25 26 30 18 14 12 60 35 12 ...
## $ Penalties : int 21 57 69 71 21 70 25 44 65 23 ...
## $ Positioning : int 11 60 81 66 15 75 11 54 68 14 ...
## $ Reactions : int 49 60 80 64 58 76 73 56 69 63 ...
## $ Short.passing : int 13 64 70 70 26 67 23 64 73 29 ...
## $ Shot.power : int 22 62 69 65 17 75 19 35 70 25 ...
## $ Sliding.tackle : int 15 22 27 29 13 23 15 63 29 13 ...
## $ Sprint.speed : int 17 67 74 78 42 74 47 74 77 43 ...
## $ Stamina : int 16 59 71 65 43 72 30 61 66 31 ...
## $ Standing.tackle : int 12 27 22 31 21 22 17 62 31 18 ...
## $ Strength : int 50 65 62 34 51 73 63 58 43 59 ...
## $ Vision : int 26 58 68 67 24 62 33 56 70 53 ...
## $ Volleys : int 13 55 71 70 14 68 18 36 66 15 ...
## $ Preferred.Positions: Factor w/ 614 levels "CAM ","CAM CB CDM ",..: 214 1 567 460 214 567 214 288 430 214 ...
## $ SPI : num 85.5 85.5 85.5 85.5 85.5 ...
attach(data)
We turned the data into numeric variables:
Ndata <- data.frame(lapply(data[, -c(1:3,5)], as.numeric))
str(Ndata)
## 'data.frame': 10079 obs. of 52 variables:
## $ Age : num 36 33 25 27 21 26 31 23 25 26 ...
## $ Total.Games : num 77 77 77 77 77 77 77 77 77 77 ...
## $ Wins : num 42 42 42 42 42 42 42 42 42 42 ...
## $ Draws : num 14 14 14 14 14 14 14 14 14 14 ...
## $ Losses : num 21 21 21 21 21 21 21 21 21 21 ...
## $ Goals : num 131 131 131 131 131 131 131 131 131 131 ...
## $ Goals.Against : num 84 84 84 84 84 84 84 84 84 84 ...
## $ Champions : num 2 2 2 2 2 2 2 2 2 2 ...
## $ Finals : num 5 5 5 5 5 5 5 5 5 5 ...
## $ Win51 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SemiFinals : num 4 4 4 4 4 4 4 4 4 4 ...
## $ SFYesNo : num 1 1 1 1 1 1 1 1 1 1 ...
## $ NewWins : num 56 56 56 56 56 56 56 56 56 56 ...
## $ WinOverLoss : num 2.67 2.67 2.67 2.67 2.67 ...
## $ Win.Perc : num 0.727 0.727 0.727 0.727 0.727 ...
## $ Overall : num 63 63 74 71 59 76 65 63 73 76 ...
## $ Potential : num 63 63 78 71 65 79 65 71 75 79 ...
## $ Club : num 480 181 22 22 50 502 316 557 415 416 ...
## $ Value.in.M.Euros : num 50 220 7.5 2.5 160 9.5 325 450 4.9 6.5 ...
## $ Wage.in.K.Euros : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Special : num 860 1649 1789 1773 1062 ...
## $ Acceleration : num 21 67 74 79 49 70 44 75 79 55 ...
## $ Aggression : num 24 60 32 26 24 48 22 58 44 33 ...
## $ Agility : num 25 65 74 82 35 74 37 80 80 43 ...
## $ Balance : num 22 69 71 82 68 68 37 86 84 53 ...
## $ Ball.control : num 13 66 83 70 18 76 17 66 74 23 ...
## $ Composure : num 25 67 66 63 28 76 29 65 66 49 ...
## $ Crossing : num 19 65 59 73 12 49 12 64 75 18 ...
## $ Curve : num 15 70 81 72 12 46 17 62 73 17 ...
## $ Dribbling : num 11 58 75 72 17 69 14 68 74 20 ...
## $ Finishing : num 10 55 71 63 13 78 16 50 67 18 ...
## $ Free.kick.accuracy : num 17 66 59 69 21 40 12 61 70 17 ...
## $ Heading.accuracy : num 20 57 60 50 17 78 14 53 49 12 ...
## $ Interceptions : num 10 34 20 42 25 23 16 55 38 22 ...
## $ Jumping : num 37 64 49 72 58 73 59 75 59 67 ...
## $ Long.passing : num 15 63 59 69 26 53 24 61 72 13 ...
## $ Long.shots : num 10 64 72 60 15 76 14 54 62 13 ...
## $ Marking : num 14 25 26 30 18 14 12 60 35 12 ...
## $ Penalties : num 21 57 69 71 21 70 25 44 65 23 ...
## $ Positioning : num 11 60 81 66 15 75 11 54 68 14 ...
## $ Reactions : num 49 60 80 64 58 76 73 56 69 63 ...
## $ Short.passing : num 13 64 70 70 26 67 23 64 73 29 ...
## $ Shot.power : num 22 62 69 65 17 75 19 35 70 25 ...
## $ Sliding.tackle : num 15 22 27 29 13 23 15 63 29 13 ...
## $ Sprint.speed : num 17 67 74 78 42 74 47 74 77 43 ...
## $ Stamina : num 16 59 71 65 43 72 30 61 66 31 ...
## $ Standing.tackle : num 12 27 22 31 21 22 17 62 31 18 ...
## $ Strength : num 50 65 62 34 51 73 63 58 43 59 ...
## $ Vision : num 26 58 68 67 24 62 33 56 70 53 ...
## $ Volleys : num 13 55 71 70 14 68 18 36 66 15 ...
## $ Preferred.Positions: num 214 1 567 460 214 567 214 288 430 214 ...
## $ SPI : num 85.5 85.5 85.5 85.5 85.5 ...
attach(Ndata)
## The following objects are masked from data:
##
## Acceleration, Age, Aggression, Agility, Balance, Ball.control,
## Champions, Club, Composure, Crossing, Curve, Draws, Dribbling,
## Finals, Finishing, Free.kick.accuracy, Goals, Goals.Against,
## Heading.accuracy, Interceptions, Jumping, Long.passing,
## Long.shots, Losses, Marking, NewWins, Overall, Penalties,
## Positioning, Potential, Preferred.Positions, Reactions,
## SemiFinals, SFYesNo, Short.passing, Shot.power,
## Sliding.tackle, Special, SPI, Sprint.speed, Stamina,
## Standing.tackle, Strength, Total.Games, Value.in.M.Euros,
## Vision, Volleys, Wage.in.K.Euros, Win.Perc, Win51,
## WinOverLoss, Wins
We scaled the data:
SNdata <- data.frame(scale(Ndata))
str(SNdata)
## 'data.frame': 10079 obs. of 52 variables:
## $ Age : num 2.3445 1.702 -0.0114 0.417 -0.8681 ...
## $ Total.Games : num 1.01 1.01 1.01 1.01 1.01 ...
## $ Wins : num 0.924 0.924 0.924 0.924 0.924 ...
## $ Draws : num 0.478 0.478 0.478 0.478 0.478 ...
## $ Losses : num 1.14 1.14 1.14 1.14 1.14 ...
## $ Goals : num 0.853 0.853 0.853 0.853 0.853 ...
## $ Goals.Against : num 1.03 1.03 1.03 1.03 1.03 ...
## $ Champions : num 0.928 0.928 0.928 0.928 0.928 ...
## $ Finals : num 1.75 1.75 1.75 1.75 1.75 ...
## $ Win51 : num 0.388 0.388 0.388 0.388 0.388 ...
## $ SemiFinals : num 0.498 0.498 0.498 0.498 0.498 ...
## $ SFYesNo : num 0.503 0.503 0.503 0.503 0.503 ...
## $ NewWins : num 0.881 0.881 0.881 0.881 0.881 ...
## $ WinOverLoss : num 0.116 0.116 0.116 0.116 0.116 ...
## $ Win.Perc : num 0.413 0.413 0.413 0.413 0.413 ...
## $ Overall : num -0.489 -0.489 1.042 0.624 -1.046 ...
## $ Potential : num -1.366 -1.366 1.007 -0.101 -1.05 ...
## $ Club : num 0.945 -0.702 -1.577 -1.577 -1.423 ...
## $ Value.in.M.Euros : num -0.722 -0.129 -0.87 -0.888 -0.338 ...
## $ Wage.in.K.Euros : num -0.468 -0.468 -0.468 -0.468 -0.468 ...
## $ Special : num -2.662 0.192 0.699 0.641 -1.932 ...
## $ Acceleration : num -2.922 0.172 0.643 0.979 -1.039 ...
## $ Aggression : num -1.8 0.249 -1.345 -1.686 -1.8 ...
## $ Agility : num -2.597 0.115 0.725 1.267 -1.919 ...
## $ Balance : num -2.972 0.356 0.498 1.277 0.285 ...
## $ Ball.control : num -2.636 0.454 1.445 0.687 -2.344 ...
## $ Composure : num -2.503 0.701 0.624 0.396 -2.274 ...
## $ Crossing : num -1.663 0.797 0.476 1.225 -2.038 ...
## $ Curve : num -1.74 1.21 1.8 1.32 -1.9 ...
## $ Dribbling : num -2.286 0.152 1.033 0.877 -1.974 ...
## $ Finishing : num -1.796 0.493 1.306 0.9 -1.644 ...
## $ Free.kick.accuracy : num -1.495 1.282 0.885 1.452 -1.268 ...
## $ Heading.accuracy : num -1.826 0.268 0.438 -0.128 -1.996 ...
## $ Interceptions : num -1.756 -0.61 -1.279 -0.228 -1.04 ...
## $ Jumping : num -2.3554 -0.0666 -1.3382 0.6116 -0.5752 ...
## $ Long.passing : num -2.379 0.654 0.401 1.033 -1.684 ...
## $ Long.shots : num -1.897 0.871 1.281 0.666 -1.641 ...
## $ Marking : num -1.391 -0.886 -0.84 -0.656 -1.207 ...
## $ Penalties : num -1.771 0.492 1.246 1.372 -1.771 ...
## $ Positioning : num -1.962 0.534 1.603 0.84 -1.758 ...
## $ Reactions : num -1.389 -0.225 1.891 0.198 -0.436 ...
## $ Short.passing : num -2.977 0.357 0.749 0.749 -2.127 ...
## $ Shot.power : num -1.916 0.375 0.776 0.547 -2.202 ...
## $ Sliding.tackle : num -1.424 -1.102 -0.872 -0.78 -1.516 ...
## $ Sprint.speed : num -3.274 0.164 0.645 0.92 -1.555 ...
## $ Stamina : num -2.944 -0.251 0.5 0.124 -1.253 ...
## $ Standing.tackle : num -1.622 -0.942 -1.169 -0.761 -1.214 ...
## $ Strength : num -1.18535 -0.00175 -0.23847 -2.44786 -1.10644 ...
## $ Vision : num -1.848 0.338 1.021 0.952 -1.984 ...
## $ Volleys : num -1.688 0.672 1.571 1.515 -1.632 ...
## $ Preferred.Positions: num -0.247 -1.431 1.716 1.121 -0.247 ...
## $ SPI : num 0.275 0.275 0.275 0.275 0.275 ...
We combined similar variables, according to FIFA’s classification of skills:
SNdata$Attacking <- (SNdata$Crossing+SNdata$Finishing+SNdata$Heading.accuracy+SNdata$Short.passing+SNdata$Volleys)/5
SNdata$Skill <- (SNdata$Dribbling+SNdata$Curve+SNdata$Free.kick.accuracy+SNdata$Long.passing+SNdata$Ball.control)/5
SNdata$Movement <- (SNdata$Acceleration+SNdata$Sprint.speed+SNdata$Agility+SNdata$Reactions+SNdata$Balance)/5
SNdata$Power <- (SNdata$Shot.power+SNdata$Jumping+SNdata$Stamina+SNdata$Strength+SNdata$Long.shots)/5
SNdata$Mentality <- (SNdata$Aggression+SNdata$Interceptions+SNdata$Positioning+SNdata$Vision+SNdata$Penalties+SNdata$Composure)/6
SNdata$Defending <- (SNdata$Marking+SNdata$Standing.tackle+SNdata$Sliding.tackle)/3
str(SNdata)
## 'data.frame': 10079 obs. of 58 variables:
## $ Age : num 2.3445 1.702 -0.0114 0.417 -0.8681 ...
## $ Total.Games : num 1.01 1.01 1.01 1.01 1.01 ...
## $ Wins : num 0.924 0.924 0.924 0.924 0.924 ...
## $ Draws : num 0.478 0.478 0.478 0.478 0.478 ...
## $ Losses : num 1.14 1.14 1.14 1.14 1.14 ...
## $ Goals : num 0.853 0.853 0.853 0.853 0.853 ...
## $ Goals.Against : num 1.03 1.03 1.03 1.03 1.03 ...
## $ Champions : num 0.928 0.928 0.928 0.928 0.928 ...
## $ Finals : num 1.75 1.75 1.75 1.75 1.75 ...
## $ Win51 : num 0.388 0.388 0.388 0.388 0.388 ...
## $ SemiFinals : num 0.498 0.498 0.498 0.498 0.498 ...
## $ SFYesNo : num 0.503 0.503 0.503 0.503 0.503 ...
## $ NewWins : num 0.881 0.881 0.881 0.881 0.881 ...
## $ WinOverLoss : num 0.116 0.116 0.116 0.116 0.116 ...
## $ Win.Perc : num 0.413 0.413 0.413 0.413 0.413 ...
## $ Overall : num -0.489 -0.489 1.042 0.624 -1.046 ...
## $ Potential : num -1.366 -1.366 1.007 -0.101 -1.05 ...
## $ Club : num 0.945 -0.702 -1.577 -1.577 -1.423 ...
## $ Value.in.M.Euros : num -0.722 -0.129 -0.87 -0.888 -0.338 ...
## $ Wage.in.K.Euros : num -0.468 -0.468 -0.468 -0.468 -0.468 ...
## $ Special : num -2.662 0.192 0.699 0.641 -1.932 ...
## $ Acceleration : num -2.922 0.172 0.643 0.979 -1.039 ...
## $ Aggression : num -1.8 0.249 -1.345 -1.686 -1.8 ...
## $ Agility : num -2.597 0.115 0.725 1.267 -1.919 ...
## $ Balance : num -2.972 0.356 0.498 1.277 0.285 ...
## $ Ball.control : num -2.636 0.454 1.445 0.687 -2.344 ...
## $ Composure : num -2.503 0.701 0.624 0.396 -2.274 ...
## $ Crossing : num -1.663 0.797 0.476 1.225 -2.038 ...
## $ Curve : num -1.74 1.21 1.8 1.32 -1.9 ...
## $ Dribbling : num -2.286 0.152 1.033 0.877 -1.974 ...
## $ Finishing : num -1.796 0.493 1.306 0.9 -1.644 ...
## $ Free.kick.accuracy : num -1.495 1.282 0.885 1.452 -1.268 ...
## $ Heading.accuracy : num -1.826 0.268 0.438 -0.128 -1.996 ...
## $ Interceptions : num -1.756 -0.61 -1.279 -0.228 -1.04 ...
## $ Jumping : num -2.3554 -0.0666 -1.3382 0.6116 -0.5752 ...
## $ Long.passing : num -2.379 0.654 0.401 1.033 -1.684 ...
## $ Long.shots : num -1.897 0.871 1.281 0.666 -1.641 ...
## $ Marking : num -1.391 -0.886 -0.84 -0.656 -1.207 ...
## $ Penalties : num -1.771 0.492 1.246 1.372 -1.771 ...
## $ Positioning : num -1.962 0.534 1.603 0.84 -1.758 ...
## $ Reactions : num -1.389 -0.225 1.891 0.198 -0.436 ...
## $ Short.passing : num -2.977 0.357 0.749 0.749 -2.127 ...
## $ Shot.power : num -1.916 0.375 0.776 0.547 -2.202 ...
## $ Sliding.tackle : num -1.424 -1.102 -0.872 -0.78 -1.516 ...
## $ Sprint.speed : num -3.274 0.164 0.645 0.92 -1.555 ...
## $ Stamina : num -2.944 -0.251 0.5 0.124 -1.253 ...
## $ Standing.tackle : num -1.622 -0.942 -1.169 -0.761 -1.214 ...
## $ Strength : num -1.18535 -0.00175 -0.23847 -2.44786 -1.10644 ...
## $ Vision : num -1.848 0.338 1.021 0.952 -1.984 ...
## $ Volleys : num -1.688 0.672 1.571 1.515 -1.632 ...
## $ Preferred.Positions: num -0.247 -1.431 1.716 1.121 -0.247 ...
## $ SPI : num 0.275 0.275 0.275 0.275 0.275 ...
## $ Attacking : num -1.99 0.517 0.908 0.852 -1.887 ...
## $ Skill : num -2.11 0.75 1.11 1.07 -1.83 ...
## $ Movement : num -2.631 0.116 0.88 0.928 -0.933 ...
## $ Power : num -2.0596 0.1852 0.1961 -0.0999 -1.3556 ...
## $ Mentality : num -1.94 0.284 0.312 0.274 -1.771 ...
## $ Defending : num -1.479 -0.977 -0.96 -0.732 -1.313 ...
```
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.