We are building a model that will help us predict which soccer player is most likely to be voted Most Valuable Player. In order to do this, we used a dataset of around 18,000 observations (number of players) and a second dataset of all teams that ever played in the World Cup. We combined both datasets and cleaned all teams and players with missing data. We ended up with 10,079 observations of individual players and 62 variables. The librabries we used in our analysis are the following:

data <- read.csv("fpdataset-2.csv")

library(ggplot2) library(corrplot) library(ResourceSelection) library(pscl) library(pROC) library(e1071) library(pastecs) library(leaps) library(ISLR) library(glmnet) library(rpart) library(randomForest) library(rpart.plot)

str(data)
## 'data.frame':    10079 obs. of  56 variables:
##  $ PIN                : int  12857 12250 2043 4302 15102 1491 10335 12412 3126 1253 ...
##  $ ID                 : int  158973 172112 198145 207454 213272 213432 213624 214660 214973 215069 ...
##  $ Name               : Factor w/ 9545 levels "A. Öztürk",..: 5187 5599 8073 7304 5170 2791 8288 5410 5469 7775 ...
##  $ Age                : int  36 33 25 27 21 26 31 23 25 26 ...
##  $ Nationality        : Factor w/ 22 levels "Argentina","Australia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Total.Games        : int  77 77 77 77 77 77 77 77 77 77 ...
##  $ Wins               : int  42 42 42 42 42 42 42 42 42 42 ...
##  $ Draws              : int  14 14 14 14 14 14 14 14 14 14 ...
##  $ Losses             : int  21 21 21 21 21 21 21 21 21 21 ...
##  $ Goals              : int  131 131 131 131 131 131 131 131 131 131 ...
##  $ Goals.Against      : int  84 84 84 84 84 84 84 84 84 84 ...
##  $ Champions          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Finals             : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Win51              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SemiFinals         : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ SFYesNo            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ NewWins            : int  56 56 56 56 56 56 56 56 56 56 ...
##  $ WinOverLoss        : num  2.67 2.67 2.67 2.67 2.67 ...
##  $ Win.Perc           : num  0.727 0.727 0.727 0.727 0.727 ...
##  $ Overall            : int  63 63 74 71 59 76 65 63 73 76 ...
##  $ Potential          : int  63 63 78 71 65 79 65 71 75 79 ...
##  $ Club               : Factor w/ 618 levels " SSV Jahn Regensburg",..: 480 181 22 22 50 502 316 557 415 416 ...
##  $ Value.in.M.Euros   : num  50 220 7.5 2.5 160 9.5 325 450 4.9 6.5 ...
##  $ Wage.in.K.Euros    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Special            : int  860 1649 1789 1773 1062 1751 1050 1759 1808 1201 ...
##  $ Acceleration       : int  21 67 74 79 49 70 44 75 79 55 ...
##  $ Aggression         : int  24 60 32 26 24 48 22 58 44 33 ...
##  $ Agility            : int  25 65 74 82 35 74 37 80 80 43 ...
##  $ Balance            : int  22 69 71 82 68 68 37 86 84 53 ...
##  $ Ball.control       : int  13 66 83 70 18 76 17 66 74 23 ...
##  $ Composure          : int  25 67 66 63 28 76 29 65 66 49 ...
##  $ Crossing           : int  19 65 59 73 12 49 12 64 75 18 ...
##  $ Curve              : int  15 70 81 72 12 46 17 62 73 17 ...
##  $ Dribbling          : int  11 58 75 72 17 69 14 68 74 20 ...
##  $ Finishing          : int  10 55 71 63 13 78 16 50 67 18 ...
##  $ Free.kick.accuracy : int  17 66 59 69 21 40 12 61 70 17 ...
##  $ Heading.accuracy   : int  20 57 60 50 17 78 14 53 49 12 ...
##  $ Interceptions      : int  10 34 20 42 25 23 16 55 38 22 ...
##  $ Jumping            : int  37 64 49 72 58 73 59 75 59 67 ...
##  $ Long.passing       : int  15 63 59 69 26 53 24 61 72 13 ...
##  $ Long.shots         : int  10 64 72 60 15 76 14 54 62 13 ...
##  $ Marking            : int  14 25 26 30 18 14 12 60 35 12 ...
##  $ Penalties          : int  21 57 69 71 21 70 25 44 65 23 ...
##  $ Positioning        : int  11 60 81 66 15 75 11 54 68 14 ...
##  $ Reactions          : int  49 60 80 64 58 76 73 56 69 63 ...
##  $ Short.passing      : int  13 64 70 70 26 67 23 64 73 29 ...
##  $ Shot.power         : int  22 62 69 65 17 75 19 35 70 25 ...
##  $ Sliding.tackle     : int  15 22 27 29 13 23 15 63 29 13 ...
##  $ Sprint.speed       : int  17 67 74 78 42 74 47 74 77 43 ...
##  $ Stamina            : int  16 59 71 65 43 72 30 61 66 31 ...
##  $ Standing.tackle    : int  12 27 22 31 21 22 17 62 31 18 ...
##  $ Strength           : int  50 65 62 34 51 73 63 58 43 59 ...
##  $ Vision             : int  26 58 68 67 24 62 33 56 70 53 ...
##  $ Volleys            : int  13 55 71 70 14 68 18 36 66 15 ...
##  $ Preferred.Positions: Factor w/ 614 levels "CAM ","CAM CB CDM ",..: 214 1 567 460 214 567 214 288 430 214 ...
##  $ SPI                : num  85.5 85.5 85.5 85.5 85.5 ...
attach(data)

We turned the data into numeric variables:

Ndata <- data.frame(lapply(data[, -c(1:3,5)], as.numeric))
str(Ndata)
## 'data.frame':    10079 obs. of  52 variables:
##  $ Age                : num  36 33 25 27 21 26 31 23 25 26 ...
##  $ Total.Games        : num  77 77 77 77 77 77 77 77 77 77 ...
##  $ Wins               : num  42 42 42 42 42 42 42 42 42 42 ...
##  $ Draws              : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ Losses             : num  21 21 21 21 21 21 21 21 21 21 ...
##  $ Goals              : num  131 131 131 131 131 131 131 131 131 131 ...
##  $ Goals.Against      : num  84 84 84 84 84 84 84 84 84 84 ...
##  $ Champions          : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ Finals             : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ Win51              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SemiFinals         : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ SFYesNo            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ NewWins            : num  56 56 56 56 56 56 56 56 56 56 ...
##  $ WinOverLoss        : num  2.67 2.67 2.67 2.67 2.67 ...
##  $ Win.Perc           : num  0.727 0.727 0.727 0.727 0.727 ...
##  $ Overall            : num  63 63 74 71 59 76 65 63 73 76 ...
##  $ Potential          : num  63 63 78 71 65 79 65 71 75 79 ...
##  $ Club               : num  480 181 22 22 50 502 316 557 415 416 ...
##  $ Value.in.M.Euros   : num  50 220 7.5 2.5 160 9.5 325 450 4.9 6.5 ...
##  $ Wage.in.K.Euros    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Special            : num  860 1649 1789 1773 1062 ...
##  $ Acceleration       : num  21 67 74 79 49 70 44 75 79 55 ...
##  $ Aggression         : num  24 60 32 26 24 48 22 58 44 33 ...
##  $ Agility            : num  25 65 74 82 35 74 37 80 80 43 ...
##  $ Balance            : num  22 69 71 82 68 68 37 86 84 53 ...
##  $ Ball.control       : num  13 66 83 70 18 76 17 66 74 23 ...
##  $ Composure          : num  25 67 66 63 28 76 29 65 66 49 ...
##  $ Crossing           : num  19 65 59 73 12 49 12 64 75 18 ...
##  $ Curve              : num  15 70 81 72 12 46 17 62 73 17 ...
##  $ Dribbling          : num  11 58 75 72 17 69 14 68 74 20 ...
##  $ Finishing          : num  10 55 71 63 13 78 16 50 67 18 ...
##  $ Free.kick.accuracy : num  17 66 59 69 21 40 12 61 70 17 ...
##  $ Heading.accuracy   : num  20 57 60 50 17 78 14 53 49 12 ...
##  $ Interceptions      : num  10 34 20 42 25 23 16 55 38 22 ...
##  $ Jumping            : num  37 64 49 72 58 73 59 75 59 67 ...
##  $ Long.passing       : num  15 63 59 69 26 53 24 61 72 13 ...
##  $ Long.shots         : num  10 64 72 60 15 76 14 54 62 13 ...
##  $ Marking            : num  14 25 26 30 18 14 12 60 35 12 ...
##  $ Penalties          : num  21 57 69 71 21 70 25 44 65 23 ...
##  $ Positioning        : num  11 60 81 66 15 75 11 54 68 14 ...
##  $ Reactions          : num  49 60 80 64 58 76 73 56 69 63 ...
##  $ Short.passing      : num  13 64 70 70 26 67 23 64 73 29 ...
##  $ Shot.power         : num  22 62 69 65 17 75 19 35 70 25 ...
##  $ Sliding.tackle     : num  15 22 27 29 13 23 15 63 29 13 ...
##  $ Sprint.speed       : num  17 67 74 78 42 74 47 74 77 43 ...
##  $ Stamina            : num  16 59 71 65 43 72 30 61 66 31 ...
##  $ Standing.tackle    : num  12 27 22 31 21 22 17 62 31 18 ...
##  $ Strength           : num  50 65 62 34 51 73 63 58 43 59 ...
##  $ Vision             : num  26 58 68 67 24 62 33 56 70 53 ...
##  $ Volleys            : num  13 55 71 70 14 68 18 36 66 15 ...
##  $ Preferred.Positions: num  214 1 567 460 214 567 214 288 430 214 ...
##  $ SPI                : num  85.5 85.5 85.5 85.5 85.5 ...
attach(Ndata)
## The following objects are masked from data:
## 
##     Acceleration, Age, Aggression, Agility, Balance, Ball.control,
##     Champions, Club, Composure, Crossing, Curve, Draws, Dribbling,
##     Finals, Finishing, Free.kick.accuracy, Goals, Goals.Against,
##     Heading.accuracy, Interceptions, Jumping, Long.passing,
##     Long.shots, Losses, Marking, NewWins, Overall, Penalties,
##     Positioning, Potential, Preferred.Positions, Reactions,
##     SemiFinals, SFYesNo, Short.passing, Shot.power,
##     Sliding.tackle, Special, SPI, Sprint.speed, Stamina,
##     Standing.tackle, Strength, Total.Games, Value.in.M.Euros,
##     Vision, Volleys, Wage.in.K.Euros, Win.Perc, Win51,
##     WinOverLoss, Wins

We scaled the data:

SNdata <- data.frame(scale(Ndata))
str(SNdata)
## 'data.frame':    10079 obs. of  52 variables:
##  $ Age                : num  2.3445 1.702 -0.0114 0.417 -0.8681 ...
##  $ Total.Games        : num  1.01 1.01 1.01 1.01 1.01 ...
##  $ Wins               : num  0.924 0.924 0.924 0.924 0.924 ...
##  $ Draws              : num  0.478 0.478 0.478 0.478 0.478 ...
##  $ Losses             : num  1.14 1.14 1.14 1.14 1.14 ...
##  $ Goals              : num  0.853 0.853 0.853 0.853 0.853 ...
##  $ Goals.Against      : num  1.03 1.03 1.03 1.03 1.03 ...
##  $ Champions          : num  0.928 0.928 0.928 0.928 0.928 ...
##  $ Finals             : num  1.75 1.75 1.75 1.75 1.75 ...
##  $ Win51              : num  0.388 0.388 0.388 0.388 0.388 ...
##  $ SemiFinals         : num  0.498 0.498 0.498 0.498 0.498 ...
##  $ SFYesNo            : num  0.503 0.503 0.503 0.503 0.503 ...
##  $ NewWins            : num  0.881 0.881 0.881 0.881 0.881 ...
##  $ WinOverLoss        : num  0.116 0.116 0.116 0.116 0.116 ...
##  $ Win.Perc           : num  0.413 0.413 0.413 0.413 0.413 ...
##  $ Overall            : num  -0.489 -0.489 1.042 0.624 -1.046 ...
##  $ Potential          : num  -1.366 -1.366 1.007 -0.101 -1.05 ...
##  $ Club               : num  0.945 -0.702 -1.577 -1.577 -1.423 ...
##  $ Value.in.M.Euros   : num  -0.722 -0.129 -0.87 -0.888 -0.338 ...
##  $ Wage.in.K.Euros    : num  -0.468 -0.468 -0.468 -0.468 -0.468 ...
##  $ Special            : num  -2.662 0.192 0.699 0.641 -1.932 ...
##  $ Acceleration       : num  -2.922 0.172 0.643 0.979 -1.039 ...
##  $ Aggression         : num  -1.8 0.249 -1.345 -1.686 -1.8 ...
##  $ Agility            : num  -2.597 0.115 0.725 1.267 -1.919 ...
##  $ Balance            : num  -2.972 0.356 0.498 1.277 0.285 ...
##  $ Ball.control       : num  -2.636 0.454 1.445 0.687 -2.344 ...
##  $ Composure          : num  -2.503 0.701 0.624 0.396 -2.274 ...
##  $ Crossing           : num  -1.663 0.797 0.476 1.225 -2.038 ...
##  $ Curve              : num  -1.74 1.21 1.8 1.32 -1.9 ...
##  $ Dribbling          : num  -2.286 0.152 1.033 0.877 -1.974 ...
##  $ Finishing          : num  -1.796 0.493 1.306 0.9 -1.644 ...
##  $ Free.kick.accuracy : num  -1.495 1.282 0.885 1.452 -1.268 ...
##  $ Heading.accuracy   : num  -1.826 0.268 0.438 -0.128 -1.996 ...
##  $ Interceptions      : num  -1.756 -0.61 -1.279 -0.228 -1.04 ...
##  $ Jumping            : num  -2.3554 -0.0666 -1.3382 0.6116 -0.5752 ...
##  $ Long.passing       : num  -2.379 0.654 0.401 1.033 -1.684 ...
##  $ Long.shots         : num  -1.897 0.871 1.281 0.666 -1.641 ...
##  $ Marking            : num  -1.391 -0.886 -0.84 -0.656 -1.207 ...
##  $ Penalties          : num  -1.771 0.492 1.246 1.372 -1.771 ...
##  $ Positioning        : num  -1.962 0.534 1.603 0.84 -1.758 ...
##  $ Reactions          : num  -1.389 -0.225 1.891 0.198 -0.436 ...
##  $ Short.passing      : num  -2.977 0.357 0.749 0.749 -2.127 ...
##  $ Shot.power         : num  -1.916 0.375 0.776 0.547 -2.202 ...
##  $ Sliding.tackle     : num  -1.424 -1.102 -0.872 -0.78 -1.516 ...
##  $ Sprint.speed       : num  -3.274 0.164 0.645 0.92 -1.555 ...
##  $ Stamina            : num  -2.944 -0.251 0.5 0.124 -1.253 ...
##  $ Standing.tackle    : num  -1.622 -0.942 -1.169 -0.761 -1.214 ...
##  $ Strength           : num  -1.18535 -0.00175 -0.23847 -2.44786 -1.10644 ...
##  $ Vision             : num  -1.848 0.338 1.021 0.952 -1.984 ...
##  $ Volleys            : num  -1.688 0.672 1.571 1.515 -1.632 ...
##  $ Preferred.Positions: num  -0.247 -1.431 1.716 1.121 -0.247 ...
##  $ SPI                : num  0.275 0.275 0.275 0.275 0.275 ...

We combined similar variables, according to FIFA’s classification of skills:

SNdata$Attacking <- (SNdata$Crossing+SNdata$Finishing+SNdata$Heading.accuracy+SNdata$Short.passing+SNdata$Volleys)/5
SNdata$Skill <- (SNdata$Dribbling+SNdata$Curve+SNdata$Free.kick.accuracy+SNdata$Long.passing+SNdata$Ball.control)/5
SNdata$Movement <- (SNdata$Acceleration+SNdata$Sprint.speed+SNdata$Agility+SNdata$Reactions+SNdata$Balance)/5
SNdata$Power <- (SNdata$Shot.power+SNdata$Jumping+SNdata$Stamina+SNdata$Strength+SNdata$Long.shots)/5
SNdata$Mentality <- (SNdata$Aggression+SNdata$Interceptions+SNdata$Positioning+SNdata$Vision+SNdata$Penalties+SNdata$Composure)/6
SNdata$Defending <- (SNdata$Marking+SNdata$Standing.tackle+SNdata$Sliding.tackle)/3
str(SNdata)
## 'data.frame':    10079 obs. of  58 variables:
##  $ Age                : num  2.3445 1.702 -0.0114 0.417 -0.8681 ...
##  $ Total.Games        : num  1.01 1.01 1.01 1.01 1.01 ...
##  $ Wins               : num  0.924 0.924 0.924 0.924 0.924 ...
##  $ Draws              : num  0.478 0.478 0.478 0.478 0.478 ...
##  $ Losses             : num  1.14 1.14 1.14 1.14 1.14 ...
##  $ Goals              : num  0.853 0.853 0.853 0.853 0.853 ...
##  $ Goals.Against      : num  1.03 1.03 1.03 1.03 1.03 ...
##  $ Champions          : num  0.928 0.928 0.928 0.928 0.928 ...
##  $ Finals             : num  1.75 1.75 1.75 1.75 1.75 ...
##  $ Win51              : num  0.388 0.388 0.388 0.388 0.388 ...
##  $ SemiFinals         : num  0.498 0.498 0.498 0.498 0.498 ...
##  $ SFYesNo            : num  0.503 0.503 0.503 0.503 0.503 ...
##  $ NewWins            : num  0.881 0.881 0.881 0.881 0.881 ...
##  $ WinOverLoss        : num  0.116 0.116 0.116 0.116 0.116 ...
##  $ Win.Perc           : num  0.413 0.413 0.413 0.413 0.413 ...
##  $ Overall            : num  -0.489 -0.489 1.042 0.624 -1.046 ...
##  $ Potential          : num  -1.366 -1.366 1.007 -0.101 -1.05 ...
##  $ Club               : num  0.945 -0.702 -1.577 -1.577 -1.423 ...
##  $ Value.in.M.Euros   : num  -0.722 -0.129 -0.87 -0.888 -0.338 ...
##  $ Wage.in.K.Euros    : num  -0.468 -0.468 -0.468 -0.468 -0.468 ...
##  $ Special            : num  -2.662 0.192 0.699 0.641 -1.932 ...
##  $ Acceleration       : num  -2.922 0.172 0.643 0.979 -1.039 ...
##  $ Aggression         : num  -1.8 0.249 -1.345 -1.686 -1.8 ...
##  $ Agility            : num  -2.597 0.115 0.725 1.267 -1.919 ...
##  $ Balance            : num  -2.972 0.356 0.498 1.277 0.285 ...
##  $ Ball.control       : num  -2.636 0.454 1.445 0.687 -2.344 ...
##  $ Composure          : num  -2.503 0.701 0.624 0.396 -2.274 ...
##  $ Crossing           : num  -1.663 0.797 0.476 1.225 -2.038 ...
##  $ Curve              : num  -1.74 1.21 1.8 1.32 -1.9 ...
##  $ Dribbling          : num  -2.286 0.152 1.033 0.877 -1.974 ...
##  $ Finishing          : num  -1.796 0.493 1.306 0.9 -1.644 ...
##  $ Free.kick.accuracy : num  -1.495 1.282 0.885 1.452 -1.268 ...
##  $ Heading.accuracy   : num  -1.826 0.268 0.438 -0.128 -1.996 ...
##  $ Interceptions      : num  -1.756 -0.61 -1.279 -0.228 -1.04 ...
##  $ Jumping            : num  -2.3554 -0.0666 -1.3382 0.6116 -0.5752 ...
##  $ Long.passing       : num  -2.379 0.654 0.401 1.033 -1.684 ...
##  $ Long.shots         : num  -1.897 0.871 1.281 0.666 -1.641 ...
##  $ Marking            : num  -1.391 -0.886 -0.84 -0.656 -1.207 ...
##  $ Penalties          : num  -1.771 0.492 1.246 1.372 -1.771 ...
##  $ Positioning        : num  -1.962 0.534 1.603 0.84 -1.758 ...
##  $ Reactions          : num  -1.389 -0.225 1.891 0.198 -0.436 ...
##  $ Short.passing      : num  -2.977 0.357 0.749 0.749 -2.127 ...
##  $ Shot.power         : num  -1.916 0.375 0.776 0.547 -2.202 ...
##  $ Sliding.tackle     : num  -1.424 -1.102 -0.872 -0.78 -1.516 ...
##  $ Sprint.speed       : num  -3.274 0.164 0.645 0.92 -1.555 ...
##  $ Stamina            : num  -2.944 -0.251 0.5 0.124 -1.253 ...
##  $ Standing.tackle    : num  -1.622 -0.942 -1.169 -0.761 -1.214 ...
##  $ Strength           : num  -1.18535 -0.00175 -0.23847 -2.44786 -1.10644 ...
##  $ Vision             : num  -1.848 0.338 1.021 0.952 -1.984 ...
##  $ Volleys            : num  -1.688 0.672 1.571 1.515 -1.632 ...
##  $ Preferred.Positions: num  -0.247 -1.431 1.716 1.121 -0.247 ...
##  $ SPI                : num  0.275 0.275 0.275 0.275 0.275 ...
##  $ Attacking          : num  -1.99 0.517 0.908 0.852 -1.887 ...
##  $ Skill              : num  -2.11 0.75 1.11 1.07 -1.83 ...
##  $ Movement           : num  -2.631 0.116 0.88 0.928 -0.933 ...
##  $ Power              : num  -2.0596 0.1852 0.1961 -0.0999 -1.3556 ...
##  $ Mentality          : num  -1.94 0.284 0.312 0.274 -1.771 ...
##  $ Defending          : num  -1.479 -0.977 -0.96 -0.732 -1.313 ...

```

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.