The Data

library(dplyr);library(rpart); library(rpart.plot)


fifa <- read.csv("/home/rstudioshared/shared_files/data/FIFA2017.csv")
View(fifa)
fifa_starters <- fifa %>% filter(!(Club_Position %in% c("Sub", "Res")))
midfielders <- fifa_starters %>% filter(Club_Position %in% c("LM", "RM", "LCM", "RCM", "CM"))

You’ll still need to run the code to create the RMSE and createFolds functions but after that you can creat and then use this more general cross validation function (below). [My Advice: Put all three of these function into a .R file so that you have them handy for future labs.]

cv_results_general <- function(data, model = lm(y ~ 1, data = data), actual, num_folds=10, method="RMSE")
  {folds <- createFolds(actual, k = num_folds)
  r <- lapply(folds, function(x) {
  train <- data[-x, ]
  test <- data[x, ]
  temp.actual <- actual[x]
  pred <- predict(model, test)
  if(method == "logloss"){error <- LogLoss(temp.actual, pred)}else{error <- RMSE(temp.actual, pred)}
  return(error)
  })
  return(mean(unlist(r)))
  }

Now, let’s put this function to use:

cv_results_general(midfielders, model=rpart(Rating ~ Skill_Moves, data=midfielders, cp=0.02),
                   actual=midfielders$Rating, num_folds=10)
## [1] 5.657422
cv_results_general(midfielders, model=lm(Rating ~ Skill_Moves, data=midfielders),
                   actual=midfielders$Rating, num_folds=10)
## [1] 5.645639
cv_results_general(midfielders, model=lm(Rating ~ Skill_Moves+Ball_Control+Dribbling+Marking+Sliding_Tackle+Vision+Speed, data=midfielders),
                   actual=midfielders$Rating, num_folds=10)
## [1] 2.533678
cv_results_general(midfielders, model=rpart(Rating ~ Skill_Moves+Ball_Control+Dribbling+Marking+Sliding_Tackle+Vision+Speed, data=midfielders),
                   actual=midfielders$Rating, num_folds=10)
## [1] 2.798789
cv_results_general(midfielders, model=rpart(Rating ~ Skill_Moves+Ball_Control+Dribbling+Marking+Sliding_Tackle+Vision+Speed, data=midfielders,cp=0.001),
                   actual=midfielders$Rating, num_folds=10)
## [1] 2.171067

Questions to Answer:

  1. What is the best model you can built to predict ratings and what is its RMSE?

  2. If you are limited to using only two variables to predict ratings, which two should you choose and how should you use them (in a linear model or in a decision tree)? And what is your RMSE?

Correlations

We might also be interested in the relationships between all of the ratings. The following code calculates the correlations between every pair of numeric ratings.

M <- cor(midfielders[,c(10,15,18:53)])
M

This information is easier to understand if we plot it. Try the two following plots:

library(corrplot)

corrplot(M)

corrplot(M, method="number", number.cex=0.5)

Questions to Answer:

  1. Which skills are most strongly correlated to player ratings?

  2. Of all pairs of skills which are most strongly correlated with each other?

  3. Which pairs of skills have the strongest negative correlations between them?

Skills by Age

Finally, we might be interested in how skill ratings vary with age. Here’s a plot of overall rating by age.

library(ggplot2)

midfielders %>% ggplot(aes(Age, Rating)) + geom_point() + geom_smooth()

Questions to Answer:

  1. How do players’ Speed, Strength, Vision and Agility ratings vary with age?