The goal of today’s lab is to build a model, using either linear regression or a decision tree to predict FIFA 2017 player ratings for midfielders using player skill ratings.

The Data

library(dplyr)


fifa <- read.csv("/home/rstudioshared/shared_files/data/FIFA2017.csv")
View(fifa)

Let’s first look at the club positions listed and eliminate reserve players:

fifa %>% group_by(Club_Position) %>% summarize(n=length(Club_Position)) %>% arrange(desc(n))

fifa_starters <- fifa %>% filter(!(Club_Position %in% c("Sub", "Res")))
View(fifa_starters %>% group_by(Club_Position) %>% summarize(n=length(Club_Position))%>% arrange(desc(n)))

Now, let’s make a data.frame with only midfielders.

midfielders <- fifa_starters %>% filter(Club_Position %in% c("LM", "RM", "LCM", "RCM", "CM"))

We can make a model to predict ratings used a few of the skill ratings as follows:

m <- lm(Rating ~ Skill_Moves+Ball_Control+Dribbling+Marking+Sliding_Tackle+Vision+Speed, data=midfielders)

Or make a decision tree to do the same with:

library(rpart); library(rpart.plot)

fit <- rpart(Rating ~ Skill_Moves+Ball_Control+Dribbling+Marking+Sliding_Tackle+Vision+Speed,data=midfielders)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)

Remember that you can make a more complex model by changing the cp parameter:

fit <- rpart(Rating ~ Skill_Moves+Ball_Control+Dribbling+Marking+Sliding_Tackle+Vision+Speed,data=midfielders, cp=0.002)
prp(fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)

Of course, you won’t be limited to these variables, so take a look at the column names and find what you think might be a good predictor.

colnames(midfielders)

Challenge:

  1. Build two different models to predict the ratings of midfielders.
  2. Use cross validation, described here, to determine which model performs better out-of-sample.