0. Instructions

1. Remove the (italicised) guidance text but keep the section headings.
2. Add as many chunks of R code as required.
3. Add descriptions of your analysis plans and explanations of your code and findings. Please be detailed and where you have made choices explain the rationale for them.
4. Write your report using RMarkdown. For guidance see a helpful blog or use the R Markdown cheatsheet which can be accessed from within RStudio by selecting Help > Cheatsheets > R Markdown Cheat Sheet.
5. Your report should be clearly and professionally presented with appropriate use of cited external sources. (5 marks)
6. It should also be easy to understand, with well-documented code following the principles of literate programming. (5 marks)

library(ltm)

## Warning: package 'ltm' was built under R version 3.6.3

## Loading required package: MASS

## Loading required package: msm

## Warning: package 'msm' was built under R version 3.6.3

## Loading required package: polycor

## Warning: package 'polycor' was built under R version 3.6.3

library(dplyr) # to run correlation coefficients with binary values

## Warning: package 'dplyr' was built under R version 3.6.3

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(validate)

## Warning: package 'validate' was built under R version 3.6.3

## 
## Attaching package: 'validate'

## The following object is masked from 'package:dplyr':
## 
##     expr

library(ggpubr) # for density plots

## Warning: package 'ggpubr' was built under R version 3.6.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.6.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:validate':
## 
##     expr

library(epiDisplay)

## Warning: package 'epiDisplay' was built under R version 3.6.3

## Loading required package: foreign

## Loading required package: survival

## Loading required package: nnet

## 
## Attaching package: 'epiDisplay'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(tree)

## Warning: package 'tree' was built under R version 3.6.3

library(rpart.plot) #Special package  for trees graphical representention (http address below in the corresponding section)

## Loading required package: rpart

## Warning: package 'rpart' was built under R version 3.6.3

library(rpart)
library(MASS) #For LDA 
library(randomForest)

## Warning: package 'randomForest' was built under R version 3.6.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

#https://rdrr.io/cran/epiDisplay/

1. Organise and clean the data

1.1 Subset the data into the specific dataset allocated (5 marks)

Comments outside the r chunks are meant to explain the theoretical reasons behind the decision-making process. Comments inside the r-code are meant to ‘explain’ the code to the reader.

# Assign your student id into the variable SID, for example:
load("CS5801_football_analysis.Rda")

## Warning in load("CS5801_football_analysis.Rda"): strings not representable in
## native encoding will be translated to UTF-8

# Assign your student id into the variable SID, for example:
SID <- 2115480
set.seed(2000)
SIDoffset <- (SID %% 25) + 1    # Your SID mod 25 + 1
load("CS5801_football_analysis.Rda")

## Warning in load("CS5801_football_analysis.Rda"): strings not representable in
## native encoding will be translated to UTF-8

# Now subset the football data set
# Pick every 25th observation starting from your offset
# Put into your data frame named mydf (you can rename it)
mydf <- football.analysis[seq(from=SIDoffset,to=nrow(football.analysis),by=25),]

1.2 Data quality analysis (5 marks)

Most (if not all) statistical models are based on the assumption of healthy and valid data. As data scientisits we need to asssess the quality of our data beforehand. What to look first? As a general principle, good data should be: 1) Complete (no missing values, NAN or incomplete data), 2) Valid (it should be adequate for its purpose -i.e, if one of our variables is time it should not be measured, say in Km), and 3) Consistent (the rules of our data should be consistent troughout our data frame)

We start by checking that the type and values of data do match (i.e., qualitative and quantitative data) We later focus on missing values, NaN, and Incomplete data. Note that we first read the metadata to understand our data frame.

#check the structure of our data frame 
str(mydf)

## 'data.frame':    514 obs. of  17 variables:
##  $ sofifa_id       : int  203376 201942 184087 231866 209989 178562 205498 231443 205632 156616 ...
##  $ potential       : int  91 87 85 88 85 83 83 89 83 81 ...
##  $ wage_eur        : num  210000 190000 130000 145000 81000 55000 125000 175000 40000 54000 ...
##  $ age             : int  28 28 31 24 27 32 28 23 25 37 ...
##  $ height_cm       : int  193 181 186 191 185 174 180 178 188 170 ...
##  $ weight_kg       : int  92 76 81 82 77 71 68 67 84 72 ...
##  $ club_name       : chr  "Liverpool" "Liverpool" "Tottenham Hotspur" "Manchester City" ...
##  $ preferred_foot  : chr  "Right" "Right" "Right" "Right" ...
##  $ pace            : int  76 77 63 65 71 63 55 92 82 74 ...
##  $ shooting        : int  60 80 55 68 74 74 65 77 83 74 ...
##  $ passing         : int  71 81 72 77 81 84 84 77 78 82 ...
##  $ dribbling       : int  71 90 67 77 81 82 80 87 83 86 ...
##  $ defending       : int  91 61 87 82 82 76 72 36 63 30 ...
##  $ physic          : int  86 78 79 79 85 70 73 56 84 53 ...
##  $ power_strength  : int  92 74 79 79 82 67 66 51 81 56 ...
##  $ power_long_shots: int  64 77 58 76 85 81 62 76 79 71 ...
##  $ high.wage.ind   : int  1 1 1 1 1 1 1 1 1 1 ...

At first glance we can see that 3 of the 17 variables must be converted to factors: Club Name, Preferred_foot, and High.Wage.ind.

#Converting to factors 
mydf$club_name <-as.factor(mydf$club_name)
mydf$preferred_foot <- as.factor(mydf$preferred_foot)
mydf$high.wage.ind <-as.factor(mydf$high.wage.ind)
#check the levels of our factor variables
str(mydf)

## 'data.frame':    514 obs. of  17 variables:
##  $ sofifa_id       : int  203376 201942 184087 231866 209989 178562 205498 231443 205632 156616 ...
##  $ potential       : int  91 87 85 88 85 83 83 89 83 81 ...
##  $ wage_eur        : num  210000 190000 130000 145000 81000 55000 125000 175000 40000 54000 ...
##  $ age             : int  28 28 31 24 27 32 28 23 25 37 ...
##  $ height_cm       : int  193 181 186 191 185 174 180 178 188 170 ...
##  $ weight_kg       : int  92 76 81 82 77 71 68 67 84 72 ...
##  $ club_name       : Factor w/ 323 levels "1. FC Köln","1. FC Union Berlin",..: 188 188 293 192 35 24 64 110 256 136 ...
##  $ preferred_foot  : Factor w/ 3 levels "Left","right",..: 3 3 3 3 3 3 3 1 3 3 ...
##  $ pace            : int  76 77 63 65 71 63 55 92 82 74 ...
##  $ shooting        : int  60 80 55 68 74 74 65 77 83 74 ...
##  $ passing         : int  71 81 72 77 81 84 84 77 78 82 ...
##  $ dribbling       : int  71 90 67 77 81 82 80 87 83 86 ...
##  $ defending       : int  91 61 87 82 82 76 72 36 63 30 ...
##  $ physic          : int  86 78 79 79 85 70 73 56 84 53 ...
##  $ power_strength  : int  92 74 79 79 82 67 66 51 81 56 ...
##  $ power_long_shots: int  64 77 58 76 85 81 62 76 79 71 ...
##  $ high.wage.ind   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Note that by checking again the structure of our data frame -after converting to factors-, we are able to see the number of levels for each factor. For instance, club name has 323 levels (hence, 323 assumed different footbal clubs), high wage industry two (it is a binary option like yes or no) and preferred foot 3 levels. The last seems to be a bit odd and it will need further checks.

#Check the number of levels 
levels(mydf$preferred_foot)

## [1] "Left"  "right" "Right"

subset(mydf, mydf$preferred_foot=="right") #entrance 9506 is the odd one.

#Convert factor back to character
mydf$preferred_foot<-as.character(mydf$preferred_foot)
#Correct the name 
mydf$preferred_foot[mydf$preferred_foot=="right"]<- "Right"
#Convert back to factor 
mydf$preferred_foot<-as.factor(mydf$preferred_foot)
levels(mydf$preferred_foot)

## [1] "Left"  "Right"

str(mydf)

## 'data.frame':    514 obs. of  17 variables:
##  $ sofifa_id       : int  203376 201942 184087 231866 209989 178562 205498 231443 205632 156616 ...
##  $ potential       : int  91 87 85 88 85 83 83 89 83 81 ...
##  $ wage_eur        : num  210000 190000 130000 145000 81000 55000 125000 175000 40000 54000 ...
##  $ age             : int  28 28 31 24 27 32 28 23 25 37 ...
##  $ height_cm       : int  193 181 186 191 185 174 180 178 188 170 ...
##  $ weight_kg       : int  92 76 81 82 77 71 68 67 84 72 ...
##  $ club_name       : Factor w/ 323 levels "1. FC Köln","1. FC Union Berlin",..: 188 188 293 192 35 24 64 110 256 136 ...
##  $ preferred_foot  : Factor w/ 2 levels "Left","Right": 2 2 2 2 2 2 2 1 2 2 ...
##  $ pace            : int  76 77 63 65 71 63 55 92 82 74 ...
##  $ shooting        : int  60 80 55 68 74 74 65 77 83 74 ...
##  $ passing         : int  71 81 72 77 81 84 84 77 78 82 ...
##  $ dribbling       : int  71 90 67 77 81 82 80 87 83 86 ...
##  $ defending       : int  91 61 87 82 82 76 72 36 63 30 ...
##  $ physic          : int  86 78 79 79 85 70 73 56 84 53 ...
##  $ power_strength  : int  92 74 79 79 82 67 66 51 81 56 ...
##  $ power_long_shots: int  64 77 58 76 85 81 62 76 79 71 ...
##  $ high.wage.ind   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

By now our dataframe has the right types (i.e, numeric or factors) and the right number of levels. We now start our data cleaning process.

1.3 Data cleaning (5 marks)

summary(mydf)

##    sofifa_id        potential        wage_eur             age       
##  Min.   : 49674   Min.   :51.00   Min.   :     1.7   Min.   :17.00  
##  1st Qu.:211299   1st Qu.:67.00   1st Qu.:  1000.0   1st Qu.:21.00  
##  Median :232626   Median :71.00   Median :  4000.0   Median :24.00  
##  Mean   :227062   Mean   :71.67   Mean   : 11608.5   Mean   :25.24  
##  3rd Qu.:247020   3rd Qu.:75.00   3rd Qu.: 11000.0   3rd Qu.:28.00  
##  Max.   :258947   Max.   :91.00   Max.   :210000.0   Max.   :78.00  
##                                                                     
##    height_cm       weight_kg                              club_name  
##  Min.   :158.0   Min.   : 56.00   KRC Genk                     :  5  
##  1st Qu.:176.0   1st Qu.: 70.00   Club Atlético Nacional Potosí:  4  
##  Median :180.0   Median : 74.00   Club Sportivo Luqueño        :  4  
##  Mean   :180.8   Mean   : 74.69   Deportivo Toluca             :  4  
##  3rd Qu.:185.0   3rd Qu.: 79.00   Kasimpasa SK                 :  4  
##  Max.   :220.0   Max.   :170.00   Liverpool                    :  4  
##                                   (Other)                      :489  
##  preferred_foot      pace           shooting        passing     
##  Left :133      Min.   :-80.00   Min.   :20.00   Min.   :25.00  
##  Right:381      1st Qu.: 60.25   1st Qu.:44.00   1st Qu.:52.00  
##                 Median : 68.00   Median :55.00   Median :58.00  
##                 Mean   : 66.50   Mean   :53.29   Mean   :57.64  
##                 3rd Qu.: 74.00   3rd Qu.:64.00   3rd Qu.:64.00  
##                 Max.   : 93.00   Max.   :83.00   Max.   :84.00  
##                                                                 
##    dribbling        defending         physic      power_strength
##  Min.   :-68.00   Min.   :15.00   Min.   :36.00   Min.   :29.0  
##  1st Qu.: 58.00   1st Qu.:36.00   1st Qu.:58.00   1st Qu.:58.0  
##  Median : 64.00   Median :56.00   Median :66.00   Median :67.0  
##  Mean   : 62.62   Mean   :51.54   Mean   :64.58   Mean   :65.5  
##  3rd Qu.: 69.00   3rd Qu.:64.00   3rd Qu.:72.00   3rd Qu.:74.0  
##  Max.   : 90.00   Max.   :91.00   Max.   :86.00   Max.   :93.0  
##                                                                 
##  power_long_shots high.wage.ind
##  Min.   :15.00    0:350        
##  1st Qu.:43.00    1:164        
##  Median :55.00                 
##  Mean   :52.31                 
##  3rd Qu.:64.00                 
##  Max.   :87.00                 
##

What type of data issues we are scrutinising for? Lets analyse each column, 1) ID. Non relevant to our Analysis. 2) Potential. Sort of overall classification. Should rank between 0 and 100 3) Wage in euros. Weekly average on euros. We have immediatly spotted a probably to low value 1.7 euros per week and a maximum of 210,000 euros per week (we may need to decide if the maximum is just an outlier, keep it or eliminate it) 4) Player age. Simple by eye balling the data here we see that we have an age of 78. Similar as above, we will need to decide whether to keep the data, replace it or eliminate it. 5) Height. Seems to be on average and the max heigth seems OK for, say a goal keeper. 6) Weight in kgs. 56 maybe a bit to low (but still possible for a person of short height), but the max weight seems way to off for a professional football player. 7) Peace attribute. It supposed to rank between 0 - 100 and we have found negative values. MUST INVESTIGATE 8) Shooting. Seems to be in scale 9) Passing. Seems to be in scale 10) Dribbling. As in peace, we have found negative values; MUST INVESTIGATE 11,12,13, and 14) all seems to be between the 0 - 100 scale.

While frustrating may look like, data scientists spends most of their time preparing and analysing data. Creating and running a model it is less time consuming once the data has been properly handled. This assignment it is not the exception and we will spend most of our time doing so (amongst other things)

Before creating the validators rules, lets try and give more sensible names to our variables

names(mydf)[1]<-"Id"
names(mydf)[3]<-"Wage"
names(mydf)[5]<-"Height"
names(mydf)[6]<-"Weight"
names(mydf)[7]<-"Club"
names(mydf)[8]<-"Foot"
names(mydf)[9]<-"Pace"
names(mydf)[10]<-"Shooting"
names(mydf)[11]<-"Passing"
names(mydf)[12]<-"Dribbling"
names(mydf)[13]<-"Defending"
names(mydf)[14]<-"Physic"
str(mydf)

## 'data.frame':    514 obs. of  17 variables:
##  $ Id              : int  203376 201942 184087 231866 209989 178562 205498 231443 205632 156616 ...
##  $ potential       : int  91 87 85 88 85 83 83 89 83 81 ...
##  $ Wage            : num  210000 190000 130000 145000 81000 55000 125000 175000 40000 54000 ...
##  $ age             : int  28 28 31 24 27 32 28 23 25 37 ...
##  $ Height          : int  193 181 186 191 185 174 180 178 188 170 ...
##  $ Weight          : int  92 76 81 82 77 71 68 67 84 72 ...
##  $ Club            : Factor w/ 323 levels "1. FC Köln","1. FC Union Berlin",..: 188 188 293 192 35 24 64 110 256 136 ...
##  $ Foot            : Factor w/ 2 levels "Left","Right": 2 2 2 2 2 2 2 1 2 2 ...
##  $ Pace            : int  76 77 63 65 71 63 55 92 82 74 ...
##  $ Shooting        : int  60 80 55 68 74 74 65 77 83 74 ...
##  $ Passing         : int  71 81 72 77 81 84 84 77 78 82 ...
##  $ Dribbling       : int  71 90 67 77 81 82 80 87 83 86 ...
##  $ Defending       : int  91 61 87 82 82 76 72 36 63 30 ...
##  $ Physic          : int  86 78 79 79 85 70 73 56 84 53 ...
##  $ power_strength  : int  92 74 79 79 82 67 66 51 81 56 ...
##  $ power_long_shots: int  64 77 58 76 85 81 62 76 79 71 ...
##  $ high.wage.ind   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Lets create our rules

summary(mydf)

##        Id           potential          Wage               age       
##  Min.   : 49674   Min.   :51.00   Min.   :     1.7   Min.   :17.00  
##  1st Qu.:211299   1st Qu.:67.00   1st Qu.:  1000.0   1st Qu.:21.00  
##  Median :232626   Median :71.00   Median :  4000.0   Median :24.00  
##  Mean   :227062   Mean   :71.67   Mean   : 11608.5   Mean   :25.24  
##  3rd Qu.:247020   3rd Qu.:75.00   3rd Qu.: 11000.0   3rd Qu.:28.00  
##  Max.   :258947   Max.   :91.00   Max.   :210000.0   Max.   :78.00  
##                                                                     
##      Height          Weight                                  Club    
##  Min.   :158.0   Min.   : 56.00   KRC Genk                     :  5  
##  1st Qu.:176.0   1st Qu.: 70.00   Club Atlético Nacional Potosí:  4  
##  Median :180.0   Median : 74.00   Club Sportivo Luqueño        :  4  
##  Mean   :180.8   Mean   : 74.69   Deportivo Toluca             :  4  
##  3rd Qu.:185.0   3rd Qu.: 79.00   Kasimpasa SK                 :  4  
##  Max.   :220.0   Max.   :170.00   Liverpool                    :  4  
##                                   (Other)                      :489  
##     Foot          Pace           Shooting        Passing        Dribbling     
##  Left :133   Min.   :-80.00   Min.   :20.00   Min.   :25.00   Min.   :-68.00  
##  Right:381   1st Qu.: 60.25   1st Qu.:44.00   1st Qu.:52.00   1st Qu.: 58.00  
##              Median : 68.00   Median :55.00   Median :58.00   Median : 64.00  
##              Mean   : 66.50   Mean   :53.29   Mean   :57.64   Mean   : 62.62  
##              3rd Qu.: 74.00   3rd Qu.:64.00   3rd Qu.:64.00   3rd Qu.: 69.00  
##              Max.   : 93.00   Max.   :83.00   Max.   :84.00   Max.   : 90.00  
##                                                                               
##    Defending         Physic      power_strength power_long_shots high.wage.ind
##  Min.   :15.00   Min.   :36.00   Min.   :29.0   Min.   :15.00    0:350        
##  1st Qu.:36.00   1st Qu.:58.00   1st Qu.:58.0   1st Qu.:43.00    1:164        
##  Median :56.00   Median :66.00   Median :67.0   Median :55.00                 
##  Mean   :51.54   Mean   :64.58   Mean   :65.5   Mean   :52.31                 
##  3rd Qu.:64.00   3rd Qu.:72.00   3rd Qu.:74.0   3rd Qu.:64.00                 
##  Max.   :91.00   Max.   :86.00   Max.   :93.0   Max.   :87.00                 
##

mydf.rules <-validator(okwage=Wage >499, okage= age< 41, okweight= Weight <96, okPace=Pace >0, okDribbling= Dribbling >0)
# Here we have decided our rules upon inspection of the mydf tab. The lowest weekly wage (aside to our minima) is 500 euros a week; pace and dribbling must be positive real numbers, and weight has to be less or equal than 95 kgs (again, our second lowest value after our maxima. Finally, apart from an outlier on the age column, all players in the data set are 40 years of age or less)
quality_mydf <-confront(mydf, mydf.rules)
summary(quality_mydf)

Having identified the number of fails, we now replace them.

subset(mydf, mydf$Wage<499)

mean(mydf$Wage)

## [1] 11608.47

mydf$Wage[mydf$Wage<499]<- 11608
#we have here replace the very low value for the mean wage 

subset(mydf, mydf$age>40)

median(mydf$age)

## [1] 24

mean(mydf$age)

## [1] 25.24125

mydf$age[mydf$age>40]<-25
#both mean and median are very close 24 and 25: we will replace for the mean.

subset(mydf, mydf$Weight>96)

mean(mydf$Weight)

## [1] 74.68872

mydf$Weight[mydf$Weight>96]<-74.6
#Again, we replace the data for the mean 

#Finally, we do the same for Pace and Dribbling by replacing for the mean of each one. 
mean(mydf$Pace)

## [1] 66.49611

mean(mydf$Dribbling)

## [1] 62.61868

subset(mydf, mydf$Pace<0)

subset(mydf, mydf$Pace<0)

mydf$Pace[mydf$Pace<0]<- 66.5
mydf$Dribbling[mydf$Dribbling<0]<-63

#Finally, we run our quality control again 
quality_mydf <-confront(mydf, mydf.rules)
summary(quality_mydf)

#Check for NA values 
table(is.na(mydf))

## 
## FALSE 
##  8738

# we have no NA in our dataframe

Now all our data makes sense.

Finally we divide our data in training and validation sets (70 % and 30% respectively)

#Divide between training and validation set 
set.seed(1000)
dt = sort(sample(nrow(mydf), nrow(mydf)*.6))
train<-mydf[dt,]
test<-mydf[-dt,]

2. Exploratory Data Analysis (EDA)

2.1 EDA plan (5 marks)

We start by noting that we have a combination of quantiative and qualitative variables, which it may lead us to run either a ANCOVA analysis or multiple regression analysis. Running a tree analysis can be very useful to determine relations.

We also note that, High.Wage.Ind. Is a qualitative dummy variable (1=yes, 0=no) for the level of salary (above 8000 euros per week).

With this information, and just after the EDA phase, we may create two models: a multiple regression logistic model and a ANCOVA model.

BEfore creating any model, we start looking for correlation between the variables. Here in particular we want to understand the correlation between the dummy variables and our (potential) independent variables.

We start by running a biserial correlation (when one of the variables is a dichotomous variable).

biserial.cor(train$Pace, train$high.wage.ind)#here the y variable has to be the binary variable.

## [1] -0.1236025

biserial.cor(train$Shooting, train$high.wage.ind)

## [1] -0.3212304

biserial.cor(train$Passing, train$high.wage.ind)

## [1] -0.5240955

biserial.cor(train$Dribbling, train$high.wage.ind)

## [1] -0.5059521

biserial.cor(train$Defending, train$high.wage.ind)

## [1] -0.271064

biserial.cor(train$Physic, train$high.wage.ind)

## [1] -0.2974406

biserial.cor(train$power_strength, train$high.wage.ind)

## [1] -0.1580989

biserial.cor(train$power_long_shots, train$high.wage.ind)

## [1] -0.3130613

biserial.cor(train$Wage, train$high.wage.ind)

## [1] -0.5801487

Passing, Dribbling, and Wage seems to be close to a high correlation (above the 0.50 treshold). To what extent this may affect our analysis, we cannot say at this point, but we will keep an eye on these variables. Curious enough, all the variables seems to be negative correlated to the high salary variable (maybe as the salary increases players tend to relax more and their performance decreases??) We now look into the correlation between wage in euros and the other ‘ability’ variables.

cor.test(train$Wage, train$Pace)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Wage and train$Pace
## t = 1.1992, df = 306, p-value = 0.2314
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04369658  0.17878763
## sample estimates:
##        cor 
## 0.06839584

cor.test(train$Wage, train$Shooting)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Wage and train$Shooting
## t = 5.2385, df = 306, p-value = 3.018e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1809204 0.3862530
## sample estimates:
##       cor 
## 0.2868783

cor.test(train$Wage, train$Passing)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Wage and train$Passing
## t = 10.008, df = 306, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4074362 0.5763553
## sample estimates:
##      cor 
## 0.496583

cor.test(train$Wage, train$Dribbling)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Wage and train$Dribbling
## t = 9.0282, df = 306, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3656107 0.5425777
## sample estimates:
##       cor 
## 0.4586295

cor.test(train$Wage, train$Defending)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Wage and train$Defending
## t = 5.9292, df = 306, p-value = 8.213e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2170366 0.4177791
## sample estimates:
##       cor 
## 0.3210087

cor.test(train$Wage, train$Physic)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Wage and train$Physic
## t = 5.7233, df = 306, p-value = 2.491e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2063714 0.4085194
## sample estimates:
##       cor 
## 0.3109579

cor.test(train$Wage, train$power_strength)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Wage and train$power_strength
## t = 3.0847, df = 306, p-value = 0.002224
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0631278 0.2799855
## sample estimates:
##      cor 
## 0.173661

cor.test(train$Wage, train$power_long_shots)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Wage and train$power_long_shots
## t = 5.3252, df = 306, p-value = 1.957e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1855040 0.3902809
## sample estimates:
##       cor 
## 0.2912249

The null hypothesis in the Pearson Correlation test is that the correlation coefficient IS NOT significantly different from 0. A p value of less than 0.05 will reject the null hypothesis with a confidence level of 95% (in other words, the possibility of rejecting the null hypothesis whne true is just 5%)

None of the ‘abilities’ variables is highly correlated to the wage variable either and this is good: it is unlikely to find multicollinarity issues between the variables thay may affect our prediction.

Note that dribbling, passing, and shooting are slightly positively correlated to the wage variable, and while not to the extent to expect multicolinearity issues, it gives us a bit of a clue about our prediciton modelling: we may expecte to be significant in our analysis.

We have checked for multicollinearity issues and we have found none.

We will now check the normality assumption between our variables and we start doing so by plotting their densities.

A word of caution here: here we are checking for normality between our data NOT OUR MODEL.

We will check for noramlity again once we have a model to test.

Why we are checking then? If our data is normal distributed (or close to), the likelihood of having a normally distributed model increases.

Similarily, a well behaved variable has more chances of being statistically significant than one that it is not (but not neccesarily).

Regardless of the outcome of our model, a full and comprehensive (and unbias) analysis of our data (and hence familiarisation) is fundamental for reaching the right conclusions under the right model.

#Lets start plotting the densities and QQ plot -for normality - for each variable,
ggqqplot(train$potential, color="red", ggtheme = theme_cleveland())  + ggtitle("QQ Plot for Potential")

ggdensity(train$potential) + ggtitle("Density Plot for Potential") + geom_density(color =
                                                                                      "purple", fill = "#69b3a2")

ggqqplot(train$Wage, color="blue", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Wage")

ggdensity(train$Wage) + ggtitle("Density Plot for Wage") + geom_density(color =
                                                                            "purple", fill = "#69b3a2")

ggqqplot(train$age, color="green", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Age")

ggdensity(train$age) + ggtitle("Density Plot for Age") + geom_density(color =
                                                                          "purple", fill = "#69b3a2")

ggqqplot(train$Height, color="Black", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Height")

ggdensity(train$Height) + ggtitle("Density Plot for Height") + geom_density(color =
                                                                                "purple", fill = "#69b3a2")

ggqqplot(train$Pace, color="red", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Pace")

ggdensity(train$Pace) + ggtitle("Density Plot for Pace") + geom_density(color =
                                                                            "purple", fill = "#69b3a2")

ggqqplot(train$Shooting, color="blue", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Shooting")

ggdensity(train$Shooting) + ggtitle("Density Plot for Shooting") + geom_density(color =
                                                                                    "purple", fill = "#69b3a2")

ggqqplot(train$Passing, color="green", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Passing")

ggdensity(train$Passing) + ggtitle("Density Plot for Passing") + geom_density(color =
                                                                                  "purple", fill = "#69b3a2")

ggqqplot(train$Dribbling, color="black", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Dribbling")

ggdensity(train$Dribbling) + ggtitle("Density Plot for Dribbling") + geom_density(color =
                                                                                      "purple", fill = "#69b3a2")

ggqqplot(train$Defending, color="red", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Defending")

ggdensity(train$Defending) + ggtitle("Density Plot for Defending") + geom_density(color =
                                                                                      "purple", fill = "#69b3a2")

ggqqplot(train$Physic, color="blue", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Physic")

ggdensity(train$Physic) + ggtitle("Density Plot for Physic") + geom_density(color =
                                                                                "purple", fill = "#69b3a2")

ggqqplot(train$power_strength, color="green", ggtheme = theme_cleveland()) + ggtitle("QQ Plot for Power Strength")

ggdensity(train$power_strength) + ggtitle("Density Plot for Power Strength") + geom_density(color =
                                                                                                "purple", fill = "#69b3a2")

ggqqplot(train$power_long_shots, color="black", ggtheme = theme_cleveland()) + ggtitle("QQ Plot fro Power Long Shots")

ggdensity(train$power_long_shots) + ggtitle("Density Plot for Power Long Shots") + geom_density(color =
                                                                                              "purple", fill = "#69b3a2")

Potential, Height, and Weight are the only ones that seems to be normally distributed: we will soon check this numerically by running a Shapiro Wilkes Test (In any case normality is not an strong assumption in statistics: homoskedasticity and independence of the observations are far more important).

Once we have built our model we can check the assumptions of independence of the observations, homsekedasticity, and normality too.

Do to the presence of categorical variables in our data frame, we cannot run a Shapiro Wilkes Test for normality to our data set as a whole. What we can do is to momentarily remove the factors from our data set and then run the Shapiro Wilkes test on the numeric variables.

train_2 <- train[,-c(1,7,8,17)]
apply(train_2, 2, shapiro.test)

## $potential
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.9868, p-value = 0.006483
## 
## 
## $Wage
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.48674, p-value < 2.2e-16
## 
## 
## $age
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.96126, p-value = 2.649e-07
## 
## 
## $Height
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.99417, p-value = 0.2868
## 
## 
## $Weight
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.99332, p-value = 0.1881
## 
## 
## $Pace
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.97403, p-value = 2.294e-05
## 
## 
## $Shooting
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.97416, p-value = 2.411e-05
## 
## 
## $Passing
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.99008, p-value = 0.03486
## 
## 
## $Dribbling
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.97554, p-value = 4.165e-05
## 
## 
## $Defending
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.92862, p-value = 5.52e-11
## 
## 
## $Physic
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.97277, p-value = 1.413e-05
## 
## 
## $power_strength
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.97868, p-value = 0.0001523
## 
## 
## $power_long_shots
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.97774, p-value = 0.0001022

#Please note that we had to use the apply () function to the shapiro test otherwise using the shapiro.test () on its own it would have produced an error. This is becaause the Shapiro test works for vectors and not on data frame as ours.

For this test, the NULL hypothesis is that our data IS INDEED NORMALLY DISTRIBUTED. Hence if we found a p value less than 0.05 we REJECT the null hypothesis and we conclude that our data it is not normally distributed (if our p value is larger than 0.05 or larger we assume that it is normally distributed)

Apart from weight, height, potential and shooting (that are borderline normally distributed), none of the other variables seems to be close to a normal distribution.

This confirms are graphical plotting.

Again, normality at this point it is not a big deal: once we have our model we will produce a series of qq plots to check for normality of the model.

2.2 EDA and summary of results (10 marks)

We have already checked the underlying assumptions. We can now explore the data by, 1) Plotting the factor variables 2) Aggregate by functions the independent variables and the dependent variables to see any underlying relations 3) Plotting Trees with the hope of finding clues about the relation between the variables.

#The variables that we are interested in plotting are 'high.wage.ind', and 'foot'
plot(train$Foot)

table(train$Foot) # most players are right foot'ed (it will be nice to understand from these ones who are high earners..?)

## 
##  Left Right 
##    72   236

#Aggregate the variables by mean 
aggregate(train, by=list(train$high.wage), FUN=mean)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

We have just calcualted the aggregate mean of all our variables and we haved grouped them by dividing those that are high earners (1), and those who not (0)

On average, high earners have a potential of 77 against the 70 of those that are no high earners.

The rest of the results are not what we expected: a clear difference in performance between the two groups.

Dribbling, defending, Physic, power strength, power_long_shots, and even age are quite close to each other.

Maybe the salary it is not directly related to the abilities themselves but with the financial resources of the club?

Or maybe with the managers expectations’?

Maybe high salary is a biased reaction to expectations or networking?

We will construct a model later on this assignment to check this and other assumptions.

To complete our exploratory analysis, We follow by plotting trees

#Here we use and adaptation of the rpart () function http://www.milbo.org/rpart-plot/prp.pdf
library(rpart.plot)
str(train)

## 'data.frame':    308 obs. of  17 variables:
##  $ Id              : int  201942 184087 231866 209989 178562 205498 205632 220971 251573 200724 ...
##  $ potential       : int  87 85 88 85 83 83 83 85 87 80 ...
##  $ Wage            : num  190000 130000 145000 81000 55000 125000 40000 95000 49000 135000 ...
##  $ age             : num  28 31 24 27 32 28 25 25 22 30 ...
##  $ Height          : int  181 186 191 185 174 180 188 172 173 180 ...
##  $ Weight          : num  76 81 82 77 71 68 84 64 68 76 ...
##  $ Club            : Factor w/ 323 levels "1. FC Köln","1. FC Union Berlin",..: 188 293 192 35 24 64 256 188 35 237 ...
##  $ Foot            : Factor w/ 2 levels "Left","Right": 2 2 2 2 2 2 2 2 1 2 ...
##  $ Pace            : num  77 63 65 71 63 55 82 71 84 78 ...
##  $ Shooting        : int  80 55 68 74 74 65 83 73 67 37 ...
##  $ Passing         : int  81 72 77 81 84 84 78 76 76 66 ...
##  $ Dribbling       : num  90 67 77 81 82 80 83 87 81 66 ...
##  $ Defending       : int  61 87 82 82 76 72 63 65 75 80 ...
##  $ Physic          : int  78 79 79 85 70 73 84 66 75 80 ...
##  $ power_strength  : int  74 79 79 82 67 66 81 57 69 79 ...
##  $ power_long_shots: int  77 58 76 85 81 62 79 73 68 28 ...
##  $ high.wage.ind   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

set.seed(567)
train_2 <- train[, c(-7,-1)] #we start by removing the ID column (number 1) because it is irrelevant and the club columns (number 7) due to having too many levels for out tree (our function will output an error message)

attach(train_2)

tree_model <- rpart(high.wage.ind ~., train_2)
rpart.plot(tree_model)

#Here the only significant variable is wage  when 33% of the footballers are high earners and the rest (67%) are not.
#the plot does not give us a lot of insight.

#Lets try something different 

tree_model_1 <-rpart(Wage~.,train_2)
rpart.plot(tree_model_1)

By using the ‘wage’ variable as our dependent variable we have a bit more of information on our tree, but a good bucnh of variables is missing from our tree.

still it is a very interesting model because wage is still our main variable in the sense that we would like to make wage predictions with out model. ON the down side, we cannot segragate, by looking at this tree, which ones belong to the high wages or not: seems to be just an analysis of the different types of abilities.

We try now a model were Potential is our main variable

set.seed(24)
tree_model_2 <-rpart(potential~.,train_2)
rpart.plot(tree_model_2)

Having the variable ‘potential’ as our dependent variable makes more sense: abilities are now divided according with the different types of abilities and the tree it is, indeed, a lot more complete (as in full)

Here the interpreation is as follow:

Those with a potential lower than 69% have a lower wage, compared with those with 76 potential that earn more (note that these number coincide with the mean of the aggregate output).

Of those on lower wage, 36% have a hight potential (70) and are older than 26 years old against the 21% remaining that have a potential of 66 and are less than 26 years old.

From these group of ‘older’ players (36%), 29% are less than 19 years old and just a 3% are older. From the group of younger ones (21% with a 66 of potential), dribbling is the most important variable with just a 9% with a score larger of 63 on dribbling. The remaining 11% that has a score in dribbling less than 63, 9% scored higher in physic.

The same interpretation applies to the RHS of the tree.

Overall, we see that when potential is the explanatory variable, wage and age are the two most important explanatory variables, followed by dribbling and passing.

Having this information in mind, we may want to construct two different models: one with a binary respone (high.indsutry.wage) and a multiple regression model with the Potential as the dependent variable.

2.3 Additional insights and issues (5 marks)

So far we have learned the following: There are no significant correlations within our variables for any of the two models that we have in mind: ANCOVA and Logistic. By plotting trees we have learned a bit more about both models and which variables were the most significants. We hope to find the same type of relation (or at least similar) when creating our regression models.

On the down side, most of our data was not normally distributed and this may be a problem when plotting our model diagnostic (qq plot, residuals vs leverage, etc)

On average, the score on the abilities of both group of earners (high an low) were quite similar. This was a bit dissapointing initially (we were expecting a stark difference between both groups) that was later confirmed by the little information provided by its respective tree plot.

While we still hope to find a good model that helps predict performance, and ultimately, belonging to the high earners group, we are open to the possibility of this not be the case.

3. Modelling

3.1 Build a model for player potential (5 Marks)

AS mentioned earlier, and based on our EDA and our research question, we will try and construct two models: one with a binary dependent variable and an ANOVA model.

For our model with a binary response we will use a glm model with a binomial family of errors with a logit link.

model.1 <- glm(train$high.wage.ind~., train, family = binomial)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

As soon as we run the code on this model -on that includes all the variables- we encounter the following message: “algorithm did not convergeglm.fit: fitted probabilities numerically ) or 1 occurred”.

This eems to be due to perfect or quasi perfect separation.

Perfect separation means that our predictor variable correctly allocate all the variables in onegroup (almost like a perfect prediction).

We hence try removing the variables that we suspect are highly correlated. We now remove the ones that seems problematic or irrelevant: the ID column, the club column, and the wage column (probably correlated with the high.wage industry variable.)

set.seed(456)
model.1 <-
  glm(
    train$high.wage.ind ~ potential + age + Height + Weight + Pace +
      Shooting + Dribbling + Defending + Physic + power_long_shots + power_strength,
    train,
    family = binomial
  )

Wage was responsible for the ‘perfect prediction’ leading to the failure of our previous model. In other words, wage and high.wage.ind were highly correlated (as if they were the ‘same variable’ but divided in two…)

#We now perform variable selection by using the step () function  
set.seed(45)
summary(model.1)

## 
## Call:
## glm(formula = train$high.wage.ind ~ potential + age + Height + 
##     Weight + Pace + Shooting + Dribbling + Defending + Physic + 
##     power_long_shots + power_strength, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5091  -0.4505  -0.1519   0.1768   2.5696  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -43.40948    9.27374  -4.681 2.86e-06 ***
## potential          0.37403    0.06319   5.919 3.24e-09 ***
## age                0.21590    0.06079   3.551 0.000383 ***
## Height            -0.05085    0.04694  -1.083 0.278656    
## Weight             0.09817    0.05647   1.738 0.082145 .  
## Pace               0.02667    0.02573   1.036 0.300016    
## Shooting           0.05775    0.05370   1.075 0.282247    
## Dribbling          0.12596    0.05356   2.352 0.018679 *  
## Defending          0.02859    0.02359   1.212 0.225537    
## Physic            -0.03509    0.07113  -0.493 0.621765    
## power_long_shots  -0.06729    0.03839  -1.753 0.079610 .  
## power_strength     0.04379    0.05191   0.843 0.398950    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 378.96  on 307  degrees of freedom
## Residual deviance: 176.97  on 296  degrees of freedom
## AIC: 200.97
## 
## Number of Fisher Scoring iterations: 6

model.2<- step(model.1)

## Start:  AIC=200.97
## train$high.wage.ind ~ potential + age + Height + Weight + Pace + 
##     Shooting + Dribbling + Defending + Physic + power_long_shots + 
##     power_strength
## 
##                    Df Deviance    AIC
## - Physic            1   177.21 199.21
## - power_strength    1   177.68 199.68
## - Pace              1   178.04 200.04
## - Shooting          1   178.13 200.13
## - Height            1   178.16 200.16
## - Defending         1   178.43 200.43
## <none>                  176.97 200.97
## - Weight            1   180.08 202.08
## - power_long_shots  1   180.14 202.14
## - Dribbling         1   182.96 204.96
## - age               1   190.92 212.92
## - potential         1   226.65 248.65
## 
## Step:  AIC=199.21
## train$high.wage.ind ~ potential + age + Height + Weight + Pace + 
##     Shooting + Dribbling + Defending + power_long_shots + power_strength
## 
##                    Df Deviance    AIC
## - power_strength    1   177.86 197.86
## - Pace              1   178.04 198.04
## - Shooting          1   178.19 198.19
## - Height            1   178.35 198.35
## - Defending         1   178.61 198.61
## <none>                  177.21 199.21
## - Weight            1   180.22 200.22
## - power_long_shots  1   180.25 200.25
## - Dribbling         1   183.51 203.51
## - age               1   190.93 210.93
## - potential         1   227.08 247.08
## 
## Step:  AIC=197.86
## train$high.wage.ind ~ potential + age + Height + Weight + Pace + 
##     Shooting + Dribbling + Defending + power_long_shots
## 
##                    Df Deviance    AIC
## - Height            1   178.66 196.66
## - Pace              1   179.01 197.01
## - Shooting          1   179.03 197.03
## - Defending         1   179.67 197.67
## <none>                  177.86 197.86
## - power_long_shots  1   181.23 199.23
## - Dribbling         1   183.65 201.65
## - Weight            1   184.23 202.23
## - age               1   194.93 212.93
## - potential         1   229.42 247.42
## 
## Step:  AIC=196.66
## train$high.wage.ind ~ potential + age + Weight + Pace + Shooting + 
##     Dribbling + Defending + power_long_shots
## 
##                    Df Deviance    AIC
## - Shooting          1   179.55 195.55
## - Pace              1   179.91 195.91
## - Defending         1   180.27 196.27
## <none>                  178.66 196.66
## - power_long_shots  1   181.60 197.60
## - Weight            1   184.99 200.99
## - Dribbling         1   185.54 201.54
## - age               1   195.31 211.31
## - potential         1   229.66 245.66
## 
## Step:  AIC=195.55
## train$high.wage.ind ~ potential + age + Weight + Pace + Dribbling + 
##     Defending + power_long_shots
## 
##                    Df Deviance    AIC
## - Defending         1   180.28 194.28
## - Pace              1   180.48 194.48
## <none>                  179.55 195.55
## - power_long_shots  1   182.29 196.29
## - Weight            1   187.20 201.20
## - Dribbling         1   188.55 202.55
## - age               1   198.11 212.11
## - potential         1   233.16 247.16
## 
## Step:  AIC=194.28
## train$high.wage.ind ~ potential + age + Weight + Pace + Dribbling + 
##     power_long_shots
## 
##                    Df Deviance    AIC
## - Pace              1   181.04 193.04
## <none>                  180.28 194.28
## - power_long_shots  1   184.25 196.25
## - Weight            1   188.25 200.25
## - Dribbling         1   189.62 201.62
## - age               1   204.06 216.06
## - potential         1   243.44 255.44
## 
## Step:  AIC=193.04
## train$high.wage.ind ~ potential + age + Weight + Dribbling + 
##     power_long_shots
## 
##                    Df Deviance    AIC
## <none>                  181.04 193.04
## - power_long_shots  1   185.37 195.37
## - Weight            1   188.25 198.25
## - Dribbling         1   195.05 205.05
## - age               1   206.04 216.04
## - potential         1   246.10 256.10

summary.glm(model.2)

## 
## Call:
## glm(formula = train$high.wage.ind ~ potential + age + Weight + 
##     Dribbling + power_long_shots, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7195  -0.4301  -0.1706   0.2254   2.5974  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -48.49545    6.04614  -8.021 1.05e-15 ***
## potential          0.36935    0.05690   6.491 8.53e-11 ***
## age                0.22394    0.04848   4.619 3.86e-06 ***
## Weight             0.08938    0.03386   2.640 0.008301 ** 
## Dribbling          0.16276    0.04680   3.478 0.000505 ***
## power_long_shots  -0.04123    0.02022  -2.039 0.041480 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 378.96  on 307  degrees of freedom
## Residual deviance: 181.04  on 302  degrees of freedom
## AIC: 193.04
## 
## Number of Fisher Scoring iterations: 6

This is really good news: The difference between the null deviance and the residual deviance is big and, as a general rule, the biggest the difference, the better the model fit (null deviance is our model with JUST the intercept and residual deviance is our model including ALL the variables).

All our variables are highly significant.

OUr model is consistent with our findings in the Exploratory Analysis.

We now run a quick tree modelling on our model,

We still need to run our model on our validation set, but before moving on, we will run some plots to check the underlying assumptions of the model.

plot(model.2)

Our model seems to be close to a normal distribution (but not as musch as we would like) with no high levearge points: no outliers were found inside the Cook’s distance line (4/n by default)

At this point we are happy to move to the next stage: replicating our model in our validation set.

set.seed(345)
model.1.validation <-
  glm(
   test$high.wage.ind ~ potential + age + Height + Weight + Pace +
      Shooting + Dribbling + Defending + Physic + power_long_shots + power_strength,
    test,
    family = binomial
  )
model.2.validation<- step(model.1.validation)

## Start:  AIC=140.76
## test$high.wage.ind ~ potential + age + Height + Weight + Pace + 
##     Shooting + Dribbling + Defending + Physic + power_long_shots + 
##     power_strength
## 
##                    Df Deviance    AIC
## - Physic            1   116.82 138.82
## - Weight            1   116.92 138.92
## - power_strength    1   117.14 139.14
## - Pace              1   117.95 139.95
## - Height            1   118.05 140.05
## <none>                  116.76 140.76
## - Defending         1   119.79 141.79
## - Shooting          1   120.47 142.47
## - power_long_shots  1   122.32 144.32
## - Dribbling         1   126.62 148.62
## - age               1   129.64 151.64
## - potential         1   144.10 166.10
## 
## Step:  AIC=138.82
## test$high.wage.ind ~ potential + age + Height + Weight + Pace + 
##     Shooting + Dribbling + Defending + power_long_shots + power_strength
## 
##                    Df Deviance    AIC
## - Weight            1   117.03 137.03
## - power_strength    1   117.61 137.60
## - Height            1   118.13 138.13
## - Pace              1   118.31 138.31
## <none>                  116.82 138.82
## - Shooting          1   120.60 140.60
## - Defending         1   120.97 140.97
## - power_long_shots  1   122.88 142.88
## - Dribbling         1   126.65 146.65
## - age               1   129.64 149.64
## - potential         1   144.10 164.10
## 
## Step:  AIC=137.03
## test$high.wage.ind ~ potential + age + Height + Pace + Shooting + 
##     Dribbling + Defending + power_long_shots + power_strength
## 
##                    Df Deviance    AIC
## - Pace              1   118.50 136.50
## - Height            1   118.75 136.75
## <none>                  117.03 137.03
## - power_strength    1   119.74 137.74
## - Shooting          1   120.70 138.70
## - Defending         1   121.08 139.08
## - power_long_shots  1   123.00 141.00
## - Dribbling         1   126.71 144.71
## - age               1   130.13 148.13
## - potential         1   144.18 162.18
## 
## Step:  AIC=136.5
## test$high.wage.ind ~ potential + age + Height + Shooting + Dribbling + 
##     Defending + power_long_shots + power_strength
## 
##                    Df Deviance    AIC
## - Height            1   120.36 136.36
## <none>                  118.50 136.50
## - Shooting          1   121.29 137.29
## - power_strength    1   121.81 137.81
## - Defending         1   121.96 137.96
## - power_long_shots  1   123.10 139.10
## - Dribbling         1   127.22 143.22
## - age               1   133.53 149.53
## - potential         1   146.90 162.90
## 
## Step:  AIC=136.36
## test$high.wage.ind ~ potential + age + Shooting + Dribbling + 
##     Defending + power_long_shots + power_strength
## 
##                    Df Deviance    AIC
## <none>                  120.36 136.36
## - Shooting          1   123.63 137.63
## - Defending         1   123.81 137.81
## - power_long_shots  1   125.07 139.07
## - power_strength    1   126.17 140.17
## - Dribbling         1   127.25 141.25
## - age               1   134.67 148.67
## - potential         1   153.54 167.54

summary.glm(model.2.validation)

## 
## Call:
## glm(formula = test$high.wage.ind ~ potential + age + Shooting + 
##     Dribbling + Defending + power_long_shots + power_strength, 
##     family = binomial, data = test)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.78671  -0.43404  -0.09763   0.32179   2.46522  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -46.30521    6.82374  -6.786 1.15e-11 ***
## potential          0.33821    0.06975   4.849 1.24e-06 ***
## age                0.27254    0.07784   3.501 0.000463 ***
## Shooting           0.09904    0.05593   1.771 0.076611 .  
## Dribbling          0.12184    0.04827   2.524 0.011597 *  
## Defending          0.03812    0.02108   1.808 0.070627 .  
## power_long_shots  -0.09178    0.04375  -2.098 0.035917 *  
## power_strength     0.05223    0.02232   2.340 0.019289 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 264.05  on 205  degrees of freedom
## Residual deviance: 120.36  on 198  degrees of freedom
## AIC: 136.36
## 
## Number of Fisher Scoring iterations: 6

plot(model.2.validation)

Our model on the validation set perform worse than in our training set but still it has good features: apart from one variable, the rest of the explanatory variables are significant to a very good confidence level. Our model it is not particularily sensitive to high leverage points and it is close to a normal distrbution (but not as normal as we would like)

POtential is, by far, the most important variable when predicting the probability of belonging to the high.wage.industry group, followed by age.

While correct, it does not seem to be particularily useful in real life (we could have reached to the same conclusion without using this model)

Creating a model for the potential variable

Our next step will then, be to create a model where potential is our dependant variable.

set.seed(200)
lm.model <- lm(potential~., data=train_2)
summary.lm(lm.model)

## 
## Call:
## lm(formula = potential ~ ., data = train_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7362 -2.3355  0.0333  1.8792 11.6787 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.527e+01  8.251e+00   6.698 1.08e-10 ***
## Wage              4.347e-05  1.115e-05   3.899 0.000120 ***
## age              -7.004e-01  5.303e-02 -13.209  < 2e-16 ***
## Height            7.186e-02  4.908e-02   1.464 0.144237    
## Weight           -9.150e-02  5.417e-02  -1.689 0.092260 .  
## FootRight         7.451e-01  4.653e-01   1.601 0.110375    
## Pace             -5.145e-02  2.479e-02  -2.076 0.038794 *  
## Shooting          7.139e-02  5.362e-02   1.332 0.184058    
## Passing           1.316e-01  4.779e-02   2.754 0.006258 ** 
## Dribbling         1.999e-01  5.356e-02   3.732 0.000229 ***
## Defending         2.105e-02  2.502e-02   0.842 0.400681    
## Physic            1.485e-01  6.801e-02   2.183 0.029826 *  
## power_strength   -2.126e-02  4.966e-02  -0.428 0.668869    
## power_long_shots -8.026e-02  3.935e-02  -2.040 0.042260 *  
## high.wage.ind1    3.555e+00  5.652e-01   6.290 1.15e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.397 on 293 degrees of freedom
## Multiple R-squared:  0.7076, Adjusted R-squared:  0.6936 
## F-statistic: 50.65 on 14 and 293 DF,  p-value: < 2.2e-16

set.seed(567)
lm.model.1 <-step(lm.model)

## Start:  AIC=767.92
## potential ~ Wage + age + Height + Weight + Foot + Pace + Shooting + 
##     Passing + Dribbling + Defending + Physic + power_strength + 
##     power_long_shots + high.wage.ind
## 
##                    Df Sum of Sq    RSS    AIC
## - power_strength    1      2.12 3383.2 766.12
## - Defending         1      8.17 3389.3 766.67
## - Shooting          1     20.46 3401.6 767.78
## <none>                          3381.1 767.92
## - Height            1     24.74 3405.8 768.17
## - Foot              1     29.59 3410.7 768.61
## - Weight            1     32.92 3414.0 768.91
## - power_long_shots  1     48.02 3429.1 770.27
## - Pace              1     49.72 3430.8 770.42
## - Physic            1     54.99 3436.1 770.89
## - Passing           1     87.51 3468.6 773.80
## - Dribbling         1    160.68 3541.8 780.22
## - Wage              1    175.46 3556.6 781.51
## - high.wage.ind     1    456.52 3837.6 804.93
## - age               1   2013.30 5394.4 909.81
## 
## Step:  AIC=766.12
## potential ~ Wage + age + Height + Weight + Foot + Pace + Shooting + 
##     Passing + Dribbling + Defending + Physic + power_long_shots + 
##     high.wage.ind
## 
##                    Df Sum of Sq    RSS    AIC
## - Defending         1     12.41 3395.6 765.25
## - Shooting          1     21.20 3404.4 766.04
## <none>                          3383.2 766.12
## - Height            1     23.23 3406.5 766.22
## - Foot              1     29.92 3413.1 766.83
## - Weight            1     39.66 3422.9 767.71
## - Pace              1     47.73 3431.0 768.43
## - power_long_shots  1     47.80 3431.0 768.44
## - Passing           1     92.79 3476.0 772.45
## - Physic            1    127.33 3510.6 775.50
## - Dribbling         1    158.85 3542.1 778.25
## - Wage              1    175.07 3558.3 779.66
## - high.wage.ind     1    457.06 3840.3 803.15
## - age               1   2050.58 5433.8 910.05
## 
## Step:  AIC=765.25
## potential ~ Wage + age + Height + Weight + Foot + Pace + Shooting + 
##     Passing + Dribbling + Physic + power_long_shots + high.wage.ind
## 
##                    Df Sum of Sq    RSS    AIC
## - Shooting          1      9.40 3405.0 764.10
## <none>                          3395.6 765.25
## - Height            1     23.76 3419.4 765.39
## - Foot              1     32.27 3427.9 766.16
## - power_long_shots  1     37.75 3433.4 766.65
## - Weight            1     46.58 3442.2 767.44
## - Pace              1     63.46 3459.1 768.95
## - Passing           1    142.45 3538.1 775.90
## - Dribbling         1    166.04 3561.7 777.95
## - Wage              1    192.36 3588.0 780.22
## - Physic            1    234.56 3630.2 783.82
## - high.wage.ind     1    468.67 3864.3 803.07
## - age               1   2065.71 5461.3 909.61
## 
## Step:  AIC=764.1
## potential ~ Wage + age + Height + Weight + Foot + Pace + Passing + 
##     Dribbling + Physic + power_long_shots + high.wage.ind
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          3405.0 764.10
## - Height            1     26.58 3431.6 764.49
## - Foot              1     35.72 3440.8 765.31
## - power_long_shots  1     41.25 3446.3 765.81
## - Weight            1     41.35 3446.4 765.81
## - Pace              1     63.46 3468.5 767.78
## - Passing           1    133.06 3538.1 773.90
## - Wage              1    192.77 3597.8 779.06
## - Dribbling         1    223.50 3628.5 781.68
## - Physic            1    225.42 3630.5 781.84
## - high.wage.ind     1    476.26 3881.3 802.42
## - age               1   2056.75 5461.8 907.63

summary(lm.model.1)

## 
## Call:
## lm(formula = potential ~ Wage + age + Height + Weight + Foot + 
##     Pace + Passing + Dribbling + Physic + power_long_shots + 
##     high.wage.ind, data = train_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8558 -2.3598 -0.0876  2.0254 11.7147 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.567e+01  7.978e+00   6.978 1.97e-11 ***
## Wage              4.511e-05  1.102e-05   4.094 5.48e-05 ***
## age              -7.024e-01  5.253e-02 -13.371  < 2e-16 ***
## Height            7.350e-02  4.835e-02   1.520 0.129551    
## Weight           -9.727e-02  5.131e-02  -1.896 0.058947 .  
## FootRight         8.139e-01  4.619e-01   1.762 0.079085 .  
## Pace             -5.461e-02  2.325e-02  -2.349 0.019499 *  
## Passing           1.430e-01  4.206e-02   3.401 0.000764 ***
## Dribbling         2.192e-01  4.973e-02   4.408 1.46e-05 ***
## Physic            1.394e-01  3.148e-02   4.427 1.35e-05 ***
## power_long_shots -3.990e-02  2.107e-02  -1.894 0.059242 .  
## high.wage.ind1    3.619e+00  5.625e-01   6.434 4.98e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.392 on 296 degrees of freedom
## Multiple R-squared:  0.7055, Adjusted R-squared:  0.6946 
## F-statistic: 64.48 on 11 and 296 DF,  p-value: < 2.2e-16

set.seed(853)
lm.model.2 <- lm(formula = potential ~ Wage + age + Foot + 
    Pace + Passing + Dribbling + Physic +  
    high.wage.ind, data = train_2)
summary(lm.model.2)

## 
## Call:
## lm(formula = potential ~ Wage + age + Foot + Pace + Passing + 
##     Dribbling + Physic + high.wage.ind, data = train_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9063 -2.4066 -0.1442  2.0998 12.2269 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.362e+01  2.319e+00  27.438  < 2e-16 ***
## Wage            4.906e-05  1.104e-05   4.444 1.25e-05 ***
## age            -7.083e-01  5.172e-02 -13.695  < 2e-16 ***
## FootRight       7.289e-01  4.637e-01   1.572 0.117054    
## Pace           -4.773e-02  2.165e-02  -2.205 0.028243 *  
## Passing         1.315e-01  4.085e-02   3.218 0.001430 ** 
## Dribbling       1.741e-01  4.540e-02   3.834 0.000153 ***
## Physic          1.273e-01  2.358e-02   5.399 1.37e-07 ***
## high.wage.ind1  3.701e+00  5.626e-01   6.578 2.13e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.424 on 299 degrees of freedom
## Multiple R-squared:  0.6968, Adjusted R-squared:  0.6887 
## F-statistic:  85.9 on 8 and 299 DF,  p-value: < 2.2e-16

set.seed(234)
lm.model.3 <- lm(formula = potential ~ Wage + age  + 
    Pace + Passing + Dribbling + Physic +  
    high.wage.ind, data = train_2)
summary(lm.model.3)

## 
## Call:
## lm(formula = potential ~ Wage + age + Pace + Passing + Dribbling + 
##     Physic + high.wage.ind, data = train_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7273 -2.3022 -0.0811  2.1240 12.4414 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.417e+01  2.297e+00  27.932  < 2e-16 ***
## Wage            4.997e-05  1.105e-05   4.521 8.86e-06 ***
## age            -7.059e-01  5.182e-02 -13.621  < 2e-16 ***
## Pace           -4.771e-02  2.170e-02  -2.198 0.028705 *  
## Passing         1.253e-01  4.076e-02   3.073 0.002312 ** 
## Dribbling       1.785e-01  4.543e-02   3.929 0.000106 ***
## Physic          1.276e-01  2.364e-02   5.401 1.35e-07 ***
## high.wage.ind1  3.684e+00  5.639e-01   6.533 2.76e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.433 on 300 degrees of freedom
## Multiple R-squared:  0.6943, Adjusted R-squared:  0.6872 
## F-statistic: 97.34 on 7 and 300 DF,  p-value: < 2.2e-16

set.seed(234)
lm.model.4 <- lm(formula = potential ~ Wage + age + Passing + Dribbling + Physic +  
    high.wage.ind, data = train_2)
summary(lm.model.4)

## 
## Call:
## lm(formula = potential ~ Wage + age + Passing + Dribbling + Physic + 
##     high.wage.ind, data = train_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7871 -2.2823 -0.1079  2.1976 12.3565 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.219e+01  2.127e+00  29.243  < 2e-16 ***
## Wage            5.245e-05  1.106e-05   4.741 3.29e-06 ***
## age            -6.743e-01  5.011e-02 -13.456  < 2e-16 ***
## Passing         1.407e-01  4.041e-02   3.481 0.000574 ***
## Dribbling       1.336e-01  4.085e-02   3.272 0.001194 ** 
## Physic          1.260e-01  2.377e-02   5.301 2.24e-07 ***
## high.wage.ind1  3.721e+00  5.672e-01   6.560 2.34e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.454 on 301 degrees of freedom
## Multiple R-squared:  0.6894, Adjusted R-squared:  0.6832 
## F-statistic: 111.3 on 6 and 301 DF,  p-value: < 2.2e-16

We will stop here. We have no more theoretical reasons to eliminate variables (as in the case of Wage for being highly correlated to potential) and now all sixs variables are highly significant (to a 99.9% confidence level) We try now our model in our validation set.

set.seed(45)
lm.model.4.validation <-lm(formula = potential ~ Wage + age + Passing + Dribbling + Physic + 
    high.wage.ind, data = test)
summary(lm.model.4.validation)

## 
## Call:
## lm(formula = potential ~ Wage + age + Passing + Dribbling + Physic + 
##     high.wage.ind, data = test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2917 -2.6137  0.1013  2.6351 12.2932 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.282e+01  2.977e+00  21.101  < 2e-16 ***
## Wage            6.932e-05  1.460e-05   4.747 3.93e-06 ***
## age            -7.072e-01  7.097e-02  -9.964  < 2e-16 ***
## Passing         1.565e-01  5.027e-02   3.114  0.00212 ** 
## Dribbling       9.798e-02  5.022e-02   1.951  0.05247 .  
## Physic          1.471e-01  3.259e-02   4.514 1.09e-05 ***
## high.wage.ind1  3.494e+00  7.901e-01   4.422 1.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.973 on 199 degrees of freedom
## Multiple R-squared:  0.623,  Adjusted R-squared:  0.6116 
## F-statistic:  54.8 on 6 and 199 DF,  p-value: < 2.2e-16

plot(lm.model.4.validation)

In reading our diagnostic plot we look for, Residuals vs Fitted: variance seems to be constant around the mean (homoskedasticity assumption checked!) Normal Q-Q plot: we can safely assume that it is normally distributed. Scale-Location: still check for homoskedaticity. Here we look for a (more or less) horizontal red line and that the spread around the line does not vary a lot. Residual vs Leverage: here we see how sensitive the residuals of our model (the difference between the observed and the predicted) are to high leverage points (imagine a leverage point at the end of our line changing the slope of our linear regression line). Here we check for heteroskedaticiy and non linearity. We also check if any leverage point is influencial or not (it has to be inside the Cook’s distance line, the red dotted one). We have none.

Our model is well behaved: it is normal, the variance is homogenous around the mean, not high leverage point and our plot confirms that out model best fit seems to be linear (red line in the residuals vs fit plot - but not perfect-) R squared has increased and, apart from Dribbling, the rest of the variables are highly significant.

3.2 Critique model using relevant diagnostics (5 marks)

So far we have created two relatively successful models: one that calculates the probability of belonging to the group of high industry wage earners in the football industry and another model that predicts, based on a number of variables of different degrees, total scores on the Potential variable (a sort of multivariate general score for footbal players…sort of like a total score x player).

Both models are sensitive to variations in data: more simply put, changes in data will produce relatively different results. ON the other hand, while our model may suffer of variance, it is biased towards this specific group and, as far as this data concerns, predictions should be accurate (i.e., bias vs variance dilemma)

Initially, our logistic model failed to produced an output: it took us a while to find the reason behind it.

Logistic models are very useful, but particularily sensitive to independence on the correlation matrix. As be increase the number of variables in our model, the likelihood of correlation between the variables increases. This is why in order to have a working model we had to remove a few variables.

Adding to this the fact that the model does not seem to be strongly normally dsitributed, and that models tend to perform worse on validation set, it is not a surprise that the model itself did not contribute with new information: from our EDA we knew about the role of the Potential variable on the high wage industry.

On the other hand, it did help to confirm our suspitions about the Potential variable and opened the door to the creation of our multiple regression model. Model 1 is useful to calculate the probability of belonging or not to the high earners group and model 2 is useful to identify which variables are statistically significant.

Our second model (the multiple regression one) performed much better in the validation set: the model was a lot more close to normal than the one in our training set; all the variables were highly significant apart from one. The model was also simpler: we have not used any interactions or log transformations (in part because there were not exccesively strong theoretical reasons to choose quadratic over linear relations)

3.3 Suggest improvements to your model (5 marks)

The data itself it is very interesting we could have tried other models with much less stringent conditions, such as lasso Regression (which basically performs variable selection).

Lasso Regression technique basically adds a penalty coefficient to our Residual Sums of Squares equation, forcing some coefficients (of our independent variables) to zero (in practice, performing variable selection)

Support vector machines techniques (used when we want to make classification when our decision boundaries are non linear) could have been very useful in here.

On the EDA side, clustering techniques can be very helpful in initial stages (but not particularily more efficient that trees which is the technique used here.)

We try interactions in our model and investigate if they provide better results (but note that we have choosen the one above based on the parsimony principle)

set.seed(1000)
interac.model.train <-
  lm(data=train,
    potential ~ I(age^2) + I(Wage^2) + I(Physic^2)+  Weight + Height + Pace + Shooting + Passing + Dribbling +
      Defending + power_long_shots + power_strength 
  )
#here we suspect that age, wage and physic may be better represented by an inverted U:  as we grow older the relation with potential decreases. The same applies to Physic and Wage - wage decreases as our performance goes down)
interac.model.train.1 <- step(interac.model.train)

## Start:  AIC=851.88
## potential ~ I(age^2) + I(Wage^2) + I(Physic^2) + Weight + Height + 
##     Pace + Shooting + Passing + Dribbling + Defending + power_long_shots + 
##     power_strength
## 
##                    Df Sum of Sq    RSS    AIC
## - power_strength    1      8.98 4507.6 850.50
## - Weight            1     22.88 4521.5 851.44
## <none>                          4498.7 851.88
## - Height            1     30.70 4529.3 851.98
## - Defending         1     34.07 4532.7 852.21
## - Shooting          1     70.00 4568.7 854.64
## - Pace              1     76.15 4574.8 855.05
## - I(Physic^2)       1     90.30 4589.0 856.00
## - Passing           1    113.94 4612.6 857.59
## - power_long_shots  1    138.24 4636.9 859.20
## - I(Wage^2)         1    186.76 4685.4 862.41
## - Dribbling         1    344.05 4842.7 872.58
## - I(age^2)          1   1871.58 6370.2 957.02
## 
## Step:  AIC=850.5
## potential ~ I(age^2) + I(Wage^2) + I(Physic^2) + Weight + Height + 
##     Pace + Shooting + Passing + Dribbling + Defending + power_long_shots
## 
##                    Df Sum of Sq    RSS    AIC
## - Height            1     26.89 4534.5 850.33
## <none>                          4507.6 850.50
## - Weight            1     32.44 4540.1 850.70
## - Defending         1     48.59 4556.2 851.80
## - Pace              1     68.81 4576.4 853.16
## - Shooting          1     72.43 4580.1 853.40
## - Passing           1    123.07 4630.7 856.79
## - power_long_shots  1    137.68 4645.3 857.76
## - I(Physic^2)       1    145.68 4653.3 858.29
## - I(Wage^2)         1    190.96 4698.6 861.27
## - Dribbling         1    339.43 4847.1 870.86
## - I(age^2)          1   1933.64 6441.3 958.44
## 
## Step:  AIC=850.33
## potential ~ I(age^2) + I(Wage^2) + I(Physic^2) + Weight + Pace + 
##     Shooting + Passing + Dribbling + Defending + power_long_shots
## 
##                    Df Sum of Sq    RSS    AIC
## - Weight            1     12.71 4547.2 849.19
## <none>                          4534.5 850.33
## - Defending         1     49.95 4584.5 851.70
## - Shooting          1     80.11 4614.6 853.72
## - Pace              1     93.67 4628.2 854.62
## - Passing           1    119.73 4654.2 856.35
## - power_long_shots  1    149.50 4684.0 858.32
## - I(Physic^2)       1    176.67 4711.2 860.10
## - I(Wage^2)         1    196.97 4731.5 861.42
## - Dribbling         1    329.08 4863.6 869.91
## - I(age^2)          1   2024.17 6558.7 962.00
## 
## Step:  AIC=849.19
## potential ~ I(age^2) + I(Wage^2) + I(Physic^2) + Pace + Shooting + 
##     Passing + Dribbling + Defending + power_long_shots
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          4547.2 849.19
## - Defending         1     59.00 4606.2 851.16
## - Shooting          1     75.38 4622.6 852.25
## - Pace              1     81.44 4628.7 852.66
## - Passing           1    122.40 4669.6 855.37
## - power_long_shots  1    145.66 4692.9 856.90
## - I(Wage^2)         1    199.33 4746.6 860.40
## - I(Physic^2)       1    205.11 4752.3 860.78
## - Dribbling         1    348.79 4896.0 869.95
## - I(age^2)          1   2035.20 6582.4 961.11

summary.lm(interac.model.train.1)

## 
## Call:
## lm(formula = potential ~ I(age^2) + I(Wage^2) + I(Physic^2) + 
##     Pace + Shooting + Passing + Dribbling + Defending + power_long_shots, 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.084 -2.755 -0.168  2.365 10.372 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.009e+01  2.155e+00  23.244  < 2e-16 ***
## I(age^2)         -1.256e-02  1.088e-03 -11.549  < 2e-16 ***
## I(Wage^2)         2.875e-10  7.953e-11   3.614 0.000353 ***
## I(Physic^2)       9.538e-04  2.601e-04   3.666 0.000291 ***
## Pace             -5.928e-02  2.566e-02  -2.310 0.021563 *  
## Shooting          1.341e-01  6.033e-02   2.223 0.026988 *  
## Passing           1.525e-01  5.386e-02   2.832 0.004938 ** 
## Dribbling         2.857e-01  5.976e-02   4.781 2.75e-06 ***
## Defending         5.210e-02  2.650e-02   1.966 0.050191 .  
## power_long_shots -1.373e-01  4.443e-02  -3.090 0.002194 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.906 on 298 degrees of freedom
## Multiple R-squared:  0.6068, Adjusted R-squared:  0.5949 
## F-statistic: 51.09 on 9 and 298 DF,  p-value: < 2.2e-16

set.seed(234)
interac.model.train.2 <-
  lm(data=train,
    potential ~ I(age^2) + I(Wage^2) + I(Physic^2)+  Weight + Height  + Passing + Dribbling +
      power_long_shots + power_strength 
  )

summary.lm(interac.model.train.2)

## 
## Call:
## lm(formula = potential ~ I(age^2) + I(Wage^2) + I(Physic^2) + 
##     Weight + Height + Passing + Dribbling + power_long_shots + 
##     power_strength, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1785 -2.7020 -0.3872  2.5664 11.5730 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.237e+01  8.721e+00   3.711 0.000246 ***
## I(age^2)         -1.150e-02  1.087e-03 -10.578  < 2e-16 ***
## I(Wage^2)         3.086e-10  8.040e-11   3.838 0.000151 ***
## I(Physic^2)       1.412e-03  4.780e-04   2.955 0.003380 ** 
## Weight           -6.027e-02  6.260e-02  -0.963 0.336442    
## Height            1.228e-01  5.569e-02   2.204 0.028269 *  
## Passing           1.917e-01  4.878e-02   3.930 0.000106 ***
## Dribbling         2.684e-01  5.193e-02   5.168 4.34e-07 ***
## power_long_shots -5.756e-02  2.430e-02  -2.369 0.018486 *  
## power_strength   -3.218e-02  5.059e-02  -0.636 0.525207    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.965 on 298 degrees of freedom
## Multiple R-squared:  0.5949, Adjusted R-squared:  0.5827 
## F-statistic: 48.63 on 9 and 298 DF,  p-value: < 2.2e-16

set.seed(4567)
interac.model.train.3 <- lm(data=train,
    potential ~ I(age^2) + I(Wage^2) + I(Physic^2)+ Height  + Passing + Dribbling 
  )
summary.lm(interac.model.train.3)

## 
## Call:
## lm(formula = potential ~ I(age^2) + I(Wage^2) + I(Physic^2) + 
##     Height + Passing + Dribbling, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4957 -2.9707 -0.2527  2.3727 11.2482 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.880e+01  8.486e+00   4.572 7.07e-06 ***
## I(age^2)    -1.197e-02  1.066e-03 -11.234  < 2e-16 ***
## I(Wage^2)    3.394e-10  8.052e-11   4.215 3.31e-05 ***
## I(Physic^2)  1.120e-03  2.574e-04   4.353 1.84e-05 ***
## Height       6.493e-02  4.509e-02   1.440 0.150923    
## Passing      1.808e-01  4.621e-02   3.912 0.000113 ***
## Dribbling    2.145e-01  4.756e-02   4.509 9.33e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.005 on 301 degrees of freedom
## Multiple R-squared:  0.5825, Adjusted R-squared:  0.5742 
## F-statistic: 70.01 on 6 and 301 DF,  p-value: < 2.2e-16

set.seed(888)
interac.model.train.4<- lm(data=train,
    potential ~ I(age^2) + I(Wage^2) + I(Physic^2)+ Passing + Dribbling 
  )

summary.lm(interac.model.train.4)

## 
## Call:
## lm(formula = potential ~ I(age^2) + I(Wage^2) + I(Physic^2) + 
##     Passing + Dribbling, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1413 -2.9865 -0.4268  2.4123 11.5640 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.076e+01  1.735e+00  29.251  < 2e-16 ***
## I(age^2)    -1.210e-02  1.064e-03 -11.368  < 2e-16 ***
## I(Wage^2)    3.434e-10  8.061e-11   4.261 2.73e-05 ***
## I(Physic^2)  1.331e-03  2.122e-04   6.272 1.24e-09 ***
## Passing      1.771e-01  4.622e-02   3.831 0.000155 ***
## Dribbling    2.012e-01  4.674e-02   4.304 2.27e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.012 on 302 degrees of freedom
## Multiple R-squared:  0.5797, Adjusted R-squared:  0.5727 
## F-statistic:  83.3 on 5 and 302 DF,  p-value: < 2.2e-16

plot(interac.model.train.4)

Compared with our choosen model, this model has a slightly good R squared value and all its variables are statistically significant: Age, Wage, Physic, Passing and dribbling.

On the other hand, it has the disadvantage of being more complicated due to the interactions terms. Plotting wise, it seems to be more normally distributed than our model, and its variance seems to be homegeneous around the mean. One observation is inside the Cooks’ distance line and if we were to use this model, we must remove it first (otherwise it will distort the residual sum of squares)

we will try a last improvement to our model: this time taking the natural log of our dependent variable.

set.seed(1300)
log.model.train <-lm(formula = log(potential) ~ age + Pace + Passing + Dribbling + Physic + 
    power_long_shots, data = test, subset = -Wage)

summary.lm(log.model.train)

## 
## Call:
## lm(formula = log(potential) ~ age + Pace + Passing + Dribbling + 
##     Physic + power_long_shots, data = test, subset = -Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15754 -0.03859 -0.00111  0.03629  0.15415 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.9407826  0.0477705  82.494  < 2e-16 ***
## age              -0.0090050  0.0011503  -7.828 2.87e-13 ***
## Pace             -0.0004355  0.0005599  -0.778 0.437607    
## Passing           0.0034402  0.0007987   4.307 2.60e-05 ***
## Dribbling         0.0040050  0.0011446   3.499 0.000576 ***
## Physic            0.0031646  0.0005034   6.287 2.01e-09 ***
## power_long_shots -0.0013386  0.0005071  -2.640 0.008958 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06323 on 199 degrees of freedom
## Multiple R-squared:  0.511,  Adjusted R-squared:  0.4963 
## F-statistic: 34.66 on 6 and 199 DF,  p-value: < 2.2e-16

#we now get rid of the Pace variable, 

set.seed(2344)
log.model.train <-lm(formula = log(potential) ~ age +  Passing + Dribbling + Physic + 
    power_long_shots, data = test, subset = -Wage)

summary.lm(log.model.train)

## 
## Call:
## lm(formula = log(potential) ~ age + Passing + Dribbling + Physic + 
##     power_long_shots, data = test, subset = -Wage)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.154021 -0.042438 -0.000225  0.035154  0.157621 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.9263053  0.0439517  89.332  < 2e-16 ***
## age              -0.0089470  0.0011468  -7.802 3.31e-13 ***
## Passing           0.0035285  0.0007898   4.467 1.32e-05 ***
## Dribbling         0.0034986  0.0009405   3.720 0.000259 ***
## Physic            0.0032565  0.0004888   6.662 2.55e-10 ***
## power_long_shots -0.0012499  0.0004936  -2.532 0.012109 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06317 on 200 degrees of freedom
## Multiple R-squared:  0.5096, Adjusted R-squared:  0.4973 
## F-statistic: 41.56 on 5 and 200 DF,  p-value: < 2.2e-16

#now we get rid of the power long shots 
set.seed(234)
log.model.train <-lm(formula = log(potential) ~ age +  Passing + Dribbling + Physic , data = test, subset = -Wage)

summary.lm(log.model.train)

## 
## Call:
## lm(formula = log(potential) ~ age + Passing + Dribbling + Physic, 
##     data = test, subset = -Wage)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.150582 -0.041249 -0.003104  0.036883  0.163460 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.9693884  0.0410664  96.658  < 2e-16 ***
## age         -0.0094863  0.0011419  -8.308 1.43e-14 ***
## Passing      0.0033202  0.0007960   4.171 4.51e-05 ***
## Dribbling    0.0021828  0.0007944   2.748  0.00654 ** 
## Physic       0.0032486  0.0004953   6.559 4.50e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06402 on 201 degrees of freedom
## Multiple R-squared:  0.4938, Adjusted R-squared:  0.4838 
## F-statistic: 49.03 on 4 and 201 DF,  p-value: < 2.2e-16

plot(log.model.train)

Our model is very well behaved in terms of homoskedasticity and normality. A word of caution on heteroskedascity vs homoskedasticity: while heteroskedasticiy does not influence the value of the coefficients in our regression line, it does make them much less precise (in other words, we may end up having a much smaller p value that we should). In dealing with hetersokedasticity we either remove or replace the variables that are causing it or we use weigthed regression (basically we give small weights to the variables responsable for the high variance)

There are no high leverage points in our log model.

We will now interpret the coefficients

set.seed(2000)
model.coef <- coef(log.model.train)
model.coef <- ((exp(model.coef)-1)*100) # here we are transforming the coefficients of our independent variables. We need to do this because our dependent variable is in logs, and the interpretation differs from the non log/linear relation. 
model.coef

##  (Intercept)          age      Passing    Dribbling       Physic 
## 5195.2137539   -0.9441404    0.3325703    0.2185213    0.3253851

For instance, in here one unit increase in score in Passing (while keeping the other variables constant), increases the potential by 33%; one unit increase in Physic does it by a 33%, and dribbling by a 22% (always keeping the other variables constant) On the down side, this model explains just a 47% of the variation in the dependent variable, which it is not great.

4. Extension work (10 marks)

4.1 Model the likelihood of a player having a weekly wage above 8000 Euro (using the high.wage.ind variable provided).

We start by going back to our logistic model

summary.glm(model.2.validation)

## 
## Call:
## glm(formula = test$high.wage.ind ~ potential + age + Shooting + 
##     Dribbling + Defending + power_long_shots + power_strength, 
##     family = binomial, data = test)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.78671  -0.43404  -0.09763   0.32179   2.46522  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -46.30521    6.82374  -6.786 1.15e-11 ***
## potential          0.33821    0.06975   4.849 1.24e-06 ***
## age                0.27254    0.07784   3.501 0.000463 ***
## Shooting           0.09904    0.05593   1.771 0.076611 .  
## Dribbling          0.12184    0.04827   2.524 0.011597 *  
## Defending          0.03812    0.02108   1.808 0.070627 .  
## power_long_shots  -0.09178    0.04375  -2.098 0.035917 *  
## power_strength     0.05223    0.02232   2.340 0.019289 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 264.05  on 205  degrees of freedom
## Residual deviance: 120.36  on 198  degrees of freedom
## AIC: 136.36
## 
## Number of Fisher Scoring iterations: 6

coeff.model <- coefficients(model.2.validation)

coeff.model

##      (Intercept)        potential              age         Shooting 
##     -46.30520531       0.33820531       0.27254203       0.09904285 
##        Dribbling        Defending power_long_shots   power_strength 
##       0.12184283       0.03811537      -0.09178418       0.05223001

# a few modifications ahead of our contingency table (we want the table to have Yes and No on each side )
test <- test %>% mutate(high.wage.ind = ifelse(high.wage.ind == 1, "Yes","No"))
test$high.wage.ind <- as.factor(test$high.wage.ind)

set.seed(2344)
probabilities <- model.2.validation %>% predict(test, type = "response")

predicted.classes <- ifelse(probabilities > 0.5, "Yes", "No")

#check the number of observations (here we our working on our validation set)
table(test$high.wage.ind)

## 
##  No Yes 
## 136  70

#Finally we create a table to assess our model 
#We have the observed values on our rows and our predicted values on column 
table(predicted.classes, test$high.wage.ind)

##                  
## predicted.classes  No Yes
##               No  125  13
##               Yes  11  57

Our model corrected predicted 125 No cases, but missclassified 11 cases. Similarily, corrected classified 57 high earners (Yes), but missclassified 13. This means that correctly classified the no cases 91% of the time (201/214) with a 9% of false negative and correctly classified the Yes cases 81% of the time and a 19% of false positive. While these numbers may seem encouragin, the problem here is that our sensitivity is low: 50%. We can try and adjust to be a bit more strict (50% is not better than chance alone)

set.seed(422)
predicted.classes <- ifelse(probabilities > 0.80, "Yes", "No")
table(predicted.classes, test$high.wage.ind)

##                  
## predicted.classes  No Yes
##               No  136  28
##               Yes   0  42

In here out of 136 observed No’s, 136were correctly predicted.

On the Yes side, 42 observations were correctly predicted using this model (60%) Considering that we are mainly interested in predicting who will became a high earner -YES- (we want to predict who will became a high earner rather than who won’t), our model has a very low predictive power when the sensitivity is high and a, relatively good prediction power when the sensitivity is low (but close to chance)

Having assessed the accuracy of our binary model, we proceed to write down our regression model.

The formula for multiple binary regression would be, p= exp(b0 +b1X1 +b2X2+…BnXn)/1+exp(b0+b1X1+b2X2+…bnXn)

#We can use the coefficients already calculated and in odds ratio.
coeff.model

##      (Intercept)        potential              age         Shooting 
##     -46.30520531       0.33820531       0.27254203       0.09904285 
##        Dribbling        Defending power_long_shots   power_strength 
##       0.12184283       0.03811537      -0.09178418       0.05223001

p= -48.5 + 0.37(Potential) + 0.22(Age) + 0.089(Weight) + 0.162(Dribbling) - 0.04(Power)/ 1+ ( -48.5 + 0.37(Potential) + 0.22(Age) + 0.089(Weight) + 0.162(Dribbling) 0.04(Power) )

We can also plot the distribution of the probabilities

hist(
  model.2.validation$fitted.values,
  xlab = "Probability of Being a High Earner",
  col = "red",
  main = "Distribution of the Predicted Probability",
  breaks = 10, 
  ylim = c(0, 170))

References

Add any references here. NB You can either do this manually or automatically with a .bib file (which then must be submitted along with your .Rmd file). See the RMarkdown documentation for guidance.

CS5801 Coursework Template Proforma

student-id

29 January, 2022