IPL Pricing: Case Study to identify the best regression model to determine the independent variables that contribute to the dependent variable (Price of IPL Player)

IIMK ADSM 2020-21 Batch-2, Group 6: Ramana, Venugopal, Siju, Vikesh, Abdul

Load required packages

library(caret)
## Warning: package 'caret' was built under R version 4.0.4
## Loading required package: lattice
## Loading required package: ggplot2
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
library(psych)
## Warning: package 'psych' was built under R version 4.0.4
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:psych':
## 
##     logit

Load the data

setwd("C:/_MyData_/IIMK/Assignment 3_IPL Players Pricing")

iplData <- read_excel ("Pricing of players IMB381-XLS-ENG.xls", sheet = "Modified Data")

names(iplData)
##  [1] "Sl.NO."          "PLAYER NAME"     "L25"             "B25-35"         
##  [5] "A35"             "Country"         "Team"            "PLAYING ROLE"   
##  [9] "BAT"             "BOW"             "ALL"             "BAT*SR"         
## [13] "BOW*ECO"         "BOW*SR-BL"       "BAT*RUN-S"       "BOW*WK-I"       
## [17] "BAT*T-RUNS"      "BAT*ODI-RUNS"    "BOW*WK-O"        "T-RUNS"         
## [21] "T-WKTS"          "ODI-RUNS"        "ODI-SR-B"        "ODI-WKTS"       
## [25] "ODI-SR-BL"       "CAPTAINCY EXP"   "INDIA"           "AUSTRALIA"      
## [29] "OTHERS"          "MTS"             "ALL*SR-B"        "ALL*SR-BL"      
## [33] "ALL*ECON"        "RUNS-S"          "HS"              "AVE"            
## [37] "SR -B"           "SIXERS"          "RUNS-C"          "WKTS"           
## [41] "AVE-BL"          "ECON"            "SR -BL"          "Year"           
## [45] "Base Price(US$)" "Sold Price(US$)" "S_B Price"       "SQRT(S-B)"
# Create another data frame with only the required columns as mentioned in the case study's PDF
attach(iplData)
analysisData <- data.frame(L25, `B25-35`, A35, `RUNS-S`, `RUNS-C`, HS, AVE, `AVE-BL`, `SR -B`,
                           `SR -BL`, SIXERS, WKTS, ECON, `CAPTAINCY EXP`, `ODI-SR-B`, `ODI-SR-BL`,
                           `ODI-RUNS`, `ODI-WKTS`, `T-RUNS`, `T-WKTS`, BAT, BOW, ALL,
                           INDIA, AUSTRALIA, OTHERS, Year, `SQRT(S-B)`)
detach(iplData)
names (analysisData)
##  [1] "L25"           "B25.35"        "A35"           "RUNS.S"       
##  [5] "RUNS.C"        "HS"            "AVE"           "AVE.BL"       
##  [9] "SR..B"         "SR..BL"        "SIXERS"        "WKTS"         
## [13] "ECON"          "CAPTAINCY.EXP" "ODI.SR.B"      "ODI.SR.BL"    
## [17] "ODI.RUNS"      "ODI.WKTS"      "T.RUNS"        "T.WKTS"       
## [21] "BAT"           "BOW"           "ALL"           "INDIA"        
## [25] "AUSTRALIA"     "OTHERS"        "Year"          "SQRT.S.B."

Check Linear Regression and VIF for multi-collinearity

lnModel <- lm (SQRT.S.B.~., data  = analysisData)
summary(lnModel)
## 
## Call:
## lm(formula = SQRT.S.B. ~ ., data = analysisData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -660.21 -180.79  -20.17  172.48  653.92 
## 
## Coefficients: (3 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -5.466e+04  3.939e+04  -1.388  0.16810   
## L25            1.977e+02  1.300e+02   1.521  0.13124   
## B25.35         5.671e+01  7.657e+01   0.741  0.46055   
## A35                   NA         NA      NA       NA   
## RUNS.S         1.689e-01  1.135e-01   1.488  0.13980   
## RUNS.C         1.251e-01  7.861e-02   1.592  0.11443   
## HS            -1.267e+00  1.773e+00  -0.714  0.47658   
## AVE            3.742e+00  5.089e+00   0.735  0.46373   
## AVE.BL         6.326e+00  7.159e+00   0.884  0.37896   
## SR..B         -8.626e-01  8.834e-01  -0.976  0.33107   
## SR..BL        -8.107e+00  9.862e+00  -0.822  0.41290   
## SIXERS         1.558e+00  2.279e+00   0.684  0.49573   
## WKTS           1.097e+00  9.992e-01   1.098  0.27483   
## ECON          -1.547e+00  7.452e+00  -0.208  0.83597   
## CAPTAINCY.EXP  1.481e+02  8.290e+01   1.787  0.07685 . 
## ODI.SR.B       5.022e-01  1.159e+00   0.433  0.66562   
## ODI.SR.BL     -2.326e+00  1.130e+00  -2.058  0.04204 * 
## ODI.RUNS       2.777e-02  2.097e-02   1.324  0.18842   
## ODI.WKTS       8.926e-01  5.197e-01   1.718  0.08880 . 
## T.RUNS        -3.139e-02  2.004e-02  -1.566  0.12030   
## T.WKTS        -3.327e-01  3.969e-01  -0.838  0.40389   
## BAT           -3.860e+01  9.942e+01  -0.388  0.69864   
## BOW           -5.220e+01  8.093e+01  -0.645  0.52031   
## ALL                   NA         NA      NA       NA   
## INDIA          2.092e+02  7.290e+01   2.869  0.00498 **
## AUSTRALIA      8.518e+01  7.869e+01   1.083  0.28148   
## OTHERS                NA         NA      NA       NA   
## Year           2.730e+01  1.960e+01   1.393  0.16658   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 265.1 on 105 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.518,  Adjusted R-squared:  0.4078 
## F-statistic: 4.702 on 24 and 105 DF,  p-value: 1.375e-08
anova(lnModel)
## Analysis of Variance Table
## 
## Response: SQRT.S.B.
##                Df  Sum Sq Mean Sq F value    Pr(>F)    
## L25             1  591152  591152  8.4107  0.004544 ** 
## B25.35          1  145411  145411  2.0689  0.153307    
## RUNS.S          1 3460455 3460455 49.2344 2.306e-10 ***
## RUNS.C          1 1319905 1319905 18.7792 3.372e-05 ***
## HS              1   68240   68240  0.9709  0.326720    
## AVE             1   20144   20144  0.2866  0.593535    
## AVE.BL          1   32983   32983  0.4693  0.494833    
## SR..B           1    2604    2604  0.0371  0.847732    
## SR..BL          1  121628  121628  1.7305  0.191214    
## SIXERS          1   58337   58337  0.8300  0.364360    
## WKTS            1   52152   52152  0.7420  0.390983    
## ECON            1   10560   10560  0.1502  0.699091    
## CAPTAINCY.EXP   1  155703  155703  2.2153  0.139646    
## ODI.SR.B        1   21092   21092  0.3001  0.584984    
## ODI.SR.BL       1  308208  308208  4.3851  0.038662 *  
## ODI.RUNS        1   18984   18984  0.2701  0.604362    
## ODI.WKTS        1  113317  113317  1.6122  0.206983    
## T.RUNS          1  459410  459410  6.5364  0.011999 *  
## T.WKTS          1  130339  130339  1.8544  0.176184    
## BAT             1   19459   19459  0.2769  0.599878    
## BOW             1     137     137  0.0019  0.964886    
## INDIA           1  565628  565628  8.0476  0.005469 ** 
## AUSTRALIA       1  118686  118686  1.6886  0.196627    
## Year            1  136373  136373  1.9403  0.166581    
## Residuals     105 7379964   70285                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# vif (lnModel)
# This function results in error and blocks generating the HTML. The error is "there are aliased coefficients in the model". This means that there are few perfectly correlated variables.

Findings

  • Linear regression results show that there are no highly significant IVs that contribute to the Pricing.

  • Shows India, but that is significant but that is at 90% confidence interval.

  • VIF shows that there seem to be perfectly correlated variables. So, check the correlation between all the numerical predictors.

  • As there are 28 IVs, we can plot all in one graph, but can’t read them. Hence plot them into two graphs.

pairs.panels(analysisData[c(-15, -16, -17, -18, -19, -20, -21, -22, -23, -24, -25, -26, -27, -28)])

pairs.panels(analysisData[c(-1, -2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14)])

Findings

  • This shows that the below variables are correlated:
    • A35 and B25.35 - -0.83
    • HS and RUNS.S - 0.83
    • AVE and HS, RUN.S - 0.88, 0.77
    • SR..BL and AVE.BL - 0.98
    • SIXERS and HS, RUNS.S - 0.79, 0.87
    • ECON and SR.BL, AVE.BL - 0.69, 0.69
    • T.RUNS and ODI.RUNS - 0.89
    • T.WKTS and ODI.WKTS - 0.82
    • OTHERS and INDIA - -0.71
  • As there are too many correlated variables, we can ignore Linear Regression and go with Lasso, Ridge and Elastic-Net Regression models.
# PARTITION THE DATA INTO TRAINING AND TEST 
set.seed (1234)
index <- sample (2, nrow(analysisData), replace = TRUE, p = c(.70,.30))
trainData <- analysisData[index == 1,]
testData <- analysisData[index == 2,]

# Custom control parameters
# Use 10 fold cross validation and repeat it for 5 times
customControl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

Run Linear Regression Model

set.seed (1234)

linearModel <- train (SQRT.S.B.~., trainData, method = "lm", trControl = customControl)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
linearModel$results
##   intercept     RMSE  Rsquared     MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 336.4882 0.2917763 275.102 91.55101  0.2285392 70.13945
linearModel
## Linear Regression 
## 
## 99 samples
## 27 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 89, 90, 88, 88, 89, 90, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   336.4882  0.2917763  275.102
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(linearModel)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -545.05 -151.59  -28.63  163.15  665.80 
## 
## Coefficients: (3 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -4.174e+04  4.704e+04  -0.887  0.37776   
## L25            1.404e+02  1.626e+02   0.864  0.39057   
## B25.35        -4.947e+01  9.922e+01  -0.499  0.61955   
## A35                   NA         NA      NA       NA   
## RUNS.S         1.363e-01  1.325e-01   1.029  0.30671   
## RUNS.C         1.635e-01  8.985e-02   1.820  0.07288 . 
## HS            -1.747e+00  2.649e+00  -0.659  0.51163   
## AVE            3.581e+00  6.381e+00   0.561  0.57632   
## AVE.BL         7.864e+00  1.011e+01   0.778  0.43922   
## SR..B         -4.067e-01  1.037e+00  -0.392  0.69607   
## SR..BL        -1.016e+01  1.329e+01  -0.765  0.44697   
## SIXERS         2.141e+00  2.641e+00   0.811  0.42001   
## WKTS           5.883e-01  1.164e+00   0.505  0.61476   
## ECON          -3.427e+00  8.566e+00  -0.400  0.69024   
## CAPTAINCY.EXP  2.005e+02  9.952e+01   2.014  0.04762 * 
## ODI.SR.B       1.988e-01  1.426e+00   0.139  0.88945   
## ODI.SR.BL     -1.193e+00  1.521e+00  -0.784  0.43547   
## ODI.RUNS       3.942e-02  2.503e-02   1.575  0.11953   
## ODI.WKTS       5.995e-01  6.462e-01   0.928  0.35657   
## T.RUNS        -4.411e-02  2.377e-02  -1.856  0.06749 . 
## T.WKTS        -7.290e-02  5.186e-01  -0.141  0.88859   
## BAT           -3.827e+01  1.241e+02  -0.308  0.75869   
## BOW           -1.477e+01  9.679e+01  -0.153  0.87915   
## ALL                   NA         NA      NA       NA   
## INDIA          2.504e+02  8.558e+01   2.926  0.00456 **
## AUSTRALIA      1.624e+02  9.338e+01   1.739  0.08616 . 
## OTHERS                NA         NA      NA       NA   
## Year           2.086e+01  2.341e+01   0.891  0.37571   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 277.3 on 74 degrees of freedom
## Multiple R-squared:  0.5413, Adjusted R-squared:  0.3925 
## F-statistic: 3.638 on 24 and 74 DF,  p-value: 1.038e-05

Findings

  • We can see that RMSE is 336.4882 and there are not many IVs that are highly significant to the dependent variable
  • Player being from India is significant, but it is at 90% confidence interval.

Run Ridge Regression Model

set.seed (1234)

ridgeModel <- train (SQRT.S.B.~., trainData, method = "glmnet",
                     tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 5)),
                     trControl = customControl)
ridgeModel
## glmnet 
## 
## 99 samples
## 27 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 89, 90, 88, 88, 89, 90, ... 
## Resampling results across tuning parameters:
## 
##   lambda    RMSE      Rsquared   MAE     
##   0.000100  316.3624  0.3192681  259.4783
##   0.250075  316.3624  0.3192681  259.4783
##   0.500050  316.3624  0.3192681  259.4783
##   0.750025  316.3624  0.3192681  259.4783
##   1.000000  316.3624  0.3192681  259.4783
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.
plot(ridgeModel)

plot(ridgeModel$finalModel, xvar = "lambda", label = TRUE)

plot(ridgeModel$finalModel, xvar = "dev", label = TRUE)

plot(varImp(ridgeModel, scale = TRUE))

Findings of Ridge Regression

  • We can see that RMSE is 316.3624 and the recommended alpha is 0 and lambda is 1.
  • Also the Ridge Regression shows that below variables significantly contribute to the dependent variable SQRT.S.B.
    • CAPTAINCY.EXP
    • L25
    • OTHERS
    • INDIA
    • B25.35
    • AUSTRALIA
    • BAT
    • Year
    • ALL
    • A35

Run a Lasso Regression Model

#LASSO REGRESSION
set.seed(1234)

lassoModel <- train (SQRT.S.B.~., trainData, method = "glmnet",
                     tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 5)), 
                     trControl = customControl)
lassoModel
## glmnet 
## 
## 99 samples
## 27 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 89, 90, 88, 88, 89, 90, ... 
## Resampling results across tuning parameters:
## 
##   lambda    RMSE      Rsquared   MAE     
##   0.000100  334.8978  0.2925710  273.9198
##   0.250075  333.1932  0.2933637  272.5206
##   0.500050  330.7415  0.2946883  270.4532
##   0.750025  328.5459  0.2956393  268.7194
##   1.000000  326.5725  0.2963811  267.4339
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 1.
plot(lassoModel)

plot(lassoModel$finalModel, xvar = "lambda", label = TRUE)

plot(lassoModel$finalModel, xvar = "dev", label = TRUE)

plot(varImp(lassoModel, scale = TRUE))

Findings of Lasso Regression

  • We can see that the least RMSE is 326.5725 and the recommended alpha is 1 and lambda is 1.
  • Also the Lasso Regression shows that below variables significantly contribute to the dependent variable SQRT.S.B.
    • CAPTAINCY.EXP
    • OTHERS
    • L25
    • INDIA
    • B25.35
    • BAT
    • Year
    • ALL

Run Elastic-Net Regression Model

set.seed (1234)

elasticModel <- train (SQRT.S.B.~., trainData, method = "glmnet",
                       tuneGrid = expand.grid(alpha = seq(0,1, length = 10),
                                              lambda = seq(0.0001, 1, length = 5)),
                       trControl = customControl)
elasticModel
## glmnet 
## 
## 99 samples
## 27 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 89, 90, 88, 88, 89, 90, ... 
## Resampling results across tuning parameters:
## 
##   alpha      lambda    RMSE      Rsquared   MAE     
##   0.0000000  0.000100  316.3624  0.3192681  259.4783
##   0.0000000  0.250075  316.3624  0.3192681  259.4783
##   0.0000000  0.500050  316.3624  0.3192681  259.4783
##   0.0000000  0.750025  316.3624  0.3192681  259.4783
##   0.0000000  1.000000  316.3624  0.3192681  259.4783
##   0.1111111  0.000100  334.9100  0.2925636  273.9604
##   0.1111111  0.250075  334.3818  0.2928674  273.5494
##   0.1111111  0.500050  333.2412  0.2936938  272.6696
##   0.1111111  0.750025  332.1910  0.2944880  271.8173
##   0.1111111  1.000000  331.2245  0.2952417  271.0123
##   0.2222222  0.000100  334.9710  0.2925712  274.0026
##   0.2222222  0.250075  334.2067  0.2929751  273.4094
##   0.2222222  0.500050  332.8832  0.2938572  272.3630
##   0.2222222  0.750025  331.6641  0.2947287  271.3450
##   0.2222222  1.000000  330.5676  0.2955208  270.3934
##   0.3333333  0.000100  335.0047  0.2925156  274.0316
##   0.3333333  0.250075  334.0872  0.2929957  273.3073
##   0.3333333  0.500050  332.5908  0.2939777  272.0950
##   0.3333333  0.750025  331.2130  0.2948665  270.9078
##   0.3333333  1.000000  329.9909  0.2957272  269.8724
##   0.4444444  0.000100  334.9045  0.2926822  273.9539
##   0.4444444  0.250075  333.8838  0.2931465  273.1414
##   0.4444444  0.500050  332.2628  0.2941103  271.8068
##   0.4444444  0.750025  330.7539  0.2950362  270.4597
##   0.4444444  1.000000  329.4005  0.2959617  269.4271
##   0.5555556  0.000100  334.9206  0.2926477  273.9453
##   0.5555556  0.250075  333.7673  0.2932004  273.0255
##   0.5555556  0.500050  331.9730  0.2942200  271.5281
##   0.5555556  0.750025  330.3179  0.2952342  270.1220
##   0.5555556  1.000000  328.8457  0.2961641  268.9796
##   0.6666667  0.000100  334.9076  0.2926138  273.9296
##   0.6666667  0.250075  333.6333  0.2932368  272.9061
##   0.6666667  0.500050  331.6687  0.2943346  271.2313
##   0.6666667  0.750025  329.8882  0.2954054  269.8003
##   0.6666667  1.000000  328.2660  0.2962625  268.4927
##   0.7777778  0.000100  334.8844  0.2925911  273.9061
##   0.7777778  0.250075  333.4956  0.2932748  272.7869
##   0.7777778  0.500050  331.3579  0.2944575  270.9290
##   0.7777778  0.750025  329.4522  0.2955357  269.4618
##   0.7777778  1.000000  327.6948  0.2963300  268.1568
##   0.8888889  0.000100  334.9018  0.2925972  273.9160
##   0.8888889  0.250075  333.3437  0.2933123  272.6548
##   0.8888889  0.500050  331.0437  0.2945838  270.6667
##   0.8888889  0.750025  328.9870  0.2955903  269.0731
##   0.8888889  1.000000  327.1217  0.2963792  267.7821
##   1.0000000  0.000100  334.8978  0.2925710  273.9198
##   1.0000000  0.250075  333.1932  0.2933637  272.5206
##   1.0000000  0.500050  330.7415  0.2946883  270.4532
##   1.0000000  0.750025  328.5459  0.2956393  268.7194
##   1.0000000  1.000000  326.5725  0.2963811  267.4339
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.
plot(elasticModel)

plot(varImp(elasticModel, scale = TRUE))

Findings of Elastic-Net Regression

  • We can see that the least RMSE is 316.3624 and the recommended alpha is 0 and lambda is 1.
  • Also the Elastic-Net Regression shows that the below variables significantly contribute to the dependent variable SQRT.S.B.
    • CAPTAINCY.EXP
    • L25
    • OTHERS
    • INDIA
    • B25.35
    • AUSTRALIA
    • BAT
    • Year
    • ALL
    • A35

Compare the 4 models

modelList <- list (linear = linearModel, lasso = lassoModel, ridge = ridgeModel, elastic = elasticModel)
modelComparison <- resamples(modelList)
summary (modelComparison)
## 
## Call:
## summary.resamples(object = modelComparison)
## 
## Models: linear, lasso, ridge, elastic 
## Number of resamples: 50 
## 
## MAE 
##              Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## linear   94.90724 232.0795 267.3368 275.1020 309.4318 445.5510    0
## lasso    99.61402 229.8958 259.2751 267.4339 303.2953 432.1984    0
## ridge   104.05325 223.4995 252.6031 259.4783 295.6658 425.4602    0
## elastic 104.05325 223.4995 252.6031 259.4783 295.6658 425.4602    0
## 
## RMSE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## linear  105.5140 271.2796 324.3533 336.4882 397.4152 577.1842    0
## lasso   113.3644 261.6002 315.3057 326.5725 386.4215 556.1973    0
## ridge   121.0110 258.4110 303.6545 316.3624 371.6800 546.0811    0
## elastic 121.0110 258.4110 303.6545 316.3624 371.6800 546.0811    0
## 
## Rsquared 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## linear  2.511387e-05 0.1055304 0.2460802 0.2917763 0.4686587 0.9383782    0
## lasso   3.279814e-06 0.1050043 0.2593284 0.2963811 0.4651699 0.9350176    0
## ridge   5.674389e-05 0.1250421 0.2896259 0.3192681 0.4853880 0.9375848    0
## elastic 5.674389e-05 0.1250421 0.2896259 0.3192681 0.4853880 0.9375848    0

Findings from comparing the models

  • Ridge model has the least RMSE among the 4 models.
  • We can see that the below variables are common across the 3 models (Ridge, Lasso and Elastic-Net) that are significantly contributing to the Pricing of the Player.
    • Captaincy Experience (CAPTAINCY.EXP)
    • Age of the player being up to 25 years or below 35 years (L25, B25.35)
    • Player being from India, Australia and other countries (INDIA, AUSTRALIA, OTHERS)
    • Player being a Batsman or All-Rounder (BAT, ALL)
    • The year in which the player joined IPL (Year)

Find the Best Model

ridgeModel$bestTune
##   alpha lambda
## 5     0      1
lassoModel$bestTune
##   alpha lambda
## 5     1      1
elasticModel$bestTune
##   alpha lambda
## 5     0      1

Findings

  • There are 5 variables that are significantly contributing to the pricing with Alpha being 0 and Lambda being 1
    • CAPTAINCY.EXP
    • L25
    • OTHERS
    • INDIA
    • B25.35

Examine the coefficients of Ridge Model

bestModel <- ridgeModel$finalModel
coef(bestModel, s = ridgeModel$bestTune$lambda)
## 28 x 1 sparse Matrix of class "dgCMatrix"
##                           1
## (Intercept)   -4.325920e+04
## L25            1.437221e+02
## B25.35        -3.825291e+01
## A35           -1.318471e+01
## RUNS.S         1.047020e-01
## RUNS.C         1.228217e-01
## HS            -7.727037e-01
## AVE            1.890611e+00
## AVE.BL         1.288837e+00
## SR..B         -3.599746e-01
## SR..BL        -1.261645e+00
## SIXERS         2.366044e+00
## WKTS           9.222959e-01
## ECON          -1.844224e+00
## CAPTAINCY.EXP  1.751146e+02
## ODI.SR.B       2.992131e-01
## ODI.SR.BL     -1.046566e+00
## ODI.RUNS       2.546127e-02
## ODI.WKTS       6.195488e-01
## T.RUNS        -2.562485e-02
## T.WKTS        -1.071907e-01
## BAT           -2.404488e+01
## BOW            4.258354e+00
## ALL            2.140721e+01
## INDIA          1.204468e+02
## AUSTRALIA      3.344905e+01
## OTHERS        -1.347188e+02
## Year           2.166016e+01

Prediction comparing the Training Data and Testing Data

predictionOne <- predict(ridgeModel, trainData)
sqrt(mean((trainData$SQRT.S.B.-predictionOne)^2))
## [1] 241.6788
predictionTwo <- predict(ridgeModel, testData)
sqrt(mean((testData$SQRT.S.B..-predictionTwo)^2))
## [1] NaN

Conclusions

  • The RMSE with the Ridge Regression is 241.6788 which is the least.
  • The most important variables that are contributing to the pricing and their coefficients are:
    • Captaincy Experience = 0.0175
    • Age being under 25 years = 0.0144
    • Player being from countries other than India & Australia = -0.0135
    • Player being from India = 0.0120
    • Player being in the age of 25 to 35 = -0.382
  • It is unclear for us how to interpret the NaN for RMSE of Test Data.