Blog5: Model Selection

Jie Zou

2022-05-20

Data Structure After Feature Engineering

## Rows: 7,907
## Columns: 13
## $ year          <int> 2014, 2014, 2006, 2010, 2007, 2017, 2007, 2001, 2011, 20…
## $ selling_price <int> 450000, 370000, 158000, 225000, 130000, 440000, 96000, 4…
## $ km_driven     <int> 145500, 120000, 140000, 127000, 120000, 45000, 175000, 5…
## $ fuel          <chr> "Diesel", "Diesel", "Petrol", "Diesel", "Petrol", "Petro…
## $ seller_type   <chr> "Individual", "Individual", "Individual", "Individual", …
## $ transmission  <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manua…
## $ seats         <int> 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 7, 5, 5, 5,…
## $ brand         <chr> "Maruti", "Skoda", "Honda", "Hyundai", "Maruti", "Hyunda…
## $ mileage_kmpl  <dbl> 23.40, 21.14, 17.70, 23.00, 16.10, 20.14, 17.30, 16.10, …
## $ max_power_bhp <dbl> 74.00, 103.52, 78.00, 90.00, 88.20, 81.86, 57.50, 37.00,…
## $ torque_nm     <dbl> 190.0, 250.0, 117.6, 215.6, 107.8, 113.0, 68.6, 59.0, 17…
## $ engine_cc     <int> 1248, 1498, 1497, 1396, 1298, 1197, 1061, 796, 1364, 139…
## $ own           <chr> "First", "Second", "Third", "First", "First", "First", "…

Numeric Correlations and Distributions

From distributions, we see that

  • km_driven, max_power_bhp and target are right skewed which needs to transform during modeling
  • engine_cc has many so many peaks
  • torque_nm looks like bi-model but it is weirder
  • mileage_kmpl is roughly normal.
  • seats: Five seats and seven seats are two most common seat capability.
  • year: Buyers are more interested in cars after year 2000

Categorical Distributions

Modeling

Full Model

lm0 <- lm(selling_price ~ . , data = d.copy)
## [1] "Adjusted R-squared: 0.854"

Feature Transformed Full Model

lm0.ft <- lm(selling_price ~ year + log(km_driven) + fuel + seller_type + transmission + brand + mileage_kmpl + log(max_power_bhp) + torque_nm + engine_cc + own, data = d.copy)
## [1] "Adjusted R-squared: 0.853"

Target Transformed Full Model

lm0.tt <- lm(log(selling_price)~., data= d.copy)
## [1] "Adjusted R-squared: 0.913"

Both Feature and Target Transformed Full Model

lm0.ftt <- lm0.ft %>% 
  update(log(selling_price)~., data = d.copy)
## [1] "Adjusted R-squared: 0.917"

Backward Elimination Using Best Adj.R.Sqrd

## Start:  AIC=-22580.38
## log(selling_price) ~ year + log(km_driven) + fuel + seller_type + 
##     transmission + brand + mileage_kmpl + log(max_power_bhp) + 
##     torque_nm + engine_cc + own
## 
##                      Df Sum of Sq     RSS    AIC
## - mileage_kmpl        1      0.01  449.17 -22582
## <none>                             449.16 -22580
## - seller_type         2      0.51  449.67 -22575
## - torque_nm           1      1.01  450.17 -22565
## - transmission        1      2.34  451.50 -22541
## - log(km_driven)      1      5.09  454.25 -22493
## - own                 4     13.75  462.91 -22350
## - fuel                3     18.32  467.48 -22270
## - engine_cc           1     19.23  468.39 -22251
## - log(max_power_bhp)  1     88.46  537.62 -21161
## - brand              30    258.62  707.78 -19045
## - year                1    578.06 1027.22 -16042
## 
## Step:  AIC=-22582.25
## log(selling_price) ~ year + log(km_driven) + fuel + seller_type + 
##     transmission + brand + log(max_power_bhp) + torque_nm + engine_cc + 
##     own
## 
##                      Df Sum of Sq     RSS    AIC
## <none>                             449.17 -22582
## - seller_type         2      0.52  449.68 -22577
## - torque_nm           1      1.00  450.17 -22567
## - transmission        1      2.35  451.52 -22543
## - log(km_driven)      1      5.15  454.32 -22494
## - own                 4     13.74  462.91 -22352
## - fuel                3     21.64  470.80 -22216
## - engine_cc           1     25.23  474.39 -22152
## - log(max_power_bhp)  1     88.97  538.14 -21155
## - brand              30    268.85  718.02 -18934
## - year                1    688.00 1137.17 -15240
## 
## Call:
## lm(formula = log(selling_price) ~ year + log(km_driven) + fuel + 
##     seller_type + transmission + brand + log(max_power_bhp) + 
##     torque_nm + engine_cc + own, data = d.copy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27667 -0.14534  0.00972  0.15144  1.74023 
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 -2.111e+02  2.023e+00 -104.361  < 2e-16 ***
## year                         1.098e-01  1.000e-03  109.724  < 2e-16 ***
## log(km_driven)              -4.066e-02  4.281e-03   -9.497  < 2e-16 ***
## fuelDiesel                   2.853e-01  3.450e-02    8.268  < 2e-16 ***
## fuelLPG                      2.134e-01  5.253e-02    4.062 4.91e-05 ***
## fuelPetrol                   9.599e-02  3.374e-02    2.845 0.004454 ** 
## seller_typeIndividual       -2.421e-02  8.927e-03   -2.712 0.006702 ** 
## seller_typeTrustmark Dealer -4.071e-02  1.857e-02   -2.193 0.028362 *  
## transmissionManual          -7.115e-02  1.110e-02   -6.411 1.53e-10 ***
## brandAshok                  -4.477e-01  2.677e-01   -1.673 0.094461 .  
## brandAudi                    1.716e-01  1.275e-01    1.346 0.178398    
## brandBMW                     4.037e-01  1.235e-01    3.268 0.001088 ** 
## brandChevrolet              -6.271e-01  1.213e-01   -5.169 2.41e-07 ***
## brandDaewoo                  5.616e-02  1.833e-01    0.306 0.759344    
## brandDatsun                 -6.053e-01  1.242e-01   -4.873 1.12e-06 ***
## brandFiat                   -5.187e-01  1.261e-01   -4.114 3.92e-05 ***
## brandForce                  -5.186e-01  1.550e-01   -3.345 0.000827 ***
## brandFord                   -4.015e-01  1.209e-01   -3.320 0.000904 ***
## brandHonda                  -2.518e-01  1.210e-01   -2.080 0.037565 *  
## brandHyundai                -3.040e-01  1.207e-01   -2.519 0.011803 *  
## brandIsuzu                  -3.306e-01  1.614e-01   -2.049 0.040537 *  
## brandJaguar                  2.505e-01  1.252e-01    2.002 0.045361 *  
## brandJeep                   -8.108e-02  1.285e-01   -0.631 0.528168    
## brandKia                    -2.467e-01  1.700e-01   -1.451 0.146739    
## brandLand                    5.647e-01  1.558e-01    3.624 0.000292 ***
## brandLexus                   6.213e-01  1.285e-01    4.836 1.35e-06 ***
## brandMahindra               -3.894e-01  1.205e-01   -3.230 0.001242 ** 
## brandMaruti                 -2.218e-01  1.205e-01   -1.840 0.065765 .  
## brandMercedes-Benz           2.639e-01  1.257e-01    2.100 0.035793 *  
## brandMG                      3.187e-02  1.840e-01    0.173 0.862489    
## brandMitsubishi             -7.845e-02  1.361e-01   -0.577 0.564282    
## brandNissan                 -3.522e-01  1.233e-01   -2.857 0.004290 ** 
## brandOpel                   -1.125e-01  2.677e-01   -0.420 0.674280    
## brandRenault                -3.552e-01  1.216e-01   -2.922 0.003490 ** 
## brandSkoda                  -3.691e-01  1.227e-01   -3.008 0.002636 ** 
## brandTata                   -7.022e-01  1.206e-01   -5.823 6.02e-09 ***
## brandToyota                 -4.241e-02  1.208e-01   -0.351 0.725547    
## brandVolkswagen             -3.926e-01  1.218e-01   -3.224 0.001267 ** 
## brandVolvo                   2.039e-01  1.249e-01    1.633 0.102521    
## log(max_power_bhp)           7.366e-01  1.867e-02   39.458  < 2e-16 ***
## torque_nm                    3.788e-04  9.034e-05    4.193 2.78e-05 ***
## engine_cc                    2.464e-04  1.173e-05   21.011  < 2e-16 ***
## ownFourth & Above           -1.577e-01  2.008e-02   -7.851 4.68e-15 ***
## ownSecond                   -8.228e-02  7.009e-03  -11.740  < 2e-16 ***
## ownTest Drive Car            6.390e-01  1.097e-01    5.823 6.02e-09 ***
## ownThird                    -1.194e-01  1.202e-02   -9.934  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2391 on 7860 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.917,  Adjusted R-squared:  0.9166 
## F-statistic:  1931 on 45 and 7860 DF,  p-value: < 2.2e-16

Conclusion

We have 5 different models here: Full model, feature transformed model, target transformed model, both feature and target transformed mode and reduced model. Based on the goodness of fit, both feature and target transformed model has the highest adjusted R squared value. However, the value between reduced model and “best” model does not make a lot difference. If I have choose one model out of the rest, I will take the reduced the model whose AIC is the lowest and adjust R squared is close to the highest one(help reduce the chance to overfit the data)