Data Structure After Feature Engineering
## Rows: 7,907
## Columns: 13
## $ year <int> 2014, 2014, 2006, 2010, 2007, 2017, 2007, 2001, 2011, 20…
## $ selling_price <int> 450000, 370000, 158000, 225000, 130000, 440000, 96000, 4…
## $ km_driven <int> 145500, 120000, 140000, 127000, 120000, 45000, 175000, 5…
## $ fuel <chr> "Diesel", "Diesel", "Petrol", "Diesel", "Petrol", "Petro…
## $ seller_type <chr> "Individual", "Individual", "Individual", "Individual", …
## $ transmission <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manua…
## $ seats <int> 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 7, 5, 5, 5,…
## $ brand <chr> "Maruti", "Skoda", "Honda", "Hyundai", "Maruti", "Hyunda…
## $ mileage_kmpl <dbl> 23.40, 21.14, 17.70, 23.00, 16.10, 20.14, 17.30, 16.10, …
## $ max_power_bhp <dbl> 74.00, 103.52, 78.00, 90.00, 88.20, 81.86, 57.50, 37.00,…
## $ torque_nm <dbl> 190.0, 250.0, 117.6, 215.6, 107.8, 113.0, 68.6, 59.0, 17…
## $ engine_cc <int> 1248, 1498, 1497, 1396, 1298, 1197, 1061, 796, 1364, 139…
## $ own <chr> "First", "Second", "Third", "First", "First", "First", "…
Numeric Correlations and Distributions
From distributions, we see that
- km_driven, max_power_bhp and target are right skewed which needs to transform during modeling
- engine_cc has many so many peaks
- torque_nm looks like bi-model but it is weirder
- mileage_kmpl is roughly normal.
- seats: Five seats and seven seats are two most common seat capability.
- year: Buyers are more interested in cars after year 2000
Categorical Distributions
Modeling
Full Model
lm0 <- lm(selling_price ~ . , data = d.copy)## [1] "Adjusted R-squared: 0.854"
Feature Transformed Full Model
lm0.ft <- lm(selling_price ~ year + log(km_driven) + fuel + seller_type + transmission + brand + mileage_kmpl + log(max_power_bhp) + torque_nm + engine_cc + own, data = d.copy)## [1] "Adjusted R-squared: 0.853"
Target Transformed Full Model
lm0.tt <- lm(log(selling_price)~., data= d.copy)## [1] "Adjusted R-squared: 0.913"
Both Feature and Target Transformed Full Model
lm0.ftt <- lm0.ft %>%
update(log(selling_price)~., data = d.copy)## [1] "Adjusted R-squared: 0.917"
Backward Elimination Using Best Adj.R.Sqrd
## Start: AIC=-22580.38
## log(selling_price) ~ year + log(km_driven) + fuel + seller_type +
## transmission + brand + mileage_kmpl + log(max_power_bhp) +
## torque_nm + engine_cc + own
##
## Df Sum of Sq RSS AIC
## - mileage_kmpl 1 0.01 449.17 -22582
## <none> 449.16 -22580
## - seller_type 2 0.51 449.67 -22575
## - torque_nm 1 1.01 450.17 -22565
## - transmission 1 2.34 451.50 -22541
## - log(km_driven) 1 5.09 454.25 -22493
## - own 4 13.75 462.91 -22350
## - fuel 3 18.32 467.48 -22270
## - engine_cc 1 19.23 468.39 -22251
## - log(max_power_bhp) 1 88.46 537.62 -21161
## - brand 30 258.62 707.78 -19045
## - year 1 578.06 1027.22 -16042
##
## Step: AIC=-22582.25
## log(selling_price) ~ year + log(km_driven) + fuel + seller_type +
## transmission + brand + log(max_power_bhp) + torque_nm + engine_cc +
## own
##
## Df Sum of Sq RSS AIC
## <none> 449.17 -22582
## - seller_type 2 0.52 449.68 -22577
## - torque_nm 1 1.00 450.17 -22567
## - transmission 1 2.35 451.52 -22543
## - log(km_driven) 1 5.15 454.32 -22494
## - own 4 13.74 462.91 -22352
## - fuel 3 21.64 470.80 -22216
## - engine_cc 1 25.23 474.39 -22152
## - log(max_power_bhp) 1 88.97 538.14 -21155
## - brand 30 268.85 718.02 -18934
## - year 1 688.00 1137.17 -15240
##
## Call:
## lm(formula = log(selling_price) ~ year + log(km_driven) + fuel +
## seller_type + transmission + brand + log(max_power_bhp) +
## torque_nm + engine_cc + own, data = d.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27667 -0.14534 0.00972 0.15144 1.74023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.111e+02 2.023e+00 -104.361 < 2e-16 ***
## year 1.098e-01 1.000e-03 109.724 < 2e-16 ***
## log(km_driven) -4.066e-02 4.281e-03 -9.497 < 2e-16 ***
## fuelDiesel 2.853e-01 3.450e-02 8.268 < 2e-16 ***
## fuelLPG 2.134e-01 5.253e-02 4.062 4.91e-05 ***
## fuelPetrol 9.599e-02 3.374e-02 2.845 0.004454 **
## seller_typeIndividual -2.421e-02 8.927e-03 -2.712 0.006702 **
## seller_typeTrustmark Dealer -4.071e-02 1.857e-02 -2.193 0.028362 *
## transmissionManual -7.115e-02 1.110e-02 -6.411 1.53e-10 ***
## brandAshok -4.477e-01 2.677e-01 -1.673 0.094461 .
## brandAudi 1.716e-01 1.275e-01 1.346 0.178398
## brandBMW 4.037e-01 1.235e-01 3.268 0.001088 **
## brandChevrolet -6.271e-01 1.213e-01 -5.169 2.41e-07 ***
## brandDaewoo 5.616e-02 1.833e-01 0.306 0.759344
## brandDatsun -6.053e-01 1.242e-01 -4.873 1.12e-06 ***
## brandFiat -5.187e-01 1.261e-01 -4.114 3.92e-05 ***
## brandForce -5.186e-01 1.550e-01 -3.345 0.000827 ***
## brandFord -4.015e-01 1.209e-01 -3.320 0.000904 ***
## brandHonda -2.518e-01 1.210e-01 -2.080 0.037565 *
## brandHyundai -3.040e-01 1.207e-01 -2.519 0.011803 *
## brandIsuzu -3.306e-01 1.614e-01 -2.049 0.040537 *
## brandJaguar 2.505e-01 1.252e-01 2.002 0.045361 *
## brandJeep -8.108e-02 1.285e-01 -0.631 0.528168
## brandKia -2.467e-01 1.700e-01 -1.451 0.146739
## brandLand 5.647e-01 1.558e-01 3.624 0.000292 ***
## brandLexus 6.213e-01 1.285e-01 4.836 1.35e-06 ***
## brandMahindra -3.894e-01 1.205e-01 -3.230 0.001242 **
## brandMaruti -2.218e-01 1.205e-01 -1.840 0.065765 .
## brandMercedes-Benz 2.639e-01 1.257e-01 2.100 0.035793 *
## brandMG 3.187e-02 1.840e-01 0.173 0.862489
## brandMitsubishi -7.845e-02 1.361e-01 -0.577 0.564282
## brandNissan -3.522e-01 1.233e-01 -2.857 0.004290 **
## brandOpel -1.125e-01 2.677e-01 -0.420 0.674280
## brandRenault -3.552e-01 1.216e-01 -2.922 0.003490 **
## brandSkoda -3.691e-01 1.227e-01 -3.008 0.002636 **
## brandTata -7.022e-01 1.206e-01 -5.823 6.02e-09 ***
## brandToyota -4.241e-02 1.208e-01 -0.351 0.725547
## brandVolkswagen -3.926e-01 1.218e-01 -3.224 0.001267 **
## brandVolvo 2.039e-01 1.249e-01 1.633 0.102521
## log(max_power_bhp) 7.366e-01 1.867e-02 39.458 < 2e-16 ***
## torque_nm 3.788e-04 9.034e-05 4.193 2.78e-05 ***
## engine_cc 2.464e-04 1.173e-05 21.011 < 2e-16 ***
## ownFourth & Above -1.577e-01 2.008e-02 -7.851 4.68e-15 ***
## ownSecond -8.228e-02 7.009e-03 -11.740 < 2e-16 ***
## ownTest Drive Car 6.390e-01 1.097e-01 5.823 6.02e-09 ***
## ownThird -1.194e-01 1.202e-02 -9.934 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2391 on 7860 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.917, Adjusted R-squared: 0.9166
## F-statistic: 1931 on 45 and 7860 DF, p-value: < 2.2e-16
Conclusion
We have 5 different models here: Full model, feature transformed model, target transformed model, both feature and target transformed mode and reduced model. Based on the goodness of fit, both feature and target transformed model has the highest adjusted R squared value. However, the value between reduced model and “best” model does not make a lot difference. If I have choose one model out of the rest, I will take the reduced the model whose AIC is the lowest and adjust R squared is close to the highest one(help reduce the chance to overfit the data)