2024-11-14
## 'data.frame': 19820 obs. of 16 variables:
## $ selling_price : num 1.2 5.5 2.15 2.26 5.7 3.5 3.15 4.1 10.5 5.75 ...
## $ year : num 2012 2016 2010 2012 2015 ...
## $ km_driven : int 120000 20000 60000 37000 30000 35000 40000 17512 20000 70000 ...
## $ mileage : num 19.7 18.9 17 20.9 22.8 ...
## $ engine : num 796 1197 1197 998 1498 ...
## $ max_power : num 46.3 82 80 67.1 98.6 ...
## $ age : num 11 7 13 11 8 10 10 5 4 6 ...
## $ Individual : int 1 1 1 1 0 1 0 0 1 0 ...
## $ Trustmark.Dealer: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Diesel : int 0 0 0 0 1 0 0 0 0 1 ...
## $ Electric : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LPG : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Petrol : int 1 1 1 1 0 1 1 1 1 0 ...
## $ Manual : int 1 1 1 1 1 1 1 1 0 1 ...
## $ X5 : int 1 1 1 1 1 1 1 1 1 0 ...
## $ X.5 : int 0 0 0 0 0 0 0 0 0 1 ...
##
## Call:
## lm(formula = selling_price ~ year + km_driven + mileage + engine +
## max_power + age + Individual + Trustmark.Dealer + Diesel +
## Electric + LPG + Petrol + Manual + X5 + X.5, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.3047 -1.2475 -0.1508 1.0015 17.5771
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.080e+03 1.192e+01 -90.587 < 2e-16 ***
## year 5.393e-01 5.947e-03 90.687 < 2e-16 ***
## km_driven -4.348e-06 3.480e-07 -12.496 < 2e-16 ***
## mileage -1.517e-01 6.655e-03 -22.797 < 2e-16 ***
## engine 7.943e-04 7.827e-05 10.147 < 2e-16 ***
## max_power 4.772e-02 7.276e-04 65.582 < 2e-16 ***
## age NA NA NA NA
## Individual -4.113e-01 3.466e-02 -11.867 < 2e-16 ***
## Trustmark.Dealer -4.136e-01 1.683e-01 -2.457 0.014011 *
## Diesel 5.966e-01 1.338e-01 4.458 8.32e-06 ***
## Electric 9.373e+00 8.725e-01 10.742 < 2e-16 ***
## LPG 4.447e-02 3.183e-01 0.140 0.888902
## Petrol -1.328e+00 1.372e-01 -9.682 < 2e-16 ***
## Manual -2.624e+00 5.055e-02 -51.899 < 2e-16 ***
## X5 -4.475e-01 1.519e-01 -2.947 0.003216 **
## X.5 -5.604e-01 1.638e-01 -3.421 0.000626 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.286 on 19805 degrees of freedom
## Multiple R-squared: 0.7778, Adjusted R-squared: 0.7777
## F-statistic: 4953 on 14 and 19805 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = selling_price ~ km_driven + mileage + engine + max_power +
## age + Individual + Trustmark.Dealer + Diesel + Electric +
## LPG + Petrol + Manual + X5 + X.5, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.3047 -1.2475 -0.1508 1.0015 17.5771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.157e+01 3.176e-01 36.444 < 2e-16 ***
## km_driven -4.348e-06 3.480e-07 -12.496 < 2e-16 ***
## mileage -1.517e-01 6.655e-03 -22.797 < 2e-16 ***
## engine 7.943e-04 7.827e-05 10.147 < 2e-16 ***
## max_power 4.772e-02 7.276e-04 65.582 < 2e-16 ***
## age -5.393e-01 5.947e-03 -90.687 < 2e-16 ***
## Individual -4.113e-01 3.466e-02 -11.867 < 2e-16 ***
## Trustmark.Dealer -4.136e-01 1.683e-01 -2.457 0.014011 *
## Diesel 5.966e-01 1.338e-01 4.458 8.32e-06 ***
## Electric 9.373e+00 8.725e-01 10.742 < 2e-16 ***
## LPG 4.447e-02 3.183e-01 0.140 0.888902
## Petrol -1.328e+00 1.372e-01 -9.682 < 2e-16 ***
## Manual -2.624e+00 5.055e-02 -51.899 < 2e-16 ***
## X5 -4.475e-01 1.519e-01 -2.947 0.003216 **
## X.5 -5.604e-01 1.638e-01 -3.421 0.000626 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.286 on 19805 degrees of freedom
## Multiple R-squared: 0.7778, Adjusted R-squared: 0.7777
## F-statistic: 4953 on 14 and 19805 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = selling_price ~ km_driven + mileage + engine + max_power +
## age + Individual + Trustmark.Dealer + Diesel + Electric +
## LPG + Manual + X5 + X.5, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.2540 -1.2574 -0.1638 1.0106 17.0683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.893e+00 2.666e-01 37.115 < 2e-16 ***
## km_driven -4.218e-06 3.485e-07 -12.102 < 2e-16 ***
## mileage -1.342e-01 6.420e-03 -20.906 < 2e-16 ***
## engine 8.770e-04 7.799e-05 11.246 < 2e-16 ***
## max_power 4.756e-02 7.292e-04 65.221 < 2e-16 ***
## age -5.365e-01 5.954e-03 -90.110 < 2e-16 ***
## Individual -4.304e-01 3.469e-02 -12.409 < 2e-16 ***
## Trustmark.Dealer -3.617e-01 1.686e-01 -2.145 0.03197 *
## Diesel 1.812e+00 4.641e-02 39.050 < 2e-16 ***
## Electric 9.776e+00 8.736e-01 11.191 < 2e-16 ***
## LPG 1.362e+00 2.885e-01 4.721 2.36e-06 ***
## Manual -2.626e+00 5.067e-02 -51.821 < 2e-16 ***
## X5 -5.004e-01 1.521e-01 -3.289 0.00101 **
## X.5 -5.372e-01 1.642e-01 -3.272 0.00107 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.291 on 19806 degrees of freedom
## Multiple R-squared: 0.7768, Adjusted R-squared: 0.7766
## F-statistic: 5302 on 13 and 19806 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = selling_price ~ km_driven + mileage + engine + max_power +
## age + Individual + Trustmark.Dealer + Diesel + Electric +
## LPG + Manual + X5, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9160 -1.2561 -0.1622 1.0082 17.2034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.331e+00 2.039e-01 45.762 < 2e-16 ***
## km_driven -4.263e-06 3.483e-07 -12.239 < 2e-16 ***
## mileage -1.301e-01 6.296e-03 -20.661 < 2e-16 ***
## engine 8.469e-04 7.746e-05 10.933 < 2e-16 ***
## max_power 4.807e-02 7.123e-04 67.487 < 2e-16 ***
## age -5.326e-01 5.836e-03 -91.264 < 2e-16 ***
## Individual -4.294e-01 3.469e-02 -12.378 < 2e-16 ***
## Trustmark.Dealer -3.569e-01 1.687e-01 -2.116 0.0344 *
## Diesel 1.789e+00 4.588e-02 38.999 < 2e-16 ***
## Electric 9.620e+00 8.725e-01 11.026 < 2e-16 ***
## LPG 1.359e+00 2.885e-01 4.709 2.51e-06 ***
## Manual -2.640e+00 5.050e-02 -52.272 < 2e-16 ***
## X5 -3.867e-02 5.682e-02 -0.681 0.4961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.292 on 19807 degrees of freedom
## Multiple R-squared: 0.7767, Adjusted R-squared: 0.7765
## F-statistic: 5740 on 12 and 19807 DF, p-value: < 2.2e-16
## km_driven mileage engine max_power
## 1.226651 2.874447 6.176241 4.022456
## age Individual Trustmark.Dealer Diesel
## 1.367855 1.081587 1.019400 2.033092
## Electric LPG Manual X5
## 1.162692 1.011332 1.535571 12.041870
## X.5
## 13.180752
## km_driven mileage engine max_power
## 1.224707 2.763159 6.089823 3.836490
## age Individual Trustmark.Dealer Diesel
## 1.313607 1.081500 1.019322 1.985553
## Electric LPG Manual X5
## 1.159214 1.011320 1.524591 1.678576
What is Multicollinearity
- Multicollinearity can occur when two or more variables in a model are very similar or even related
- Multicollinearity can make it difficult to distinguish which variable is affecting the model
- Multicollinearity has the ability to overfit the data thus preventing models from creating accurate predictions.
Examples
- Height and Wingspan
- Pregnancy and Ovaries
- Left Foot Length and Right Foot Length

Problems with Multicollinearity
- Model Misspecification: Multicollinearity can unfortunately lead to incorrect conclusions about the relationships between variables.
- Inflated R² with Insignificant Coefficients:
- High correlation among independent variables can artificially inflate the model’s R² value.
- Individual coefficients (Beta’s) may not be statistically significant due to high standard errors.
- Difficult Interpretation: Multicollinearity makes it hard to determine the unique contribution of each predictor to the dependent variable.
Equation of a Linear Regression Model
\[
y = \beta_0 + \beta_1 x + \epsilon
\]
- \(y\): Dependent variable (response)
- \(x\): Independent variable (predictor)
- \(\beta_0\): Intercept (value of \(y\) when \(x = 0\))
- \(\beta_1\): Slope (change in \(y\) for a one-unit change in \(x\))
- \(\epsilon\): Error term (captures unobserved factors and noise)
Scatterplot between two Multicollinear Variables Year and Age

- Age and year are related variables, hense their obvious multicollinearity
Testing for Multicollinearity
A high value of R2 and a significant F- Statistic that contradicts the t-test signals multicollinearity
In R, multicollinearity can be tested by the correlation matrix, VIF, and computing the alias
Correlation Matrix
Solutions for Multicollinearity
- The typical solution and good practice for multicollinearity is dropping one of the variables
- Stepwise regression can help eliminate variables
- Variance Inflation Factor (VIF) shows what percentage of the variance is inflated for each coefficient.
- VIF shows the degree to which Standard Error is inflated due to collinearity
- Dimension analysis through reduction reducing techniques like PCA
- When dropping multicollinear variables, the standard error of the regression coefficients typically decreases, improving model reliability.
What is Variance Inflation Factor (VIF)?
The Variance Inflation Factor (VIF) is calculated as:
\[
\text{VIF}(X_j) = \frac{1}{1 - R_j^2}
\]
where \(R_j^2\) is the coefficient of determination when regressing predictor \(X_j\) on all other predictors.
Interpretation of VIF
- VIF = 1: No multicollinearity.
- VIF between 1 and 5: Moderate multicollinearity.
- VIF > 10: High multicollinearity- usually indicitive of collinearity problems
VIF
vif_results = vif(model2)
print(vif_results)
## km_driven mileage engine max_power
## 1.228493 3.103632 6.250776 4.024651
## age Individual Trustmark.Dealer Diesel
## 1.371166 1.085106 1.020435 16.985109
## Electric LPG Petrol Manual
## 1.165349 1.237391 17.839330 1.535600
## X5 X.5
## 12.057470 13.183576
VIF values barplot
