2024-11-14

## 'data.frame':    19820 obs. of  16 variables:
##  $ selling_price   : num  1.2 5.5 2.15 2.26 5.7 3.5 3.15 4.1 10.5 5.75 ...
##  $ year            : num  2012 2016 2010 2012 2015 ...
##  $ km_driven       : int  120000 20000 60000 37000 30000 35000 40000 17512 20000 70000 ...
##  $ mileage         : num  19.7 18.9 17 20.9 22.8 ...
##  $ engine          : num  796 1197 1197 998 1498 ...
##  $ max_power       : num  46.3 82 80 67.1 98.6 ...
##  $ age             : num  11 7 13 11 8 10 10 5 4 6 ...
##  $ Individual      : int  1 1 1 1 0 1 0 0 1 0 ...
##  $ Trustmark.Dealer: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Diesel          : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ Electric        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LPG             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Petrol          : int  1 1 1 1 0 1 1 1 1 0 ...
##  $ Manual          : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ X5              : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ X.5             : int  0 0 0 0 0 0 0 0 0 1 ...
## 
## Call:
## lm(formula = selling_price ~ year + km_driven + mileage + engine + 
##     max_power + age + Individual + Trustmark.Dealer + Diesel + 
##     Electric + LPG + Petrol + Manual + X5 + X.5, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.3047  -1.2475  -0.1508   1.0015  17.5771 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.080e+03  1.192e+01 -90.587  < 2e-16 ***
## year              5.393e-01  5.947e-03  90.687  < 2e-16 ***
## km_driven        -4.348e-06  3.480e-07 -12.496  < 2e-16 ***
## mileage          -1.517e-01  6.655e-03 -22.797  < 2e-16 ***
## engine            7.943e-04  7.827e-05  10.147  < 2e-16 ***
## max_power         4.772e-02  7.276e-04  65.582  < 2e-16 ***
## age                      NA         NA      NA       NA    
## Individual       -4.113e-01  3.466e-02 -11.867  < 2e-16 ***
## Trustmark.Dealer -4.136e-01  1.683e-01  -2.457 0.014011 *  
## Diesel            5.966e-01  1.338e-01   4.458 8.32e-06 ***
## Electric          9.373e+00  8.725e-01  10.742  < 2e-16 ***
## LPG               4.447e-02  3.183e-01   0.140 0.888902    
## Petrol           -1.328e+00  1.372e-01  -9.682  < 2e-16 ***
## Manual           -2.624e+00  5.055e-02 -51.899  < 2e-16 ***
## X5               -4.475e-01  1.519e-01  -2.947 0.003216 ** 
## X.5              -5.604e-01  1.638e-01  -3.421 0.000626 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.286 on 19805 degrees of freedom
## Multiple R-squared:  0.7778, Adjusted R-squared:  0.7777 
## F-statistic:  4953 on 14 and 19805 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = selling_price ~ km_driven + mileage + engine + max_power + 
##     age + Individual + Trustmark.Dealer + Diesel + Electric + 
##     LPG + Petrol + Manual + X5 + X.5, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.3047  -1.2475  -0.1508   1.0015  17.5771 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.157e+01  3.176e-01  36.444  < 2e-16 ***
## km_driven        -4.348e-06  3.480e-07 -12.496  < 2e-16 ***
## mileage          -1.517e-01  6.655e-03 -22.797  < 2e-16 ***
## engine            7.943e-04  7.827e-05  10.147  < 2e-16 ***
## max_power         4.772e-02  7.276e-04  65.582  < 2e-16 ***
## age              -5.393e-01  5.947e-03 -90.687  < 2e-16 ***
## Individual       -4.113e-01  3.466e-02 -11.867  < 2e-16 ***
## Trustmark.Dealer -4.136e-01  1.683e-01  -2.457 0.014011 *  
## Diesel            5.966e-01  1.338e-01   4.458 8.32e-06 ***
## Electric          9.373e+00  8.725e-01  10.742  < 2e-16 ***
## LPG               4.447e-02  3.183e-01   0.140 0.888902    
## Petrol           -1.328e+00  1.372e-01  -9.682  < 2e-16 ***
## Manual           -2.624e+00  5.055e-02 -51.899  < 2e-16 ***
## X5               -4.475e-01  1.519e-01  -2.947 0.003216 ** 
## X.5              -5.604e-01  1.638e-01  -3.421 0.000626 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.286 on 19805 degrees of freedom
## Multiple R-squared:  0.7778, Adjusted R-squared:  0.7777 
## F-statistic:  4953 on 14 and 19805 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = selling_price ~ km_driven + mileage + engine + max_power + 
##     age + Individual + Trustmark.Dealer + Diesel + Electric + 
##     LPG + Manual + X5 + X.5, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.2540  -1.2574  -0.1638   1.0106  17.0683 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.893e+00  2.666e-01  37.115  < 2e-16 ***
## km_driven        -4.218e-06  3.485e-07 -12.102  < 2e-16 ***
## mileage          -1.342e-01  6.420e-03 -20.906  < 2e-16 ***
## engine            8.770e-04  7.799e-05  11.246  < 2e-16 ***
## max_power         4.756e-02  7.292e-04  65.221  < 2e-16 ***
## age              -5.365e-01  5.954e-03 -90.110  < 2e-16 ***
## Individual       -4.304e-01  3.469e-02 -12.409  < 2e-16 ***
## Trustmark.Dealer -3.617e-01  1.686e-01  -2.145  0.03197 *  
## Diesel            1.812e+00  4.641e-02  39.050  < 2e-16 ***
## Electric          9.776e+00  8.736e-01  11.191  < 2e-16 ***
## LPG               1.362e+00  2.885e-01   4.721 2.36e-06 ***
## Manual           -2.626e+00  5.067e-02 -51.821  < 2e-16 ***
## X5               -5.004e-01  1.521e-01  -3.289  0.00101 ** 
## X.5              -5.372e-01  1.642e-01  -3.272  0.00107 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.291 on 19806 degrees of freedom
## Multiple R-squared:  0.7768, Adjusted R-squared:  0.7766 
## F-statistic:  5302 on 13 and 19806 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = selling_price ~ km_driven + mileage + engine + max_power + 
##     age + Individual + Trustmark.Dealer + Diesel + Electric + 
##     LPG + Manual + X5, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9160  -1.2561  -0.1622   1.0082  17.2034 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.331e+00  2.039e-01  45.762  < 2e-16 ***
## km_driven        -4.263e-06  3.483e-07 -12.239  < 2e-16 ***
## mileage          -1.301e-01  6.296e-03 -20.661  < 2e-16 ***
## engine            8.469e-04  7.746e-05  10.933  < 2e-16 ***
## max_power         4.807e-02  7.123e-04  67.487  < 2e-16 ***
## age              -5.326e-01  5.836e-03 -91.264  < 2e-16 ***
## Individual       -4.294e-01  3.469e-02 -12.378  < 2e-16 ***
## Trustmark.Dealer -3.569e-01  1.687e-01  -2.116   0.0344 *  
## Diesel            1.789e+00  4.588e-02  38.999  < 2e-16 ***
## Electric          9.620e+00  8.725e-01  11.026  < 2e-16 ***
## LPG               1.359e+00  2.885e-01   4.709 2.51e-06 ***
## Manual           -2.640e+00  5.050e-02 -52.272  < 2e-16 ***
## X5               -3.867e-02  5.682e-02  -0.681   0.4961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.292 on 19807 degrees of freedom
## Multiple R-squared:  0.7767, Adjusted R-squared:  0.7765 
## F-statistic:  5740 on 12 and 19807 DF,  p-value: < 2.2e-16
##        km_driven          mileage           engine        max_power 
##         1.226651         2.874447         6.176241         4.022456 
##              age       Individual Trustmark.Dealer           Diesel 
##         1.367855         1.081587         1.019400         2.033092 
##         Electric              LPG           Manual               X5 
##         1.162692         1.011332         1.535571        12.041870 
##              X.5 
##        13.180752
##        km_driven          mileage           engine        max_power 
##         1.224707         2.763159         6.089823         3.836490 
##              age       Individual Trustmark.Dealer           Diesel 
##         1.313607         1.081500         1.019322         1.985553 
##         Electric              LPG           Manual               X5 
##         1.159214         1.011320         1.524591         1.678576

What is Multicollinearity

  • Multicollinearity can occur when two or more variables in a model are very similar or even related
  • Multicollinearity can make it difficult to distinguish which variable is affecting the model
  • Multicollinearity has the ability to overfit the data thus preventing models from creating accurate predictions.

Examples

  • Height and Wingspan
  • Pregnancy and Ovaries
  • Left Foot Length and Right Foot Length

Problems with Multicollinearity

  • Model Misspecification: Multicollinearity can unfortunately lead to incorrect conclusions about the relationships between variables.
  • Inflated R² with Insignificant Coefficients:
    • High correlation among independent variables can artificially inflate the model’s R² value.
    • Individual coefficients (Beta’s) may not be statistically significant due to high standard errors.
  • Difficult Interpretation: Multicollinearity makes it hard to determine the unique contribution of each predictor to the dependent variable.

Equation of a Linear Regression Model

\[ y = \beta_0 + \beta_1 x + \epsilon \]

  • \(y\): Dependent variable (response)
  • \(x\): Independent variable (predictor)
  • \(\beta_0\): Intercept (value of \(y\) when \(x = 0\))
  • \(\beta_1\): Slope (change in \(y\) for a one-unit change in \(x\))
  • \(\epsilon\): Error term (captures unobserved factors and noise)

Scatterplot between two Multicollinear Variables Year and Age

  • Age and year are related variables, hense their obvious multicollinearity

Testing for Multicollinearity

  • A high value of R2 and a significant F- Statistic that contradicts the t-test signals multicollinearity

  • In R, multicollinearity can be tested by the correlation matrix, VIF, and computing the alias

Correlation Matrix

Solutions for Multicollinearity

  • The typical solution and good practice for multicollinearity is dropping one of the variables
  • Stepwise regression can help eliminate variables
  • Variance Inflation Factor (VIF) shows what percentage of the variance is inflated for each coefficient.
  • VIF shows the degree to which Standard Error is inflated due to collinearity
  • Dimension analysis through reduction reducing techniques like PCA
  • When dropping multicollinear variables, the standard error of the regression coefficients typically decreases, improving model reliability.

What is Variance Inflation Factor (VIF)?

The Variance Inflation Factor (VIF) is calculated as:

\[ \text{VIF}(X_j) = \frac{1}{1 - R_j^2} \]

where \(R_j^2\) is the coefficient of determination when regressing predictor \(X_j\) on all other predictors.

Interpretation of VIF

  • VIF = 1: No multicollinearity.
  • VIF between 1 and 5: Moderate multicollinearity.
  • VIF > 10: High multicollinearity- usually indicitive of collinearity problems

VIF

vif_results = vif(model2)
print(vif_results)
##        km_driven          mileage           engine        max_power 
##         1.228493         3.103632         6.250776         4.024651 
##              age       Individual Trustmark.Dealer           Diesel 
##         1.371166         1.085106         1.020435        16.985109 
##         Electric              LPG           Petrol           Manual 
##         1.165349         1.237391        17.839330         1.535600 
##               X5              X.5 
##        12.057470        13.183576

VIF values barplot