Libraries

Discussion 12:

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

This dataset contains information about used cars.It has one dependent variable and more than one independent variables.

The columns in the given dataset are as follows:

  • name
  • year
  • selling_price
  • km_driven
  • fuel
  • seller_type
  • transmission
  • Owner

Here is the original link from Kaggle: https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho Frist I loaded dataset in my GitHub and loaded to R

Load Dataset from Github

df<- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-621-HW3/main/Car%20Data.csv")
head(df)
##        Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1          ritz 2014          3.35          5.59      27000    Petrol
## 2           sx4 2013          4.75          9.54      43000    Diesel
## 3          ciaz 2017          7.25          9.85       6900    Petrol
## 4       wagon r 2011          2.85          4.15       5200    Petrol
## 5         swift 2014          4.60          6.87      42450    Diesel
## 6 vitara brezza 2018          9.25          9.83       2071    Diesel
##   Seller_Type Transmission Owner
## 1      Dealer       Manual     0
## 2      Dealer       Manual     0
## 3      Dealer       Manual     0
## 4      Dealer       Manual     0
## 5      Dealer       Manual     0
## 6      Dealer       Manual     0

First, we’ll create a correlation plot to visualize the associations between variables:

Find missing values in each column

print(colSums(is.na(df)))
##      Car_Name          Year Selling_Price Present_Price    Kms_Driven 
##             0             0             0             0             0 
##     Fuel_Type   Seller_Type  Transmission         Owner 
##             0             0             0             0

We see that there is no missing values in this dataset. So we are good.

Use the cor() function to test the relationship between your independent variables and make sure they aren’t too highly correlated.

Check correlation between Year and Selling_Price

cor(df$Selling_Price, df$Year)
## [1] 0.236141

Check Normality:Use the hist() function to test whether your dependent variable follows a normal distribution.

hist(df$Selling_Price)

Check Linearity:using two scatterplots: one for Year and Selling_Price, and one for Kms_Driven and Selling_Price.

df %>%
  ggplot(aes(x=Year, y= Selling_Price)) +
  geom_point()+
  geom_smooth(method='lm',na.rm=TRUE) +
  labs(x='Year',y='Selling_Price',title='Car selling price by year')

plot( Selling_Price~Year, data = df)

plot( Selling_Price~Kms_Driven, data = df)

Although the relationship between Selling_Price and Kms_Driven is a bit less clear, it still appears linear. We can proceed with linear regression.

Multiple Linear MOdel with all variables

model1_Full <- lm(Selling_Price~.,data = df)
summary(model1_Full)
## 
## Call:
## lm(formula = Selling_Price ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7919 -0.3898  0.0000  0.3617  8.5244 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -1.205e+03  1.145e+02 -10.524  < 2e-16 ***
## Car_NameActiva 3g                 -4.199e+00  2.078e+00  -2.021 0.044697 *  
## Car_NameActiva 4g                 -8.251e+00  2.284e+00  -3.612 0.000387 ***
## Car_Namealto 800                  -8.390e+00  2.390e+00  -3.511 0.000555 ***
## Car_Namealto k10                  -6.846e+00  1.943e+00  -3.524 0.000530 ***
## Car_Nameamaze                     -7.210e+00  1.900e+00  -3.795 0.000197 ***
## Car_NameBajaj  ct 100             -6.675e+00  2.198e+00  -3.036 0.002723 ** 
## Car_NameBajaj Avenger 150         -7.113e+00  2.208e+00  -3.221 0.001497 ** 
## Car_NameBajaj Avenger 150 street  -7.203e+00  2.208e+00  -3.262 0.001304 ** 
## Car_NameBajaj Avenger 220         -7.585e+00  1.855e+00  -4.090 6.31e-05 ***
## Car_NameBajaj Avenger 220 dtsi    -5.252e+00  1.891e+00  -2.778 0.006009 ** 
## Car_NameBajaj Avenger Street 220  -4.415e+00  2.150e+00  -2.053 0.041368 *  
## Car_NameBajaj Discover 100        -5.534e+00  2.173e+00  -2.547 0.011635 *  
## Car_NameBajaj Discover 125        -5.015e+00  2.041e+00  -2.457 0.014899 *  
## Car_NameBajaj Dominar 400         -7.504e+00  2.216e+00  -3.386 0.000857 ***
## Car_NameBajaj Pulsar  NS 200      -6.087e+00  2.179e+00  -2.793 0.005738 ** 
## Car_NameBajaj Pulsar 135 LS       -6.137e+00  2.183e+00  -2.811 0.005445 ** 
## Car_NameBajaj Pulsar 150          -3.363e+00  1.715e+00  -1.961 0.051345 .  
## Car_NameBajaj Pulsar 220 F        -5.498e+00  1.895e+00  -2.901 0.004147 ** 
## Car_NameBajaj Pulsar NS 200       -5.206e+00  1.788e+00  -2.912 0.004015 ** 
## Car_NameBajaj Pulsar RS200        -7.085e+00  2.204e+00  -3.214 0.001531 ** 
## Car_Namebaleno                    -7.484e+00  2.379e+00  -3.145 0.001919 ** 
## Car_Namebrio                      -6.515e+00  1.881e+00  -3.464 0.000655 ***
## Car_Namecamry                     -1.314e+01  3.307e+00  -3.972 0.000100 ***
## Car_Nameciaz                      -6.685e+00  1.882e+00  -3.553 0.000478 ***
## Car_Namecity                      -6.707e+00  1.846e+00  -3.633 0.000358 ***
## Car_Namecorolla                   -6.701e+00  2.344e+00  -2.858 0.004723 ** 
## Car_Namecorolla altis             -9.082e+00  1.932e+00  -4.701 4.90e-06 ***
## Car_Namecreta                     -4.689e+00  2.043e+00  -2.295 0.022788 *  
## Car_Namedzire                     -6.791e+00  1.941e+00  -3.499 0.000579 ***
## Car_Nameelantra                   -5.292e+00  2.120e+00  -2.496 0.013394 *  
## Car_Nameeon                       -7.729e+00  1.936e+00  -3.991 9.30e-05 ***
## Car_Nameertiga                    -7.135e+00  1.916e+00  -3.723 0.000257 ***
## Car_Nameetios cross               -7.713e+00  2.008e+00  -3.841 0.000166 ***
## Car_Nameetios g                   -7.137e+00  2.031e+00  -3.514 0.000548 ***
## Car_Nameetios gd                  -8.248e+00  2.347e+00  -3.514 0.000549 ***
## Car_Nameetios liva                -6.847e+00  1.941e+00  -3.527 0.000525 ***
## Car_Namefortuner                  -6.625e+00  2.249e+00  -2.946 0.003611 ** 
## Car_Namegrand i10                 -6.528e+00  1.900e+00  -3.435 0.000723 ***
## Car_NameHero  CBZ Xtreme          -2.647e+00  2.127e+00  -1.245 0.214773    
## Car_NameHero  Ignitor Disc        -5.966e+00  2.313e+00  -2.580 0.010626 *  
## Car_NameHero Extreme              -5.713e+00  1.905e+00  -2.999 0.003064 ** 
## Car_NameHero Glamour              -5.625e+00  2.172e+00  -2.590 0.010330 *  
## Car_NameHero Honda CBZ extreme    -4.156e+00  2.148e+00  -1.935 0.054471 .  
## Car_NameHero Honda Passion Pro    -4.746e+00  2.160e+00  -2.197 0.029168 *  
## Car_NameHero Hunk                 -2.301e+00  2.259e+00  -1.019 0.309667    
## Car_NameHero Passion Pro          -7.007e+00  1.937e+00  -3.618 0.000379 ***
## Car_NameHero Passion X pro        -7.108e+00  2.211e+00  -3.215 0.001526 ** 
## Car_NameHero Splender iSmart      -6.955e+00  1.936e+00  -3.592 0.000415 ***
## Car_NameHero Splender Plus        -7.180e+00  2.214e+00  -3.243 0.001392 ** 
## Car_NameHero Super Splendor       -6.951e-01  2.119e+00  -0.328 0.743257    
## Car_NameHonda Activa 125          -7.629e+00  2.268e+00  -3.364 0.000924 ***
## Car_NameHonda Activa 4G           -8.172e+00  2.028e+00  -4.030 7.98e-05 ***
## Car_NameHonda CB Hornet 160R      -7.575e+00  1.855e+00  -4.084 6.47e-05 ***
## Car_NameHonda CB Shine            -3.759e+00  1.866e+00  -2.015 0.045297 *  
## Car_NameHonda CB Trigger          -5.576e+00  2.172e+00  -2.567 0.011005 *  
## Car_NameHonda CB twister          -4.667e+00  1.882e+00  -2.480 0.013991 *  
## Car_NameHonda CB Unicorn          -6.690e+00  2.195e+00  -3.048 0.002624 ** 
## Car_NameHonda CBR 150             -5.871e+00  1.900e+00  -3.090 0.002296 ** 
## Car_NameHonda Dream Yuga          -7.826e+00  2.226e+00  -3.517 0.000544 ***
## Car_NameHonda Karizma             -3.843e+00  1.877e+00  -2.048 0.041905 *  
## Car_NameHyosung GT250R            -7.071e+00  2.307e+00  -3.065 0.002487 ** 
## Car_Namei10                       -5.319e+00  1.911e+00  -2.783 0.005910 ** 
## Car_Namei20                       -6.563e+00  1.866e+00  -3.518 0.000541 ***
## Car_Nameignis                     -7.560e+00  2.377e+00  -3.180 0.001713 ** 
## Car_Nameinnova                    -4.107e+00  1.902e+00  -2.159 0.032078 *  
## Car_Namejazz                      -6.970e+00  1.907e+00  -3.655 0.000331 ***
## Car_NameKTM 390 Duke              -6.436e+00  2.172e+00  -2.963 0.003425 ** 
## Car_NameKTM RC200                 -7.315e+00  1.940e+00  -3.770 0.000217 ***
## Car_NameKTM RC390                 -6.752e+00  2.182e+00  -3.094 0.002262 ** 
## Car_Nameland cruiser              -2.384e+01  5.915e+00  -4.030 7.98e-05 ***
## Car_NameMahindra Mojo XT300       -6.931e+00  2.203e+00  -3.146 0.001916 ** 
## Car_Nameomni                      -6.237e+00  2.343e+00  -2.662 0.008420 ** 
## Car_Nameritz                      -6.629e+00  1.942e+00  -3.414 0.000778 ***
## Car_NameRoyal Enfield Bullet 350  -7.031e+00  2.205e+00  -3.189 0.001664 ** 
## Car_NameRoyal Enfield Classic 350 -6.435e+00  1.701e+00  -3.784 0.000205 ***
## Car_NameRoyal Enfield Classic 500 -4.094e+00  1.866e+00  -2.194 0.029387 *  
## Car_NameRoyal Enfield Thunder 350 -5.879e+00  1.755e+00  -3.350 0.000969 ***
## Car_NameRoyal Enfield Thunder 500 -6.395e+00  1.813e+00  -3.527 0.000524 ***
## Car_Names cross                   -6.968e+00  2.347e+00  -2.969 0.003365 ** 
## Car_NameSuzuki Access 125         -3.027e+00  2.203e+00  -1.374 0.171025    
## Car_Nameswift                     -6.257e+00  1.840e+00  -3.401 0.000816 ***
## Car_Namesx4                       -5.708e+00  1.876e+00  -3.042 0.002670 ** 
## Car_NameTVS Apache RTR 160        -6.230e+00  1.816e+00  -3.430 0.000736 ***
## Car_NameTVS Apache RTR 180        -5.264e+00  1.893e+00  -2.780 0.005964 ** 
## Car_NameTVS Jupyter               -6.421e+00  2.242e+00  -2.864 0.004641 ** 
## Car_NameTVS Sport                 -7.785e+00  2.226e+00  -3.498 0.000581 ***
## Car_NameTVS Wego                  -4.102e+00  2.205e+00  -1.861 0.064278 .  
## Car_NameUM Renegade Mojave        -7.380e+00  2.214e+00  -3.333 0.001028 ** 
## Car_Nameverna                     -6.552e+00  1.843e+00  -3.555 0.000474 ***
## Car_Namevitara brezza             -6.871e+00  2.377e+00  -2.891 0.004271 ** 
## Car_Namewagon r                   -5.148e+00  1.952e+00  -2.637 0.009030 ** 
## Car_Namexcent                     -7.288e+00  2.041e+00  -3.571 0.000448 ***
## Car_NameYamaha Fazer              -6.202e+00  2.182e+00  -2.842 0.004957 ** 
## Car_NameYamaha FZ  v 2.0          -6.893e+00  1.933e+00  -3.567 0.000455 ***
## Car_NameYamaha FZ 16              -6.472e+00  2.194e+00  -2.950 0.003562 ** 
## Car_NameYamaha FZ S               -5.082e+00  2.164e+00  -2.349 0.019826 *  
## Car_NameYamaha FZ S V 2.0         -6.982e+00  1.838e+00  -3.799 0.000194 ***
## Year                               6.012e-01  5.699e-02  10.549  < 2e-16 ***
## Present_Price                      5.784e-01  6.338e-02   9.127  < 2e-16 ***
## Kms_Driven                        -4.608e-06  3.647e-06  -1.264 0.207916    
## Fuel_TypeDiesel                    2.424e+00  1.220e+00   1.987 0.048285 *  
## Fuel_TypePetrol                    1.749e+00  1.199e+00   1.459 0.146260    
## Seller_TypeIndividual             -1.134e+00  9.483e-01  -1.196 0.233095    
## TransmissionManual                -3.280e-01  4.356e-01  -0.753 0.452435    
## Owner                              2.718e-01  7.628e-01   0.356 0.721971    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.487 on 195 degrees of freedom
## Multiple R-squared:  0.9443, Adjusted R-squared:  0.9144 
## F-statistic: 31.51 on 105 and 195 DF,  p-value: < 2.2e-16
  • 1Q and 3Q are equidistant from zero and Median is zero. which is good for our model.

  • 94.43% of variability according to the R^2 value.

Multiple Linear regression with manually selected features

Choosing the highly significant variables as outputted by model1_Full:

  • Year
  • Present_Price
  • Kms_Driven
  • Seller_TypeIndividual
  • TransmissionManual
model2_selected <- lm(Selling_Price~ Year + Present_Price +Kms_Driven + Seller_Type + Transmission, data = df)
summary(model2_selected)
## 
## Call:
## lm(formula = Selling_Price ~ Year + Present_Price + Kms_Driven + 
##     Seller_Type + Transmission, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2829 -0.8487 -0.1967  0.6072 11.6630 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -9.004e+02  8.985e+01 -10.021  < 2e-16 ***
## Year                   4.485e-01  4.457e-02  10.063  < 2e-16 ***
## Present_Price          4.717e-01  1.587e-02  29.730  < 2e-16 ***
## Kms_Driven            -3.518e-06  3.387e-06  -1.039 0.299834    
## Seller_TypeIndividual -1.351e+00  2.682e-01  -5.038 8.22e-07 ***
## TransmissionManual    -1.253e+00  3.475e-01  -3.605 0.000367 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.88 on 295 degrees of freedom
## Multiple R-squared:  0.8655, Adjusted R-squared:  0.8632 
## F-statistic: 379.5 on 5 and 295 DF,  p-value: < 2.2e-16

Check for homoscedasticity

Again, we should check that our model is actually a good fit for the data, and that we don’t have large variation in the model error, by running this code:

par(mfrow=c(2,2)) 
plot(model2_selected)

par(mfrow=c(1,1))

As with our simple regression, the residuals show no bias, so we can say our model fits the assumption of homoscedasticity.

Plot the Q-Q plot to check the Normality:

# define residuals
res <- resid(model2_selected)

#create Q-Q plot for residuals
qqnorm(res)

#add a straight diagonal line to the plot
qqline(res, col = "red")

Here we can see that Model follows a nearly normal distribution. We can say that the model fits the data. This means that there is a strong relationship between Selling_Price, Year.