Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The columns in the given dataset are as follows:
Here is the original link from Kaggle: https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho Frist I loaded dataset in my GitHub and loaded to R
df<- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-621-HW3/main/Car%20Data.csv")
head(df)
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1 ritz 2014 3.35 5.59 27000 Petrol
## 2 sx4 2013 4.75 9.54 43000 Diesel
## 3 ciaz 2017 7.25 9.85 6900 Petrol
## 4 wagon r 2011 2.85 4.15 5200 Petrol
## 5 swift 2014 4.60 6.87 42450 Diesel
## 6 vitara brezza 2018 9.25 9.83 2071 Diesel
## Seller_Type Transmission Owner
## 1 Dealer Manual 0
## 2 Dealer Manual 0
## 3 Dealer Manual 0
## 4 Dealer Manual 0
## 5 Dealer Manual 0
## 6 Dealer Manual 0
First, we’ll create a correlation plot to visualize the associations between variables:
print(colSums(is.na(df)))
## Car_Name Year Selling_Price Present_Price Kms_Driven
## 0 0 0 0 0
## Fuel_Type Seller_Type Transmission Owner
## 0 0 0 0
We see that there is no missing values in this dataset. So we are good.
cor(df$Selling_Price, df$Year)
## [1] 0.236141
hist(df$Selling_Price)
df %>%
ggplot(aes(x=Year, y= Selling_Price)) +
geom_point()+
geom_smooth(method='lm',na.rm=TRUE) +
labs(x='Year',y='Selling_Price',title='Car selling price by year')
plot( Selling_Price~Year, data = df)
plot( Selling_Price~Kms_Driven, data = df)
Although the relationship between Selling_Price and Kms_Driven is a bit less clear, it still appears linear. We can proceed with linear regression.
model1_Full <- lm(Selling_Price~.,data = df)
summary(model1_Full)
##
## Call:
## lm(formula = Selling_Price ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7919 -0.3898 0.0000 0.3617 8.5244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.205e+03 1.145e+02 -10.524 < 2e-16 ***
## Car_NameActiva 3g -4.199e+00 2.078e+00 -2.021 0.044697 *
## Car_NameActiva 4g -8.251e+00 2.284e+00 -3.612 0.000387 ***
## Car_Namealto 800 -8.390e+00 2.390e+00 -3.511 0.000555 ***
## Car_Namealto k10 -6.846e+00 1.943e+00 -3.524 0.000530 ***
## Car_Nameamaze -7.210e+00 1.900e+00 -3.795 0.000197 ***
## Car_NameBajaj ct 100 -6.675e+00 2.198e+00 -3.036 0.002723 **
## Car_NameBajaj Avenger 150 -7.113e+00 2.208e+00 -3.221 0.001497 **
## Car_NameBajaj Avenger 150 street -7.203e+00 2.208e+00 -3.262 0.001304 **
## Car_NameBajaj Avenger 220 -7.585e+00 1.855e+00 -4.090 6.31e-05 ***
## Car_NameBajaj Avenger 220 dtsi -5.252e+00 1.891e+00 -2.778 0.006009 **
## Car_NameBajaj Avenger Street 220 -4.415e+00 2.150e+00 -2.053 0.041368 *
## Car_NameBajaj Discover 100 -5.534e+00 2.173e+00 -2.547 0.011635 *
## Car_NameBajaj Discover 125 -5.015e+00 2.041e+00 -2.457 0.014899 *
## Car_NameBajaj Dominar 400 -7.504e+00 2.216e+00 -3.386 0.000857 ***
## Car_NameBajaj Pulsar NS 200 -6.087e+00 2.179e+00 -2.793 0.005738 **
## Car_NameBajaj Pulsar 135 LS -6.137e+00 2.183e+00 -2.811 0.005445 **
## Car_NameBajaj Pulsar 150 -3.363e+00 1.715e+00 -1.961 0.051345 .
## Car_NameBajaj Pulsar 220 F -5.498e+00 1.895e+00 -2.901 0.004147 **
## Car_NameBajaj Pulsar NS 200 -5.206e+00 1.788e+00 -2.912 0.004015 **
## Car_NameBajaj Pulsar RS200 -7.085e+00 2.204e+00 -3.214 0.001531 **
## Car_Namebaleno -7.484e+00 2.379e+00 -3.145 0.001919 **
## Car_Namebrio -6.515e+00 1.881e+00 -3.464 0.000655 ***
## Car_Namecamry -1.314e+01 3.307e+00 -3.972 0.000100 ***
## Car_Nameciaz -6.685e+00 1.882e+00 -3.553 0.000478 ***
## Car_Namecity -6.707e+00 1.846e+00 -3.633 0.000358 ***
## Car_Namecorolla -6.701e+00 2.344e+00 -2.858 0.004723 **
## Car_Namecorolla altis -9.082e+00 1.932e+00 -4.701 4.90e-06 ***
## Car_Namecreta -4.689e+00 2.043e+00 -2.295 0.022788 *
## Car_Namedzire -6.791e+00 1.941e+00 -3.499 0.000579 ***
## Car_Nameelantra -5.292e+00 2.120e+00 -2.496 0.013394 *
## Car_Nameeon -7.729e+00 1.936e+00 -3.991 9.30e-05 ***
## Car_Nameertiga -7.135e+00 1.916e+00 -3.723 0.000257 ***
## Car_Nameetios cross -7.713e+00 2.008e+00 -3.841 0.000166 ***
## Car_Nameetios g -7.137e+00 2.031e+00 -3.514 0.000548 ***
## Car_Nameetios gd -8.248e+00 2.347e+00 -3.514 0.000549 ***
## Car_Nameetios liva -6.847e+00 1.941e+00 -3.527 0.000525 ***
## Car_Namefortuner -6.625e+00 2.249e+00 -2.946 0.003611 **
## Car_Namegrand i10 -6.528e+00 1.900e+00 -3.435 0.000723 ***
## Car_NameHero CBZ Xtreme -2.647e+00 2.127e+00 -1.245 0.214773
## Car_NameHero Ignitor Disc -5.966e+00 2.313e+00 -2.580 0.010626 *
## Car_NameHero Extreme -5.713e+00 1.905e+00 -2.999 0.003064 **
## Car_NameHero Glamour -5.625e+00 2.172e+00 -2.590 0.010330 *
## Car_NameHero Honda CBZ extreme -4.156e+00 2.148e+00 -1.935 0.054471 .
## Car_NameHero Honda Passion Pro -4.746e+00 2.160e+00 -2.197 0.029168 *
## Car_NameHero Hunk -2.301e+00 2.259e+00 -1.019 0.309667
## Car_NameHero Passion Pro -7.007e+00 1.937e+00 -3.618 0.000379 ***
## Car_NameHero Passion X pro -7.108e+00 2.211e+00 -3.215 0.001526 **
## Car_NameHero Splender iSmart -6.955e+00 1.936e+00 -3.592 0.000415 ***
## Car_NameHero Splender Plus -7.180e+00 2.214e+00 -3.243 0.001392 **
## Car_NameHero Super Splendor -6.951e-01 2.119e+00 -0.328 0.743257
## Car_NameHonda Activa 125 -7.629e+00 2.268e+00 -3.364 0.000924 ***
## Car_NameHonda Activa 4G -8.172e+00 2.028e+00 -4.030 7.98e-05 ***
## Car_NameHonda CB Hornet 160R -7.575e+00 1.855e+00 -4.084 6.47e-05 ***
## Car_NameHonda CB Shine -3.759e+00 1.866e+00 -2.015 0.045297 *
## Car_NameHonda CB Trigger -5.576e+00 2.172e+00 -2.567 0.011005 *
## Car_NameHonda CB twister -4.667e+00 1.882e+00 -2.480 0.013991 *
## Car_NameHonda CB Unicorn -6.690e+00 2.195e+00 -3.048 0.002624 **
## Car_NameHonda CBR 150 -5.871e+00 1.900e+00 -3.090 0.002296 **
## Car_NameHonda Dream Yuga -7.826e+00 2.226e+00 -3.517 0.000544 ***
## Car_NameHonda Karizma -3.843e+00 1.877e+00 -2.048 0.041905 *
## Car_NameHyosung GT250R -7.071e+00 2.307e+00 -3.065 0.002487 **
## Car_Namei10 -5.319e+00 1.911e+00 -2.783 0.005910 **
## Car_Namei20 -6.563e+00 1.866e+00 -3.518 0.000541 ***
## Car_Nameignis -7.560e+00 2.377e+00 -3.180 0.001713 **
## Car_Nameinnova -4.107e+00 1.902e+00 -2.159 0.032078 *
## Car_Namejazz -6.970e+00 1.907e+00 -3.655 0.000331 ***
## Car_NameKTM 390 Duke -6.436e+00 2.172e+00 -2.963 0.003425 **
## Car_NameKTM RC200 -7.315e+00 1.940e+00 -3.770 0.000217 ***
## Car_NameKTM RC390 -6.752e+00 2.182e+00 -3.094 0.002262 **
## Car_Nameland cruiser -2.384e+01 5.915e+00 -4.030 7.98e-05 ***
## Car_NameMahindra Mojo XT300 -6.931e+00 2.203e+00 -3.146 0.001916 **
## Car_Nameomni -6.237e+00 2.343e+00 -2.662 0.008420 **
## Car_Nameritz -6.629e+00 1.942e+00 -3.414 0.000778 ***
## Car_NameRoyal Enfield Bullet 350 -7.031e+00 2.205e+00 -3.189 0.001664 **
## Car_NameRoyal Enfield Classic 350 -6.435e+00 1.701e+00 -3.784 0.000205 ***
## Car_NameRoyal Enfield Classic 500 -4.094e+00 1.866e+00 -2.194 0.029387 *
## Car_NameRoyal Enfield Thunder 350 -5.879e+00 1.755e+00 -3.350 0.000969 ***
## Car_NameRoyal Enfield Thunder 500 -6.395e+00 1.813e+00 -3.527 0.000524 ***
## Car_Names cross -6.968e+00 2.347e+00 -2.969 0.003365 **
## Car_NameSuzuki Access 125 -3.027e+00 2.203e+00 -1.374 0.171025
## Car_Nameswift -6.257e+00 1.840e+00 -3.401 0.000816 ***
## Car_Namesx4 -5.708e+00 1.876e+00 -3.042 0.002670 **
## Car_NameTVS Apache RTR 160 -6.230e+00 1.816e+00 -3.430 0.000736 ***
## Car_NameTVS Apache RTR 180 -5.264e+00 1.893e+00 -2.780 0.005964 **
## Car_NameTVS Jupyter -6.421e+00 2.242e+00 -2.864 0.004641 **
## Car_NameTVS Sport -7.785e+00 2.226e+00 -3.498 0.000581 ***
## Car_NameTVS Wego -4.102e+00 2.205e+00 -1.861 0.064278 .
## Car_NameUM Renegade Mojave -7.380e+00 2.214e+00 -3.333 0.001028 **
## Car_Nameverna -6.552e+00 1.843e+00 -3.555 0.000474 ***
## Car_Namevitara brezza -6.871e+00 2.377e+00 -2.891 0.004271 **
## Car_Namewagon r -5.148e+00 1.952e+00 -2.637 0.009030 **
## Car_Namexcent -7.288e+00 2.041e+00 -3.571 0.000448 ***
## Car_NameYamaha Fazer -6.202e+00 2.182e+00 -2.842 0.004957 **
## Car_NameYamaha FZ v 2.0 -6.893e+00 1.933e+00 -3.567 0.000455 ***
## Car_NameYamaha FZ 16 -6.472e+00 2.194e+00 -2.950 0.003562 **
## Car_NameYamaha FZ S -5.082e+00 2.164e+00 -2.349 0.019826 *
## Car_NameYamaha FZ S V 2.0 -6.982e+00 1.838e+00 -3.799 0.000194 ***
## Year 6.012e-01 5.699e-02 10.549 < 2e-16 ***
## Present_Price 5.784e-01 6.338e-02 9.127 < 2e-16 ***
## Kms_Driven -4.608e-06 3.647e-06 -1.264 0.207916
## Fuel_TypeDiesel 2.424e+00 1.220e+00 1.987 0.048285 *
## Fuel_TypePetrol 1.749e+00 1.199e+00 1.459 0.146260
## Seller_TypeIndividual -1.134e+00 9.483e-01 -1.196 0.233095
## TransmissionManual -3.280e-01 4.356e-01 -0.753 0.452435
## Owner 2.718e-01 7.628e-01 0.356 0.721971
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.487 on 195 degrees of freedom
## Multiple R-squared: 0.9443, Adjusted R-squared: 0.9144
## F-statistic: 31.51 on 105 and 195 DF, p-value: < 2.2e-16
1Q and 3Q are equidistant from zero and Median is zero. which is good for our model.
94.43% of variability according to the R^2 value.
Choosing the highly significant variables as outputted by model1_Full:
model2_selected <- lm(Selling_Price~ Year + Present_Price +Kms_Driven + Seller_Type + Transmission, data = df)
summary(model2_selected)
##
## Call:
## lm(formula = Selling_Price ~ Year + Present_Price + Kms_Driven +
## Seller_Type + Transmission, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2829 -0.8487 -0.1967 0.6072 11.6630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.004e+02 8.985e+01 -10.021 < 2e-16 ***
## Year 4.485e-01 4.457e-02 10.063 < 2e-16 ***
## Present_Price 4.717e-01 1.587e-02 29.730 < 2e-16 ***
## Kms_Driven -3.518e-06 3.387e-06 -1.039 0.299834
## Seller_TypeIndividual -1.351e+00 2.682e-01 -5.038 8.22e-07 ***
## TransmissionManual -1.253e+00 3.475e-01 -3.605 0.000367 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.88 on 295 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8632
## F-statistic: 379.5 on 5 and 295 DF, p-value: < 2.2e-16
Again, we should check that our model is actually a good fit for the data, and that we don’t have large variation in the model error, by running this code:
par(mfrow=c(2,2))
plot(model2_selected)
par(mfrow=c(1,1))
As with our simple regression, the residuals show no bias, so we can say our model fits the assumption of homoscedasticity.
# define residuals
res <- resid(model2_selected)
#create Q-Q plot for residuals
qqnorm(res)
#add a straight diagonal line to the plot
qqline(res, col = "red")
Here we can see that Model follows a nearly normal distribution. We can say that the model fits the data. This means that there is a strong relationship between Selling_Price, Year.