library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(dplyr)
library(ggrepel)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
options(scipen = 6)
cars24 <- read.delim('Cars24.csv', sep = ",")
head(cars24)
## Car.Brand Model Price Model.Year Location Fuel Driven..Kms.
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434
## Gear Ownership EMI..monthly.
## 1 Manual 2 7350
## 2 Manual 1 7790
## 3 Manual 2 5098
## 4 Manual 1 6816
## 5 Automatic 1 4642
## 6 Manual 1 5554
For my model I am choosing Price as the response variable and Age, Driven..kms., multiple_owner and Gear as my explanatory variables. Since my data doesn’t have Age and multiple_owner columns directly, I need to calculate it.
cars24 <- cars24 |>
mutate(multiple_owner = ifelse(Ownership > 1, 1, 0),
Age = year(now()) - Model.Year
)
head(cars24)
## Car.Brand Model Price Model.Year Location Fuel Driven..Kms.
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434
## Gear Ownership EMI..monthly. multiple_owner Age
## 1 Manual 2 7350 1 8
## 2 Manual 1 7790 0 13
## 3 Manual 2 5098 1 13
## 4 Manual 1 6816 0 13
## 5 Automatic 1 4642 0 9
## 6 Manual 1 5554 0 12
model <- lm(Price ~ Age + Driven..Kms. + Fuel + Car.Brand + Gear, data = cars24)
summary(model)
##
## Call:
## lm(formula = Price ~ Age + Driven..Kms. + Fuel + Car.Brand +
## Gear, data = cars24)
##
## Residuals:
## Min 1Q Median 3Q Max
## -756704 -104940 -6464 83627 5680165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1955024.69796 41229.81274 47.418 < 2e-16 ***
## Age -48443.41634 1052.12724 -46.043 < 2e-16 ***
## Driven..Kms. -0.36673 0.07718 -4.752 0.00000206473 ***
## FuelElectric -490779.89954 142599.07499 -3.442 0.000582 ***
## FuelPetrol -156927.97595 6612.76467 -23.731 < 2e-16 ***
## FuelPetrol + CNG -193512.35474 17327.35603 -11.168 < 2e-16 ***
## FuelPetrol + LPG -110244.47147 47765.80482 -2.308 0.021032 *
## Car.BrandBMW -195941.72525 56162.38697 -3.489 0.000489 ***
## Car.BrandChevrolet -848377.39293 53983.43476 -15.716 < 2e-16 ***
## Car.BrandDatsun -969620.04234 57999.66503 -16.718 < 2e-16 ***
## Car.BrandFiat -896325.39486 70624.30828 -12.691 < 2e-16 ***
## Car.BrandFord -713070.15639 43891.82961 -16.246 < 2e-16 ***
## Car.BrandHonda -658151.50901 41753.17137 -15.763 < 2e-16 ***
## Car.BrandHyundai -675327.20249 41118.88003 -16.424 < 2e-16 ***
## Car.BrandISUZU -600711.83155 204303.40918 -2.940 0.003292 **
## Car.BrandJaguar 711025.83290 122382.39961 5.810 0.00000000658 ***
## Car.BrandJeep 25766.85642 65667.88409 0.392 0.694791
## Car.BrandKIA 3681.37057 59931.42154 0.061 0.951022
## Car.BrandLandrover 496037.68363 98101.79221 5.056 0.00000044028 ***
## Car.BrandMahindra -542587.91975 43928.39699 -12.352 < 2e-16 ***
## Car.BrandMaruti -765594.25519 40915.97935 -18.711 < 2e-16 ***
## Car.BrandMercedes 94894.75325 52860.95859 1.795 0.072677 .
## Car.BrandMG 273353.01480 59794.92567 4.572 0.00000494132 ***
## Car.BrandMitsubishi 38291.83599 147174.20940 0.260 0.794734
## Car.BrandNissan -800446.80299 53406.92769 -14.988 < 2e-16 ***
## Car.BrandRenault -863830.97115 43515.04163 -19.851 < 2e-16 ***
## Car.BrandSkoda -598514.41920 49162.93144 -12.174 < 2e-16 ***
## Car.BrandSsangyong -541388.43661 122520.93653 -4.419 0.00001010585 ***
## Car.BrandTata -710298.48991 45746.64815 -15.527 < 2e-16 ***
## Car.BrandToyota -389903.17344 42123.61343 -9.256 < 2e-16 ***
## Car.BrandVolkswagen -725462.89905 43084.31377 -16.838 < 2e-16 ***
## Car.BrandVolvo -418205.96365 204250.21983 -2.048 0.040652 *
## GearManual -178640.53617 8998.98201 -19.851 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200200 on 5885 degrees of freedom
## Multiple R-squared: 0.6165, Adjusted R-squared: 0.6145
## F-statistic: 295.7 on 32 and 5885 DF, p-value: < 2.2e-16
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Heteroscedasticity: The spread of residuals increases with the fitted values, which is a sign of heteroscedasticity. This means the model errors are not constant across the range of predictions, violating one of the assumptions of linear regression.
Outliers: There are some points with large residuals that are far from the main cluster of residuals around zero. These may be influential outliers and could be impacting the model’s performance.
plots <- gg_resX(model, plot.all = FALSE)
# for each variable of interest ...
plots$Age
plots$Driven..Kms
The spread of residuals is fairly even across the ages, which is a good sign and this suggests that the model captures linear relationship between age and price of the car. There are a few residuals that are significantly higher or lower than the others, which suggests potential outliers.
The residuals appear to have more spread (variance) as Driven..Kms. increases, especially noticeable after about 200,000 km. This suggests that there is heteroscedasticity in the data, violating one of the assumptions. There are few outliers in this data which might influence the model.
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Usually we want to see a normal distribution of the residuals in a histogram of residuals plot. But we are seeing a Right-skewed distribution and it has a right tail as well. This violates one of the assumptions of linear regression which is Errors are normally distributed.
gg_qqplot(model)
The deviations at both ends suggests that the residuals are not normally distributed and may have heavier tails than expected under a normal distribution. This violates one assumption of linear regression, which is normally distributed residuals.
Let’s check the vif of our model and see which independent variables has multi collinearity issues.
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.378891 1 1.174262
## Driven..Kms. 1.577608 1 1.256029
## Fuel 1.461429 4 1.048570
## Car.Brand 1.496066 25 1.008089
## Gear 1.138662 1 1.067081
As we can see the vif for each of the variables is less than 2 and its far below the general rule of thumb, which is 10. This show us that the model doesn’t have any variables with multicollinearities.
Creating a dataframe or a tibble that contains all the coefficients of the model and arranging the coefficients in descending order so that I can better see it.
coeff_df <- data.frame(
term = rownames(as.data.frame(model$coefficients)),
estimate = model$coefficients
)
rownames(coeff_df) <- NULL
coeff_df <- coeff_df |>
arrange(desc(estimate))
head(coeff_df, 8)
## term estimate
## 1 (Intercept) 1955024.698
## 2 Car.BrandJaguar 711025.833
## 3 Car.BrandLandrover 496037.684
## 4 Car.BrandMG 273353.015
## 5 Car.BrandMercedes 94894.753
## 6 Car.BrandMitsubishi 38291.836
## 7 Car.BrandJeep 25766.856
## 8 Car.BrandKIA 3681.371
As we see in the above coefficients of the model, it is understood that the brands Jaguar, Landrover, MG, Mercedes, Mitsubishi, Jeep and Kia has positive coefficients. This mean the base price for each of the cars is approximately 700K, 500K, 273K, 95K, 38K, 26K, 3K INR respectively, not considering any other variables in the model. Also looking at the brands, we can say that they fall under Luxury range and which kind of makes sense, since all these brands have a positive coefficient.
coeff_df |>
filter(term == "Age")
## term estimate
## 1 Age -48443.42
This tells us that for every increase in a year of the age of the car, decreases the price of the car by INR 48,443. This shows us that age of the car has a negative correlation with the price of the car. And for a brand new car the variable becomes 0, telling us there won’t be a change in price of the car since zero nullifies the effect.