Week 11 Assignment

Loading necessary libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(dplyr)
library(ggrepel)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
options(scipen = 6)
cars24 <- read.delim('Cars24.csv', sep = ",")

head(cars24)
##   Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
## 6    Maruti        AltoLXI 249699       2012 Hyderabad Petrol        28434
##        Gear Ownership EMI..monthly.
## 1    Manual         2          7350
## 2    Manual         1          7790
## 3    Manual         2          5098
## 4    Manual         1          6816
## 5 Automatic         1          4642
## 6    Manual         1          5554

Choosing Response and Explanatory variables

For my model I am choosing Price as the response variable and Age, Driven..kms., multiple_owner and Gear as my explanatory variables. Since my data doesn’t have Age and multiple_owner columns directly, I need to calculate it.

cars24 <- cars24 |>
  mutate(multiple_owner = ifelse(Ownership > 1, 1, 0),
         Age = year(now()) - Model.Year
        )

head(cars24)
##   Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
## 6    Maruti        AltoLXI 249699       2012 Hyderabad Petrol        28434
##        Gear Ownership EMI..monthly. multiple_owner Age
## 1    Manual         2          7350              1   8
## 2    Manual         1          7790              0  13
## 3    Manual         2          5098              1  13
## 4    Manual         1          6816              0  13
## 5 Automatic         1          4642              0   9
## 6    Manual         1          5554              0  12

Building Linear model

model <- lm(Price ~ Age + Driven..Kms. + Fuel + Car.Brand + Gear, data = cars24)

summary(model)
## 
## Call:
## lm(formula = Price ~ Age + Driven..Kms. + Fuel + Car.Brand + 
##     Gear, data = cars24)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -756704 -104940   -6464   83627 5680165 
## 
## Coefficients:
##                          Estimate    Std. Error t value      Pr(>|t|)    
## (Intercept)         1955024.69796   41229.81274  47.418       < 2e-16 ***
## Age                  -48443.41634    1052.12724 -46.043       < 2e-16 ***
## Driven..Kms.             -0.36673       0.07718  -4.752 0.00000206473 ***
## FuelElectric        -490779.89954  142599.07499  -3.442      0.000582 ***
## FuelPetrol          -156927.97595    6612.76467 -23.731       < 2e-16 ***
## FuelPetrol + CNG    -193512.35474   17327.35603 -11.168       < 2e-16 ***
## FuelPetrol + LPG    -110244.47147   47765.80482  -2.308      0.021032 *  
## Car.BrandBMW        -195941.72525   56162.38697  -3.489      0.000489 ***
## Car.BrandChevrolet  -848377.39293   53983.43476 -15.716       < 2e-16 ***
## Car.BrandDatsun     -969620.04234   57999.66503 -16.718       < 2e-16 ***
## Car.BrandFiat       -896325.39486   70624.30828 -12.691       < 2e-16 ***
## Car.BrandFord       -713070.15639   43891.82961 -16.246       < 2e-16 ***
## Car.BrandHonda      -658151.50901   41753.17137 -15.763       < 2e-16 ***
## Car.BrandHyundai    -675327.20249   41118.88003 -16.424       < 2e-16 ***
## Car.BrandISUZU      -600711.83155  204303.40918  -2.940      0.003292 ** 
## Car.BrandJaguar      711025.83290  122382.39961   5.810 0.00000000658 ***
## Car.BrandJeep         25766.85642   65667.88409   0.392      0.694791    
## Car.BrandKIA           3681.37057   59931.42154   0.061      0.951022    
## Car.BrandLandrover   496037.68363   98101.79221   5.056 0.00000044028 ***
## Car.BrandMahindra   -542587.91975   43928.39699 -12.352       < 2e-16 ***
## Car.BrandMaruti     -765594.25519   40915.97935 -18.711       < 2e-16 ***
## Car.BrandMercedes     94894.75325   52860.95859   1.795      0.072677 .  
## Car.BrandMG          273353.01480   59794.92567   4.572 0.00000494132 ***
## Car.BrandMitsubishi   38291.83599  147174.20940   0.260      0.794734    
## Car.BrandNissan     -800446.80299   53406.92769 -14.988       < 2e-16 ***
## Car.BrandRenault    -863830.97115   43515.04163 -19.851       < 2e-16 ***
## Car.BrandSkoda      -598514.41920   49162.93144 -12.174       < 2e-16 ***
## Car.BrandSsangyong  -541388.43661  122520.93653  -4.419 0.00001010585 ***
## Car.BrandTata       -710298.48991   45746.64815 -15.527       < 2e-16 ***
## Car.BrandToyota     -389903.17344   42123.61343  -9.256       < 2e-16 ***
## Car.BrandVolkswagen -725462.89905   43084.31377 -16.838       < 2e-16 ***
## Car.BrandVolvo      -418205.96365  204250.21983  -2.048      0.040652 *  
## GearManual          -178640.53617    8998.98201 -19.851       < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200200 on 5885 degrees of freedom
## Multiple R-squared:  0.6165, Adjusted R-squared:  0.6145 
## F-statistic: 295.7 on 32 and 5885 DF,  p-value: < 2.2e-16

Diagnosing the model

Residual Vs Fitted value

gg_resfitted(model) +
    geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Issues

Heteroscedasticity: The spread of residuals increases with the fitted values, which is a sign of heteroscedasticity. This means the model errors are not constant across the range of predictions, violating one of the assumptions of linear regression.

Outliers: There are some points with large residuals that are far from the main cluster of residuals around zero. These may be influential outliers and could be impacting the model’s performance.

Residuals Vs X values

plots <- gg_resX(model, plot.all = FALSE)

# for each variable of interest ...
plots$Age 

plots$Driven..Kms 

Residual Vs Age

The spread of residuals is fairly even across the ages, which is a good sign and this suggests that the model captures linear relationship between age and price of the car. There are a few residuals that are significantly higher or lower than the others, which suggests potential outliers.

Residual Vs Driven..Kms.

The residuals appear to have more spread (variance) as Driven..Kms. increases, especially noticeable after about 200,000 km. This suggests that there is heteroscedasticity in the data, violating one of the assumptions. There are few outliers in this data which might influence the model.

Residual Histogram

gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Usually we want to see a normal distribution of the residuals in a histogram of residuals plot. But we are seeing a Right-skewed distribution and it has a right tail as well. This violates one of the assumptions of linear regression which is Errors are normally distributed.

QQ Plot

gg_qqplot(model)

Issues

The deviations at both ends suggests that the residuals are not normally distributed and may have heavier tails than expected under a normal distribution. This violates one assumption of linear regression, which is normally distributed residuals.

Variance infliction factor

Let’s check the vif of our model and see which independent variables has multi collinearity issues.

vif(model)
##                  GVIF Df GVIF^(1/(2*Df))
## Age          1.378891  1        1.174262
## Driven..Kms. 1.577608  1        1.256029
## Fuel         1.461429  4        1.048570
## Car.Brand    1.496066 25        1.008089
## Gear         1.138662  1        1.067081

As we can see the vif for each of the variables is less than 2 and its far below the general rule of thumb, which is 10. This show us that the model doesn’t have any variables with multicollinearities.

Interpreting Coefficients

Creating a dataframe or a tibble that contains all the coefficients of the model and arranging the coefficients in descending order so that I can better see it.

coeff_df <- data.frame(
  term = rownames(as.data.frame(model$coefficients)),
  estimate = model$coefficients
)
rownames(coeff_df) <- NULL

coeff_df <- coeff_df |>
  arrange(desc(estimate))

Insight Gathered on Car Brand

head(coeff_df, 8)
##                  term    estimate
## 1         (Intercept) 1955024.698
## 2     Car.BrandJaguar  711025.833
## 3  Car.BrandLandrover  496037.684
## 4         Car.BrandMG  273353.015
## 5   Car.BrandMercedes   94894.753
## 6 Car.BrandMitsubishi   38291.836
## 7       Car.BrandJeep   25766.856
## 8        Car.BrandKIA    3681.371

As we see in the above coefficients of the model, it is understood that the brands Jaguar, Landrover, MG, Mercedes, Mitsubishi, Jeep and Kia has positive coefficients. This mean the base price for each of the cars is approximately 700K, 500K, 273K, 95K, 38K, 26K, 3K INR respectively, not considering any other variables in the model. Also looking at the brands, we can say that they fall under Luxury range and which kind of makes sense, since all these brands have a positive coefficient.

Insight Gathered on Age

coeff_df |>
  filter(term == "Age")
##   term  estimate
## 1  Age -48443.42

This tells us that for every increase in a year of the age of the car, decreases the price of the car by INR 48,443. This shows us that age of the car has a negative correlation with the price of the car. And for a brand new car the variable becomes 0, telling us there won’t be a change in price of the car since zero nullifies the effect.