Week 9 Assignment

Importing the dataset and libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(dplyr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(broom)
library(ggfortify)
library(lindia)

options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())
df <- read.delim('Cars24.csv', sep = ',')

head(df)
##   Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
## 6    Maruti        AltoLXI 249699       2012 Hyderabad Petrol        28434
##        Gear Ownership EMI..monthly.
## 1    Manual         2          7350
## 2    Manual         1          7790
## 3    Manual         2          5098
## 4    Manual         1          6816
## 5 Automatic         1          4642
## 6    Manual         1          5554

Creating new column - Age

df$age = year(now()) - df$Model.Year

head(df)
##   Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
## 6    Maruti        AltoLXI 249699       2012 Hyderabad Petrol        28434
##        Gear Ownership EMI..monthly. age
## 1    Manual         2          7350   8
## 2    Manual         1          7790  13
## 3    Manual         2          5098  13
## 4    Manual         1          6816  13
## 5 Automatic         1          4642   9
## 6    Manual         1          5554  12

Last week’s Linear regression model

df |>
  ggplot(mapping = aes(x = age, y = Price)) +
  geom_point(color = 'darkblue') +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  geom_smooth(method = "lm", se = FALSE, color = 'black') +
  labs(x = "Age of the car",
       y = "Car Price (INR)",
       title = "Age Vs Price")
## `geom_smooth()` using formula = 'y ~ x'

The linear regression formula:\[ \begin{align} Price &= 993461.37 −50964.13×Age \end{align} \]

model <- lm(Price ~ age, data = df)
model$coefficients
## (Intercept)         age 
##  1015796.91   -52658.23

Adding few more variables into the model

Creating a new binary column Multiple owners. 1 says there is multiple owners for that car and 0 says there is only a single owner to the car

df <- mutate(df, multiple_owner = ifelse(Ownership > 1, 1, 0))

head(df)
##   Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
## 6    Maruti        AltoLXI 249699       2012 Hyderabad Petrol        28434
##        Gear Ownership EMI..monthly. age multiple_owner
## 1    Manual         2          7350   8              1
## 2    Manual         1          7790  13              0
## 3    Manual         2          5098  13              1
## 4    Manual         1          6816  13              0
## 5 Automatic         1          4642   9              0
## 6    Manual         1          5554  12              0
model_ <- lm(Price ~ age + multiple_owner + age:multiple_owner, df)

tidy(model_) |>
  select(term, estimate) |>
  mutate(estimate = round(estimate, 1))
## # A tibble: 4 × 2
##   term               estimate
##   <chr>                 <dbl>
## 1 (Intercept)        1039710.
## 2 age                 -56015.
## 3 multiple_owner      -79662.
## 4 age:multiple_owner   10387.

Variable addition explanation

Age

  • Interpretation: For every additional year of age, the price decreases by Rs.56,014.9. This suggests that older items depreciate significantly, which is a common trend in assets like cars or property.

  • Why include it: Age is a natural predictor of price depreciation over time, so it makes sense to include this in the model.

Multiple owner

  • Interpretation: If the car has had multiple owners (multiple_owner = 1), the price decreases by Rs.79,661.9 compared to a car with a single owner. This indicates that having more owners reduces the perceived value of the item, which aligns with the assumption that multiple owners could imply higher wear or lower trust in the product’s condition.

  • Why include it: Whether an item has had multiple owners can significantly affect the value of the car, which makes it an important variable to include.

Interaction term: age:multiple_owner

  • Interpretation: The positive interaction estimate of Rs.10,386.8 means that the negative effect of age on price is less severe for cars with multiple owners. In other words, for cars with multiple owners, the price decreases at a slower rate as they age compared to cars with a single owner.

  • Why include it: The interaction term allows the model to account for different rates of price depreciation based on ownership history. It reveals that the effect of age on price depends on whether the car had multiple owners, which would have been missed without this term.

Check for Multicollinearity

Let’s consider the features age (\(x_1\)) and Ownership (\(x_2\)). Again, if we’re interested in modeling the sales price for a car, then these are both explanatory variables.

df |>
  ggplot(mapping = aes(y = age, x = Ownership)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = 'blue') +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

There’s a linear relationship between these two, so we could write a sort of “sub-model” as:

\[ x_2 = m x_1 + b \]

Now, suppose we were to build a model that contained both of them:

\[ \begin{align*} \hat{y} &= \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 \\ &= \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2(mx_1 + b) \\ &= \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2mx_1 + \hat{\beta}_2b\\ \to &= (\hat{\beta}_0 + \hat{\beta}_2b) + (\hat{\beta}_1 + \hat{\beta}_2m)x_1 \end{align*} \]

So if we were to interpret \(\hat{\beta}_1\) (in the “original” model output) as the amount \(y\) increases with each unit change in \(x_1\), then we’d be mistaken. The truth of the matter (in this case) would be that on average, \(y\) increases by about \(\hat{\beta}_1 + \hat{\beta}_2m\) with each unit change in \(x_1\).

Evaluating the model

model_ <- lm(Price ~ age + multiple_owner, df)

Residuals vs. Fitted Values

gg_resfitted(model_) + 
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Insights:

  • The increasing spread of residuals (heteroscedasticity) is a concern because it indicates that your model might not be fully capturing the variance in car prices, particularly for more expensive cars.

  • The presence of potential outliers and a slight pattern in the residuals suggests that the model might not be the best fit for the data.

Residuals Vs X values

plots_ <- gg_resX(model_ , plot.all = FALSE)

plots_$age

  • The model performs reasonably well in predicting car prices across different ages, but it may have issues with older cars or certain age ranges where larger residuals exist.

  • The lack of a strong trend is a good sign, but the presence of outliers and slight heteroscedasticity means there could be room for model improvement.

plots_$multiple_owner

# note: multiple_owner is a binary valued column
# if the car has multiple owners - 1
# if the car has only one owner - 0

Cant’ really interpret anything seeing this graph. The residuals for both values of “multiple_owner” (0 and 1) exhibit high variability, spanning a wide range from approximately -8,000,000 to +8,000,000. This indicates that the model has considerable error in predicting the outcome for both groups.

Residual Histogram

gg_reshist(model_)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram shows a right-skewed distribution with a concentration of residuals around zero but a long tail extending to the right. This suggests that there are some cases where the model’s predicted values are substantially lower than the actual values.

QQ-Plots

gg_qqplot(model_)

This right tail indicates that the residuals are positively skewed, with several large positive residuals that are much higher than what would be expected in a normal distribution. This aligns with the histogram showing a long right tail.

The extreme points on the upper right confirm that there are outliers in the residuals, corresponding to large prediction errors in the positive direction

This Q-Q plot, along with the histogram, suggests that the residuals are not normally distributed due to positive skewness and outliers. This deviation points to areas where the model could be improved for more accurate predictions.

Cook’s D by Observation

gg_cooksd(model_, threshold = 'matlab')

Clearly there are many rows of data have a high influence on the model. And we see a few rows as extreme cases such as 1118, 1491, 3329, 3522, 3518, 4979. Let’s investigate these rows further by creating a plot.

Investigating high influence points

ggplot(data = slice(df, 
                    c(1118, 1491, 3329, 3522, 3518, 4979))) +
  geom_point(data = df,
             mapping = aes(x = age, y = Price)) +
  geom_point(mapping = aes(x = age, y = Price),
             color = 'darkred') +
  geom_text_repel(mapping = aes(x = age, 
                                y = Price,
                                label = Car.Brand),
             color = 'darkred') +
  labs(title="Investigating High Influence Points",
       subtitle="Label = Car Brand")

Looks like those outliers are luxury cars. This explains sort of explains the high price.

Model Output

model_$coefficients
##    (Intercept)            age multiple_owner 
##     1018509.25      -53669.80       27574.48

\[ \begin{align} Price &= 1018509.25 -53669.80×Age + 27574.48×(multiple owner) \end{align} \]

Model Interpretation

The above model says that the baseline price is approximately 1,018,509.25 INR when both age is zero and ownership type is single. The coefficients indicate how each predictor variable (Age and ownership type) affects the price.

Variable Age

The coefficient for Age (-53,669.80) indicates that as the age increases by one unit, the price is expected to decrease by approximately 53,669.80 INR, assuming all other variables remain constant. This negative coefficient suggests an inverse relationship between age and price, indicating that newer cars (lower age) tend to have higher values, while older ones generally depreciate in price.

Variable multiple_owner

The coefficient for the “multiple owner” variable (27,574.48) represents the increase in price associated with multiple ownership compared to single ownership. For a car that has multiple owners, the price is expected to be 27,574.48 INR higher than if it had a single owner, all else being equal. This is opposing to what is generally considered. And this poses new question which requires further analysis to answer.