library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(dplyr)
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(broom)
library(ggfortify)
library(lindia)
options(scipen = 6)
# default theme, unless otherwise noted
theme_set(theme_minimal())
df <- read.delim('Cars24.csv', sep = ',')
head(df)
## Car.Brand Model Price Model.Year Location Fuel Driven..Kms.
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434
## Gear Ownership EMI..monthly.
## 1 Manual 2 7350
## 2 Manual 1 7790
## 3 Manual 2 5098
## 4 Manual 1 6816
## 5 Automatic 1 4642
## 6 Manual 1 5554
df$age = year(now()) - df$Model.Year
head(df)
## Car.Brand Model Price Model.Year Location Fuel Driven..Kms.
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434
## Gear Ownership EMI..monthly. age
## 1 Manual 2 7350 8
## 2 Manual 1 7790 13
## 3 Manual 2 5098 13
## 4 Manual 1 6816 13
## 5 Automatic 1 4642 9
## 6 Manual 1 5554 12
df |>
ggplot(mapping = aes(x = age, y = Price)) +
geom_point(color = 'darkblue') +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
geom_smooth(method = "lm", se = FALSE, color = 'black') +
labs(x = "Age of the car",
y = "Car Price (INR)",
title = "Age Vs Price")
## `geom_smooth()` using formula = 'y ~ x'
The linear regression formula:\[ \begin{align} Price &= 993461.37 −50964.13×Age \end{align} \]
model <- lm(Price ~ age, data = df)
model$coefficients
## (Intercept) age
## 1015796.91 -52658.23
Creating a new binary column Multiple owners. 1 says there is multiple owners for that car and 0 says there is only a single owner to the car
df <- mutate(df, multiple_owner = ifelse(Ownership > 1, 1, 0))
head(df)
## Car.Brand Model Price Model.Year Location Fuel Driven..Kms.
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434
## Gear Ownership EMI..monthly. age multiple_owner
## 1 Manual 2 7350 8 1
## 2 Manual 1 7790 13 0
## 3 Manual 2 5098 13 1
## 4 Manual 1 6816 13 0
## 5 Automatic 1 4642 9 0
## 6 Manual 1 5554 12 0
model_ <- lm(Price ~ age + multiple_owner + age:multiple_owner, df)
tidy(model_) |>
select(term, estimate) |>
mutate(estimate = round(estimate, 1))
## # A tibble: 4 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 1039710.
## 2 age -56015.
## 3 multiple_owner -79662.
## 4 age:multiple_owner 10387.
Interpretation: For every additional year of age, the price decreases by Rs.56,014.9. This suggests that older items depreciate significantly, which is a common trend in assets like cars or property.
Why include it: Age is a natural predictor of price depreciation over time, so it makes sense to include this in the model.
Interpretation: If the car has had
multiple owners (multiple_owner = 1), the
price decreases by Rs.79,661.9 compared to a car with a
single owner. This indicates that having more owners reduces the
perceived value of the item, which aligns with the assumption that
multiple owners could imply higher wear or lower trust in the product’s
condition.
Why include it: Whether an item has had multiple owners can significantly affect the value of the car, which makes it an important variable to include.
Interpretation: The positive interaction
estimate of Rs.10,386.8 means that the negative effect
of age on price is less severe for cars
with multiple owners. In other words, for cars with multiple owners, the
price decreases at a slower rate as they age compared to cars with a
single owner.
Why include it: The interaction term allows the model to account for different rates of price depreciation based on ownership history. It reveals that the effect of age on price depends on whether the car had multiple owners, which would have been missed without this term.
Let’s consider the features age (\(x_1\)) and Ownership (\(x_2\)). Again, if we’re interested in
modeling the sales price for a car, then these are both
explanatory variables.
df |>
ggplot(mapping = aes(y = age, x = Ownership)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = 'blue') +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
There’s a linear relationship between these two, so we could write a sort of “sub-model” as:
\[ x_2 = m x_1 + b \]
Now, suppose we were to build a model that contained both of them:
\[ \begin{align*} \hat{y} &= \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 \\ &= \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2(mx_1 + b) \\ &= \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2mx_1 + \hat{\beta}_2b\\ \to &= (\hat{\beta}_0 + \hat{\beta}_2b) + (\hat{\beta}_1 + \hat{\beta}_2m)x_1 \end{align*} \]
So if we were to interpret \(\hat{\beta}_1\) (in the “original” model output) as the amount \(y\) increases with each unit change in \(x_1\), then we’d be mistaken. The truth of the matter (in this case) would be that on average, \(y\) increases by about \(\hat{\beta}_1 + \hat{\beta}_2m\) with each unit change in \(x_1\).
model_ <- lm(Price ~ age + multiple_owner, df)
gg_resfitted(model_) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Insights:
The increasing spread of residuals (heteroscedasticity) is a concern because it indicates that your model might not be fully capturing the variance in car prices, particularly for more expensive cars.
The presence of potential outliers and a slight pattern in the residuals suggests that the model might not be the best fit for the data.
plots_ <- gg_resX(model_ , plot.all = FALSE)
plots_$age
The model performs reasonably well in predicting car prices across different ages, but it may have issues with older cars or certain age ranges where larger residuals exist.
The lack of a strong trend is a good sign, but the presence of outliers and slight heteroscedasticity means there could be room for model improvement.
plots_$multiple_owner
# note: multiple_owner is a binary valued column
# if the car has multiple owners - 1
# if the car has only one owner - 0
Cant’ really interpret anything seeing this graph. The residuals for both values of “multiple_owner” (0 and 1) exhibit high variability, spanning a wide range from approximately -8,000,000 to +8,000,000. This indicates that the model has considerable error in predicting the outcome for both groups.
gg_reshist(model_)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram shows a right-skewed distribution with a concentration of residuals around zero but a long tail extending to the right. This suggests that there are some cases where the model’s predicted values are substantially lower than the actual values.
gg_qqplot(model_)
This right tail indicates that the residuals are positively skewed, with several large positive residuals that are much higher than what would be expected in a normal distribution. This aligns with the histogram showing a long right tail.
The extreme points on the upper right confirm that there are outliers in the residuals, corresponding to large prediction errors in the positive direction
This Q-Q plot, along with the histogram, suggests that the residuals are not normally distributed due to positive skewness and outliers. This deviation points to areas where the model could be improved for more accurate predictions.
gg_cooksd(model_, threshold = 'matlab')
Clearly there are many rows of data have a high influence on the model. And we see a few rows as extreme cases such as 1118, 1491, 3329, 3522, 3518, 4979. Let’s investigate these rows further by creating a plot.
ggplot(data = slice(df,
c(1118, 1491, 3329, 3522, 3518, 4979))) +
geom_point(data = df,
mapping = aes(x = age, y = Price)) +
geom_point(mapping = aes(x = age, y = Price),
color = 'darkred') +
geom_text_repel(mapping = aes(x = age,
y = Price,
label = Car.Brand),
color = 'darkred') +
labs(title="Investigating High Influence Points",
subtitle="Label = Car Brand")
Looks like those outliers are luxury cars. This explains sort of explains the high price.
model_$coefficients
## (Intercept) age multiple_owner
## 1018509.25 -53669.80 27574.48
\[ \begin{align} Price &= 1018509.25 -53669.80×Age + 27574.48×(multiple owner) \end{align} \]
The above model says that the baseline price is approximately 1,018,509.25 INR when both age is zero and ownership type is single. The coefficients indicate how each predictor variable (Age and ownership type) affects the price.
The coefficient for Age (-53,669.80) indicates that as the age increases by one unit, the price is expected to decrease by approximately 53,669.80 INR, assuming all other variables remain constant. This negative coefficient suggests an inverse relationship between age and price, indicating that newer cars (lower age) tend to have higher values, while older ones generally depreciate in price.
The coefficient for the “multiple owner” variable (27,574.48) represents the increase in price associated with multiple ownership compared to single ownership. For a car that has multiple owners, the price is expected to be 27,574.48 INR higher than if it had a single owner, all else being equal. This is opposing to what is generally considered. And this poses new question which requires further analysis to answer.