The main idea I had behind using this dataset was to try and find some way to predict the selling price of a used car based on brand, model, age, no of previous owners, fuel type, kilometers driven and transmission type. In this dataset I have the selling price for all the rows, but if you consider a scenario like while adding a new row of a used car data, someone should make an assessment of the car and figure out what the selling price should be. I am trying to find a way to automate this part using linear regression.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(dplyr)
library(ggrepel)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(vtable)
## Loading required package: kableExtra
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
options(scipen = 6)
cars24 <- read.csv("Cars24.csv", na.strings = "")
head(cars24)
## car_brand model price year location fuel km_driven gear
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674 Manual
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979 Manual
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330 Manual
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662 Manual
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256 Automatic
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434 Manual
## ownership monthly_payment
## 1 2 7350
## 2 1 7790
## 3 2 5098
## 4 1 6816
## 5 1 4642
## 6 1 5554
For the purpose of this project, it is understood that selling price (price) is the dependent variable also called as Response variable.
And the independent variables are car_brand, model, year, fuel, km_driven, gear and ownership. Independent variables are also called as explanatory variable.
sapply(cars24, function(x) sum(is.na(x)))
## car_brand model price year location
## 0 265 0 0 0
## fuel km_driven gear ownership monthly_payment
## 0 0 0 0 0
This shows us that all the cells have some value, except for the column model. We will remove those rows before feeding it to our mode.
cars24 <- na.omit(cars24)
sapply(cars24, function(x) sum(is.na(x)))
## car_brand model price year location
## 0 0 0 0 0
## fuel km_driven gear ownership monthly_payment
## 0 0 0 0 0
Now we can proceed exploring each column to get a better understanding.
Creating a function that outputs python equivalent value_count function in pandas.
value_counts <- function(df, col_name) {
df |>
group_by({{ col_name }}) |>
summarise(n=n()) |>
arrange(desc(n))
}
head(value_counts(cars24, car_brand),5)
## # A tibble: 5 × 2
## car_brand n
## <chr> <int>
## 1 Maruti 2744
## 2 Hyundai 1218
## 3 Honda 447
## 4 Toyota 286
## 5 Volkswagen 186
This shows us the top 5 selling brand in India is Maruti, Hyundai, Honda, Toyota and Volkswagen.
cars24|>
filter(car_brand %in% head(value_counts(cars24, car_brand),5)$car_brand) |>
ggplot() +
geom_bar(mapping = aes(x = car_brand)) +
labs(title = "Top 5 Selling brands",
x = "Car Brand",
y = "Count") +
theme_hc()
head(value_counts(cars24, model),5)
## # A tibble: 5 × 2
## model n
## <chr> <int>
## 1 Alto 800LXI 184
## 2 Wagon R 1.0VXI 177
## 3 Alto K10VXI 156
## 4 SwiftVDI 137
## 5 SwiftVXI 135
cars24|>
filter(model %in% head(value_counts(cars24, model),5)$model) |>
ggplot() +
geom_bar(mapping = aes(x = model, fill = car_brand)) +
labs(title = "Top 5 Selling car models",
x = "Car model",
y = "Count") +
theme_hc()
For applying linear regression it is better to find age of each car using the year, instead of using the year column itself.
cars24$age <- year(now()) - cars24$year
head(cars24)
## car_brand model price year location fuel km_driven gear
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674 Manual
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979 Manual
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330 Manual
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662 Manual
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256 Automatic
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434 Manual
## ownership monthly_payment age
## 1 2 7350 8
## 2 1 7790 13
## 3 2 5098 13
## 4 1 6816 13
## 5 1 4642 9
## 6 1 5554 12
Now, with age we’ll visualize it.
head(value_counts(cars24, age),5)
## # A tibble: 5 × 2
## age n
## <dbl> <int>
## 1 9 707
## 2 10 662
## 3 7 638
## 4 8 609
## 5 12 586
cars24 |>
ggplot() +
geom_histogram(mapping = aes(x = age)) +
labs(title = "Distribution of age",
x = "Age",
y = "Count") +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
value_counts(cars24, fuel)
## # A tibble: 5 × 2
## fuel n
## <chr> <int>
## 1 Petrol 3611
## 2 Diesel 1877
## 3 Petrol + CNG 146
## 4 Petrol + LPG 17
## 5 Electric 2
Let’s see it visually.
cars24 |>
ggplot() +
geom_bar(mapping = aes(x = fuel)) +
labs(title = "Distribution of Fuel type",
x = "Fuel type",
y = "Count") +
theme_minimal()
With this plot it is evident that Petrol and Diesel are the most common fuel type used in India, followed by Petrol+CNG, Petrol+LPG. Electric vehicles are not that popular in India.
Lets see the distribution of fuel type among top 5 car brands in India
cars24 |>
filter(car_brand %in% head(value_counts(cars24, car_brand),5)$car_brand) |>
ggplot() +
geom_bar(mapping = aes(x = car_brand, fill = fuel)) +
labs(title = "Top 5 Selling brands",
x = "Car Brand",
y = "Count") +
theme_minimal()
cars24 |>
ggplot() +
geom_histogram(mapping = aes(x = km_driven, fill = fuel), color = 'white') +
labs(title = "Distribution of the variable km_driven",
x = "Kilometers driven",
y = "Count") +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cars24 |>
ggplot() +
geom_bar(mapping = aes(x = gear, fill = fuel)) +
labs(title = "Distribution of transmission type",
x = "Fuel type",
y = "Count") +
theme_minimal()
From the above plot we can clearly see that the manual gear type is more popular than the automatic one.
For the purpose of applying linear regression, we are going to transform the ownership column into a new column named multiple_owner. This column will have values 0 and 1. 0 mentions that the car doesn’t have multiple owner, that is, it has only single owner. And 1 depicts that the car has multiple owner.
Let’s transform this column.
cars24$multiple_owner <- ifelse(cars24$ownership > 1, 1, 0)
head(cars24)
## car_brand model price year location fuel km_driven gear
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674 Manual
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979 Manual
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330 Manual
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662 Manual
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256 Automatic
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434 Manual
## ownership monthly_payment age multiple_owner
## 1 2 7350 8 1
## 2 1 7790 13 0
## 3 2 5098 13 1
## 4 1 6816 13 0
## 5 1 4642 9 0
## 6 1 5554 12 0
cars24 |>
ggplot() +
geom_bar(mapping = aes(x = multiple_owner)) +
labs(title = "Distribution of ownership type",
x = "Ownership type",
y = "Count") +
scale_x_continuous(breaks = c(0, 1)) +
theme_minimal()
cars24 |>
ggplot() +
geom_histogram(mapping = aes(x = price), color = 'white') +
labs(title = "Distribution of the variable km_driven",
x = "Kilometers driven",
y = "Count") +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This plot is heavily right skewed and with a tail, which suggests us that there might be outliers that needs to be addressed.
I know for a fact that the price of the car decreases as the age of the car increases. Let’s devise a hypothesis test to confirm this.
Null Hypothesis: There is no relationship between the age of a car and its average price, meaning the age of the car does not affect its price.
Alternative Hypothesis: There is a relationship between the age of a car and its average price, such that the price decreases as the car’s age increases.
test_model <- lm(price ~ age, data = cars24)
summary(test_model)
##
## Call:
## lm(formula = price ~ age, data = cars24)
##
## Residuals:
## Min 1Q Median 3Q Max
## -460089 -131985 -45685 55271 3215036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 993461 12150 81.77 <2e-16 ***
## age -50964 1218 -41.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263100 on 5651 degrees of freedom
## Multiple R-squared: 0.2366, Adjusted R-squared: 0.2365
## F-statistic: 1752 on 1 and 5651 DF, p-value: < 2.2e-16
If you see the p-value for this model is 2.2 * e^-16, which is less than the significance value of 0.05. This means that we can reject our null hypothesis. That means there is a relationship between the age and price of the car. And by looking at the coefficient of age in the model it is evident that, it is a negative relationship since it is a negative coefficient.
cars24 |>
ggplot(mapping = aes(x = age, y = price)) +
geom_point(color = 'darkblue') +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
geom_smooth(method = "lm", se = FALSE, color = 'black') +
labs(x = "Age of the car",
y = "Car Price (INR)",
title = "Age Vs Price") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
I know for a fact that the price of the car decreases as the km_driven of the car increases. Let’s devise a hypothesis test to confirm this.
Null Hypothesis: There is no relationship between the km_driven of a car and its average price.
Alternative Hypothesis: There is a relationship between the km_driven of a car and its average price.
test_model <- lm(price ~ km_driven, data = cars24)
summary(test_model)
##
## Call:
## lm(formula = price ~ km_driven, data = cars24)
##
## Residuals:
## Min 1Q Median 3Q Max
## -416457 -185675 -79840 92514 3023395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 540147.74177 7062.18070 76.485 < 2e-16 ***
## km_driven -0.54701 0.09469 -5.777 0.00000000803 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 300300 on 5651 degrees of freedom
## Multiple R-squared: 0.00587, Adjusted R-squared: 0.005694
## F-statistic: 33.37 on 1 and 5651 DF, p-value: 0.000000008034
If you see the p-value for this model is 0.000000008034, which is less than the significance value of 0.05. This means that we can reject our null hypothesis. That means there is a relationship between the km_driven and price of the car. And by looking at the coefficient of age in the model it is evident that, it is a negative relationship since it is a negative coefficient.
cars24 |>
ggplot(mapping = aes(x = km_driven, y = price)) +
geom_point(color = 'darkblue') +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
geom_smooth(method = "lm", se = FALSE, color = 'black') +
labs(x = "Distance driven by the car (km)",
y = "Car Price (INR)",
title = "km_driven Vs Price") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The categorical variable selected is Fuel. The Fuel column has four different values, Petrol, Diesel, Electric, Petrol+LPG and Petrol+CNG. Let’s devise an ANOVA test to figure out whether there is any difference between the response variable between the four class of transmission.
Null Hypothesis : There is no significant difference in the mean car price across different fuel types.
Alternative Hypothesis : There is a significant difference in the mean car price across different fuel types.
Choosing the Significance value (\(\alpha\)) to be 0.05.
m <- aov(price ~ fuel, data = cars24)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## fuel 4 60061539128291 15015384782073 187.4 <2e-16 ***
## Residuals 5648 452479435103508 80113214431
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p value is less than \(\alpha\), we can reject the null hypothesis. Which means that there is significant difference in the mean car price across different fuel types.
Since we rejected the null hypothesis, which means that there is enough evidence to conclude there is significant difference between the fuel types. This result suggests that the fuel type of the car impacts its price in the used car market. Understanding this relationship can guide buyers, sellers, and dealers in setting or negotiating car prices more effectively.
cars24 |>
filter(price < 4000000) |>
ggplot() +
geom_boxplot(mapping = aes(y = price, x = fuel)) +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
labs(x = "Fuel Type",
y = "Car Price (INR)",
title = "Fuel Type Vs Price")
I am going to apply one hot encoding to this column, ultimately splitting it into 5 different columns namely, diesel, petrol, electric, LPG and CNG.
cars24 <- cars24 |>
mutate(
petrol = ifelse(grepl("petrol", fuel, ignore.case = TRUE), 1, 0),
diesel = ifelse(grepl("diesel", fuel, ignore.case = TRUE), 1, 0),
electric = ifelse(grepl("electric", fuel, ignore.case = TRUE), 1, 0),
lpg = ifelse(grepl("lpg", fuel, ignore.case = TRUE), 1, 0),
cng = ifelse(grepl("cng", fuel, ignore.case = TRUE), 1, 0),
)
head(cars24)
## car_brand model price year location fuel km_driven gear
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674 Manual
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979 Manual
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330 Manual
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662 Manual
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256 Automatic
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434 Manual
## ownership monthly_payment age multiple_owner petrol diesel electric lpg cng
## 1 2 7350 8 1 1 0 0 0 0
## 2 1 7790 13 0 1 0 0 0 0
## 3 2 5098 13 1 1 0 0 0 0
## 4 1 6816 13 0 1 0 0 0 0
## 5 1 4642 9 0 1 0 0 0 0
## 6 1 5554 12 0 1 0 0 0 0
Like how I converted ownership into binary values, I am also going to convert gear column into a binary one. Creating a new column named manual, and going to update 1 for the rows with manual gear and 0 for rows with automatic gear. I didn’t use one hot encoding, because this way I’m getting one less column (one less independent variable as well).
cars24$manual <- ifelse(grepl("manual", cars24$gear, ignore.case = TRUE), 1, 0)
head(cars24)
## car_brand model price year location fuel km_driven gear
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674 Manual
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979 Manual
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330 Manual
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662 Manual
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256 Automatic
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434 Manual
## ownership monthly_payment age multiple_owner petrol diesel electric lpg cng
## 1 2 7350 8 1 1 0 0 0 0
## 2 1 7790 13 0 1 0 0 0 0
## 3 2 5098 13 1 1 0 0 0 0
## 4 1 6816 13 0 1 0 0 0 0
## 5 1 4642 9 0 1 0 0 0 0
## 6 1 5554 12 0 1 0 0 0 0
## manual
## 1 1
## 2 1
## 3 1
## 4 1
## 5 0
## 6 1
Here I am not going to apply one hot encoding, since I will end up with 26 more columns, one for each brand. Instead I am gonna Target encode this column. This replaces the car brands with a number representing the average selling price (target value) associated with each category.
cars24 <- cars24 |>
group_by(car_brand) |>
mutate(brand_new = mean(price, na.rm = TRUE)) |>
ungroup()
head(cars24)
## # A tibble: 6 × 19
## car_brand model price year location fuel km_driven gear ownership
## <chr> <chr> <int> <int> <chr> <chr> <int> <chr> <int>
## 1 Hyundai EonERA PLUS 330399 2016 Hyderab… Petr… 10674 Manu… 2
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderab… Petr… 20979 Manu… 1
## 3 Maruti Alto K10LXI 229199 2011 Hyderab… Petr… 47330 Manu… 2
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderab… Petr… 19662 Manu… 1
## 5 Tata NanoTWIST XTA 208699 2015 Hyderab… Petr… 11256 Auto… 1
## 6 Maruti AltoLXI 249699 2012 Hyderab… Petr… 28434 Manu… 1
## # ℹ 10 more variables: monthly_payment <int>, age <dbl>, multiple_owner <dbl>,
## # petrol <dbl>, diesel <dbl>, electric <dbl>, lpg <dbl>, cng <dbl>,
## # manual <dbl>, brand_new <dbl>
cars24 <- cars24 |>
group_by(model) |>
mutate(model_new = mean(price, na.rm = TRUE)) |>
ungroup()
head(cars24)
## # A tibble: 6 × 20
## car_brand model price year location fuel km_driven gear ownership
## <chr> <chr> <int> <int> <chr> <chr> <int> <chr> <int>
## 1 Hyundai EonERA PLUS 330399 2016 Hyderab… Petr… 10674 Manu… 2
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderab… Petr… 20979 Manu… 1
## 3 Maruti Alto K10LXI 229199 2011 Hyderab… Petr… 47330 Manu… 2
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderab… Petr… 19662 Manu… 1
## 5 Tata NanoTWIST XTA 208699 2015 Hyderab… Petr… 11256 Auto… 1
## 6 Maruti AltoLXI 249699 2012 Hyderab… Petr… 28434 Manu… 1
## # ℹ 11 more variables: monthly_payment <int>, age <dbl>, multiple_owner <dbl>,
## # petrol <dbl>, diesel <dbl>, electric <dbl>, lpg <dbl>, cng <dbl>,
## # manual <dbl>, brand_new <dbl>, model_new <dbl>
Scaling is essential for maintaining consistent relationship between the features and improving model performance. To provide a general understanding of why we do scaling consider the below example.
age ranges between 3 to 17
km_driven ranges between 179 to 912380
Since the features have a massive difference on their range, it is better to have these values in scale. Hence we are gonna scale our variables.
cars24_scaled <- cars24 |>
mutate(across(where(is.numeric), ~ (. - min(.)) / (max(.) - min(.))))
head(cars24_scaled)
## # A tibble: 6 × 20
## car_brand model price year location fuel km_driven gear ownership
## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 Hyundai EonERA PLUS 0.0703 0.643 Hyderab… Petr… 0.0113 Manu… 0.333
## 2 Maruti Wagon R 1.0LXI 0.0761 0.286 Hyderab… Petr… 0.0226 Manu… 0
## 3 Maruti Alto K10LXI 0.0406 0.286 Hyderab… Petr… 0.0515 Manu… 0.333
## 4 Maruti RitzVXI BS IV 0.0633 0.286 Hyderab… Petr… 0.0212 Manu… 0
## 5 Tata NanoTWIST XTA 0.0346 0.571 Hyderab… Petr… 0.0120 Auto… 0
## 6 Maruti AltoLXI 0.0466 0.357 Hyderab… Petr… 0.0308 Manu… 0
## # ℹ 11 more variables: monthly_payment <dbl>, age <dbl>, multiple_owner <dbl>,
## # petrol <dbl>, diesel <dbl>, electric <dbl>, lpg <dbl>, cng <dbl>,
## # manual <dbl>, brand_new <dbl>, model_new <dbl>
Since for the purpose of this project I am choosing the following explanatory variables. Not choosing electric, because it has only two rows in the entire dataset, so it is redundant. And the target variable is price.
brand_new
age
km_driven
petrol
diesel
lpg
cng
manual
multiple_owner
set.seed(42) # For reproducibility
train_indices <- createDataPartition(cars24_scaled$price, p = 0.8, list = FALSE)
train_set <- cars24_scaled[train_indices, ]
test_set <- cars24_scaled[-train_indices, ]
model1 <- lm(price ~ brand_new + age + petrol + diesel + lpg + cng + km_driven + manual + multiple_owner, data = train_set)
model1$coefficients
## (Intercept) brand_new age petrol diesel
## 0.062548726 0.443192610 -0.184582277 0.121840217 0.163061647
## lpg cng km_driven manual multiple_owner
## 0.015062922 -0.012241668 -0.067749398 -0.061311493 -0.001424012
The equation of the model is mentioned below:
\[ \begin{align} \text{Price} &= 0.062 + 0.443 \times \text{brand} - 0.184 \times \text{age} + 0.121 \times \text{petrol} \\ &\quad + 0.163 \times \text{diesel} + 0.015 \times \text{lpg} - 0.012 \times \text{cng} \\ &\quad - 0.067 \times \text{km_driven} - 0.061 \times \text{manual} - 0.001 \times \text{multiple_owner}\end{align} \]
vif_values <- vif(model1)
vif_df <- data.frame(
Variable = names(vif_values),
VIF = as.numeric(vif_values)
)
vif_df |>
ggplot() +
geom_bar(mapping = aes(x = VIF, y = Variable), stat = "identity", fill = "steelblue") + # Horizontal bar
geom_vline(xintercept = 5, linetype = "dashed", color = "red", linewidth = 1) + # Add vertical line at VIF = 5
labs(title = "VIF Values", x = "VIF", y = "Variable") + # Axis labels and title
theme_minimal() +
theme(axis.text.x = element_text(angle = 0), # Keep x-axis labels horizontal
axis.text.y = element_text(angle = 0)) # Keep y-axis labels horizontal
This suggests that the variables petrol and diesel are heavily correlated. Let’s remove any one variable and see whether the variance inflation factor (vif) improves.
model1_new <- lm(price ~ brand_new + age + petrol + lpg + cng + km_driven + manual + multiple_owner, data = train_set)
model1_new$coefficients
## (Intercept) brand_new age petrol lpg
## 0.225644856 0.442652257 -0.184525079 -0.041110309 0.015077654
## cng km_driven manual multiple_owner
## -0.012251173 -0.067168255 -0.061407508 -0.001575917
vif_values <- vif(model1_new)
vif_df <- data.frame(
Variable = names(vif_values),
VIF = as.numeric(vif_values)
)
vif_df |>
ggplot() +
geom_bar(mapping = aes(x = VIF, y = Variable), stat = "identity", fill = "steelblue") +
geom_vline(xintercept = 5, linetype = "dashed", color = "red", linewidth = 1) +
labs(title = "VIF Values after removing variable diesel", x = "VIF", y = "Variable") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0),
axis.text.y = element_text(angle = 0))
Since the vif value of all the variables is less than 5, there is no multicollinearity.
After removing the variable diesel, the equation of the new model is:
\[ \begin{align} Price &= 0.225 + 0.442 \times \text{brand} - 0.184 \times \text{age} - 0.041 \times \text{petrol} \\ &\quad + 0.015 \times \text{lpg} - 0.012 \times \text{cng} - 0.067 \times \text{km_driven} \\ &\quad - 0.061 \times \text{manual} - 0.001 \times \text{multiple_owner}\end{align} \]
coef_df <- data.frame(Predictor = names(coef(model1_new)[-1]), coefficient = coef(model1_new)[-1])
ggplot(coef_df, aes(x = reorder(Predictor, coefficient), y = coefficient)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # Flip for horizontal bars (optional)
labs(
title = "Linear Regression Coefficients",
x = "Predictors",
y = "Coefficient Value"
) +
theme_minimal()
summary(model1_new)$r.squared
## [1] 0.6572958
The R-squared value is 0.657. This implies that approximately 65% of the variance in the dependent variable (price of car) can be explained by the independent variables (e.g., age, km_driven, make(brand), etc.) in your regression model. This value makes sense, because we haven’t considered some variables like model, location and emi.
gg_resfitted(model1_new) +
geom_smooth(se=FALSE) +
theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
gg_resX(model1_new, plot.all = FALSE)
## $brand_new
##
## $age
##
## $petrol
##
## $lpg
##
## $cng
##
## $km_driven
##
## $manual
##
## $multiple_owner
gg_reshist(model1_new) +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_qqplot(model1_new) +
theme_minimal()
# mean squared error
mse <- mean(model1_new$residuals ^ 2)
# root mean squared error
rmse <- sqrt(mse)
print(rmse)
## [1] 0.05099379
What is rmse?
RMSE stands for Root Mean Squared Error, which is a commonly used metric for evaluating the performance of a regression model. It measures the average magnitude of the errors (i.e., the difference between the predicted and actual values) in the same units as the target variable. A lower RMSE indicates better model performance.
Interpretation:
If RMSE is 0, the model’s predictions are perfect.
A lower RMSE indicates better model performance, but it should be compared to other models or benchmarks for context.
This suggests that our model is better model, since our rmse value is close to zero.
# Make predictions on the test set
test_set$predicted_price <- predict(model1_new, newdata = test_set)
test_set <- test_set |>
mutate(
price_ = price*(3495000 - 91000) + 91000,
predicted_price_1 = predicted_price*(3495000 - 91000) + 91000
)
test_set |>
select(car_brand, price_, predicted_price_1) |>
head(10)
## # A tibble: 10 × 3
## car_brand price_ predicted_price_1
## <chr> <dbl> <dbl>
## 1 Maruti 249699 285770.
## 2 Maruti 240599 240231.
## 3 Hyundai 401599 441285.
## 4 Maruti 241599 241317.
## 5 Maruti 356099 505645.
## 6 Hyundai 401699 402981.
## 7 Maruti 238099 284154.
## 8 Maruti 613499 551835.
## 9 Hyundai 222699 114447.
## 10 Hyundai 346399 258847.
set.seed(42) # For reproducibility
train_indices <- createDataPartition(cars24_scaled$price, p = 0.8, list = FALSE)
train_set_n <- cars24_scaled[train_indices, ]
test_set_n <- cars24_scaled[-train_indices, ]
model2 <- lm(price ~ brand_new + model_new + age + petrol + lpg + cng + km_driven + manual + multiple_owner + monthly_payment, data = train_set_n)
summary(model2)
##
## Call:
## lm(formula = price ~ brand_new + model_new + age + petrol + lpg +
## cng + km_driven + manual + multiple_owner + monthly_payment,
## data = train_set_n)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0000068354 -0.0000032991 0.0000000698 0.0000033353 0.0000067163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.000003171042 0.000000383605 -8.266 <2e-16 ***
## brand_new 0.000000237386 0.000000825224 0.288 0.774
## model_new -0.000000761144 0.000002504408 -0.304 0.761
## age 0.000000461986 0.000000383265 1.205 0.228
## petrol 0.000000008533 0.000000146433 0.058 0.954
## lpg -0.000000400379 0.000001111758 -0.360 0.719
## cng -0.000000230332 0.000000379372 -0.607 0.544
## km_driven -0.000001416548 0.000001499095 -0.945 0.345
## manual -0.000000186398 0.000000208804 -0.893 0.372
## multiple_owner 0.000000029390 0.000000135818 0.216 0.829
## monthly_payment 0.999998961918 0.000002509395 398502.000 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.000003824 on 4513 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.348e+11 on 10 and 4513 DF, p-value: < 2.2e-16
model2$coefficients
## (Intercept) brand_new model_new age
## -0.000003171041555 0.000000237385647 -0.000000761144379 0.000000461986113
## petrol lpg cng km_driven
## 0.000000008533082 -0.000000400379181 -0.000000230331561 -0.000001416547952
## manual multiple_owner monthly_payment
## -0.000000186398325 0.000000029389619 0.999998961918167
\[ \begin{align} Price &= 0.000003 + 0.0000002 \times \text{brand} - 0.0000007 \times \text{model}- 0.0000004 \times \text{age}\\ &\quad - 0.000000008 \times \text{petrol} - 0.0000004 \times \text{lpg} - 0.0000002 \times \text{cng} \\ &\quad - 0.000001 \times \text{km_driven} - 0.0000001 \times \text{manual} \\ &\quad + 0.00000002 \times \text{multiple_owner} + 0.999 \times \text{monthly_payment} \end{align} \]
coef_df <- data.frame(Predictor = names(coef(model2)[-1]), coefficient = coef(model2)[-1])
ggplot(coef_df, aes(x = reorder(Predictor, coefficient), y = coefficient)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # Flip for horizontal bars (optional)
labs(
title = "Linear Regression Coefficients",
x = "Predictors",
y = "Coefficient Value"
) +
theme_minimal()
model2_new <- lm(price ~ brand_new + model_new + age + petrol + lpg + cng + km_driven + manual + multiple_owner , data = train_set_n)
summary(model2)
##
## Call:
## lm(formula = price ~ brand_new + model_new + age + petrol + lpg +
## cng + km_driven + manual + multiple_owner + monthly_payment,
## data = train_set_n)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0000068354 -0.0000032991 0.0000000698 0.0000033353 0.0000067163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.000003171042 0.000000383605 -8.266 <2e-16 ***
## brand_new 0.000000237386 0.000000825224 0.288 0.774
## model_new -0.000000761144 0.000002504408 -0.304 0.761
## age 0.000000461986 0.000000383265 1.205 0.228
## petrol 0.000000008533 0.000000146433 0.058 0.954
## lpg -0.000000400379 0.000001111758 -0.360 0.719
## cng -0.000000230332 0.000000379372 -0.607 0.544
## km_driven -0.000001416548 0.000001499095 -0.945 0.345
## manual -0.000000186398 0.000000208804 -0.893 0.372
## multiple_owner 0.000000029390 0.000000135818 0.216 0.829
## monthly_payment 0.999998961918 0.000002509395 398502.000 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.000003824 on 4513 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.348e+11 on 10 and 4513 DF, p-value: < 2.2e-16
model2_new$coefficients
## (Intercept) brand_new model_new age petrol
## 0.047271441 0.044209548 0.894126803 -0.044030342 -0.006238419
## lpg cng km_driven manual multiple_owner
## 0.011383853 -0.006645599 -0.078064515 -0.004021122 -0.002311798
\[ \begin{align} Price &= 0.047 + 0.044 \times \text{brand} + 0.89 \times \text{model} - 0.044 \times \text{age} \\ &\quad - 0.006 \times \text{petrol} + 0.011 \times \text{lpg} - 0.006 \times \text{cng} \\ &\quad - 0.078 \times \text{km_driven} - 0.004 \times \text{manual} - 0.002 \times \text{multiple_owner}\end{align} \]
coef_df <- data.frame(Predictor = names(coef(model2_new)[-1]), coefficient = coef(model2_new)[-1])
ggplot(coef_df, aes(x = reorder(Predictor, coefficient), y = coefficient)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # Flip for horizontal bars (optional)
labs(
title = "Linear Regression Coefficients",
x = "Predictors",
y = "Coefficient Value"
) +
theme_minimal()
vif_values <- vif(model2_new)
vif_df <- data.frame(
Variable = names(vif_values),
VIF = as.numeric(vif_values)
)
vif_df |>
ggplot() +
geom_bar(mapping = aes(x = VIF, y = Variable), stat = "identity", fill = "steelblue") + # Horizontal bar
geom_vline(xintercept = 5, linetype = "dashed", color = "red", size = 1) + # Add vertical line at VIF = 5
labs(title = "VIF Values", x = "VIF", y = "Variable") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0),
axis.text.y = element_text(angle = 0))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
summary(model2_new)$r.squared
## [1] 0.9323682
# mean squared error
mse <- mean(model2_new$residuals ^ 2)
# root mean squared error
rmse <- sqrt(mse)
print(rmse)
## [1] 0.02265338
# Make predictions on the test set
test_set_n$predicted_price <- predict(model2_new, newdata = test_set_n)
test_set_n <- test_set_n |>
mutate(
price_ = price*(3495000 - 91000) + 91000,
predicted_price_2 = predicted_price*(3495000 - 91000) + 91000
)
test_set_n |>
select(car_brand, price_, predicted_price_2) |>
head(10)
## # A tibble: 10 × 3
## car_brand price_ predicted_price_2
## <chr> <dbl> <dbl>
## 1 Maruti 249699 224233.
## 2 Maruti 240599 212745.
## 3 Hyundai 401599 395450.
## 4 Maruti 241599 214006.
## 5 Maruti 356099 338990.
## 6 Hyundai 401699 394006.
## 7 Maruti 238099 222355.
## 8 Maruti 613499 513521.
## 9 Hyundai 222699 155264.
## 10 Hyundai 346399 259326.
If you think about it, we need to know the price of a car to determine the monthly payment to be made for that car. And with this info, we can see that the emi (monthly_payment) and price of the car forms a strict linear relationship. Hence we can’t use this as an explanatory variable for our model.
cars24 |>
ggplot() +
geom_point(mapping = aes(x = monthly_payment, y = price)) +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
labs(x = "Monthly payment",
y = "Car price (INR)",
title = "Monthly payment Vs Car Price") +
theme_minimal()
gg_resfitted(model2_new) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
gg_reshist(model2_new)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_qqplot(model2_new)