Problem Statement

The main idea I had behind using this dataset was to try and find some way to predict the selling price of a used car based on brand, model, age, no of previous owners, fuel type, kilometers driven and transmission type. In this dataset I have the selling price for all the rows, but if you consider a scenario like while adding a new row of a used car data, someone should make an assessment of the car and figure out what the selling price should be. I am trying to find a way to automate this part using linear regression.

Importing libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(dplyr)
library(ggrepel)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(patchwork)
library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(vtable)

## Loading required package: kableExtra
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

options(scipen = 6)

Importing data set

cars24 <- read.csv("Cars24.csv", na.strings = "") 

head(cars24)

##   car_brand          model  price year  location   fuel km_driven      gear
## 1   Hyundai    EonERA PLUS 330399 2016 Hyderabad Petrol     10674    Manual
## 2    Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol     20979    Manual
## 3    Maruti    Alto K10LXI 229199 2011 Hyderabad Petrol     47330    Manual
## 4    Maruti  RitzVXI BS IV 306399 2011 Hyderabad Petrol     19662    Manual
## 5      Tata  NanoTWIST XTA 208699 2015 Hyderabad Petrol     11256 Automatic
## 6    Maruti        AltoLXI 249699 2012 Hyderabad Petrol     28434    Manual
##   ownership monthly_payment
## 1         2            7350
## 2         1            7790
## 3         2            5098
## 4         1            6816
## 5         1            4642
## 6         1            5554

Assumptions made

For the purpose of this project, it is understood that selling price (price) is the dependent variable also called as Response variable.

And the independent variables are car_brand, model, year, fuel, km_driven, gear and ownership. Independent variables are also called as explanatory variable.

Initial EDA

Checking for null values

sapply(cars24, function(x) sum(is.na(x)))

##       car_brand           model           price            year        location 
##               0             265               0               0               0 
##            fuel       km_driven            gear       ownership monthly_payment 
##               0               0               0               0               0

This shows us that all the cells have some value, except for the column model. We will remove those rows before feeding it to our mode.

cars24 <- na.omit(cars24)
sapply(cars24, function(x) sum(is.na(x)))

##       car_brand           model           price            year        location 
##               0               0               0               0               0 
##            fuel       km_driven            gear       ownership monthly_payment 
##               0               0               0               0               0

Now we can proceed exploring each column to get a better understanding.

Exploring each column

Creating a function that outputs python equivalent value_count function in pandas.

value_counts <- function(df, col_name) {
  df |>
    group_by({{ col_name }}) |>
    summarise(n=n()) |>
    arrange(desc(n))
}

Column : car_brand

head(value_counts(cars24, car_brand),5)

## # A tibble: 5 × 2
##   car_brand      n
##   <chr>      <int>
## 1 Maruti      2744
## 2 Hyundai     1218
## 3 Honda        447
## 4 Toyota       286
## 5 Volkswagen   186

This shows us the top 5 selling brand in India is Maruti, Hyundai, Honda, Toyota and Volkswagen.

cars24|>
  filter(car_brand %in% head(value_counts(cars24, car_brand),5)$car_brand) |>
  ggplot() +
  geom_bar(mapping = aes(x = car_brand)) +
  labs(title = "Top 5 Selling brands",
       x = "Car Brand",
       y = "Count") +
  theme_hc()

Column : model

head(value_counts(cars24, model),5)

## # A tibble: 5 × 2
##   model              n
##   <chr>          <int>
## 1 Alto 800LXI      184
## 2 Wagon R 1.0VXI   177
## 3 Alto K10VXI      156
## 4 SwiftVDI         137
## 5 SwiftVXI         135

cars24|>
  filter(model %in% head(value_counts(cars24, model),5)$model) |>
  ggplot() +
  geom_bar(mapping = aes(x = model, fill = car_brand)) +
  labs(title = "Top 5 Selling car models",
       x = "Car model",
       y = "Count") +
  theme_hc()

Column : year

For applying linear regression it is better to find age of each car using the year, instead of using the year column itself.

cars24$age <- year(now()) - cars24$year

head(cars24)

##   car_brand          model  price year  location   fuel km_driven      gear
## 1   Hyundai    EonERA PLUS 330399 2016 Hyderabad Petrol     10674    Manual
## 2    Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol     20979    Manual
## 3    Maruti    Alto K10LXI 229199 2011 Hyderabad Petrol     47330    Manual
## 4    Maruti  RitzVXI BS IV 306399 2011 Hyderabad Petrol     19662    Manual
## 5      Tata  NanoTWIST XTA 208699 2015 Hyderabad Petrol     11256 Automatic
## 6    Maruti        AltoLXI 249699 2012 Hyderabad Petrol     28434    Manual
##   ownership monthly_payment age
## 1         2            7350   8
## 2         1            7790  13
## 3         2            5098  13
## 4         1            6816  13
## 5         1            4642   9
## 6         1            5554  12

Now, with age we’ll visualize it.

head(value_counts(cars24, age),5)

## # A tibble: 5 × 2
##     age     n
##   <dbl> <int>
## 1     9   707
## 2    10   662
## 3     7   638
## 4     8   609
## 5    12   586

cars24 |>
  ggplot() +
  geom_histogram(mapping = aes(x = age)) +
  labs(title = "Distribution of age",
       x = "Age",
       y = "Count") +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Column : fuel

value_counts(cars24, fuel)

## # A tibble: 5 × 2
##   fuel             n
##   <chr>        <int>
## 1 Petrol        3611
## 2 Diesel        1877
## 3 Petrol + CNG   146
## 4 Petrol + LPG    17
## 5 Electric         2

Let’s see it visually.

cars24 |>
  ggplot() +
  geom_bar(mapping = aes(x = fuel)) +
  labs(title = "Distribution of Fuel type",
       x = "Fuel type",
       y = "Count") +
  theme_minimal()

With this plot it is evident that Petrol and Diesel are the most common fuel type used in India, followed by Petrol+CNG, Petrol+LPG. Electric vehicles are not that popular in India.

Lets see the distribution of fuel type among top 5 car brands in India

cars24 |>
  filter(car_brand %in% head(value_counts(cars24, car_brand),5)$car_brand) |>
  ggplot() +
  geom_bar(mapping = aes(x = car_brand, fill = fuel)) +
  labs(title = "Top 5 Selling brands",
       x = "Car Brand",
       y = "Count") +
  theme_minimal()

Column : km_driven

cars24 |>
  ggplot() +
  geom_histogram(mapping = aes(x = km_driven, fill = fuel), color = 'white') +
  labs(title = "Distribution of the variable km_driven",
       x = "Kilometers driven",
       y = "Count") +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Column : gear

cars24 |>
  ggplot() +
  geom_bar(mapping = aes(x = gear, fill = fuel)) +
  labs(title = "Distribution of transmission type",
       x = "Fuel type",
       y = "Count") +
  theme_minimal()

From the above plot we can clearly see that the manual gear type is more popular than the automatic one.

Column : ownership

For the purpose of applying linear regression, we are going to transform the ownership column into a new column named multiple_owner. This column will have values 0 and 1. 0 mentions that the car doesn’t have multiple owner, that is, it has only single owner. And 1 depicts that the car has multiple owner.

Let’s transform this column.

cars24$multiple_owner <- ifelse(cars24$ownership > 1, 1, 0)

head(cars24)

##   car_brand          model  price year  location   fuel km_driven      gear
## 1   Hyundai    EonERA PLUS 330399 2016 Hyderabad Petrol     10674    Manual
## 2    Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol     20979    Manual
## 3    Maruti    Alto K10LXI 229199 2011 Hyderabad Petrol     47330    Manual
## 4    Maruti  RitzVXI BS IV 306399 2011 Hyderabad Petrol     19662    Manual
## 5      Tata  NanoTWIST XTA 208699 2015 Hyderabad Petrol     11256 Automatic
## 6    Maruti        AltoLXI 249699 2012 Hyderabad Petrol     28434    Manual
##   ownership monthly_payment age multiple_owner
## 1         2            7350   8              1
## 2         1            7790  13              0
## 3         2            5098  13              1
## 4         1            6816  13              0
## 5         1            4642   9              0
## 6         1            5554  12              0

cars24 |>
  ggplot() +
  geom_bar(mapping = aes(x = multiple_owner)) +
  labs(title = "Distribution of ownership type",
       x = "Ownership type",
       y = "Count") +
  scale_x_continuous(breaks = c(0, 1)) +
  theme_minimal()

Column : price

cars24 |>
  ggplot() +
  geom_histogram(mapping = aes(x = price), color = 'white') +
  labs(title = "Distribution of the variable km_driven",
       x = "Kilometers driven",
       y = "Count") +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot is heavily right skewed and with a tail, which suggests us that there might be outliers that needs to be addressed.

Hypothesis Test

Test 1

I know for a fact that the price of the car decreases as the age of the car increases. Let’s devise a hypothesis test to confirm this.

Null Hypothesis: There is no relationship between the age of a car and its average price, meaning the age of the car does not affect its price.

Alternative Hypothesis: There is a relationship between the age of a car and its average price, such that the price decreases as the car’s age increases.

test_model <- lm(price ~ age, data = cars24)

summary(test_model)

## 
## Call:
## lm(formula = price ~ age, data = cars24)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -460089 -131985  -45685   55271 3215036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   993461      12150   81.77   <2e-16 ***
## age           -50964       1218  -41.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 263100 on 5651 degrees of freedom
## Multiple R-squared:  0.2366, Adjusted R-squared:  0.2365 
## F-statistic:  1752 on 1 and 5651 DF,  p-value: < 2.2e-16

If you see the p-value for this model is 2.2 * e^-16, which is less than the significance value of 0.05. This means that we can reject our null hypothesis. That means there is a relationship between the age and price of the car. And by looking at the coefficient of age in the model it is evident that, it is a negative relationship since it is a negative coefficient.

cars24 |>
  ggplot(mapping = aes(x = age, y = price)) +
  geom_point(color = 'darkblue') +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  geom_smooth(method = "lm", se = FALSE, color = 'black') +
  labs(x = "Age of the car",
       y = "Car Price (INR)",
       title = "Age Vs Price") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Test 2

I know for a fact that the price of the car decreases as the km_driven of the car increases. Let’s devise a hypothesis test to confirm this.

Null Hypothesis: There is no relationship between the km_driven of a car and its average price.

Alternative Hypothesis: There is a relationship between the km_driven of a car and its average price.

test_model <- lm(price ~ km_driven, data = cars24)

summary(test_model)

## 
## Call:
## lm(formula = price ~ km_driven, data = cars24)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -416457 -185675  -79840   92514 3023395 
## 
## Coefficients:
##                 Estimate   Std. Error t value      Pr(>|t|)    
## (Intercept) 540147.74177   7062.18070  76.485       < 2e-16 ***
## km_driven       -0.54701      0.09469  -5.777 0.00000000803 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 300300 on 5651 degrees of freedom
## Multiple R-squared:  0.00587,    Adjusted R-squared:  0.005694 
## F-statistic: 33.37 on 1 and 5651 DF,  p-value: 0.000000008034

If you see the p-value for this model is 0.000000008034, which is less than the significance value of 0.05. This means that we can reject our null hypothesis. That means there is a relationship between the km_driven and price of the car. And by looking at the coefficient of age in the model it is evident that, it is a negative relationship since it is a negative coefficient.

cars24 |>
  ggplot(mapping = aes(x = km_driven, y = price)) +
  geom_point(color = 'darkblue') +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  geom_smooth(method = "lm", se = FALSE, color = 'black') +
  labs(x = "Distance driven by the car (km)",
       y = "Car Price (INR)",
       title = "km_driven Vs Price") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Test 3

The categorical variable selected is Fuel. The Fuel column has four different values, Petrol, Diesel, Electric, Petrol+LPG and Petrol+CNG. Let’s devise an ANOVA test to figure out whether there is any difference between the response variable between the four class of transmission.

Null Hypothesis : There is no significant difference in the mean car price across different fuel types.

Alternative Hypothesis : There is a significant difference in the mean car price across different fuel types.

Choosing the Significance value (\(\alpha\)) to be 0.05.

m <- aov(price ~ fuel, data = cars24)
summary(m)

##               Df          Sum Sq        Mean Sq F value Pr(>F)    
## fuel           4  60061539128291 15015384782073   187.4 <2e-16 ***
## Residuals   5648 452479435103508    80113214431                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p value is less than \(\alpha\), we can reject the null hypothesis. Which means that there is significant difference in the mean car price across different fuel types.

Since we rejected the null hypothesis, which means that there is enough evidence to conclude there is significant difference between the fuel types. This result suggests that the fuel type of the car impacts its price in the used car market. Understanding this relationship can guide buyers, sellers, and dealers in setting or negotiating car prices more effectively.

cars24 |>
  filter(price < 4000000) |>
  ggplot() +
  geom_boxplot(mapping = aes(y = price, x = fuel)) +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  labs(x = "Fuel Type",
       y = "Car Price (INR)",
       title = "Fuel Type Vs Price")

Preprocessing the data

Converting categorical variables into numerical variables

Column Fuel

I am going to apply one hot encoding to this column, ultimately splitting it into 5 different columns namely, diesel, petrol, electric, LPG and CNG.

cars24 <- cars24 |>
  mutate(
    petrol = ifelse(grepl("petrol", fuel, ignore.case = TRUE), 1, 0),
    diesel = ifelse(grepl("diesel", fuel, ignore.case = TRUE), 1, 0),
    electric = ifelse(grepl("electric", fuel, ignore.case = TRUE), 1, 0),
    lpg = ifelse(grepl("lpg", fuel, ignore.case = TRUE), 1, 0),
    cng = ifelse(grepl("cng", fuel, ignore.case = TRUE), 1, 0),
  )

head(cars24)

##   car_brand          model  price year  location   fuel km_driven      gear
## 1   Hyundai    EonERA PLUS 330399 2016 Hyderabad Petrol     10674    Manual
## 2    Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol     20979    Manual
## 3    Maruti    Alto K10LXI 229199 2011 Hyderabad Petrol     47330    Manual
## 4    Maruti  RitzVXI BS IV 306399 2011 Hyderabad Petrol     19662    Manual
## 5      Tata  NanoTWIST XTA 208699 2015 Hyderabad Petrol     11256 Automatic
## 6    Maruti        AltoLXI 249699 2012 Hyderabad Petrol     28434    Manual
##   ownership monthly_payment age multiple_owner petrol diesel electric lpg cng
## 1         2            7350   8              1      1      0        0   0   0
## 2         1            7790  13              0      1      0        0   0   0
## 3         2            5098  13              1      1      0        0   0   0
## 4         1            6816  13              0      1      0        0   0   0
## 5         1            4642   9              0      1      0        0   0   0
## 6         1            5554  12              0      1      0        0   0   0

Column gear

Like how I converted ownership into binary values, I am also going to convert gear column into a binary one. Creating a new column named manual, and going to update 1 for the rows with manual gear and 0 for rows with automatic gear. I didn’t use one hot encoding, because this way I’m getting one less column (one less independent variable as well).

cars24$manual <- ifelse(grepl("manual", cars24$gear, ignore.case = TRUE), 1, 0)

head(cars24)

##   car_brand          model  price year  location   fuel km_driven      gear
## 1   Hyundai    EonERA PLUS 330399 2016 Hyderabad Petrol     10674    Manual
## 2    Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol     20979    Manual
## 3    Maruti    Alto K10LXI 229199 2011 Hyderabad Petrol     47330    Manual
## 4    Maruti  RitzVXI BS IV 306399 2011 Hyderabad Petrol     19662    Manual
## 5      Tata  NanoTWIST XTA 208699 2015 Hyderabad Petrol     11256 Automatic
## 6    Maruti        AltoLXI 249699 2012 Hyderabad Petrol     28434    Manual
##   ownership monthly_payment age multiple_owner petrol diesel electric lpg cng
## 1         2            7350   8              1      1      0        0   0   0
## 2         1            7790  13              0      1      0        0   0   0
## 3         2            5098  13              1      1      0        0   0   0
## 4         1            6816  13              0      1      0        0   0   0
## 5         1            4642   9              0      1      0        0   0   0
## 6         1            5554  12              0      1      0        0   0   0
##   manual
## 1      1
## 2      1
## 3      1
## 4      1
## 5      0
## 6      1

Column Car Brand

Here I am not going to apply one hot encoding, since I will end up with 26 more columns, one for each brand. Instead I am gonna Target encode this column. This replaces the car brands with a number representing the average selling price (target value) associated with each category.

cars24 <- cars24 |> 
  group_by(car_brand) |>
  mutate(brand_new = mean(price, na.rm = TRUE)) |>
  ungroup()

head(cars24)

## # A tibble: 6 × 19
##   car_brand model           price  year location fuel  km_driven gear  ownership
##   <chr>     <chr>           <int> <int> <chr>    <chr>     <int> <chr>     <int>
## 1 Hyundai   EonERA PLUS    330399  2016 Hyderab… Petr…     10674 Manu…         2
## 2 Maruti    Wagon R 1.0LXI 350199  2011 Hyderab… Petr…     20979 Manu…         1
## 3 Maruti    Alto K10LXI    229199  2011 Hyderab… Petr…     47330 Manu…         2
## 4 Maruti    RitzVXI BS IV  306399  2011 Hyderab… Petr…     19662 Manu…         1
## 5 Tata      NanoTWIST XTA  208699  2015 Hyderab… Petr…     11256 Auto…         1
## 6 Maruti    AltoLXI        249699  2012 Hyderab… Petr…     28434 Manu…         1
## # ℹ 10 more variables: monthly_payment <int>, age <dbl>, multiple_owner <dbl>,
## #   petrol <dbl>, diesel <dbl>, electric <dbl>, lpg <dbl>, cng <dbl>,
## #   manual <dbl>, brand_new <dbl>

Column model

cars24 <- cars24 |> 
  group_by(model) |>
  mutate(model_new = mean(price, na.rm = TRUE)) |>
  ungroup()

head(cars24)

## # A tibble: 6 × 20
##   car_brand model           price  year location fuel  km_driven gear  ownership
##   <chr>     <chr>           <int> <int> <chr>    <chr>     <int> <chr>     <int>
## 1 Hyundai   EonERA PLUS    330399  2016 Hyderab… Petr…     10674 Manu…         2
## 2 Maruti    Wagon R 1.0LXI 350199  2011 Hyderab… Petr…     20979 Manu…         1
## 3 Maruti    Alto K10LXI    229199  2011 Hyderab… Petr…     47330 Manu…         2
## 4 Maruti    RitzVXI BS IV  306399  2011 Hyderab… Petr…     19662 Manu…         1
## 5 Tata      NanoTWIST XTA  208699  2015 Hyderab… Petr…     11256 Auto…         1
## 6 Maruti    AltoLXI        249699  2012 Hyderab… Petr…     28434 Manu…         1
## # ℹ 11 more variables: monthly_payment <int>, age <dbl>, multiple_owner <dbl>,
## #   petrol <dbl>, diesel <dbl>, electric <dbl>, lpg <dbl>, cng <dbl>,
## #   manual <dbl>, brand_new <dbl>, model_new <dbl>

Scaling

Scaling is essential for maintaining consistent relationship between the features and improving model performance. To provide a general understanding of why we do scaling consider the below example.

age ranges between 3 to 17
km_driven ranges between 179 to 912380

Since the features have a massive difference on their range, it is better to have these values in scale. Hence we are gonna scale our variables.

cars24_scaled <- cars24 |>
  mutate(across(where(is.numeric), ~ (. - min(.)) / (max(.) - min(.))))

head(cars24_scaled)

## # A tibble: 6 × 20
##   car_brand model           price  year location fuel  km_driven gear  ownership
##   <chr>     <chr>           <dbl> <dbl> <chr>    <chr>     <dbl> <chr>     <dbl>
## 1 Hyundai   EonERA PLUS    0.0703 0.643 Hyderab… Petr…    0.0113 Manu…     0.333
## 2 Maruti    Wagon R 1.0LXI 0.0761 0.286 Hyderab… Petr…    0.0226 Manu…     0    
## 3 Maruti    Alto K10LXI    0.0406 0.286 Hyderab… Petr…    0.0515 Manu…     0.333
## 4 Maruti    RitzVXI BS IV  0.0633 0.286 Hyderab… Petr…    0.0212 Manu…     0    
## 5 Tata      NanoTWIST XTA  0.0346 0.571 Hyderab… Petr…    0.0120 Auto…     0    
## 6 Maruti    AltoLXI        0.0466 0.357 Hyderab… Petr…    0.0308 Manu…     0    
## # ℹ 11 more variables: monthly_payment <dbl>, age <dbl>, multiple_owner <dbl>,
## #   petrol <dbl>, diesel <dbl>, electric <dbl>, lpg <dbl>, cng <dbl>,
## #   manual <dbl>, brand_new <dbl>, model_new <dbl>

Choosing necessary variables

Since for the purpose of this project I am choosing the following explanatory variables. Not choosing electric, because it has only two rows in the entire dataset, so it is redundant. And the target variable is price.

brand_new
age
km_driven
petrol
diesel
lpg
cng
manual
multiple_owner

Train test split

set.seed(42)  # For reproducibility
train_indices <- createDataPartition(cars24_scaled$price, p = 0.8, list = FALSE)

train_set <- cars24_scaled[train_indices, ]
test_set <- cars24_scaled[-train_indices, ]

Linear Regression Model

model1 <- lm(price ~ brand_new + age + petrol + diesel + lpg + cng + km_driven + manual + multiple_owner, data = train_set)

model1$coefficients

##    (Intercept)      brand_new            age         petrol         diesel 
##    0.062548726    0.443192610   -0.184582277    0.121840217    0.163061647 
##            lpg            cng      km_driven         manual multiple_owner 
##    0.015062922   -0.012241668   -0.067749398   -0.061311493   -0.001424012

The equation of the model is mentioned below:

\[ \begin{align} \text{Price} &= 0.062 + 0.443 \times \text{brand} - 0.184 \times \text{age} + 0.121 \times \text{petrol} \\ &\quad + 0.163 \times \text{diesel} + 0.015 \times \text{lpg} - 0.012 \times \text{cng} \\ &\quad - 0.067 \times \text{km_driven} - 0.061 \times \text{manual} - 0.001 \times \text{multiple_owner}\end{align} \]

Check for Multicollinearity

vif_values <- vif(model1)

vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = as.numeric(vif_values)
)

vif_df |>
  ggplot() +
  geom_bar(mapping = aes(x = VIF, y = Variable), stat = "identity", fill = "steelblue") +  # Horizontal bar
  geom_vline(xintercept = 5, linetype = "dashed", color = "red", linewidth = 1) +  # Add vertical line at VIF = 5
  labs(title = "VIF Values", x = "VIF", y = "Variable") +  # Axis labels and title
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0),  # Keep x-axis labels horizontal
        axis.text.y = element_text(angle = 0))  # Keep y-axis labels horizontal

This suggests that the variables petrol and diesel are heavily correlated. Let’s remove any one variable and see whether the variance inflation factor (vif) improves.

model1_new <- lm(price ~ brand_new + age + petrol + lpg + cng + km_driven + manual + multiple_owner, data = train_set)
model1_new$coefficients

##    (Intercept)      brand_new            age         petrol            lpg 
##    0.225644856    0.442652257   -0.184525079   -0.041110309    0.015077654 
##            cng      km_driven         manual multiple_owner 
##   -0.012251173   -0.067168255   -0.061407508   -0.001575917

vif_values <- vif(model1_new)

vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = as.numeric(vif_values)
)

vif_df |>
  ggplot() +
  geom_bar(mapping = aes(x = VIF, y = Variable), stat = "identity", fill = "steelblue") +
  geom_vline(xintercept = 5, linetype = "dashed", color = "red", linewidth = 1) +
  labs(title = "VIF Values after removing variable diesel", x = "VIF", y = "Variable") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0),  
        axis.text.y = element_text(angle = 0))

Since the vif value of all the variables is less than 5, there is no multicollinearity.

After removing the variable diesel, the equation of the new model is:

\[ \begin{align} Price &= 0.225 + 0.442 \times \text{brand} - 0.184 \times \text{age} - 0.041 \times \text{petrol} \\ &\quad + 0.015 \times \text{lpg} - 0.012 \times \text{cng} - 0.067 \times \text{km_driven} \\ &\quad - 0.061 \times \text{manual} - 0.001 \times \text{multiple_owner}\end{align} \]

Weights of coefficients

coef_df <- data.frame(Predictor = names(coef(model1_new)[-1]), coefficient = coef(model1_new)[-1])
ggplot(coef_df, aes(x = reorder(Predictor, coefficient), y = coefficient)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # Flip for horizontal bars (optional)
  labs(
    title = "Linear Regression Coefficients",
    x = "Predictors",
    y = "Coefficient Value"
  ) +
  theme_minimal()

R - Squared

summary(model1_new)$r.squared

## [1] 0.6572958

The R-squared value is 0.657. This implies that approximately 65% of the variance in the dependent variable (price of car) can be explained by the independent variables (e.g., age, km_driven, make(brand), etc.) in your regression model. This value makes sense, because we haven’t considered some variables like model, location and emi.

Evaluating the model

Residuals vs. Fitted Values

gg_resfitted(model1_new) +
  geom_smooth(se=FALSE) +
  theme_minimal()

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Residuals vs. X Values

gg_resX(model1_new, plot.all = FALSE)

## $brand_new

## 
## $age

## 
## $petrol

## 
## $lpg

## 
## $cng

## 
## $km_driven

## 
## $manual

## 
## $multiple_owner

Residual Histogram

gg_reshist(model1_new) +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

QQ-Plots

gg_qqplot(model1_new) +
  theme_minimal()

RSME of the model

# mean squared error
mse <- mean(model1_new$residuals ^ 2)

# root mean squared error
rmse <- sqrt(mse)

print(rmse)

## [1] 0.05099379

What is rmse?

RMSE stands for Root Mean Squared Error, which is a commonly used metric for evaluating the performance of a regression model. It measures the average magnitude of the errors (i.e., the difference between the predicted and actual values) in the same units as the target variable. A lower RMSE indicates better model performance.

Interpretation:

If RMSE is 0, the model’s predictions are perfect.
A lower RMSE indicates better model performance, but it should be compared to other models or benchmarks for context.

This suggests that our model is better model, since our rmse value is close to zero.

Apply the Model to the Test Set

# Make predictions on the test set
test_set$predicted_price <- predict(model1_new, newdata = test_set)

test_set <- test_set |>
  mutate(
    price_ = price*(3495000 - 91000) + 91000,
    predicted_price_1 = predicted_price*(3495000 - 91000) + 91000
  )

test_set |>
  select(car_brand, price_, predicted_price_1) |>
  head(10)

## # A tibble: 10 × 3
##    car_brand price_ predicted_price_1
##    <chr>      <dbl>             <dbl>
##  1 Maruti    249699           285770.
##  2 Maruti    240599           240231.
##  3 Hyundai   401599           441285.
##  4 Maruti    241599           241317.
##  5 Maruti    356099           505645.
##  6 Hyundai   401699           402981.
##  7 Maruti    238099           284154.
##  8 Maruti    613499           551835.
##  9 Hyundai   222699           114447.
## 10 Hyundai   346399           258847.

Out of curiosity

Model with all variables

set.seed(42)  # For reproducibility
train_indices <- createDataPartition(cars24_scaled$price, p = 0.8, list = FALSE)

train_set_n <- cars24_scaled[train_indices, ]
test_set_n <- cars24_scaled[-train_indices, ]

model2 <- lm(price ~ brand_new + model_new  + age + petrol + lpg + cng + km_driven + manual + multiple_owner + monthly_payment, data = train_set_n)

summary(model2)

## 
## Call:
## lm(formula = price ~ brand_new + model_new + age + petrol + lpg + 
##     cng + km_driven + manual + multiple_owner + monthly_payment, 
##     data = train_set_n)
## 
## Residuals:
##           Min            1Q        Median            3Q           Max 
## -0.0000068354 -0.0000032991  0.0000000698  0.0000033353  0.0000067163 
## 
## Coefficients:
##                        Estimate      Std. Error    t value Pr(>|t|)    
## (Intercept)     -0.000003171042  0.000000383605     -8.266   <2e-16 ***
## brand_new        0.000000237386  0.000000825224      0.288    0.774    
## model_new       -0.000000761144  0.000002504408     -0.304    0.761    
## age              0.000000461986  0.000000383265      1.205    0.228    
## petrol           0.000000008533  0.000000146433      0.058    0.954    
## lpg             -0.000000400379  0.000001111758     -0.360    0.719    
## cng             -0.000000230332  0.000000379372     -0.607    0.544    
## km_driven       -0.000001416548  0.000001499095     -0.945    0.345    
## manual          -0.000000186398  0.000000208804     -0.893    0.372    
## multiple_owner   0.000000029390  0.000000135818      0.216    0.829    
## monthly_payment  0.999998961918  0.000002509395 398502.000   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.000003824 on 4513 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.348e+11 on 10 and 4513 DF,  p-value: < 2.2e-16

Linear regression equation

model2$coefficients

##        (Intercept)          brand_new          model_new                age 
## -0.000003171041555  0.000000237385647 -0.000000761144379  0.000000461986113 
##             petrol                lpg                cng          km_driven 
##  0.000000008533082 -0.000000400379181 -0.000000230331561 -0.000001416547952 
##             manual     multiple_owner    monthly_payment 
## -0.000000186398325  0.000000029389619  0.999998961918167

\[ \begin{align} Price &= 0.000003 + 0.0000002 \times \text{brand} - 0.0000007 \times \text{model}- 0.0000004 \times \text{age}\\ &\quad - 0.000000008 \times \text{petrol} - 0.0000004 \times \text{lpg} - 0.0000002 \times \text{cng} \\ &\quad - 0.000001 \times \text{km_driven} - 0.0000001 \times \text{manual} \\ &\quad + 0.00000002 \times \text{multiple_owner} + 0.999 \times \text{monthly_payment} \end{align} \]

coef_df <- data.frame(Predictor = names(coef(model2)[-1]), coefficient = coef(model2)[-1])
ggplot(coef_df, aes(x = reorder(Predictor, coefficient), y = coefficient)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # Flip for horizontal bars (optional)
  labs(
    title = "Linear Regression Coefficients",
    x = "Predictors",
    y = "Coefficient Value"
  ) +
  theme_minimal()

model2_new <- lm(price ~ brand_new + model_new  + age + petrol + lpg + cng + km_driven + manual + multiple_owner , data = train_set_n)

summary(model2)

## 
## Call:
## lm(formula = price ~ brand_new + model_new + age + petrol + lpg + 
##     cng + km_driven + manual + multiple_owner + monthly_payment, 
##     data = train_set_n)
## 
## Residuals:
##           Min            1Q        Median            3Q           Max 
## -0.0000068354 -0.0000032991  0.0000000698  0.0000033353  0.0000067163 
## 
## Coefficients:
##                        Estimate      Std. Error    t value Pr(>|t|)    
## (Intercept)     -0.000003171042  0.000000383605     -8.266   <2e-16 ***
## brand_new        0.000000237386  0.000000825224      0.288    0.774    
## model_new       -0.000000761144  0.000002504408     -0.304    0.761    
## age              0.000000461986  0.000000383265      1.205    0.228    
## petrol           0.000000008533  0.000000146433      0.058    0.954    
## lpg             -0.000000400379  0.000001111758     -0.360    0.719    
## cng             -0.000000230332  0.000000379372     -0.607    0.544    
## km_driven       -0.000001416548  0.000001499095     -0.945    0.345    
## manual          -0.000000186398  0.000000208804     -0.893    0.372    
## multiple_owner   0.000000029390  0.000000135818      0.216    0.829    
## monthly_payment  0.999998961918  0.000002509395 398502.000   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.000003824 on 4513 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.348e+11 on 10 and 4513 DF,  p-value: < 2.2e-16

model2_new$coefficients

##    (Intercept)      brand_new      model_new            age         petrol 
##    0.047271441    0.044209548    0.894126803   -0.044030342   -0.006238419 
##            lpg            cng      km_driven         manual multiple_owner 
##    0.011383853   -0.006645599   -0.078064515   -0.004021122   -0.002311798

\[ \begin{align} Price &= 0.047 + 0.044 \times \text{brand} + 0.89 \times \text{model} - 0.044 \times \text{age} \\ &\quad - 0.006 \times \text{petrol} + 0.011 \times \text{lpg} - 0.006 \times \text{cng} \\ &\quad - 0.078 \times \text{km_driven} - 0.004 \times \text{manual} - 0.002 \times \text{multiple_owner}\end{align} \]

Weights of new LR model2

coef_df <- data.frame(Predictor = names(coef(model2_new)[-1]), coefficient = coef(model2_new)[-1])
ggplot(coef_df, aes(x = reorder(Predictor, coefficient), y = coefficient)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # Flip for horizontal bars (optional)
  labs(
    title = "Linear Regression Coefficients",
    x = "Predictors",
    y = "Coefficient Value"
  ) +
  theme_minimal()

Multicollinearity check

vif_values <- vif(model2_new)

vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = as.numeric(vif_values)
)

vif_df |>
  ggplot() +
  geom_bar(mapping = aes(x = VIF, y = Variable), stat = "identity", fill = "steelblue") +  # Horizontal bar
  geom_vline(xintercept = 5, linetype = "dashed", color = "red", size = 1) +  # Add vertical line at VIF = 5
  labs(title = "VIF Values", x = "VIF", y = "Variable") +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0),  
        axis.text.y = element_text(angle = 0))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

R Squared

summary(model2_new)$r.squared

## [1] 0.9323682

RSME of new model

# mean squared error
mse <- mean(model2_new$residuals ^ 2)

# root mean squared error
rmse <- sqrt(mse)

print(rmse)

## [1] 0.02265338

Apply the new model to test set

# Make predictions on the test set
test_set_n$predicted_price <- predict(model2_new, newdata = test_set_n)

test_set_n <- test_set_n |>
  mutate(
    price_ = price*(3495000 - 91000) + 91000,
    predicted_price_2 = predicted_price*(3495000 - 91000) + 91000
  )

test_set_n |>
  select(car_brand, price_, predicted_price_2) |>
  head(10)

## # A tibble: 10 × 3
##    car_brand price_ predicted_price_2
##    <chr>      <dbl>             <dbl>
##  1 Maruti    249699           224233.
##  2 Maruti    240599           212745.
##  3 Hyundai   401599           395450.
##  4 Maruti    241599           214006.
##  5 Maruti    356099           338990.
##  6 Hyundai   401699           394006.
##  7 Maruti    238099           222355.
##  8 Maruti    613499           513521.
##  9 Hyundai   222699           155264.
## 10 Hyundai   346399           259326.

Why emi (monthly_payment) has a coefficient as 0.999?

If you think about it, we need to know the price of a car to determine the monthly payment to be made for that car. And with this info, we can see that the emi (monthly_payment) and price of the car forms a strict linear relationship. Hence we can’t use this as an explanatory variable for our model.

cars24 |>
  ggplot() +
  geom_point(mapping = aes(x = monthly_payment, y = price)) +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  labs(x = "Monthly payment",
       y = "Car price (INR)",
       title = "Monthly payment Vs Car Price") +
  theme_minimal()

Evaluating the new model

gg_resfitted(model2_new) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

gg_reshist(model2_new)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

gg_qqplot(model2_new)

Final Project