Predicting Used Car Price with Linear Regression

Janet Felita

2021-05-05

Background

Background Story

Hi! This is a project submission for a task in a bootcamp course I’m attending right now. The goal of this project is to create a linear regression model and I chose the Used Cars Dataset that I obtained on Kaggle here which was scraped from CarDekho.com, one of India’s leading car search venture.

A little background story of the company: CarDekho.com, as a platform where users could buy new and used cars that are best suited for them, carries rich automotive content such as expert reviews, detailed specs and prices, comparisons as well as videos and pictures of all car brands and models available in India. The company has tie-ups with many auto manufacturers, more than 4000 car dealers and numerous financial institutions to facilitate the purchase of vehicles.

Aside from the ability of the platform to sell cars to users, users could also sell their car to CarDekho.com. I personally think that it would be easier to valuate the price of a new car rather than a used car, since used car prices usually depreciated within time and there are also a lot of other factors that affects the price depreciation such as the brand model, total kilometers driven, the condition of the car, and others.

Since there are many factors that could determine the resale value of a used car, I’m going to build a model to predict just that.

Objective

The objective of this project is to build a model to predict the resale value of used cars based on their specifications using linear regression.

The Data

Preparation

Importing libraries that will be used further on.

# import libraries
library(tidyverse)
library(dplyr)
library(inspectdf)
library(magicfor)
library(ggpubr)
library(GGally)
library(ggplot2)
library(viridis)
library(MLmetrics)
library(car)
library(caret)
library(vtreat)
library(lmtest)
library(pastecs)

Importing the Used Cars Dataset.

# import data
car <- read.csv("data/cardekho_cleaned.csv", 
                na.strings=c("null", "","NA"), 
                stringsAsFactors = T)

DT::datatable(car, rownames = F)

Description of each columns:

  • id: index
  • car_name: name of car
  • brand: brand of car extracted from car_name
  • model: model of car extracted from car_name
  • new_price: min. to max. price range of the new car
  • min_cost_price: new min. price extracted from new_price
  • max_cost_price: new max. price extracted from new_price
  • vehicle_age: age of the vehicle since it was first bought
  • km_driven: total km driven by car
  • seller_type: type of seller
  • fuel_type: type of fuel used by car
  • transmission_type: manual/automatic
  • mileage: fuel efficiency in car (kmpl)
  • engine: car’s engine in cc
  • max_power: max. power of car in bhp
  • seats: number of seats in car
  • selling_price: selling price of used car

Right off the bat we could see that there are missing values within the dataset. We’re going to check how many of them were NAs and if there’s any row duplicates.

# check duplicates
table(duplicated(car))

FALSE 
19544 
# check NA
colSums(is.na(car))
               id          car_name             brand             model 
                0                 0                 0                 0 
        new_price    min_cost_price    max_cost_price       vehicle_age 
            10102             10102             10102                 0 
        km_driven       seller_type         fuel_type transmission_type 
                0                 0                 0                 0 
          mileage            engine         max_power             seats 
                0                 0                 0                 0 
    selling_price 
                0 

From the results above we could see that there are 51.69% missing values with a total of 10,102 rows which means that half of our data consists of missing values in 3 columns: new_price, min_cost_price, and max_cost_price. The rows in which missing values are present are the same since min_cost_price and max_cost_price derived from new_price.

Before going on to the next step of imputing the missing values, I’d first like to create a new column average_new_price which is the average price between min_cost_price and max_cost_price, and removing those 2 columns afterwards.

# remove id column and create average_new_price column
car <- car %>% 
  mutate(average_new_price = (min_cost_price + max_cost_price)/2) %>% 
  select(-c(min_cost_price, max_cost_price)) %>% 
  select(1:5,16, everything())

With this dataset I’m going to impute the missing values with the mean of average_new_price of each car name. To ease this process, I’ll split my data into 2: data with no missing values and with missing values.

# car with na
car_na <- car[is.na(car$average_new_price),] %>% 
  select(!c(new_price, average_new_price))

# car with no na
car_no_na <- car %>% drop_na() 

Next, I’ll be going to impute the missing values based on the car_name provided only if there’s the same car_name listed on both data.

# calculating the mean of each car name from car_no_na
car_no_na_mean <- car_no_na %>%  
  group_by(car_name) %>% 
  summarise(average_new_price = mean(average_new_price)) %>% 
  ungroup()

# imputing mean price to car_na
car_2 <- left_join(car_na, car_no_na_mean)

# check NA left
round(colSums(is.na(car_2))/nrow(car)*100, 2)
               id          car_name             brand             model 
             0.00              0.00              0.00              0.00 
      vehicle_age         km_driven       seller_type         fuel_type 
             0.00              0.00              0.00              0.00 
transmission_type           mileage            engine         max_power 
             0.00              0.00              0.00              0.00 
            seats     selling_price average_new_price 
             0.00              0.00             21.15 

After the imputation process above, the missing values are reduced but it’s still 21.15% of the total data in which the car_name doesn’t exist in the data with no missing values. Next, I’ll be splitting the car_na_2 into 2 data with the same method as the first pre-imputation process and joining them to the original no NA data for the data that has been imputed.

# splitting the no NA data
car_no_na_2 <- car_2 %>% drop_na()
car_no_na_3 <- bind_rows(car_no_na, car_no_na_2)

# car na
car_na_3 <- car_2[is.na(car_2$average_new_price),] %>% 
  select(!c(average_new_price))

Ideally, I would say that the price of a new car really depends on the type and brand of the car, but seeing that we don’t have those data available, I’m going to impute it based on the average_new_price of a car’s specifications, in this case max_power and engine. Both of them are numeric variables and it’s going to be hard to determine the price based on numerics (if not with machine learning methods too but that’s not our objective here), hence we’ll be categorizing those to a categorical range. I’d first like to see how these data are distributed.

# engine summary from car_na_3
summary(car_na_3$engine)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    624    1197    1364    1450    1498    6752 
car_na_3 %>% filter(engine >= 1500 & engine <= 3000) %>% nrow()
[1] 1002
# engine summary from the car_no_na
summary(car_no_na_3$engine)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    793    1197    1248    1486    1582    6592 
# max_power summary from car_na_2
summary(car_na_3$max_power)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34.20   68.05   83.80   95.00  104.68  575.00 
car_na_3 %>% filter(max_power >= 105 & max_power <= 200) %>% nrow()
[1] 902
# max_power summary from the car_no_na
summary(car_no_na_3$max_power)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   38.4    74.0    88.5   100.6   117.3   626.0 

From the information above, we could see that 75% of the engine ranges from 600 - 1600 on both missing values and non-missing values data and the number of data ranging from 1500 - 3000 is still quite a lot which is 1002 rows. From that information we’ll be categorizing the engine by 100cc intervals until it reaches 3000 and the rest will be going to one category.

As for the max_power, it ranges from 30 - 120 on both data and the number of data ranging from 105 - 200 is still 902 rows, hence we’ll be categorizing it by 10bhp intervals until it reaches 200.

# categorizing engine and max power to data range
magic_for(print, silent = TRUE)
## create new column
### create breaks for engine
engine_breaks <- get_breaks(by = 100)(x = 1:3000)
### create labels for engine
for (i in 1:n_distinct(engine_breaks)){
  print(paste0(engine_breaks[i], " - ", engine_breaks[i+1]))
}
engine_labels <- magic_result_as_vector()
### omitting the last element of engine_labels
engine_labels <- engine_labels[-length(engine_labels)]
### create breaks for max_power
max_power_breaks <- get_breaks(by = 10)(x = 1:200)
### create labels for max_power
for (i in 1:n_distinct(max_power_breaks)){
  print(paste0(max_power_breaks[i], " - ", max_power_breaks[i+1]))
}
max_power_labels <- magic_result_as_vector()
### omitting the last element of max_power_labels
max_power_labels <- max_power_labels[-length(max_power_labels)]
### create new column engine_range and max_power_range for no na data
car_no_na_4 <- car_no_na_3 %>% 
  mutate(engine_range = cut(engine,
                            breaks = engine_breaks,
                            labels = engine_labels),
         max_power_range = cut(max_power,
                               breaks = max_power_breaks,
                               labels = max_power_labels)) %>% 
  mutate_at(vars(engine_range, max_power_range), as.character)
### impute engine_range and max_power_range for no NA data
car_no_na_4$engine_range[is.na(car_no_na_4$engine_range)] <- "> 3000"
car_no_na_4$max_power_range[is.na(car_no_na_4$max_power_range)] <- "> 200"
### create new column engine_range and max_power_range for na data
car_na_4 <- car_na_3 %>% 
  mutate(engine_range = cut(engine,
                            breaks = engine_breaks,
                            labels = engine_labels),
         max_power_range = cut(max_power,
                               breaks = max_power_breaks,
                               labels = max_power_labels)) %>% 
  mutate_at(vars(engine_range, max_power_range), as.character)
### impute engine_range and max_power_range for NA data
car_na_4$engine_range[is.na(car_na_4$engine_range)] <- "> 3000"
car_na_4$max_power_range[is.na(car_na_4$max_power_range)] <- "> 200"

## calculate mean for no na data
car_no_na_mean_2 <- car_no_na_4 %>% 
  group_by(engine_range, max_power_range) %>% 
  summarise(average_new_price = mean(average_new_price)) %>% 
  ungroup()

## impute mean price to car_na_3
car_na_5 <- left_join(car_na_4, car_no_na_mean_2)

## check NA left
round(colSums(is.na(car_na_5))/nrow(car)*100, 2)
               id          car_name             brand             model 
             0.00              0.00              0.00              0.00 
      vehicle_age         km_driven       seller_type         fuel_type 
             0.00              0.00              0.00              0.00 
transmission_type           mileage            engine         max_power 
             0.00              0.00              0.00              0.00 
            seats     selling_price      engine_range   max_power_range 
             0.00              0.00              0.00              0.00 
average_new_price 
             3.09 

After the second imputation process, there’s still a remaining of 3.09% missing values left from the total data and I decided to omit those missing values since it’s < 5% and it’s going to be less and less precise to impute the remaining NAs with just one variable of either engine or max_power.

# dropping all NAs
car_no_na_5 <- car_na_5 %>% drop_na()

# joining all data
car_3 <- bind_rows(car_no_na_4, car_no_na_5)

# check NA for new_price
colSums(is.na(car_3))
               id          car_name             brand             model 
                0                 0                 0                 0 
        new_price average_new_price       vehicle_age         km_driven 
             9498                 0                 0                 0 
      seller_type         fuel_type transmission_type           mileage 
                0                 0                 0                 0 
           engine         max_power             seats     selling_price 
                0                 0                 0                 0 
     engine_range   max_power_range 
                0                 0 
# check missing rows
round(nrow(car_3)/nrow(car)*100 + 3.09, 2)
[1] 100

Quick data summary.

glimpse(car_3)
Rows: 18,940
Columns: 18
$ id                <int> 1, 4, 5, 6, 7, 8, 12, 14, 19, 22, 23, 25, 27, 29, 30~
$ car_name          <fct> Hyundai Grand, Ford Ecosport, Maruti Wagon R, Hyunda~
$ brand             <fct> Hyundai, Ford, Maruti, Hyundai, Maruti, Hyundai, Mar~
$ model             <fct> Grand, Ecosport, Wagon R, i10, Wagon R, Venue, Swift~
$ new_price         <fct> New Car (On-Road Price) : Rs.7.11-7.48 Lakh*, New Ca~
$ average_new_price <dbl> 729500, 1196500, 605000, 658500, 613500, 1036000, 78~
$ vehicle_age       <int> 5, 6, 8, 8, 3, 2, 4, 8, 7, 6, 5, 8, 2, 6, 10, 14, 8,~
$ km_driven         <int> 20000, 30000, 35000, 40000, 17512, 20000, 28321, 652~
$ seller_type       <fct> Individual, Dealer, Individual, Dealer, Dealer, Indi~
$ fuel_type         <fct> Petrol, Diesel, Petrol, Petrol, Petrol, Petrol, Petr~
$ transmission_type <fct> Manual, Manual, Manual, Manual, Manual, Automatic, M~
$ mileage           <dbl> 18.90, 22.77, 18.90, 20.36, 20.51, 18.15, 16.60, 22.~
$ engine            <int> 1197, 1498, 998, 1197, 998, 998, 1197, 1582, 2143, 1~
$ max_power         <dbl> 82.00, 98.59, 67.10, 78.90, 67.04, 118.35, 85.00, 12~
$ seats             <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 5, 7, 5~
$ selling_price     <int> 550000, 570000, 350000, 315000, 410000, 1050000, 511~
$ engine_range      <chr> "1100 - 1200", "1400 - 1500", "900 - 1000", "1100 - ~
$ max_power_range   <chr> "80 - 90", "90 - 100", "60 - 70", "70 - 80", "60 - 7~

After playing around with the data, I conclude that car_name and model won’t be doing any good in further analysis because:

  • brand and model are actually derived from the car_name. In this case I see that brand serves better purpose in predicting resale value since different brand names has each of their own different branding and some brands have a higher resale value than the others.
  • the model of the car doesn’t really matter here since each brand have a specific name and different from each other. Even if a different brand has the same car model, there’s no guarantee that the specifications of those cars are the same. The model here is just a matter of naming and not really showing anything else. It would make so much difference when instead of just a model name, this data provides us along with the car type (hatchback, sedan, SUV, etc.) because a similar specific model of a car along with the type could have a different model name under different brands/car make, but having a similar price range. Based on those 2 things above, I’m going to remove car_name and model columns along with id, new_price, engine_range, and max_power_range.
# remove columns
car_clean <- car_3 %>% 
  select(!c(id, new_price, car_name, model, engine_range, max_power_range))

Outliers

Now we’re moving on to the next step: checking on whether there’s any “anomalies” in our dataset. I’m first going to look closely on the descriptive statistics of each column.

car_clean %>% inspect_num()

From the brief descriptive statistics above, we could see there are some oddities in our data:

  • there’s no such thing as a 0-seater car.
  • the largest difference between mean and median is in the prices where there’s an INR 2,881,145 difference in average_new_price and INR 226,194 in selling_price which means that there’s extreme outliers at the higher side of the price.
  • other variables’ mean and median value gap is not as huge as the prices, but km_driven, engine, and max_power are worth to be checked later on.

Let’s first tackle the first problem: the car with 0 seats.

# filter rows
no_seats <- car %>% filter(seats == 0)

unique(no_seats$car_name)
[1] Honda City   Nissan Kicks
262 Levels: Ambassador Avigo Ambassador Classic Audi A3 Audi A4 ... Volvo XC90

These 2 cars’ seats are listed as 0 whereas the actual Honda City and Nissan Kicks are both a 5-seater car. Let’s fix the data.

# re-input correct value
car_clean <- car_clean %>% 
  mutate(seats = ifelse(seats == 0, 5, seats))

# re-check data
car_clean %>% filter(seats == 0)
 [1] brand             average_new_price vehicle_age       km_driven        
 [5] seller_type       fuel_type         transmission_type mileage          
 [9] engine            max_power         seats             selling_price    
<0 rows> (or 0-length row.names)

Done for the first problem. The second problem is the outliers in both average_new_price and selling_price. Let’s take a look at those values when plotted with the same scale.

plot1 <- ggplot(car_clean, aes(x = as.numeric(row.names(car_clean)), 
                               y = average_new_price)) +
  geom_point(aes(color = average_new_price)) +
  scale_y_continuous(breaks = seq(0, 500000000, by = 100000000)) +
  theme_minimal() +
  theme(legend.position = "none") +
  xlab("index") +
  scale_color_viridis(option = "plasma")

plot2 <- ggplot(car_clean, aes(x = as.numeric(row.names(car_clean)), 
                               y = selling_price)) +
  geom_point(aes(color = selling_price)) +
  scale_y_continuous(limits = c(0, 500000000), 
                     breaks = seq(0, 500000000, by = 100000000)) +
  theme_minimal() +
  theme(legend.position = "none") +
  xlab("index") +
  scale_color_viridis(option = "plasma")

egg::ggarrange(plot1, plot2, ncol = 2)

Looking at the side by side plot comparison with the same scaling we could conclude one thing: resale price are so much lower than the new price, which is why, a little off-topic, many people are saying that it’s not worth it to get a car loan since its price depreciated quickly while installments alone could take up to 5 years or more. By the time the installment is over, the car isn’t worth as much.

Back to the topic, here we could see that there are indeed outliers for both average_new_price and selling_price. However, we could also see that the average_new_price are roughly divided into 3 groups including the outliers which are: cars priced INR 0 - 150,000,000, INR 150,000,000 - 200,000,000, and INR 350,000,000 - 500,000,000. This could also mean that the average_new_price isn’t normally distributed.

It could be a little overwhelming to remove all the outliers just by looking at the plot since the outlier data looked like they are quite a lot in number, but remembering that this dataset actually contains more than 15,000 rows, I assume that it would still be safe to remove all the outliers.

# remove average_new_price outliers
car_clean_2 <- car_clean %>% filter(average_new_price < 200000000)

# check % of data removed
round((nrow(car_clean)/nrow(car) - 1)*100, 2)
[1] -3.09
round((nrow(car_clean_2)/nrow(car) - 1)*100, 2)
[1] -3.59

We just removed another .5% of the data, which still lies safely below 5%. We’re now going to plot the correlation between average_new_price and selling_price to check for other outliers.

ggplot(car_clean_2, aes(x = average_new_price, y = selling_price)) +
  geom_point(aes(color = average_new_price)) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_color_viridis(option = "plasma")

Now we could clearly see the outliers of the selling_price. The extreme ones are just 3 data in total and we’re going to remove them later. We’re now going back to observe the average_new_price, the INR 0 - 150,000,000 group mentioned before were actually roughly divided into several groups again if seen in a smaller scale. For now we’re going to let them be although by just briefly looking at it we know that the outliers aren’t that many compared to our total data.

# remove selling_price outlier
car_clean_2 <- car_clean_2 %>% filter(selling_price < 20000000)

We’re now going to look at the frequency table of each categorical values.

brand_freq <- car_clean_2 %>% group_by(brand) %>% summarise(value = length(brand)) %>% ungroup()
d1 <- kable(car_clean_2 %>% group_by(seller_type) %>% summarise(value = length(seller_type)) %>% ungroup())
d2 <- kable(car_clean_2 %>% group_by(fuel_type) %>% summarise(value = length(fuel_type)) %>% ungroup())
d3 <- kable(car_clean_2 %>% group_by(transmission_type) %>% summarise(value = length(transmission_type)) %>% ungroup())

ggplot(brand_freq, aes(x = brand, y = value)) +
  geom_col(fill = "#b6308b") +
  geom_text(aes(label = value), 
            position = position_dodge(width = 0.9), 
            vjust = -0.4,
            hjust = 0.5) +
  theme_classic() +
  ylab("frequency") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        axis.title.x = element_blank())

knitr::kables(list(d1, d2, d3)) %>% kableExtra::kable_styling(position = "center")
seller_type value
Dealer 11359
Individual 7295
Trustmark Dealer 186
fuel_type value
CNG 308
Diesel 9366
Electric 4
LPG 51
Petrol 9111
transmission_type value
Automatic 3722
Manual 15118

We could see from the bar plot above that there’s in fact a similar name Isuzu and ISUZU and some brands with fewer recurrence. Since this dataset consists of tens of thousands of rows, I decided to remove brand with < 10 recurrence to ensure that when we split the data into train and test, both parts will at least get the same brand. We’ll also remove Electric fuel_type for the same reason.

# change ISUZU to Isuzu
car_clean_2 <- car_clean_2 %>% mutate(brand = as.character(brand),
                                      brand = case_when(brand == "ISUZU" ~ "Isuzu",
                                              T ~ brand),
                                      brand = as.factor(brand))

# filter brands with <= 10 recurrence
brand_freq <- car_clean_2 %>% group_by(brand) %>% summarise(value = length(brand)) %>% ungroup() %>% filter(value >= 10)
brand_name <- unique(brand_freq$brand)

# filter rows in original dataset
 car_clean_2 <- car_clean_2 %>% filter(brand %in% brand_name, fuel_type != "Electric")

# check % data removed
round(((nrow(car_clean_2)/nrow(car)) - 1)*100, 2)
[1] -3.73

Great! Still below 5%.

Analysis

We’re now going to take a look at the correlation between each variables.

ggcorr(car_clean_2, label = T, label_size = 2.5, hjust = 1, layout.exp = 3) + 
  scale_fill_viridis(option = "plasma")

We could see that most of the variables have a strong correlation with the selling_price. We’re now going to take a look at the selling_price data distribution.

ggplot(car_clean_2, aes(selling_price)) +
  geom_histogram(fill = "#2d1a94", color = "#441d9e", alpha = 0.5, bins = 50) +
  theme_minimal()

We can see that the selling_price is right-skewed and so we’re going to transform the selling_price to its natural log value.

# transform to log
car_clean_3 <- car_clean_2 %>% 
  mutate(log_selling_price = log(selling_price))

Quick check of the transformed result.

ggplot(car_clean_3, aes(log_selling_price)) +
  geom_histogram(fill = "#2d1a94", color = "#441d9e", alpha = 0.5) +
  theme_minimal()

Now our data has been evenly distributed. We can now move on to the next step which is building the regression model.

A problem that I found next within this dataset is that it contains a lot of categorical variables. To avoid future issues, We’re going to create our own dummy variables for those categories using dummy_cols(). We’re also going to remove the selling_price column since we’re only going to use the log_selling_price.

# create dummy variables
car_dummy <- dummy_cols(car_clean_3, select_columns = c("brand", "seller_type", "fuel_type", "transmission_type"))

# create dummy variables and remove columns which created the variables
car_dummy_2 <- dummy_cols(car_clean_3, select_columns = c("brand", "seller_type", "fuel_type", "transmission_type"),
           remove_selected_columns = TRUE)

# remove selling_price column 
car_dummy_2 <- car_dummy_2 %>% select(-selling_price)

dim(car_dummy_2)
[1] 18815    55

We’re now ending up with 55 columns / 54 predictors for the model.

Modeling

Split Train and Test Data

Before moving on to the model training, We’ll split the car_clean data into test and train. We’ll be using an 80:20 ratio.

# split data
set.seed(902)
samplesize <- round(0.8 * nrow(car_clean_3), 0)
index <- sample(seq_len(nrow(car_clean_3)), size = samplesize)

train <- car_clean_3[index, ]
test <- car_clean_3[-index, ]

Building Model

First Model

We’re now going to build the first model with all predictors.

# model with all predictors
options(max.print=5.5E5)
model_1 <- lm(log_selling_price ~ ., train)
summary(model_1)

Call:
lm(formula = log_selling_price ~ ., data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.58978 -0.11847  0.01752  0.13783  2.03874 

Coefficients:
                                    Estimate       Std. Error  t value
(Intercept)                 13.4675363814059  0.0413582404486  325.631
brandBMW                    -0.1427563081059  0.0179799842878   -7.940
brandChevrolet              -0.6426923752261  0.0200096846287  -32.119
brandDatsun                 -0.6189139368520  0.0251882306473  -24.572
brandFiat                   -0.5333103382646  0.0380743912640  -14.007
brandFord                   -0.3772740994923  0.0174527866270  -21.617
brandHonda                  -0.2424841560955  0.0165283534883  -14.671
brandHyundai                -0.2689689036630  0.0161381585224  -16.667
brandIsuzu                  -0.4399136783779  0.0888543411160   -4.951
brandJaguar                 -0.0867280640120  0.0312085729006   -2.779
brandJeep                   -0.2322541519170  0.0389556828222   -5.962
brandKia                    -0.1504878698345  0.0436924166933   -3.444
brandLand Rover             -0.1092541780447  0.0368798112402   -2.962
brandLexus                  -0.0975011470408  0.0975736027436   -0.999
brandMahindra               -0.4227548153088  0.0182463260609  -23.169
brandMaruti                 -0.2842282435634  0.0164684158803  -17.259
brandMercedes-Benz          -0.1167211038634  0.0180886192776   -6.453
brandMG                     -0.2108230582550  0.0554500003970   -3.802
brandMini                    0.2459180709311  0.0568595663735    4.325
brandMitsubishi             -0.2740014942247  0.0446139850740   -6.142
brandNissan                 -0.3711808928878  0.0210370672101  -17.644
brandPorsche                -0.4360277234398  0.0504160364339   -8.649
brandRenault                -0.4185126972672  0.0187007010967  -22.380
brandSkoda                  -0.3212293510023  0.0190065590894  -16.901
brandTata                   -0.5817521142538  0.0177589658367  -32.758
brandToyota                 -0.1915892702768  0.0176036457647  -10.883
brandVolkswagen             -0.2953446235570  0.0175666167226  -16.813
brandVolvo                  -0.1285609700877  0.0432168515281   -2.975
average_new_price           -0.0000000040668  0.0000000004263   -9.540
vehicle_age                 -0.1048570420503  0.0007503842438 -139.738
km_driven                   -0.0000001601917  0.0000000371379   -4.313
seller_typeIndividual       -0.0717338594135  0.0038015470265  -18.870
seller_typeTrustmark Dealer -0.0157253563649  0.0175100665696   -0.898
fuel_typeDiesel              0.2128951240930  0.0142392325883   14.951
fuel_typeLPG                -0.1242911479406  0.0362719267545   -3.427
fuel_typePetrol             -0.0327446132881  0.0147853313943   -2.215
transmission_typeManual     -0.1139214901413  0.0058482647296  -19.480
mileage                     -0.0124746758002  0.0008629527243  -14.456
engine                       0.0001091462880  0.0000103784788   10.517
max_power                    0.0044454273126  0.0001163817993   38.197
seats                        0.0379574593765  0.0035349214526   10.738
selling_price                0.0000002662920  0.0000000044413   59.958
                                        Pr(>|t|)    
(Intercept)                 < 0.0000000000000002 ***
brandBMW                     0.00000000000000217 ***
brandChevrolet              < 0.0000000000000002 ***
brandDatsun                 < 0.0000000000000002 ***
brandFiat                   < 0.0000000000000002 ***
brandFord                   < 0.0000000000000002 ***
brandHonda                  < 0.0000000000000002 ***
brandHyundai                < 0.0000000000000002 ***
brandIsuzu                   0.00000074652780333 ***
brandJaguar                             0.005460 ** 
brandJeep                    0.00000000254742618 ***
brandKia                                0.000574 ***
brandLand Rover                         0.003057 ** 
brandLexus                              0.317686    
brandMahindra               < 0.0000000000000002 ***
brandMaruti                 < 0.0000000000000002 ***
brandMercedes-Benz           0.00000000011321516 ***
brandMG                                 0.000144 ***
brandMini                    0.00001535109712998 ***
brandMitsubishi              0.00000000083751067 ***
brandNissan                 < 0.0000000000000002 ***
brandPorsche                < 0.0000000000000002 ***
brandRenault                < 0.0000000000000002 ***
brandSkoda                  < 0.0000000000000002 ***
brandTata                   < 0.0000000000000002 ***
brandToyota                 < 0.0000000000000002 ***
brandVolkswagen             < 0.0000000000000002 ***
brandVolvo                              0.002937 ** 
average_new_price           < 0.0000000000000002 ***
vehicle_age                 < 0.0000000000000002 ***
km_driven                    0.00001617642324495 ***
seller_typeIndividual       < 0.0000000000000002 ***
seller_typeTrustmark Dealer             0.369160    
fuel_typeDiesel             < 0.0000000000000002 ***
fuel_typeLPG                            0.000613 ***
fuel_typePetrol                         0.026798 *  
transmission_typeManual     < 0.0000000000000002 ***
mileage                     < 0.0000000000000002 ***
engine                      < 0.0000000000000002 ***
max_power                   < 0.0000000000000002 ***
seats                       < 0.0000000000000002 ***
selling_price               < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2136 on 15010 degrees of freedom
Multiple R-squared:  0.912, Adjusted R-squared:  0.9118 
F-statistic:  3796 on 41 and 15010 DF,  p-value: < 0.00000000000000022

So far we could see that the original predictors were only 12 variables, but after we input them to the linear model the variables increases to 41 in total. That is because R automatically creates dummy variables for all those categorical variables since linear regression model could only take numbers for its predictors.

Our first model gives us a result of 0.912 Multiple R-squared and 0.9118 Adjusted R-squared which is actually pretty good, since it means that the model could depict 91.2% of our data. Most of the predictors are also significant to the target log_selling_price shown by the Pr(>|t|) < 0.05. There are some predictors that aren’t significant, but they were all categorical variables which belong to a specific column, so we’re not going to remove them.

Although the R-squared values is quite satisfactory, the number of variables used in this model isn’t. We’re going to tackle this matter in the second model.

Second Model

We’re now going to move to our second model. In this second model we’ll “semi-manually” create dummy variables for the categories. We’re going to follow the steps mentioned in Cory Lesmeister and Sunil Kumar Chinnamgari’ book: Advanced Machine Learning with R in treating the training data.

We’re first going to create varlist object to be used in the data treatment process, which is just extracting all column names from the car_clean_3 object minus log_selling_price and selling_price to be treated as strings.

# create varlist object for data treatment
varlist <- car_clean_3 %>% 
  select(-"log_selling_price", -"selling_price") %>% 
  colnames()

# saving deleted columns to an object
train_log_selling_price <- train$log_selling_price
test_log_selling_price <- test$log_selling_price

Next we’re going to design the treatment sceheme. We’re going to use minFraction of 0.1 which just means that we want the treatment process to return a categorical output with categories that has at least 10% recurrence from the total counts.

# training data treatment
train_treatment <- designTreatmentsZ(dframe = train,
                                     varlist = varlist,
                                     minFraction = 0.1)
[1] "vtreat 1.6.2 inspecting inputs Wed May 05 01:49:40 2021"
[1] "designing treatments Wed May 05 01:49:40 2021"
[1] " have initial level statistics Wed May 05 01:49:40 2021"
[1] " scoring treatments Wed May 05 01:49:40 2021"
[1] "have treatment plan Wed May 05 01:49:40 2021"
# original n of unique variables for each category
n_distinct(train$brand)
[1] 28
n_distinct(train$seller_type)
[1] 3
n_distinct(train$fuel_type)
[1] 4
n_distinct(train$transmission_type)
[1] 2

As we could see from the result above, the designTreatmentsZ() function creates a treatment plan in the form of dataframe where it automatically states whether the variable is categorical or not and creates extraModelDegrees where the number is actually n-1 of the total categories in each column. It’s actually the same concept for how dummy variables are supposed to work in modeling because there’s 1 category in each column that is included in the intercept value.

The treatment plan also generates code column in which:

  • clean: not categorical variables
  • catP: categorical variables
  • lev: categorical variables in which there’s at least 10% recurrence from the total data

And that’s why I personally this method is more efficient rather than letting R decides the dummy variables. However, in this dataframe I slightly spot a problem which is the transmission_type. transmission_type only consists of 2 categorical variables: Automatic and Manual. Here we can see that both Automatic and Manual has their own lev which could cause a perfect collinearity which means that each 0s in Automatic, results a 1 in Manual.

We’re now going to apply the treatment plan to the train and test data and delete lev Manual column

# apply treatment to both train and test data
train_2 <- prepare(train_treatment, train) %>% select(-transmission_type_lev_x_Manual)
test_2 <- prepare(train_treatment, test) %>% select(-transmission_type_lev_x_Manual)

Here is what the treatment result looks like in train data.

[1] 15052 17

We now then have a new additional columns resulting to 18 which is fewer than the first model of 41. We could see that the applied treatment changed the categorical columns with an additional _catP and substitutes each categorical variables to the total % recurrence of the said category in each row while inputing “0” or “1” in the lev columns. We’re now going to remove columns with _catP now since we already have the lev dummy variables and inputs the log_selling_price column back in.

# remove columns with _catP
train_2 <- train_2 %>%
  select(-contains("_catP"))
test_2 <- test_2 %>%
  select(-contains("_catP"))

We’re now going to build the second model.

# joining log_selling_price back to the df
train_2$log_selling_price <- train_log_selling_price

# model 2
model_2 <- lm(log_selling_price ~., train_2)
summary(model_2)

Call:
lm(formula = log_selling_price ~ ., data = train_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.39868 -0.15290  0.00844  0.16170  2.31614 

Coefficients:
                                          Estimate       Std. Error  t value
(Intercept)                       12.9602115026522  0.0497665271067  260.420
average_new_price                  0.0000000011810  0.0000000005021    2.352
vehicle_age                       -0.1216698142064  0.0008189994778 -148.559
km_driven                         -0.0000003155462  0.0000000459422   -6.868
mileage                           -0.0137188599443  0.0010123825949  -13.551
engine                             0.0002099476214  0.0000116435879   18.031
max_power                          0.0079565775879  0.0001200187877   66.294
seats                              0.0061712942598  0.0039328833656    1.569
brand_lev_x_Hyundai                0.0908332787399  0.0063840118732   14.228
brand_lev_x_Maruti                 0.1439009671707  0.0060514783916   23.779
seller_type_lev_x_Dealer           0.0419801949266  0.0218088564051    1.925
seller_type_lev_x_Individual      -0.0654597641009  0.0219084020288   -2.988
fuel_type_lev_x_Diesel             0.2046968924109  0.0164643374490   12.433
fuel_type_lev_x_Petrol            -0.0623752506728  0.0167558869679   -3.723
transmission_type_lev_x_Automatic  0.2118382204507  0.0068991910844   30.705
                                              Pr(>|t|)    
(Intercept)                       < 0.0000000000000002 ***
average_new_price                             0.018671 *  
vehicle_age                       < 0.0000000000000002 ***
km_driven                             0.00000000000675 ***
mileage                           < 0.0000000000000002 ***
engine                            < 0.0000000000000002 ***
max_power                         < 0.0000000000000002 ***
seats                                         0.116633    
brand_lev_x_Hyundai               < 0.0000000000000002 ***
brand_lev_x_Maruti                < 0.0000000000000002 ***
seller_type_lev_x_Dealer                      0.054259 .  
seller_type_lev_x_Individual                  0.002814 ** 
fuel_type_lev_x_Diesel            < 0.0000000000000002 ***
fuel_type_lev_x_Petrol                        0.000198 ***
transmission_type_lev_x_Automatic < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2663 on 15037 degrees of freedom
Multiple R-squared:  0.863, Adjusted R-squared:  0.8629 
F-statistic:  6766 on 14 and 15037 DF,  p-value: < 0.00000000000000022

We now have a linear model with 0.863 R-squared, 0.8629 Adj. R-squared, and a total of 14 predictors. There were some predictors that aren’t as significant as the others so we’re going to make a third model which excludes those variables.

Third Model

We’re now going to build the third model which basically uses the same predictors as the second model minus the insignificant variables.

# model 3
model_3 <- lm(log_selling_price ~. - seats -seller_type_lev_x_Dealer, train_2)
summary(model_3)

Call:
lm(formula = log_selling_price ~ . - seats - seller_type_lev_x_Dealer, 
    data = train_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.41196 -0.15219  0.00873  0.16132  2.31721 

Coefficients:
                                          Estimate       Std. Error  t value
(Intercept)                       13.0442238011792  0.0355126154418  367.312
average_new_price                  0.0000000010988  0.0000000004992    2.201
vehicle_age                       -0.1219117975918  0.0007898040128 -154.357
km_driven                         -0.0000003136041  0.0000000459373   -6.827
mileage                           -0.0143053221698  0.0009440689731  -15.153
engine                             0.0002167981011  0.0000107214169   20.221
max_power                          0.0078903635430  0.0001118544298   70.541
brand_lev_x_Hyundai                0.0897082506095  0.0063618684610   14.101
brand_lev_x_Maruti                 0.1449987324360  0.0059926124934   24.196
seller_type_lev_x_Individual      -0.1064827246520  0.0046138929206  -23.079
fuel_type_lev_x_Diesel             0.2058109687934  0.0164375694726   12.521
fuel_type_lev_x_Petrol            -0.0642475767788  0.0166470309279   -3.859
transmission_type_lev_x_Automatic  0.2105874030700  0.0068318753204   30.824
                                              Pr(>|t|)    
(Intercept)                       < 0.0000000000000002 ***
average_new_price                             0.027743 *  
vehicle_age                       < 0.0000000000000002 ***
km_driven                             0.00000000000902 ***
mileage                           < 0.0000000000000002 ***
engine                            < 0.0000000000000002 ***
max_power                         < 0.0000000000000002 ***
brand_lev_x_Hyundai               < 0.0000000000000002 ***
brand_lev_x_Maruti                < 0.0000000000000002 ***
seller_type_lev_x_Individual      < 0.0000000000000002 ***
fuel_type_lev_x_Diesel            < 0.0000000000000002 ***
fuel_type_lev_x_Petrol                        0.000114 ***
transmission_type_lev_x_Automatic < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2663 on 15039 degrees of freedom
Multiple R-squared:  0.8629,    Adjusted R-squared:  0.8628 
F-statistic:  7891 on 12 and 15039 DF,  p-value: < 0.00000000000000022

There’s only a 0.1% difference between the second model and the third model while the third only needs 12 predictors, so between the 2 model (model_2 and model_3), we’ll go for the 3rd model and compare its prediction results to model_1.

Model Evaluation

Prediction Results

We’re now going to use the first and third model to predict the test results.

# predict with first model
predict_1 <- predict(model_1, test)

# predict with third model
predict_3 <- predict(model_3, test_2)

head(predict_1)
       5       21       22       26       31       39 
13.16683 13.26648 13.38866 13.26083 12.72313 14.29395 
head(predict_3)
       1        2        3        4        5        6 
13.20635 13.22513 13.44352 13.37557 12.93095 14.44644 

The results weren’t very much different. We could see that the second model tends to predict the variables higher than the first model.

We’re now going to compare the Adjusted R-squared, RMSE (for train and test), and MAPE for each models.

# adj. r squared
summary(model_1)$adj.r.squared
summary(model_2)$adj.r.squared

# RMSE train
RMSE(model_1$fitted.values, train$log_selling_price)
RMSE(model_2$fitted.values, train_2$log_selling_price)

# RMSE test
RMSE(predict_1, test$log_selling_price)
RMSE(predict_3, test_log_selling_price)

# MAPE
MAPE(predict_1, test$log_selling_price)
MAPE(predict_3, test_log_selling_price)
Model Predictors Adj.R_squared RMSE_train RMSE_test MAPE
model_1 41 91.18 % 0.2133 0.2134 0.0123
model_3 12 86.29 % 0.2662 0.2688 0.0153

Viewing the result in comparison, we could see that:

  • model_3 has much fewer predictors than model_1 (29 difference),
  • model_1 performs slightly better with the Adj. R-squared which can depict 91.12% of the data while model_3 86.29%,
  • model_1 has lower RMSE in the train and test data than model_3,
  • both models perform well in both train and test data which means that the models fit just right (no under or overfitting),
  • MAPE in model_1 is better than model_3

Based on those observations, we’ll opt for the third model since the prediction result isn’t very much different with the first model and it only takes 12 predictors.

Assumptions

In this part, we’re going to check whether the model we chose follows the regression assumptions.

Linearity

Linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects.

Linearity and additivity of the relationship between dependent and independent variables are described as:

  • The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed.
  • The slope of that line does not depend on the values of the other variables.
  • The effects of different independent variables on the expected value of the dependent variable are additive.

We’re going to check the linearity of our model by plotting it.

# residual vs fitted plot
ggplot(model_3, aes(x = .fitted, y = .resid)) + 
  geom_point(color = "#4c72b0", alpha = 0.3) +
  geom_hline(aes(yintercept = 0), color = "#345790") +
  geom_smooth(color = "#2d1a94") +
  theme_bw() +
  labs(title = "Residuals vs Fitted",
       x = "Fitted Values",
       y = "Residuals") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

Here we could see that our model isn’t linear towards the end, which means that the model could predict better in the mid-range, but when the model is predicting higher values, the error is larger in the negative side since our model tends to under-predict the value.

Residuals Normality

Second, we’re going to check for residuals normality.

# histogram
ggplot(model_3, aes(x = .resid)) +
  geom_histogram(bins = 30, fill = "#4c72b0", color = "#345790", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Residuals Distribution",
       x = "Residuals",
       y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5))

Here we could see that the residuals are, indeed, normally distributed. However, there were still some outliers which I assume is still related with the reason why our linearity plot is curved towards the end.

Homoscedasticity

We’re next going to check whether the data are homoscedastic (residuals are equal across the regression line) with bptest().

# check for homoscedasticity
bptest(model_3)

    studentized Breusch-Pagan test

data:  model_3
BP = 1699.4, df = 12, p-value < 0.00000000000000022

Here we could see that the p-value of model_3 is far below 0.05, which means that our data is heteroscedastic.

No-Multicollinearity

For the last assumption, we’re going to check if our data fulfills the no-multicollinearity assumption with vif() function.

# check for multicollinearity
vif(model_3)
                average_new_price                       vehicle_age 
                         1.507520                          1.285787 
                        km_driven                           mileage 
                         1.217953                          3.094771 
                           engine                         max_power 
                         6.096573                          4.686581 
              brand_lev_x_Hyundai                brand_lev_x_Maruti 
                         1.233595                          1.589526 
     seller_type_lev_x_Individual            fuel_type_lev_x_Diesel 
                         1.070334                         14.331834 
           fuel_type_lev_x_Petrol transmission_type_lev_x_Automatic 
                        14.690279                          1.552254 

Here we could see that there’s a strong collinearity between fuel_type Diesel and Petrol. We could also see that engine has a high VIF value.

Model Improvement

Build Model Improvement

Based on the assumptions that we evaluate above, we’re now going to fix the model. We’re first going to remove outliers in the data.

# remove outliers
car_improv <- car_clean_3 %>% filter(log_selling_price < 15.8)

# check rows removed
(nrow(car_improv)/nrow(car) - 1)*100
[1] -3.868195

We’re now going to split the data to train and test

# split data
set.seed(902)
samplesize <- round(0.8 * nrow(car_improv), 0)
index <- sample(seq_len(nrow(car_improv)), size = samplesize)

train_improv <- car_improv[index, ]
test_improv <- car_improv[-index, ]

We’re now going to repeat the same step as how we build the second model and remove fuel_type Diesel and engine.

# create varlist object for data treatment
varlist_improv <- car_improv %>% 
  select(-"log_selling_price", -"selling_price") %>% 
  colnames()

# saving deleted columns to an object
train_log_selling_price_improv <- train_improv$log_selling_price
test_log_selling_price_improv <- test_improv$log_selling_price

# training data treatment
train_treatment_improv <- designTreatmentsZ(dframe = train_improv,
                                     varlist = varlist_improv,
                                     minFraction = 0.1)
[1] "vtreat 1.6.2 inspecting inputs Wed May 05 01:50:30 2021"
[1] "designing treatments Wed May 05 01:50:30 2021"
[1] " have initial level statistics Wed May 05 01:50:30 2021"
[1] " scoring treatments Wed May 05 01:50:30 2021"
[1] "have treatment plan Wed May 05 01:50:30 2021"
# apply treatment to both train and test data
train_improv_2 <- prepare(train_treatment_improv, train_improv) %>% 
  select(-transmission_type_lev_x_Manual)
test_improv_2 <- prepare(train_treatment_improv, test_improv) %>% 
  select(-transmission_type_lev_x_Manual)

# remove columns with _catP
train_improv_2 <- train_improv_2 %>%
  select(-contains("_catP"))
test_improv_2 <- test_improv_2 %>%
  select(-contains("_catP"))

# joining log_selling_price back to the df
train_improv_2$log_selling_price <- train_log_selling_price_improv

# model improvement
model_improv <- lm(log_selling_price ~. - seats -seller_type_lev_x_Dealer -
                     fuel_type_lev_x_Diesel - engine, train_improv_2)
summary(model_improv)

Call:
lm(formula = log_selling_price ~ . - seats - seller_type_lev_x_Dealer - 
    fuel_type_lev_x_Diesel - engine, data = train_improv_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.07937 -0.15163  0.00774  0.16235  2.32850 

Coefficients:
                                         Estimate      Std. Error  t value
(Intercept)                       13.594387719971  0.024464329954  555.682
average_new_price                  0.000000001107  0.000000000510    2.171
vehicle_age                       -0.123324184128  0.000795727780 -154.983
km_driven                         -0.000000240168  0.000000046473   -5.168
mileage                           -0.022867366185  0.000790878569  -28.914
max_power                          0.009690736937  0.000092071178  105.253
brand_lev_x_Hyundai                0.063748209795  0.006363808593   10.017
brand_lev_x_Maruti                 0.125594044450  0.006036248894   20.807
seller_type_lev_x_Individual      -0.103448462838  0.004689091075  -22.062
fuel_type_lev_x_Petrol            -0.318182872110  0.005403144467  -58.888
transmission_type_lev_x_Automatic  0.182561834948  0.006910512579   26.418
                                              Pr(>|t|)    
(Intercept)                       < 0.0000000000000002 ***
average_new_price                               0.0299 *  
vehicle_age                       < 0.0000000000000002 ***
km_driven                                   0.00000024 ***
mileage                           < 0.0000000000000002 ***
max_power                         < 0.0000000000000002 ***
brand_lev_x_Hyundai               < 0.0000000000000002 ***
brand_lev_x_Maruti                < 0.0000000000000002 ***
seller_type_lev_x_Individual      < 0.0000000000000002 ***
fuel_type_lev_x_Petrol            < 0.0000000000000002 ***
transmission_type_lev_x_Automatic < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2709 on 15019 degrees of freedom
Multiple R-squared:  0.8575,    Adjusted R-squared:  0.8574 
F-statistic:  9037 on 10 and 15019 DF,  p-value: < 0.00000000000000022

The R-squared and Adj. R-squared aren’t very much different with the model before with 85.75%. We’re now going to predict the result and compare it to the model before.

# predict model
predict_improv <- predict(model_improv, test_improv_2)
Model Predictors Adj.R_squared RMSE_train RMSE_test MAPE
model_3 12 86.28 % 0.2662 0.2688 0.0153
model_improvement 11 85.74 % 0.2708 0.2726 0.0153

Still, the results aren’t very much different from the base model.

Model Improvement Evaluation

We’re now going to re-evaluate our model based on the 4 assumptions: linearity, residuals normality, homoscedasticity, and no-multicollinearity.

# residual vs fitted plot
ggplot(model_improv, aes(x = .fitted, y = .resid)) + 
  geom_point(color = "#4c72b0", alpha = 0.3) +
  geom_hline(aes(yintercept = 0), color = "#345790") +
  geom_smooth(color = "#2d1a94") +
  theme_bw() +
  labs(title = "Residuals vs Fitted",
       x = "Fitted Values",
       y = "Residuals") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

# histogram
ggplot(model_improv, aes(x = .resid)) +
  geom_histogram(bins = 30, fill = "#4c72b0", color = "#345790", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Residuals Distribution",
       x = "Residuals",
       y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5))

# homoscedasticity
bptest(model_improv)

    studentized Breusch-Pagan test

data:  model_improv
BP = 1518.8, df = 10, p-value < 0.00000000000000022
# no-multicollinearity
vif(model_improv)
                average_new_price                       vehicle_age 
                         1.435668                          1.271233 
                        km_driven                           mileage 
                         1.206055                          2.080817 
                        max_power               brand_lev_x_Hyundai 
                         2.889481                          1.202041 
               brand_lev_x_Maruti      seller_type_lev_x_Individual 
                         1.550935                          1.067726 
           fuel_type_lev_x_Petrol transmission_type_lev_x_Automatic 
                         1.492993                          1.543603 

After improving our model, it doesn’t seem to give us satisfactory result. The new model just performs better in the multicollinearity since now there’s no VIF > 5. It still gives us the same tailed value in linearity, and model is still heteroscedastic.

Conclusion

After several tries of me trying to pimp up the model, there are still assumptions which are not satisfied yet, linearity and homoscedasticity. From what I could see by my own newbie perspective, it could happen because:

  • selling_price relies heavily on the brand of the car. However, we should also know that different brands have different models and how they influence the selling price might be different. I feel like our model could do a lot better with more data such as the type of the car. It would just make more sense if a car resale value is seen more based on the brand and the type because the same type of a car by different brands could have a different resale value.
  • At the first pre-process stage of the data, we could see that the selling_price is very skewed to the right. From tens of thousands of data, 75% of them are actually priced between INR 34,000 - 799,250. With the median value of 530,000 and maximum value of 39,500,000 is too huge of a gap and we couldn’t just remove the 25% of the data.
  • 50% of the data were actually missing values. Perhaps if there’s a more logical way to impute those missing values it could’ve been better. An additional info of the car type as mentioned above could make the imputation more precise.
  • Or maybe… the data is just the way it is and linear regression isn’t the most appropriate method for this data.

All in all, the 2 things that I really hope in the future regarding this dataset is an additional car type variable and predicting this data with another machine learning method. I hope I could also learn a lot more and be better in machine learning so I can pinpoint exactly what’s wrong and what’s not. This is a long journey but thanks for reading until the end! :)

Resources

  1. https://towardsdatascience.com/assumptions-of-linear-regression-5d87c347140
  2. https://people.duke.edu/~rnau/testing.htm
  3. https://www.statisticssolutions.com/assumptions-of-linear-regression/
  4. Lesmeister, Cory and Chinnamgari, Dr. Sunil Kumar (2019). Advanced Machine Learning with R. Packt Publisher.