Price Prediction for Used Car

Introduction

Project Background

At present, the amount of transfers held by the industry is increasing year by year. Thus, transfers play an important role in the development of automobiles. In developed countries, automobiles are referred to as an “industrial industry”. According to industry experts, the automotive industry is growing significantly. Apart from being the fastest growing country in the automotive industry, it also represents a global presence,as in most other countries, cars have become very popular among local residents and expatriates working in the country.

Today, almost everyone wants to own their own car, but due to factors such as affordability or financial situation, many people opt for used cars. Accurately predicting used car prices requires expertise, as they depend on many factors and characteristics. Used car prices in the market are not stable and both buyers and sellers need an intelligent system that allows them to effectively predict the price. In this project,used car price prediction was conducted on used car-related data. The data obtained from Kaggle: https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho?datasetId=33080&sortBy=voteCount&language=R&select=Car+details+v3.csv

Project Questions

In order to predict used car price, there are several question can be ask which are:

What are the algorithms is used and which algortihm performed the best in predicting used car selling price.
Which variables influence the used car selling price?

In this regard, we need to create machine learning algorithms in predecting the selling price of cars based on available variables.

Project Objectives

Buying a used car from a dealership can be a frustrating and unsatisfying experience, as some dealers have been known to use deceptive sales tactics to close the deal. So our project aims to :

To build machine learning algorithms which can be implementing into used car recommendation system in predicting used car selling price.
To identify what variables influence the used car selling price.

Considering this is an interesting research topic for the research community, and by continuing their steps, we hope to achieve important results using more advanced methods from previous work.

Import necessary packages:

library(tidyverse)
library(stringr)
library(purrr)
library(Amelia)
library(GGally)
library(caret)
library(relaimpo)
library(randomForest)
library(gbm)
library(broom)
library(dplyr)
library(e1071)
library(magrittr) 
library(dplyr) 
library(ggplot2)
library(data.table)
library(tidyr) 
library(tidyselect)
library(plotly)
library(reactable)
library(htmlwidgets)
library('IRdisplay')
library("scales")

Data Understanding

Dataset exploration

In this project, we are using a vehicle dataset consisting of four sets of data that contains the information about used cars named car data, Car details v3, Car details v4 and Car Details from Car Dekho. For acknowledgement, we are using the Car details v3 dataset. To explore the dataset, we will use the R language to visualize our findings on the Exploratory Data Analysis part. Firstly, we will import the cleaned dataset into our R project and view them.

Import dataset from local storage:

car <- read.csv('../group_project/car_project.csv')
view(car)

Running the script gave us important information that there are about 8128 rows and 13 columns for this dataset. For data types, there are nine (9) column character types and four (4) column double types. Next, let’s see the dataset dimension.

Dimension of dataset:

dim(car)

## [1] 8128   13

The car data has 8128 observations and 13 variables which are the same as the number of rows and columns as above. Next, we want to be familiar with the variables available within the dataset. We will take a look at the top rows of the car dataset.

Name of columns:

names(car)

##  [1] "name"          "year"          "selling_price" "km_driven"    
##  [5] "fuel"          "seller_type"   "transmission"  "owner"        
##  [9] "mileage"       "engine"        "max_power"     "torque"       
## [13] "seats"

The description of each coloumns:

name - Name of the car models.
year - Year of the car when it was bought.
selling_price - Price at which the car is being sold.
km_driven - Number of kilometers the car is driven.
fuel - Fuel type of the car used.
seller_type - Tells if a seller is individual or a dealer.
transmission - Gear transmission of the car.
owner - Number of previous owners of the car.

Summary of dataset:

summary(car)

##      name                year      selling_price        km_driven      
##  Length:8128        Min.   :1983   Min.   :   29999   Min.   :      1  
##  Class :character   1st Qu.:2011   1st Qu.:  254999   1st Qu.:  35000  
##  Mode  :character   Median :2015   Median :  450000   Median :  60000  
##                     Mean   :2014   Mean   :  638272   Mean   :  69820  
##                     3rd Qu.:2017   3rd Qu.:  675000   3rd Qu.:  98000  
##                     Max.   :2020   Max.   :10000000   Max.   :2360457  
##                                                                        
##      fuel           seller_type        transmission          owner          
##  Length:8128        Length:8128        Length:8128        Length:8128       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    mileage             engine           max_power            torque         
##  Length:8128        Length:8128        Length:8128        Length:8128       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      seats       
##  Min.   : 2.000  
##  1st Qu.: 5.000  
##  Median : 5.000  
##  Mean   : 5.417  
##  3rd Qu.: 5.000  
##  Max.   :14.000  
##  NA's   :221

As for the output of summary(), if the column is numeric type, then the summary would contain information like minimum, maximum, median, mean, etc. If the column is char type, then summary would contain information like length, class and mode. Here, we can see there are nine character types: name, fuel, seller_type, transmission, owner, mileage, engine, max_power, torque and four character types: year, selling_price, km_driven, seats. Next, let’s see the data structure.

str(car)

## 'data.frame':    8128 obs. of  13 variables:
##  $ name         : chr  "Maruti Swift Dzire VDI" "Skoda Rapid 1.5 TDI Ambition" "Honda City 2017-2020 EXi" "Hyundai i20 Sportz Diesel" ...
##  $ year         : int  2014 2014 2006 2010 2007 2017 2007 2001 2011 2013 ...
##  $ selling_price: int  450000 370000 158000 225000 130000 440000 96000 45000 350000 200000 ...
##  $ km_driven    : int  145500 120000 140000 127000 120000 45000 175000 5000 90000 169000 ...
##  $ fuel         : chr  "Diesel" "Diesel" "Petrol" "Diesel" ...
##  $ seller_type  : chr  "Individual" "Individual" "Individual" "Individual" ...
##  $ transmission : chr  "Manual" "Manual" "Manual" "Manual" ...
##  $ owner        : chr  "First Owner" "Second Owner" "Third Owner" "First Owner" ...
##  $ mileage      : chr  "23.4 kmpl" "21.14 kmpl" "17.7 kmpl" "23.0 kmpl" ...
##  $ engine       : chr  "1248 CC" "1498 CC" "1497 CC" "1396 CC" ...
##  $ max_power    : chr  "74 bhp" "103.52 bhp" "78 bhp" "90 bhp" ...
##  $ torque       : chr  "190Nm@ 2000rpm" "250Nm@ 1500-2500rpm" "12.7@ 2,700(kgm@ rpm)" "22.4 kgm at 1750-2750rpm" ...
##  $ seats        : int  5 5 5 5 5 5 5 4 5 5 ...

Data structure is a way to see an overview of the data so that it can be used effectively. In this dataset, it gives an overview of dataframes, the generic data objects of R which used to store the tabular data. Since our main objective focused on the selling price of the cars, we will see the distributions of the selling price in the first place.

Exploratory Data Analysis

Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.

Function to plot width and height of plot:

fig<-function(x,y){
  options(repr.plot.width = x, repr.plot.height = y)
}

Top 10 most expensive car

expensive_cars <- car %>% group_by(name) %>% summarise(selling_price=max(selling_price))%>%top_n(10)

## Selecting by selling_price

head(expensive_cars)

## # A tibble: 6 × 2
##   name                                       selling_price
##   <chr>                                              <int>
## 1 Audi A6 35 TFSI Matrix                           6523000
## 2 BMW 5 Series 520d Luxury Line                    5200000
## 3 BMW 6 Series GT 630d Luxury Line                 6000000
## 4 BMW X4 M Sport X xDrive20d                       5800000
## 5 BMW X7 xDrive 30d DPE                            7200000
## 6 Mercedes-Benz E-Class Exclusive E 200 BSIV       5200000

print(expensive_cars)

## # A tibble: 10 × 2
##    name                                       selling_price
##    <chr>                                              <int>
##  1 Audi A6 35 TFSI Matrix                           6523000
##  2 BMW 5 Series 520d Luxury Line                    5200000
##  3 BMW 6 Series GT 630d Luxury Line                 6000000
##  4 BMW X4 M Sport X xDrive20d                       5800000
##  5 BMW X7 xDrive 30d DPE                            7200000
##  6 Mercedes-Benz E-Class Exclusive E 200 BSIV       5200000
##  7 Mercedes-Benz S-Class S 350 CDI                  6000000
##  8 Volvo S90 D4 Inscription BSIV                    5500000
##  9 Volvo XC60 Inscription D5 BSIV                   5500000
## 10 Volvo XC90 T8 Excellence BSIV                   10000000

options(scipen = 999)
ggplot(data = expensive_cars, aes(y=name, x=selling_price, fill=selling_price))+
  geom_bar(stat="identity", width = 0.5, fill="#E14D2A",color = 'black')+
  geom_text(aes(label=selling_price), vjust=1.7, color="black", size=3.0)+
  scale_x_continuous(labels = comma)+
  labs(x="Car Price",
       y="Car Brand", 
       title="Top 10 most expensive cars")+ 
  theme_bw()+
  theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10),
        axis.text.y= element_text(size=10), axis.title=element_text(size=10))

The most expensive car model is Volvo XC90 T8 Excellence BSIV, sold at 10,000,000. The second highest expensive car model is BMW X7 xDrive 30d DPE sold at 72,000,000. The third rank of expensive car brands is Audi A6 35 TFSI Matrix which was sold at 6,523,000.

Top 10 most cheapest cars

cheapest_cars <- filter(car %>% group_by(name) %>% summarise(selling_price=max(selling_price)) %>% arrange(desc(selling_price)))


ggplot(data = tail(cheapest_cars,10), aes(y=name, x=selling_price, fill=selling_price))+
  geom_bar(stat="identity", width = 0.5, fill="#001253",color = 'white')+
  geom_text(aes(label=selling_price), vjust=1.9, color="black", size=3.0)+
  scale_x_continuous(labels = comma)+
  labs(x="Car Price",
       y="Car Brand", 
       title="Top 10 most cheapest car model")+ 
  theme_bw()+
  theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10),
        axis.text.y= element_text(size=10), axis.title=element_text(size=10))

The most cheapest car model is Hyundai Santro LE, sold at 35,000. The second lowest car model sold is Tata Indica DLX sold at 40,000. The third and fourth rank share the same value of the cheapest car models sold at 45,000 which are Tata Nano LX SE and Maruti 800 DXBSII.

Number of car sold per year

sold_per_year<- car %>% group_by(name, year)%>%summarise(Count=length(name))

## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.

sold_per_year

## # A tibble: 3,837 × 3
## # Groups:   name [2,058]
##    name                               year Count
##    <chr>                             <int> <int>
##  1 Ambassador CLASSIC 1500 DSL AC     2000     1
##  2 Ambassador Classic 2000 DSZ AC PS  1994     1
##  3 Ambassador Grand 1500 DSZ BSIII    2008     1
##  4 Ambassador Grand 2000 DSZ PW CL    2008     1
##  5 Ashok Leyland Stile LE             2013     1
##  6 Audi A3 35 TDI Premium Plus        2017     1
##  7 Audi A3 35 TDI Premium Plus        2018     1
##  8 Audi A3 40 TFSI Premium            2017     1
##  9 Audi A4 1.8 TFSI                   2010     1
## 10 Audi A4 2.0 TDI                    2014     2
## # … with 3,827 more rows

ggplot(sold_per_year, aes(year))+
  geom_line(stat="count", width = 0.7,color = 'black')+  # Stack for stacked chart
  labs(x="Year",
       y="Count", 
       title="Distribution of cars sold per year")+ 
  theme_bw()+
  theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10,angle=90),
        axis.text.y= element_text(size=10), axis.title=element_text(size=10))

From the data distributions of the used cars sold over past few years from 1996 to 2020, there is an increment of the car units sold. In 2012, there was a slight reduction in the number of counts and rose back up till 2017 where it reached its peak at 434 units sold. Then, the number of used cars sold decreases till 2020.

Split distribution count of fuel types

fuel_type<- car %>% group_by(fuel) %>% summarise(Count=length(fuel)) %>% plot_ly(x=~fuel, y=~Count, color = ~fuel, colors = c("#764AF1","#F2F2F2"), type='bar')%>%layout(title="Distribution of fuel types on used car model", xaxis=list(title="Fuel type"))

fuel_type

From the data distributions of the fuel types over the total of used cars model, the highest fuel type car model was diesel, followed by petrol. While the lowest fuel type consumption of car models is Compressed Natural Gas (CNG) and Liquefied Petroleum Gas (LPG).

Split distribution count of seller types

seller<- car %>% group_by(seller_type) %>% summarise(Count=length(seller_type)) %>% plot_ly(x=~seller_type, y=~Count, color = ~seller_type, colors = c("#00FFFF","#F2F2F2"), type='bar')%>%layout(title="Distribution of fuel types on used car model", xaxis=list(title="Seller type"))

seller

From the data distributions of the seller types over the total of used cars model, the highest seller type car model was Individual, followed by Dealer and Trustmark Dealer.

Split distribution count of transmission type

transmission <- car %>% group_by(transmission) %>% summarise(Count=length(transmission)) %>% plot_ly(x=~transmission, y=~Count, color = ~transmission, colors = c("#00FF00","#F2F2F2"), type='bar')%>%layout(title="Distribution of used car model's transmission type", xaxis=list(title="transmission type"))

transmission

From the data distributions of the transmission type over the total of used cars model, the highest seller type car model is Manual over the 7000 in counts, while the lowest number is the Manual type with around 1000 in counts.

Relationship between selling price and car manufactured

ggplot(data = car, aes(x=year, y=selling_price, fill=selling_price)) + 
  geom_point()+
  labs(y="Selling Price",
       x="Year", 
       title="Relationship between Selling Price and Car Manufactured")+  
  theme_bw()+
  theme(plot.title = element_text(size=15)
        ,axis.text.x= element_text(size=10),
        axis.text.y= element_text(size=10),
        axis.title=element_text(size=10))

From the relationship of data distributions of sold car selling price over the year, the graph shows that there is an incremental density in value. As the year in time increases, the selling price values increase.

Relationship between selling price and KM driven

ggplot(data = car, aes(x=km_driven, y=selling_price, fill=km_driven)) + 
  geom_point()+
  labs(y="Selling Price",
       x="KM driven", 
       title="Relationship between Selling Price and km driven")+  
  theme_bw()+
  theme(plot.title = element_text(size=15)
        ,axis.text.x= element_text(size=10),
        axis.text.y= element_text(size=10),
        axis.title=element_text(size=10))

From the relationship of data distributions of sold car selling price over the km driven, the graph shows that there is an decrement density in value. As the year in km driven increases, the selling price values decrease.

Missing data detection

Check and visualize the missing missing data:

car[car == ''] <- NA
sapply(car, function(x) sum(is.na(x)))

##          name          year selling_price     km_driven          fuel 
##             0             0             0             0             0 
##   seller_type  transmission         owner       mileage        engine 
##             0             0             0           221           221 
##     max_power        torque         seats 
##           215           222           221

missmap(car, legend = TRUE, col = c("red", "blue"))

There are 221 missing value in mileage, engine and seats while there are 215 and 222 missing value for max_power and torque respectively.

Data Preparation

Change model name variable into company name to understand data better with just single word rather than multiple word.

car$name <- word(car$name,1)

Data transformation and imputation

change car mileage variable to numerical data type and impute the missing value in car mileage using the mean:

car$mileage <- parse_number(car$mileage)
car$mileage[is.na(car$mileage)]<-mean(car$mileage,na.rm=TRUE)

change car engine variable to numerical data type and impute the missing value using the mean:

car$engine <- parse_number(car$engine)
car$engine[is.na(car$engine)]<-mean(car$engine,na.rm=TRUE)

Change car max_power variable to numeric data type and impute missing value using mean:

car$max_power <- parse_number(car$max_power)
car$max_power[is.na(car$max_power)]<-mean(car$max_power,na.rm=TRUE)

Impute car seat variable missing value using median:

car$seats[is.na(car$seats)]<-median(car$seats,na.rm=TRUE)

Change car torque variable to numeric and input missing value using mean:

car$torque <- parse_number(car$torque)
car$torque[is.na(car$torque)]<-mean(car$torque,na.rm=TRUE)

Check the data after data transformation and imputation.

head(car)

##      name year selling_price km_driven   fuel seller_type transmission
## 1  Maruti 2014        450000    145500 Diesel  Individual       Manual
## 2   Skoda 2014        370000    120000 Diesel  Individual       Manual
## 3   Honda 2006        158000    140000 Petrol  Individual       Manual
## 4 Hyundai 2010        225000    127000 Diesel  Individual       Manual
## 5  Maruti 2007        130000    120000 Petrol  Individual       Manual
## 6 Hyundai 2017        440000     45000 Petrol  Individual       Manual
##          owner mileage engine max_power torque seats
## 1  First Owner   23.40   1248     74.00 190.00     5
## 2 Second Owner   21.14   1498    103.52 250.00     5
## 3  Third Owner   17.70   1497     78.00  12.70     5
## 4  First Owner   23.00   1396     90.00  22.40     5
## 5  First Owner   16.10   1298     88.20  11.50     5
## 6  First Owner   20.14   1197     81.86 113.75     5

The data has been transformed into more suitable form.

Recheck and revisualize the missing value:

sapply(car, function(x) sum(is.na(x)))

##          name          year selling_price     km_driven          fuel 
##             0             0             0             0             0 
##   seller_type  transmission         owner       mileage        engine 
##             0             0             0             0             0 
##     max_power        torque         seats 
##             0             0             0

missmap(car, legend = TRUE, col = c("red", "blue"))

There is no missing value remaining after cleaning.

Check data type for all variables:

lapply(car, class)

## $name
## [1] "character"
## 
## $year
## [1] "integer"
## 
## $selling_price
## [1] "integer"
## 
## $km_driven
## [1] "integer"
## 
## $fuel
## [1] "character"
## 
## $seller_type
## [1] "character"
## 
## $transmission
## [1] "character"
## 
## $owner
## [1] "character"
## 
## $mileage
## [1] "numeric"
## 
## $engine
## [1] "numeric"
## 
## $max_power
## [1] "numeric"
## 
## $torque
## [1] "numeric"
## 
## $seats
## [1] "integer"

Data Encoding

Next, the categorical data will be encoded as data preparation before modelling.

The each car name variable will be encoded to the numerical value from 0 to 31. The car name data type will be changed from character to numerical data type.

car$name <- str_replace(car$name, 'Maruti', '0')
car$name <- str_replace(car$name, 'Skoda', '1')
car$name <- str_replace(car$name, 'Honda', '2')
car$name <- str_replace(car$name, 'Hyundai', '3')
car$name <- str_replace(car$name, 'Toyota', '4')
car$name <- str_replace(car$name, 'Ford', '5')
car$name <- str_replace(car$name, 'Renault', '6')
car$name <- str_replace(car$name, 'Mahindra', '7')
car$name <- str_replace(car$name, 'Tata', '8')
car$name <- str_replace(car$name, 'Chevrolet', '9')
car$name <- str_replace(car$name, 'Fiat', '10')
car$name <- str_replace(car$name, 'Datsun', '11')
car$name <- str_replace(car$name, 'Jeep', '12')
car$name <- str_replace(car$name, 'Mercedes-Benz', '13')
car$name <- str_replace(car$name, 'Mitsubishi', '14')
car$name <- str_replace(car$name, 'Audi', '15')
car$name <- str_replace(car$name, 'Volkswagen', '16')
car$name <- str_replace(car$name, 'BMW', '17')
car$name <- str_replace(car$name, 'Nissan', '18')
car$name <- str_replace(car$name, 'Lexus', '19')
car$name <- str_replace(car$name, 'Jaguar', '20')
car$name <- str_replace(car$name, 'Land', '21')
car$name <- str_replace(car$name, 'MG', '22')
car$name <- str_replace(car$name, 'Volvo', '23')
car$name <- str_replace(car$name, 'Daewoo', '24')
car$name <- str_replace(car$name, 'Kia', '25')
car$name <- str_replace(car$name, 'Force', '26')
car$name <- str_replace(car$name, 'Ambassador', '27')
car$name <- str_replace(car$name, 'Ashok', '28')
car$name <- str_replace(car$name, 'Isuzu', '29')
car$name <- str_replace(car$name, 'Opel', '30')
car$name <- str_replace(car$name, 'Peugeot', '31')
car$name <- as.numeric(car$name)

table(car$name)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2448  105  467 1415  488  397  228  772  734  230   47   65   31   54   14   40 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##  186  120   81   34   71    6    3   67    3    4    6    4    1    5    1    1

The car transmission variable will be encoded to the numerical value where “Manual” will be changed to 0 while “Auto” will be changed to 1. The transmission variable data type will be changed from character to numerical data type:

car$transmission <- str_replace(car$transmission, 'Manual', "0")
car$transmission <- str_replace(car$transmission, 'Automatic', "1")
car$transmission <- as.numeric(car$transmission)
table(car$transmission)

## 
##    0    1 
## 7078 1050

The car owner variables will be encoded to numerical value where “First Owner” will be changed to “0”, “Second Owner” will be changed to “1”, “Third Owner” will be changed to “2”, “Fourth & Above Owner” will be changed to “3” and “Test Drive Car” will be changed to “4”:

car$owner <- str_replace(car$owner, 'First Owner', "0")
car$owner <- str_replace(car$owner, 'Second Owner', "1")
car$owner <- str_replace(car$owner, 'Third Owner', "2")
car$owner <- str_replace(car$owner, 'Fourth & Above Owner', "3")
car$owner <- str_replace(car$owner, 'Test Drive Car', "4")
car$owner <- as.numeric(car$owner)
table(car$owner)

## 
##    0    1    2    3    4 
## 5289 2105  555  174    5

The car seller_type variable will be encoded to numerical value where “Trustmark Dealer” will be changed to “0”, “Dealer” will be changed to “1” and “Individual” will be changed to “2”:

car$seller_type <- str_replace(car$seller_type, "Trustmark Dealer", "0")
car$seller_type <- str_replace(car$seller_type, "Dealer", "1")
car$seller_type <- str_replace(car$seller_type, "Individual", "2")
car$seller_type <- as.numeric(car$seller_type)
table(car$seller_type)

## 
##    0    1    2 
##  236 1126 6766

The car fuel variable will be encoded to numerical value where “Diesel” will be changed to “0”, “Petrol” will be changed to “1” and “CNG” will be changed to “2” and “LPG” will be changed to “3”:

car$fuel <- str_replace(car$fuel, 'Diesel', "0")
car$fuel <- str_replace(car$fuel, 'Petrol', "1")
car$fuel <- str_replace(car$fuel, 'CNG', "2")
car$fuel <- str_replace(car$fuel, 'LPG', "3")
car$fuel <- as.numeric(car$fuel)
table(car$fuel)

## 
##    0    1    2    3 
## 4402 3631   57   38

Check the dataset after encoding and transformation:

head(car)

##   name year selling_price km_driven fuel seller_type transmission owner mileage
## 1    0 2014        450000    145500    0           2            0     0   23.40
## 2    1 2014        370000    120000    0           2            0     1   21.14
## 3    2 2006        158000    140000    1           2            0     2   17.70
## 4    3 2010        225000    127000    0           2            0     0   23.00
## 5    0 2007        130000    120000    1           2            0     0   16.10
## 6    3 2017        440000     45000    1           2            0     0   20.14
##   engine max_power torque seats
## 1   1248     74.00 190.00     5
## 2   1498    103.52 250.00     5
## 3   1497     78.00  12.70     5
## 4   1396     90.00  22.40     5
## 5   1298     88.20  11.50     5
## 6   1197     81.86 113.75     5

All data is encoded properly.

Data Standardization

specify non target variables into categorical and numerical column and specify target variables for preparation for normalization.

cat_cols <- c('name', 'year', 'fuel', 'seller_type', 'transmission', 'owner', 'seats')
num_cols <- c('km_driven', 'mileage', 'engine', 'max_power', 'torque')
target_cols <- c('selling_price')

Standardized the data for non target variables, while target variable will not be standardized for easy representation of selling price.

preproc <- preProcess(car[, c(cat_cols,num_cols)], method=c("range"))
car_scaled <- predict(preproc, car)
head(car_scaled)

##         name      year selling_price  km_driven      fuel seller_type
## 1 0.00000000 0.8378378        450000 0.06164021 0.0000000           1
## 2 0.03225806 0.8378378        370000 0.05083721 0.0000000           1
## 3 0.06451613 0.6216216        158000 0.05931015 0.3333333           1
## 4 0.09677419 0.7297297        225000 0.05380274 0.0000000           1
## 5 0.00000000 0.6486486        130000 0.05083721 0.3333333           1
## 6 0.09677419 0.9189189        440000 0.01906369 0.3333333           1
##   transmission owner   mileage    engine max_power      torque seats
## 1            0  0.00 0.5571429 0.2093960   0.18500 0.236164244  0.25
## 2            0  0.25 0.5033333 0.2932886   0.25880 0.312675338  0.25
## 3            0  0.50 0.4214286 0.2929530   0.19500 0.010073961  0.25
## 4            0  0.00 0.5476190 0.2590604   0.22500 0.022443254  0.25
## 5            0  0.00 0.3833333 0.2261745   0.22050 0.008543739  0.25
## 6            0  0.00 0.4795238 0.1922819   0.20465 0.138931395  0.25

The data has been successfully processed as shown in the display of data above.

Modelling

Split the data into 80% training set and 20% testing set.

trainIndex <- createDataPartition(car_scaled$selling_price, p = 0.8,
                                  list = FALSE,
                                  times = 1)
train <- car_scaled[ trainIndex,]
test <- car_scaled[-trainIndex,]

dim(train)

## [1] 6504   13

dim(test)

## [1] 1624   13

5 regression algorithms will be used for modelling and all of them will be evaluated and compared to find the best model. The evaluation metric will use RMSE score for evaluation. The comparison of each model will be discussed on the evaluation & conclusion section.

Linear Regression (LR)

Linear regression training:

set.seed(123)
lr <- lm(selling_price ~ ., data = train)
summary(lr)

## 
## Call:
## lm(formula = selling_price ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2403261  -208940    -1909   160224  4188508 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  -1664598      82982 -20.060 < 0.0000000000000002 ***
## name           744425      41694  17.854 < 0.0000000000000002 ***
## year          1093985      72699  15.048 < 0.0000000000000002 ***
## km_driven    -3063252     322330  -9.503 < 0.0000000000000002 ***
## fuel            95788      48452   1.977               0.0481 *  
## seller_type   -216807      27028  -8.021  0.00000000000000123 ***
## transmission   425377      21767  19.543 < 0.0000000000000002 ***
## owner          -19877      36090  -0.551               0.5818    
## mileage        824411      93647   8.803 < 0.0000000000000002 ***
## engine         163451      75656   2.160               0.0308 *  
## max_power     4519394     138815  32.557 < 0.0000000000000002 ***
## torque         402575      90972   4.425  0.00000978984490370 ***
## seats         -160435     104570  -1.534               0.1250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 450800 on 6491 degrees of freedom
## Multiple R-squared:  0.6877, Adjusted R-squared:  0.6871 
## F-statistic:  1191 on 12 and 6491 DF,  p-value: < 0.00000000000000022

The summary of the Linear Regression model is stated in the display above which give statistical information about the model.

Linear regression testing:

pred_lr <- predict(lr, newdata = test)
error_lr <- test$selling_price - pred_lr

eval_lr <- cbind(test$selling_price, pred_lr)
colnames(eval_lr) <- c("True Price", "Predicted Price")
eval_lr <- as.data.frame(eval_lr)
head(eval_lr)

##    True Price Predicted Price
## 6      440000        536920.9
## 10     200000        150437.9
## 11     500000        815158.6
## 18     500000        453932.6
## 24     600000        402256.6
## 25     500000        724944.7

From the model testing, some proportion of true values and predicted value has been shown above just to give rough view of comparison between true values and predicted values. As can been shown in display above, roughly, the difference between true values and predicted values in Linear Regression model is quite far and the predicted value have negative value. All the selling price values should be positive.

Scatter plot of actual values against predicted values for Linear Regression.

RMSE_lr <- round(sqrt(mean(error_lr^2)),2)
RMSE_lr

## [1] 451026.5

plot(test$selling_price,pred_lr, main="Actual Value vs Predicted Value for Linear Regression Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)

The RMSE value is Root Mean Squared Error of the predicted and true values. The higher RMSE indicate that the higher error which indicate that lower model performance while lower RMSE indicate that the lower error which indicate that higher model performance. The RMSE value for Linear Regression is 451026.51 which are quite high. In the scatter plot the blue point represent true value while red point represent predicted value and the black line is regressed diagonal line which as benchmark where points that are closer to the line indicate the better performance of the model. In the Linear Regression model, the point of both values are dispersed away from the diagonal line which indicate that the model performance is low. In addition, some of the predicted value lies in the negative and zero value, it is impossible to have zero or negative value of selling price. There, it can be conclude that the performance of Linear regression model is worse.

Random Forest Regression (RF)

Random Forest Regression training.

set.seed(123)
rf <- randomForest(selling_price~.,data = train)
rf

## 
## Call:
##  randomForest(formula = selling_price ~ ., data = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 21601607350
##                     % Var explained: 96.67

Random Forest Regression testing.

pred_rf <- predict(rf, test)
error_rf <- test$selling_price - pred_rf

eval_rf <- cbind(test$selling_price, pred_rf)
colnames(eval_rf) <- c("True Price", "Predicted Price")
eval_rf <- as.data.frame(eval_rf)
head(eval_rf)

##    True Price Predicted Price
## 6      440000        502552.8
## 10     200000        256983.0
## 11     500000        574281.4
## 18     500000        336490.5
## 24     600000        344239.1
## 25     500000        497701.3

Roughly, in the Random Forest model, the difference of true value and predicted value having relatively small difference which roughly indicate that the Random Forest has better performance than Linear Regression model.

Scatter plot for actual values against predicted values for Random forest Regression.

RMSE_rf <- round(sqrt(mean(error_rf^2)),2)
RMSE_rf

## [1] 125516

plot(test$selling_price,pred_rf, main="Actual Value vs Predicted Value for Random Forest Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)

The RMSE value for Random Forest model is 125515.97 which is very low compared to the Linear Regression model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with only a few point is far from the regressed diagonal line which indicate the good performance of model. Based on the RMSE value and scatter plot, the Random Forest have good performance.

K-Nearest Neighbor (KNN)

K-Nearest Neighbor training.

set.seed(123)
knn = knnreg(selling_price~.,data = train)
knn

## 5-nearest neighbor regression model

K-Nearest Neighbor testing.

pred_knn <- predict(knn, newdata = test)
error_knn <- test$selling_price - pred_knn

eval_knn <- cbind(test$selling_price, pred_knn)
colnames(eval_knn) <- c("True Price", "Predicted Price")
eval_knn <- as.data.frame(eval_knn)
head(eval_knn)

##   True Price Predicted Price
## 1     440000        521999.8
## 2     200000        257999.8
## 3     500000        507000.0
## 4     500000        327200.0
## 5     600000        295800.0
## 6     500000        477000.0

Roughly, in the KNN model, the difference of true value and predicted value have noticeable difference. This roughly estimate that the KNN model will have lower performance compared the Random Forest model.

Scatter plot for actual values against predicted values for Random forest Regression.

RMSE_knn <- round(sqrt(mean(error_knn^2)),2)
RMSE_knn

## [1] 186713.2

plot(test$selling_price,pred_knn, main="Actual Value vs Predicted Value for K-Nearest Neighbor Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)

The RMSE value for KNN model is 186713.17 which is significantly high compared to the Random Forest model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with just a some point that are far from the regressed diagonal line which indicate the notable performance of model. However, based on RMSE and overall fit of the point, the performance of Random Forest model is still far better than KNN model.

Support Vector Machine (SVM)

Support Vector Machine training.

set.seed(123)
svm = svm(selling_price~.,data = train)
svm

## 
## Call:
## svm(formula = selling_price ~ ., data = train)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.08333333 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  2167

Support Vector Machine testing.

pred_svm <- predict(svm, newdata = test)
error_svm <- test$selling_price - pred_svm

eval_svm <- cbind(test$selling_price, pred_svm)
colnames(eval_svm) <- c("True Price", "Predicted Price")
eval_svm <- as.data.frame(eval_svm)
head(eval_svm)

##    True Price Predicted Price
## 6      440000        520850.5
## 10     200000        312831.1
## 11     500000        549614.3
## 18     500000        376808.7
## 24     600000        422692.7
## 25     500000        475045.1

Roughly, in SVM mode, the difference of true value and predicted value having also relatively small difference same as KNN model. This roughly estimate that the SVM model will have almost the same performance with the KNN models.

Scatter plot for actual values against predicted values for Support Vector Machine.

RMSE_svm <- round(sqrt(mean(error_svm^2)),2)
RMSE_svm

## [1] 180960.2

plot(test$selling_price,pred_svm, main="Actual Value vs Predicted Value for SVM  Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)

The RMSE value for SVM model is 180960.23 which is almost the same as KNN model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with just a some point that are far from the regressed diagonal line which indicate the notable performance of model. However, based on RMSE and overall fit of the point, the performance of Random Forest model is still far better than SVM model. ### Graient Boost (GB)

Gradient Boost training.

set.seed(123)

gbm <- gbm(
  formula = selling_price ~ .,
  distribution = "gaussian",
  data = train,
  n.trees = 6000,
  interaction.depth = 3,
  shrinkage = 0.1,
  cv.folds = 5,
  n.cores = NULL, # will use all cores by default
  verbose = FALSE
)  

gbm

## gbm(formula = selling_price ~ ., distribution = "gaussian", data = train, 
##     n.trees = 6000, interaction.depth = 3, shrinkage = 0.1, cv.folds = 5, 
##     verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 6000 iterations were performed.
## The best cross-validation iteration was 4881.
## There were 12 predictors of which 12 had non-zero influence.

Gradient Boost testing.

pred_gbm <- predict(gbm, test)

## Using 4881 trees...

error_gbm <- test$selling_price - pred_gbm

eval_gbm <- cbind(test$selling_price, pred_gbm)
colnames(eval_gbm) <- c("True Price", "Predicted Price")
eval_gbm <- as.data.frame(eval_gbm)
head(eval_gbm)

##   True Price Predicted Price
## 1     440000        508454.7
## 2     200000        251092.9
## 3     500000        553044.7
## 4     500000        338273.2
## 5     600000        438715.2
## 6     500000        502449.1

RMSE_gbm <- round(sqrt(mean(error_gbm^2)),2)
RMSE_gbm

## [1] 115704.1

Roughly, in the Gradient Boost model, the difference of true value and predicted value having also relatively small difference same as Random Forest model. This roughly estimate that the Gradient Boost model will have almost the same performance with the Random Forest model.

Scatter plot for actual values against predicted values for Gradient Boost.

plot(test$selling_price,pred_gbm, main="Actual Value vs Predicted Value for Gradient Boost Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)

The RMSE value for Gradient Boost model is 115704.11 which is relatively low and almost similar with the Random Forest model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with almost negligible points that far from regressed diagonal line which indicate the outstanding performance of model. Based on the RMSE and overall fit of the point, the Gradient Boost model have better performance compared to Random Forest model with small difference between RMSE score.

Evaluation & Conclusion

Each RMSE score for each model will be put into a table for comparison. RMSE score for all models is shown in Figure Below.

Model <- c('Linear Regression','Random Forest','Gradient Boosting', 'K-Nearest Neighbour', "Support Vector Machine")
RMSE <- c(RMSE_lr,RMSE_rf,RMSE_gbm,RMSE_knn,RMSE_svm)
res <- data.frame(Model,RMSE)
res %>% arrange(RMSE,descending =TRUE)

##                    Model     RMSE
## 1      Gradient Boosting 115704.1
## 2          Random Forest 125516.0
## 3 Support Vector Machine 180960.2
## 4    K-Nearest Neighbour 186713.2
## 5      Linear Regression 451026.5

ggplot(data = res, aes(x= Model, y = RMSE, fill = Model)) + geom_bar(stat="identity") + theme(axis.text.x = element_blank(), axis.ticks = element_blank())

Gradient Boost model have the lowest RMSE score which is 115704.11 with only small difference with Random Forest model. Therefore, Gradient Boost model the best model among the other model while Linear Regression model have the highest RMSE score which is 451026.51 which indicate the worse model among the other model.

Using Gradient Boosting model, the features importance can be calculated to find which features or variables have greater influence in the price prediction.

summary(
  gbm, 
  cBars = 10,
  method = relative.influence, las = 2
)

##                       var    rel.inf
## max_power       max_power 63.2215544
## year                 year 15.0016012
## torque             torque  7.8205921
## km_driven       km_driven  5.6022714
## engine             engine  2.8279422
## name                 name  2.5183787
## mileage           mileage  1.7593982
## transmission transmission  0.5422712
## seller_type   seller_type  0.2384438
## owner               owner  0.2040022
## seats               seats  0.1503491
## fuel                 fuel  0.1131957

From the features importance, the max_power variable is the biggest infrequence which 60% of relative influence in the predicting selling price based on this dataset and Gradient Boost model. The features importance rank from highest to lowest is max_power, year, torque, km_driven, name, engine, mileage, transmission, seller_type and owner.

Gradient Boost model and Random Forest model have less significant difference in RMSE value. Both model can be approved to be the best model, the selection of project this project are not fully utilizing the machine learning capabilities such as hyperparameter tuning. The hyperparameter tuning can not be utilized due to technical limitation.

Hyperparameter tuning script for Gradient Boost model:

rainControl <- trainControl(method = "cv",
                             number = 10,
                             returnResamp="all", ### use "all" to return all cross-validated metrics
                             search = "grid")

tuneGrid <- expand.grid(
  n.trees = c(5000, 10000),
  interaction.depth = c( 6, 13),
  shrinkage = c(0.01, 0.001),
  n.minobsinnode=c(5, 10)
)
gbm_hpt <- train(selling_price ~.,
                data = train,
                method = "gbm",
                tuneGrid = tuneGrid,
                trControl = trainControl,
                verbose=FALSE)

pred_gbm_hpt <- predict(gbm_hpt, test)
error_gbm_hpt <- test$selling_price - pred_gbm_hpt
RMSE_gbm_hpt <- round(sqrt(mean(error_gbm_hpt^2)),2)
RMSE_gbm_hpt


plot(test$selling_price,pred_gbm_hpt, main="Scatterplot", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")