At present, the amount of transfers held by the industry is increasing year by year. Thus, transfers play an important role in the development of automobiles. In developed countries, automobiles are referred to as an “industrial industry”. According to industry experts, the automotive industry is growing significantly. Apart from being the fastest growing country in the automotive industry, it also represents a global presence,as in most other countries, cars have become very popular among local residents and expatriates working in the country.
Today, almost everyone wants to own their own car, but due to factors such as affordability or financial situation, many people opt for used cars. Accurately predicting used car prices requires expertise, as they depend on many factors and characteristics. Used car prices in the market are not stable and both buyers and sellers need an intelligent system that allows them to effectively predict the price. In this project,used car price prediction was conducted on used car-related data. The data obtained from Kaggle: https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho?datasetId=33080&sortBy=voteCount&language=R&select=Car+details+v3.csv
In order to predict used car price, there are several question can be ask which are:
What are the algorithms is used and which algortihm performed the best in predicting used car selling price.
Which variables influence the used car selling price?
In this regard, we need to create machine learning algorithms in predecting the selling price of cars based on available variables.
Buying a used car from a dealership can be a frustrating and unsatisfying experience, as some dealers have been known to use deceptive sales tactics to close the deal. So our project aims to :
To build machine learning algorithms which can be implementing into used car recommendation system in predicting used car selling price.
To identify what variables influence the used car selling price.
Considering this is an interesting research topic for the research community, and by continuing their steps, we hope to achieve important results using more advanced methods from previous work.
Import necessary packages:
library(tidyverse)
library(stringr)
library(purrr)
library(Amelia)
library(GGally)
library(caret)
library(relaimpo)
library(randomForest)
library(gbm)
library(broom)
library(dplyr)
library(e1071)
library(magrittr)
library(dplyr)
library(ggplot2)
library(data.table)
library(tidyr)
library(tidyselect)
library(plotly)
library(reactable)
library(htmlwidgets)
library('IRdisplay')
library("scales")
In this project, we are using a vehicle dataset consisting of four sets of data that contains the information about used cars named car data, Car details v3, Car details v4 and Car Details from Car Dekho. For acknowledgement, we are using the Car details v3 dataset. To explore the dataset, we will use the R language to visualize our findings on the Exploratory Data Analysis part. Firstly, we will import the cleaned dataset into our R project and view them.
Import dataset from local storage:
car <- read.csv('../group_project/car_project.csv')
view(car)
Running the script gave us important information that there are about 8128 rows and 13 columns for this dataset. For data types, there are nine (9) column character types and four (4) column double types. Next, let’s see the dataset dimension.
Dimension of dataset:
dim(car)
## [1] 8128 13
The car data has 8128 observations and 13 variables which are the same as the number of rows and columns as above. Next, we want to be familiar with the variables available within the dataset. We will take a look at the top rows of the car dataset.
Name of columns:
names(car)
## [1] "name" "year" "selling_price" "km_driven"
## [5] "fuel" "seller_type" "transmission" "owner"
## [9] "mileage" "engine" "max_power" "torque"
## [13] "seats"
The description of each coloumns:
name - Name of the car models.year - Year of the car when it was bought.selling_price - Price at which the car is being
sold.km_driven - Number of kilometers the car is
driven.fuel - Fuel type of the car used.seller_type - Tells if a seller is individual or a
dealer.transmission - Gear transmission of the car.owner - Number of previous owners of the car.Summary of dataset:
summary(car)
## name year selling_price km_driven
## Length:8128 Min. :1983 Min. : 29999 Min. : 1
## Class :character 1st Qu.:2011 1st Qu.: 254999 1st Qu.: 35000
## Mode :character Median :2015 Median : 450000 Median : 60000
## Mean :2014 Mean : 638272 Mean : 69820
## 3rd Qu.:2017 3rd Qu.: 675000 3rd Qu.: 98000
## Max. :2020 Max. :10000000 Max. :2360457
##
## fuel seller_type transmission owner
## Length:8128 Length:8128 Length:8128 Length:8128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## mileage engine max_power torque
## Length:8128 Length:8128 Length:8128 Length:8128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## seats
## Min. : 2.000
## 1st Qu.: 5.000
## Median : 5.000
## Mean : 5.417
## 3rd Qu.: 5.000
## Max. :14.000
## NA's :221
As for the output of summary(), if the column is numeric type, then the summary would contain information like minimum, maximum, median, mean, etc. If the column is char type, then summary would contain information like length, class and mode. Here, we can see there are nine character types: name, fuel, seller_type, transmission, owner, mileage, engine, max_power, torque and four character types: year, selling_price, km_driven, seats. Next, let’s see the data structure.
str(car)
## 'data.frame': 8128 obs. of 13 variables:
## $ name : chr "Maruti Swift Dzire VDI" "Skoda Rapid 1.5 TDI Ambition" "Honda City 2017-2020 EXi" "Hyundai i20 Sportz Diesel" ...
## $ year : int 2014 2014 2006 2010 2007 2017 2007 2001 2011 2013 ...
## $ selling_price: int 450000 370000 158000 225000 130000 440000 96000 45000 350000 200000 ...
## $ km_driven : int 145500 120000 140000 127000 120000 45000 175000 5000 90000 169000 ...
## $ fuel : chr "Diesel" "Diesel" "Petrol" "Diesel" ...
## $ seller_type : chr "Individual" "Individual" "Individual" "Individual" ...
## $ transmission : chr "Manual" "Manual" "Manual" "Manual" ...
## $ owner : chr "First Owner" "Second Owner" "Third Owner" "First Owner" ...
## $ mileage : chr "23.4 kmpl" "21.14 kmpl" "17.7 kmpl" "23.0 kmpl" ...
## $ engine : chr "1248 CC" "1498 CC" "1497 CC" "1396 CC" ...
## $ max_power : chr "74 bhp" "103.52 bhp" "78 bhp" "90 bhp" ...
## $ torque : chr "190Nm@ 2000rpm" "250Nm@ 1500-2500rpm" "12.7@ 2,700(kgm@ rpm)" "22.4 kgm at 1750-2750rpm" ...
## $ seats : int 5 5 5 5 5 5 5 4 5 5 ...
Data structure is a way to see an overview of the data so that it can be used effectively. In this dataset, it gives an overview of dataframes, the generic data objects of R which used to store the tabular data. Since our main objective focused on the selling price of the cars, we will see the distributions of the selling price in the first place.
Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.
Function to plot width and height of plot:
fig<-function(x,y){
options(repr.plot.width = x, repr.plot.height = y)
}
expensive_cars <- car %>% group_by(name) %>% summarise(selling_price=max(selling_price))%>%top_n(10)
## Selecting by selling_price
head(expensive_cars)
## # A tibble: 6 × 2
## name selling_price
## <chr> <int>
## 1 Audi A6 35 TFSI Matrix 6523000
## 2 BMW 5 Series 520d Luxury Line 5200000
## 3 BMW 6 Series GT 630d Luxury Line 6000000
## 4 BMW X4 M Sport X xDrive20d 5800000
## 5 BMW X7 xDrive 30d DPE 7200000
## 6 Mercedes-Benz E-Class Exclusive E 200 BSIV 5200000
print(expensive_cars)
## # A tibble: 10 × 2
## name selling_price
## <chr> <int>
## 1 Audi A6 35 TFSI Matrix 6523000
## 2 BMW 5 Series 520d Luxury Line 5200000
## 3 BMW 6 Series GT 630d Luxury Line 6000000
## 4 BMW X4 M Sport X xDrive20d 5800000
## 5 BMW X7 xDrive 30d DPE 7200000
## 6 Mercedes-Benz E-Class Exclusive E 200 BSIV 5200000
## 7 Mercedes-Benz S-Class S 350 CDI 6000000
## 8 Volvo S90 D4 Inscription BSIV 5500000
## 9 Volvo XC60 Inscription D5 BSIV 5500000
## 10 Volvo XC90 T8 Excellence BSIV 10000000
options(scipen = 999)
ggplot(data = expensive_cars, aes(y=name, x=selling_price, fill=selling_price))+
geom_bar(stat="identity", width = 0.5, fill="#E14D2A",color = 'black')+
geom_text(aes(label=selling_price), vjust=1.7, color="black", size=3.0)+
scale_x_continuous(labels = comma)+
labs(x="Car Price",
y="Car Brand",
title="Top 10 most expensive cars")+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
The most expensive car model is Volvo XC90 T8 Excellence BSIV, sold at 10,000,000. The second highest expensive car model is BMW X7 xDrive 30d DPE sold at 72,000,000. The third rank of expensive car brands is Audi A6 35 TFSI Matrix which was sold at 6,523,000.
cheapest_cars <- filter(car %>% group_by(name) %>% summarise(selling_price=max(selling_price)) %>% arrange(desc(selling_price)))
ggplot(data = tail(cheapest_cars,10), aes(y=name, x=selling_price, fill=selling_price))+
geom_bar(stat="identity", width = 0.5, fill="#001253",color = 'white')+
geom_text(aes(label=selling_price), vjust=1.9, color="black", size=3.0)+
scale_x_continuous(labels = comma)+
labs(x="Car Price",
y="Car Brand",
title="Top 10 most cheapest car model")+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
The most cheapest car model is Hyundai Santro LE, sold at 35,000. The second lowest car model sold is Tata Indica DLX sold at 40,000. The third and fourth rank share the same value of the cheapest car models sold at 45,000 which are Tata Nano LX SE and Maruti 800 DXBSII.
sold_per_year<- car %>% group_by(name, year)%>%summarise(Count=length(name))
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
sold_per_year
## # A tibble: 3,837 × 3
## # Groups: name [2,058]
## name year Count
## <chr> <int> <int>
## 1 Ambassador CLASSIC 1500 DSL AC 2000 1
## 2 Ambassador Classic 2000 DSZ AC PS 1994 1
## 3 Ambassador Grand 1500 DSZ BSIII 2008 1
## 4 Ambassador Grand 2000 DSZ PW CL 2008 1
## 5 Ashok Leyland Stile LE 2013 1
## 6 Audi A3 35 TDI Premium Plus 2017 1
## 7 Audi A3 35 TDI Premium Plus 2018 1
## 8 Audi A3 40 TFSI Premium 2017 1
## 9 Audi A4 1.8 TFSI 2010 1
## 10 Audi A4 2.0 TDI 2014 2
## # … with 3,827 more rows
ggplot(sold_per_year, aes(year))+
geom_line(stat="count", width = 0.7,color = 'black')+ # Stack for stacked chart
labs(x="Year",
y="Count",
title="Distribution of cars sold per year")+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10,angle=90),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
From the data distributions of the used cars sold over past few years from 1996 to 2020, there is an increment of the car units sold. In 2012, there was a slight reduction in the number of counts and rose back up till 2017 where it reached its peak at 434 units sold. Then, the number of used cars sold decreases till 2020.
fuel_type<- car %>% group_by(fuel) %>% summarise(Count=length(fuel)) %>% plot_ly(x=~fuel, y=~Count, color = ~fuel, colors = c("#764AF1","#F2F2F2"), type='bar')%>%layout(title="Distribution of fuel types on used car model", xaxis=list(title="Fuel type"))
fuel_type
From the data distributions of the fuel types over the total of used cars model, the highest fuel type car model was diesel, followed by petrol. While the lowest fuel type consumption of car models is Compressed Natural Gas (CNG) and Liquefied Petroleum Gas (LPG).
seller<- car %>% group_by(seller_type) %>% summarise(Count=length(seller_type)) %>% plot_ly(x=~seller_type, y=~Count, color = ~seller_type, colors = c("#00FFFF","#F2F2F2"), type='bar')%>%layout(title="Distribution of fuel types on used car model", xaxis=list(title="Seller type"))
seller
From the data distributions of the seller types over the total of used cars model, the highest seller type car model was Individual, followed by Dealer and Trustmark Dealer.
transmission <- car %>% group_by(transmission) %>% summarise(Count=length(transmission)) %>% plot_ly(x=~transmission, y=~Count, color = ~transmission, colors = c("#00FF00","#F2F2F2"), type='bar')%>%layout(title="Distribution of used car model's transmission type", xaxis=list(title="transmission type"))
transmission
From the data distributions of the transmission type over the total of used cars model, the highest seller type car model is Manual over the 7000 in counts, while the lowest number is the Manual type with around 1000 in counts.
ggplot(data = car, aes(x=year, y=selling_price, fill=selling_price)) +
geom_point()+
labs(y="Selling Price",
x="Year",
title="Relationship between Selling Price and Car Manufactured")+
theme_bw()+
theme(plot.title = element_text(size=15)
,axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10),
axis.title=element_text(size=10))
From the relationship of data distributions of sold car selling price over the year, the graph shows that there is an incremental density in value. As the year in time increases, the selling price values increase.
ggplot(data = car, aes(x=km_driven, y=selling_price, fill=km_driven)) +
geom_point()+
labs(y="Selling Price",
x="KM driven",
title="Relationship between Selling Price and km driven")+
theme_bw()+
theme(plot.title = element_text(size=15)
,axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10),
axis.title=element_text(size=10))
From the relationship of data distributions of sold car selling price over the km driven, the graph shows that there is an decrement density in value. As the year in km driven increases, the selling price values decrease.
Check and visualize the missing missing data:
car[car == ''] <- NA
sapply(car, function(x) sum(is.na(x)))
## name year selling_price km_driven fuel
## 0 0 0 0 0
## seller_type transmission owner mileage engine
## 0 0 0 221 221
## max_power torque seats
## 215 222 221
missmap(car, legend = TRUE, col = c("red", "blue"))
There are 221 missing value in mileage, engine and seats while there are 215 and 222 missing value for max_power and torque respectively.
Change model name variable into company name to understand data better with just single word rather than multiple word.
car$name <- word(car$name,1)
change car mileage variable to numerical data type and impute the missing value in car mileage using the mean:
car$mileage <- parse_number(car$mileage)
car$mileage[is.na(car$mileage)]<-mean(car$mileage,na.rm=TRUE)
change car engine variable to numerical data type and impute the missing value using the mean:
car$engine <- parse_number(car$engine)
car$engine[is.na(car$engine)]<-mean(car$engine,na.rm=TRUE)
Change car max_power variable to numeric data type and impute missing value using mean:
car$max_power <- parse_number(car$max_power)
car$max_power[is.na(car$max_power)]<-mean(car$max_power,na.rm=TRUE)
Impute car seat variable missing value using median:
car$seats[is.na(car$seats)]<-median(car$seats,na.rm=TRUE)
Change car torque variable to numeric and input missing value using mean:
car$torque <- parse_number(car$torque)
car$torque[is.na(car$torque)]<-mean(car$torque,na.rm=TRUE)
Check the data after data transformation and imputation.
head(car)
## name year selling_price km_driven fuel seller_type transmission
## 1 Maruti 2014 450000 145500 Diesel Individual Manual
## 2 Skoda 2014 370000 120000 Diesel Individual Manual
## 3 Honda 2006 158000 140000 Petrol Individual Manual
## 4 Hyundai 2010 225000 127000 Diesel Individual Manual
## 5 Maruti 2007 130000 120000 Petrol Individual Manual
## 6 Hyundai 2017 440000 45000 Petrol Individual Manual
## owner mileage engine max_power torque seats
## 1 First Owner 23.40 1248 74.00 190.00 5
## 2 Second Owner 21.14 1498 103.52 250.00 5
## 3 Third Owner 17.70 1497 78.00 12.70 5
## 4 First Owner 23.00 1396 90.00 22.40 5
## 5 First Owner 16.10 1298 88.20 11.50 5
## 6 First Owner 20.14 1197 81.86 113.75 5
The data has been transformed into more suitable form.
Recheck and revisualize the missing value:
sapply(car, function(x) sum(is.na(x)))
## name year selling_price km_driven fuel
## 0 0 0 0 0
## seller_type transmission owner mileage engine
## 0 0 0 0 0
## max_power torque seats
## 0 0 0
missmap(car, legend = TRUE, col = c("red", "blue"))
There is no missing value remaining after cleaning.
Check data type for all variables:
lapply(car, class)
## $name
## [1] "character"
##
## $year
## [1] "integer"
##
## $selling_price
## [1] "integer"
##
## $km_driven
## [1] "integer"
##
## $fuel
## [1] "character"
##
## $seller_type
## [1] "character"
##
## $transmission
## [1] "character"
##
## $owner
## [1] "character"
##
## $mileage
## [1] "numeric"
##
## $engine
## [1] "numeric"
##
## $max_power
## [1] "numeric"
##
## $torque
## [1] "numeric"
##
## $seats
## [1] "integer"
Next, the categorical data will be encoded as data preparation before modelling.
The each car name variable will be encoded to the numerical value from 0 to 31. The car name data type will be changed from character to numerical data type.
car$name <- str_replace(car$name, 'Maruti', '0')
car$name <- str_replace(car$name, 'Skoda', '1')
car$name <- str_replace(car$name, 'Honda', '2')
car$name <- str_replace(car$name, 'Hyundai', '3')
car$name <- str_replace(car$name, 'Toyota', '4')
car$name <- str_replace(car$name, 'Ford', '5')
car$name <- str_replace(car$name, 'Renault', '6')
car$name <- str_replace(car$name, 'Mahindra', '7')
car$name <- str_replace(car$name, 'Tata', '8')
car$name <- str_replace(car$name, 'Chevrolet', '9')
car$name <- str_replace(car$name, 'Fiat', '10')
car$name <- str_replace(car$name, 'Datsun', '11')
car$name <- str_replace(car$name, 'Jeep', '12')
car$name <- str_replace(car$name, 'Mercedes-Benz', '13')
car$name <- str_replace(car$name, 'Mitsubishi', '14')
car$name <- str_replace(car$name, 'Audi', '15')
car$name <- str_replace(car$name, 'Volkswagen', '16')
car$name <- str_replace(car$name, 'BMW', '17')
car$name <- str_replace(car$name, 'Nissan', '18')
car$name <- str_replace(car$name, 'Lexus', '19')
car$name <- str_replace(car$name, 'Jaguar', '20')
car$name <- str_replace(car$name, 'Land', '21')
car$name <- str_replace(car$name, 'MG', '22')
car$name <- str_replace(car$name, 'Volvo', '23')
car$name <- str_replace(car$name, 'Daewoo', '24')
car$name <- str_replace(car$name, 'Kia', '25')
car$name <- str_replace(car$name, 'Force', '26')
car$name <- str_replace(car$name, 'Ambassador', '27')
car$name <- str_replace(car$name, 'Ashok', '28')
car$name <- str_replace(car$name, 'Isuzu', '29')
car$name <- str_replace(car$name, 'Opel', '30')
car$name <- str_replace(car$name, 'Peugeot', '31')
car$name <- as.numeric(car$name)
table(car$name)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2448 105 467 1415 488 397 228 772 734 230 47 65 31 54 14 40
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 186 120 81 34 71 6 3 67 3 4 6 4 1 5 1 1
The car transmission variable will be encoded to the numerical value where “Manual” will be changed to 0 while “Auto” will be changed to 1. The transmission variable data type will be changed from character to numerical data type:
car$transmission <- str_replace(car$transmission, 'Manual', "0")
car$transmission <- str_replace(car$transmission, 'Automatic', "1")
car$transmission <- as.numeric(car$transmission)
table(car$transmission)
##
## 0 1
## 7078 1050
The car owner variables will be encoded to numerical value where “First Owner” will be changed to “0”, “Second Owner” will be changed to “1”, “Third Owner” will be changed to “2”, “Fourth & Above Owner” will be changed to “3” and “Test Drive Car” will be changed to “4”:
car$owner <- str_replace(car$owner, 'First Owner', "0")
car$owner <- str_replace(car$owner, 'Second Owner', "1")
car$owner <- str_replace(car$owner, 'Third Owner', "2")
car$owner <- str_replace(car$owner, 'Fourth & Above Owner', "3")
car$owner <- str_replace(car$owner, 'Test Drive Car', "4")
car$owner <- as.numeric(car$owner)
table(car$owner)
##
## 0 1 2 3 4
## 5289 2105 555 174 5
The car seller_type variable will be encoded to numerical value where “Trustmark Dealer” will be changed to “0”, “Dealer” will be changed to “1” and “Individual” will be changed to “2”:
car$seller_type <- str_replace(car$seller_type, "Trustmark Dealer", "0")
car$seller_type <- str_replace(car$seller_type, "Dealer", "1")
car$seller_type <- str_replace(car$seller_type, "Individual", "2")
car$seller_type <- as.numeric(car$seller_type)
table(car$seller_type)
##
## 0 1 2
## 236 1126 6766
The car fuel variable will be encoded to numerical value where “Diesel” will be changed to “0”, “Petrol” will be changed to “1” and “CNG” will be changed to “2” and “LPG” will be changed to “3”:
car$fuel <- str_replace(car$fuel, 'Diesel', "0")
car$fuel <- str_replace(car$fuel, 'Petrol', "1")
car$fuel <- str_replace(car$fuel, 'CNG', "2")
car$fuel <- str_replace(car$fuel, 'LPG', "3")
car$fuel <- as.numeric(car$fuel)
table(car$fuel)
##
## 0 1 2 3
## 4402 3631 57 38
Check the dataset after encoding and transformation:
head(car)
## name year selling_price km_driven fuel seller_type transmission owner mileage
## 1 0 2014 450000 145500 0 2 0 0 23.40
## 2 1 2014 370000 120000 0 2 0 1 21.14
## 3 2 2006 158000 140000 1 2 0 2 17.70
## 4 3 2010 225000 127000 0 2 0 0 23.00
## 5 0 2007 130000 120000 1 2 0 0 16.10
## 6 3 2017 440000 45000 1 2 0 0 20.14
## engine max_power torque seats
## 1 1248 74.00 190.00 5
## 2 1498 103.52 250.00 5
## 3 1497 78.00 12.70 5
## 4 1396 90.00 22.40 5
## 5 1298 88.20 11.50 5
## 6 1197 81.86 113.75 5
All data is encoded properly.
specify non target variables into categorical and numerical column and specify target variables for preparation for normalization.
cat_cols <- c('name', 'year', 'fuel', 'seller_type', 'transmission', 'owner', 'seats')
num_cols <- c('km_driven', 'mileage', 'engine', 'max_power', 'torque')
target_cols <- c('selling_price')
Standardized the data for non target variables, while target variable will not be standardized for easy representation of selling price.
preproc <- preProcess(car[, c(cat_cols,num_cols)], method=c("range"))
car_scaled <- predict(preproc, car)
head(car_scaled)
## name year selling_price km_driven fuel seller_type
## 1 0.00000000 0.8378378 450000 0.06164021 0.0000000 1
## 2 0.03225806 0.8378378 370000 0.05083721 0.0000000 1
## 3 0.06451613 0.6216216 158000 0.05931015 0.3333333 1
## 4 0.09677419 0.7297297 225000 0.05380274 0.0000000 1
## 5 0.00000000 0.6486486 130000 0.05083721 0.3333333 1
## 6 0.09677419 0.9189189 440000 0.01906369 0.3333333 1
## transmission owner mileage engine max_power torque seats
## 1 0 0.00 0.5571429 0.2093960 0.18500 0.236164244 0.25
## 2 0 0.25 0.5033333 0.2932886 0.25880 0.312675338 0.25
## 3 0 0.50 0.4214286 0.2929530 0.19500 0.010073961 0.25
## 4 0 0.00 0.5476190 0.2590604 0.22500 0.022443254 0.25
## 5 0 0.00 0.3833333 0.2261745 0.22050 0.008543739 0.25
## 6 0 0.00 0.4795238 0.1922819 0.20465 0.138931395 0.25
The data has been successfully processed as shown in the display of data above.
Split the data into 80% training set and 20% testing set.
trainIndex <- createDataPartition(car_scaled$selling_price, p = 0.8,
list = FALSE,
times = 1)
train <- car_scaled[ trainIndex,]
test <- car_scaled[-trainIndex,]
dim(train)
## [1] 6504 13
dim(test)
## [1] 1624 13
5 regression algorithms will be used for modelling and all of them will be evaluated and compared to find the best model. The evaluation metric will use RMSE score for evaluation. The comparison of each model will be discussed on the evaluation & conclusion section.
Linear regression training:
set.seed(123)
lr <- lm(selling_price ~ ., data = train)
summary(lr)
##
## Call:
## lm(formula = selling_price ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2403261 -208940 -1909 160224 4188508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1664598 82982 -20.060 < 0.0000000000000002 ***
## name 744425 41694 17.854 < 0.0000000000000002 ***
## year 1093985 72699 15.048 < 0.0000000000000002 ***
## km_driven -3063252 322330 -9.503 < 0.0000000000000002 ***
## fuel 95788 48452 1.977 0.0481 *
## seller_type -216807 27028 -8.021 0.00000000000000123 ***
## transmission 425377 21767 19.543 < 0.0000000000000002 ***
## owner -19877 36090 -0.551 0.5818
## mileage 824411 93647 8.803 < 0.0000000000000002 ***
## engine 163451 75656 2.160 0.0308 *
## max_power 4519394 138815 32.557 < 0.0000000000000002 ***
## torque 402575 90972 4.425 0.00000978984490370 ***
## seats -160435 104570 -1.534 0.1250
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 450800 on 6491 degrees of freedom
## Multiple R-squared: 0.6877, Adjusted R-squared: 0.6871
## F-statistic: 1191 on 12 and 6491 DF, p-value: < 0.00000000000000022
The summary of the Linear Regression model is stated in the display above which give statistical information about the model.
Linear regression testing:
pred_lr <- predict(lr, newdata = test)
error_lr <- test$selling_price - pred_lr
eval_lr <- cbind(test$selling_price, pred_lr)
colnames(eval_lr) <- c("True Price", "Predicted Price")
eval_lr <- as.data.frame(eval_lr)
head(eval_lr)
## True Price Predicted Price
## 6 440000 536920.9
## 10 200000 150437.9
## 11 500000 815158.6
## 18 500000 453932.6
## 24 600000 402256.6
## 25 500000 724944.7
From the model testing, some proportion of true values and predicted value has been shown above just to give rough view of comparison between true values and predicted values. As can been shown in display above, roughly, the difference between true values and predicted values in Linear Regression model is quite far and the predicted value have negative value. All the selling price values should be positive.
Scatter plot of actual values against predicted values for Linear Regression.
RMSE_lr <- round(sqrt(mean(error_lr^2)),2)
RMSE_lr
## [1] 451026.5
plot(test$selling_price,pred_lr, main="Actual Value vs Predicted Value for Linear Regression Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value is Root Mean Squared Error of the predicted and true values. The higher RMSE indicate that the higher error which indicate that lower model performance while lower RMSE indicate that the lower error which indicate that higher model performance. The RMSE value for Linear Regression is 451026.51 which are quite high. In the scatter plot the blue point represent true value while red point represent predicted value and the black line is regressed diagonal line which as benchmark where points that are closer to the line indicate the better performance of the model. In the Linear Regression model, the point of both values are dispersed away from the diagonal line which indicate that the model performance is low. In addition, some of the predicted value lies in the negative and zero value, it is impossible to have zero or negative value of selling price. There, it can be conclude that the performance of Linear regression model is worse.
Random Forest Regression training.
set.seed(123)
rf <- randomForest(selling_price~.,data = train)
rf
##
## Call:
## randomForest(formula = selling_price ~ ., data = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 21601607350
## % Var explained: 96.67
Random Forest Regression testing.
pred_rf <- predict(rf, test)
error_rf <- test$selling_price - pred_rf
eval_rf <- cbind(test$selling_price, pred_rf)
colnames(eval_rf) <- c("True Price", "Predicted Price")
eval_rf <- as.data.frame(eval_rf)
head(eval_rf)
## True Price Predicted Price
## 6 440000 502552.8
## 10 200000 256983.0
## 11 500000 574281.4
## 18 500000 336490.5
## 24 600000 344239.1
## 25 500000 497701.3
Roughly, in the Random Forest model, the difference of true value and predicted value having relatively small difference which roughly indicate that the Random Forest has better performance than Linear Regression model.
Scatter plot for actual values against predicted values for Random forest Regression.
RMSE_rf <- round(sqrt(mean(error_rf^2)),2)
RMSE_rf
## [1] 125516
plot(test$selling_price,pred_rf, main="Actual Value vs Predicted Value for Random Forest Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value for Random Forest model is 125515.97 which is very low compared to the Linear Regression model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with only a few point is far from the regressed diagonal line which indicate the good performance of model. Based on the RMSE value and scatter plot, the Random Forest have good performance.
K-Nearest Neighbor training.
set.seed(123)
knn = knnreg(selling_price~.,data = train)
knn
## 5-nearest neighbor regression model
K-Nearest Neighbor testing.
pred_knn <- predict(knn, newdata = test)
error_knn <- test$selling_price - pred_knn
eval_knn <- cbind(test$selling_price, pred_knn)
colnames(eval_knn) <- c("True Price", "Predicted Price")
eval_knn <- as.data.frame(eval_knn)
head(eval_knn)
## True Price Predicted Price
## 1 440000 521999.8
## 2 200000 257999.8
## 3 500000 507000.0
## 4 500000 327200.0
## 5 600000 295800.0
## 6 500000 477000.0
Roughly, in the KNN model, the difference of true value and predicted value have noticeable difference. This roughly estimate that the KNN model will have lower performance compared the Random Forest model.
Scatter plot for actual values against predicted values for Random forest Regression.
RMSE_knn <- round(sqrt(mean(error_knn^2)),2)
RMSE_knn
## [1] 186713.2
plot(test$selling_price,pred_knn, main="Actual Value vs Predicted Value for K-Nearest Neighbor Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value for KNN model is 186713.17 which is significantly high compared to the Random Forest model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with just a some point that are far from the regressed diagonal line which indicate the notable performance of model. However, based on RMSE and overall fit of the point, the performance of Random Forest model is still far better than KNN model.
Support Vector Machine training.
set.seed(123)
svm = svm(selling_price~.,data = train)
svm
##
## Call:
## svm(formula = selling_price ~ ., data = train)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.08333333
## epsilon: 0.1
##
##
## Number of Support Vectors: 2167
Support Vector Machine testing.
pred_svm <- predict(svm, newdata = test)
error_svm <- test$selling_price - pred_svm
eval_svm <- cbind(test$selling_price, pred_svm)
colnames(eval_svm) <- c("True Price", "Predicted Price")
eval_svm <- as.data.frame(eval_svm)
head(eval_svm)
## True Price Predicted Price
## 6 440000 520850.5
## 10 200000 312831.1
## 11 500000 549614.3
## 18 500000 376808.7
## 24 600000 422692.7
## 25 500000 475045.1
Roughly, in SVM mode, the difference of true value and predicted value having also relatively small difference same as KNN model. This roughly estimate that the SVM model will have almost the same performance with the KNN models.
Scatter plot for actual values against predicted values for Support Vector Machine.
RMSE_svm <- round(sqrt(mean(error_svm^2)),2)
RMSE_svm
## [1] 180960.2
plot(test$selling_price,pred_svm, main="Actual Value vs Predicted Value for SVM Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value for SVM model is 180960.23 which is almost the same as KNN model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with just a some point that are far from the regressed diagonal line which indicate the notable performance of model. However, based on RMSE and overall fit of the point, the performance of Random Forest model is still far better than SVM model. ### Graient Boost (GB)
Gradient Boost training.
set.seed(123)
gbm <- gbm(
formula = selling_price ~ .,
distribution = "gaussian",
data = train,
n.trees = 6000,
interaction.depth = 3,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
gbm
## gbm(formula = selling_price ~ ., distribution = "gaussian", data = train,
## n.trees = 6000, interaction.depth = 3, shrinkage = 0.1, cv.folds = 5,
## verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 6000 iterations were performed.
## The best cross-validation iteration was 4881.
## There were 12 predictors of which 12 had non-zero influence.
Gradient Boost testing.
pred_gbm <- predict(gbm, test)
## Using 4881 trees...
error_gbm <- test$selling_price - pred_gbm
eval_gbm <- cbind(test$selling_price, pred_gbm)
colnames(eval_gbm) <- c("True Price", "Predicted Price")
eval_gbm <- as.data.frame(eval_gbm)
head(eval_gbm)
## True Price Predicted Price
## 1 440000 508454.7
## 2 200000 251092.9
## 3 500000 553044.7
## 4 500000 338273.2
## 5 600000 438715.2
## 6 500000 502449.1
RMSE_gbm <- round(sqrt(mean(error_gbm^2)),2)
RMSE_gbm
## [1] 115704.1
Roughly, in the Gradient Boost model, the difference of true value and predicted value having also relatively small difference same as Random Forest model. This roughly estimate that the Gradient Boost model will have almost the same performance with the Random Forest model.
Scatter plot for actual values against predicted values for Gradient Boost.
plot(test$selling_price,pred_gbm, main="Actual Value vs Predicted Value for Gradient Boost Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value for Gradient Boost model is 115704.11 which is relatively low and almost similar with the Random Forest model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with almost negligible points that far from regressed diagonal line which indicate the outstanding performance of model. Based on the RMSE and overall fit of the point, the Gradient Boost model have better performance compared to Random Forest model with small difference between RMSE score.
Each RMSE score for each model will be put into a table for comparison. RMSE score for all models is shown in Figure Below.
Model <- c('Linear Regression','Random Forest','Gradient Boosting', 'K-Nearest Neighbour', "Support Vector Machine")
RMSE <- c(RMSE_lr,RMSE_rf,RMSE_gbm,RMSE_knn,RMSE_svm)
res <- data.frame(Model,RMSE)
res %>% arrange(RMSE,descending =TRUE)
## Model RMSE
## 1 Gradient Boosting 115704.1
## 2 Random Forest 125516.0
## 3 Support Vector Machine 180960.2
## 4 K-Nearest Neighbour 186713.2
## 5 Linear Regression 451026.5
ggplot(data = res, aes(x= Model, y = RMSE, fill = Model)) + geom_bar(stat="identity") + theme(axis.text.x = element_blank(), axis.ticks = element_blank())
Gradient Boost model have the lowest RMSE score which is 115704.11 with only small difference with Random Forest model. Therefore, Gradient Boost model the best model among the other model while Linear Regression model have the highest RMSE score which is 451026.51 which indicate the worse model among the other model.
Using Gradient Boosting model, the features importance can be calculated to find which features or variables have greater influence in the price prediction.
summary(
gbm,
cBars = 10,
method = relative.influence, las = 2
)
## var rel.inf
## max_power max_power 63.2215544
## year year 15.0016012
## torque torque 7.8205921
## km_driven km_driven 5.6022714
## engine engine 2.8279422
## name name 2.5183787
## mileage mileage 1.7593982
## transmission transmission 0.5422712
## seller_type seller_type 0.2384438
## owner owner 0.2040022
## seats seats 0.1503491
## fuel fuel 0.1131957
From the features importance, the max_power variable is the biggest infrequence which 60% of relative influence in the predicting selling price based on this dataset and Gradient Boost model. The features importance rank from highest to lowest is max_power, year, torque, km_driven, name, engine, mileage, transmission, seller_type and owner.
Gradient Boost model and Random Forest model have less significant difference in RMSE value. Both model can be approved to be the best model, the selection of project this project are not fully utilizing the machine learning capabilities such as hyperparameter tuning. The hyperparameter tuning can not be utilized due to technical limitation.
Hyperparameter tuning script for Gradient Boost model:
rainControl <- trainControl(method = "cv",
number = 10,
returnResamp="all", ### use "all" to return all cross-validated metrics
search = "grid")
tuneGrid <- expand.grid(
n.trees = c(5000, 10000),
interaction.depth = c( 6, 13),
shrinkage = c(0.01, 0.001),
n.minobsinnode=c(5, 10)
)
gbm_hpt <- train(selling_price ~.,
data = train,
method = "gbm",
tuneGrid = tuneGrid,
trControl = trainControl,
verbose=FALSE)
pred_gbm_hpt <- predict(gbm_hpt, test)
error_gbm_hpt <- test$selling_price - pred_gbm_hpt
RMSE_gbm_hpt <- round(sqrt(mean(error_gbm_hpt^2)),2)
RMSE_gbm_hpt
plot(test$selling_price,pred_gbm_hpt, main="Scatterplot", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")