| Name | Matric Number |
|---|---|
| LEE GUANG SHEN | 22052269 |
| TIMOTHY CHEN XIAN YII | 22056170 |
| LIM KHAI FUNG | 22058576 |
| LEONG MIN QI | 22077164 |
| MANG YU JIE | 22064556 |
In this modern society, cars are slowly becoming a necessity for
every adult. However, not everyone is able to afford to buy a brand new
car. Therefore, there is a considerable increase in the demand of used
cars, as people are looking for a more economical option. The retail
price of a new car is hard to estimate as there are no concrete
guidelines in setting the price of the car. More importantly, the price
of a used car depends on the supply and demand, where the prices are
highly dependent on the characteristics of the used car. Additional
aspects such as car design, car condition and mileage will greately
affect the retail price of the used car.
Thus, accurate prediction for used cars’ retail prices are beneficial to both seller and buyers. The seller can utilize the output of the prediction model to entice potential buyers and maximize their profits whereas the buyer can easily find out the price estimate of the car. Finding the correct price of used car can be difficult due to the consideration based on numerous factors. Hence, this study will be carried out.
Pricing used cars is a dilemma for most of used car dealers and their buyers, as it is difficult to decide the worth of posted price for the pre-owned car. Hence, this project aims to explore the features that can help estimate the price of a used car and use the features to build machine learning models to both predict a price value of the car as well as classify the price category of the used car.
library(lubridate)
library(dplyr)
library(tidyr)
library(stringr)
library(ggcorrplot)
library(patchwork)
library(randomForest)
library(caret)
library(corrplot)
library(gbm)
library(e1071)
library(stats)
library(arules)
library(pROC)
The dataset used for this project is retrieved from Car Price Prediction Challege. This dataset is made available by Deep Contractor on Kaggle.
df <- read.csv("https://raw.githubusercontent.com/mangyj922/car-price-prediction/main/car_price_prediction.csv")
str(df)
## 'data.frame': 19237 obs. of 18 variables:
## $ ID : int 45654403 44731507 45774419 45769185 45809263 45802912 45656768 45816158 45641395 45756839 ...
## $ Price : int 13328 16621 8467 3607 11726 39493 1803 549 1098 26657 ...
## $ Levy : chr "1399" "1018" "-" "862" ...
## $ Manufacturer : chr "LEXUS" "CHEVROLET" "HONDA" "FORD" ...
## $ Model : chr "RX 450" "Equinox" "FIT" "Escape" ...
## $ Prod..year : int 2010 2011 2006 2011 2014 2016 2010 2013 2014 2007 ...
## $ Category : chr "Jeep" "Jeep" "Hatchback" "Jeep" ...
## $ Leather.interior: chr "Yes" "No" "No" "Yes" ...
## $ Fuel.type : chr "Hybrid" "Petrol" "Petrol" "Hybrid" ...
## $ Engine.volume : chr "3.5" "3" "1.3" "2.5" ...
## $ Mileage : chr "186005 km" "192000 km" "200000 km" "168966 km" ...
## $ Cylinders : num 6 6 4 4 4 4 4 4 4 6 ...
## $ Gear.box.type : chr "Automatic" "Tiptronic" "Variator" "Automatic" ...
## $ Drive.wheels : chr "4x4" "4x4" "Front" "4x4" ...
## $ Doors : chr "04-May" "04-May" "04-May" "04-May" ...
## $ Wheel : chr "Left wheel" "Left wheel" "Right-hand drive" "Left wheel" ...
## $ Color : chr "Silver" "Black" "Black" "White" ...
## $ Airbags : int 12 8 2 0 4 4 12 12 12 12 ...
From the description above, we can see that the dataset comprises of
19237 rows/observations with 18 columns/common features.
The content and the structure of the dataset can be found under the
table below:
| Column | Type | Description |
|---|---|---|
| ID | integer | ID of used car |
| Price | integer | The price of used car |
| Levy | character | The tax to be paid |
| Manufacturer | character | The brand of used car |
| Model | character | The model name of used car |
| Prod..year | integer | The production year of used car |
| Category | character | The category of used car |
| Leather.interior | character | Leather car seat (Boolean) |
| Fuel.type | character | The fuel type of used car |
| Engine.volume | character | The engine volume of used car |
| Mileage | character | The mileage of used car |
| Cylinders | numeric | The number of cylinder engines |
| Gear.box.type | character | The gear box type |
| Drive.wheels | character | The wheel routing system |
| Doors | character | The number of doors |
| Wheel | character | The placement of driver sear |
| Color | character | The color of used car |
| Airbags | integer | The number of airbags |
The summary of the dataset can be found below:
summary(df)
## ID Price Levy Manufacturer
## Min. :20746880 Min. : 1 Length:19237 Length:19237
## 1st Qu.:45698374 1st Qu.: 5331 Class :character Class :character
## Median :45772308 Median : 13172 Mode :character Mode :character
## Mean :45576536 Mean : 18556
## 3rd Qu.:45802036 3rd Qu.: 22075
## Max. :45816654 Max. :26307500
## Model Prod..year Category Leather.interior
## Length:19237 Min. :1939 Length:19237 Length:19237
## Class :character 1st Qu.:2009 Class :character Class :character
## Mode :character Median :2012 Mode :character Mode :character
## Mean :2011
## 3rd Qu.:2015
## Max. :2020
## Fuel.type Engine.volume Mileage Cylinders
## Length:19237 Length:19237 Length:19237 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 4.000
## Mean : 4.583
## 3rd Qu.: 4.000
## Max. :16.000
## Gear.box.type Drive.wheels Doors Wheel
## Length:19237 Length:19237 Length:19237 Length:19237
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Color Airbags
## Length:19237 Min. : 0.000
## Class :character 1st Qu.: 4.000
## Mode :character Median : 6.000
## Mean : 6.583
## 3rd Qu.:12.000
## Max. :16.000
The summary above shows a quick overview of the basic statistics of the dataset, we can conclude that there are a total of 5 quantitative data and 13 qualitative data. Interestingly, the range between minimum and maximum values for the numerical features are very large.
Data cleaning is essential in the data preprocessing phase to ensure the quality, consistency and integrity of the data. It involves identifying and handling errors, missing values, outliers and other issues that may affect the accuracy and reliability of the data.
First, we handle any missing data by checking if there are any Null values.
any(is.na(df))
## [1] FALSE
There are no null values within the dataset.
Secondly, we will drop any duplicate records in the dataset.
df <- df %>% distinct()
dim(df)
## [1] 18924 18
After dropping duplicate rows from the dataframe, the output is left with 18924 observations, indicating that there are duplicated observations in the dataset.
Unique identifiers such as “ID” and “Color” will be dropped from the dataset.
df <- df[,!names(df) %in% c("ID", "Color")]
dim(df)
## [1] 18924 16
After dropping the “ID” and “Color” columns, 16 features are left.
The feature names are renamed to ease the process of data analysis.
colnames(df) <- c('price','levy','manufacturer','model','prod_year','category','leather_seat','fuel_type','engine_volume','mileage','cylinders','gearbox_type','drive_wheels','doors','wheel','airbags')
In the column “levy”, there are some records showing “-”. The records
with ‘-’ will be imputed with 0 and the column type will be converted to
integer.
Besides, the column “doors” consists of non-consistent data, where
there are some records of it shown in date format. The records in date
format will be corrected and the whole column “doors” will be converted
to integer.
Furthermore, the column “mileage” contains the distance unit “km” in the dataset. The unit does not bring any significance value to our dataset. Hence, the unit “km” will be removed to represent the column “mileage” in integer type.
df$levy <- sub('-','0',df$levy)
df$levy <- as.integer(df$levy)
df$doors <- sub('04-May','4',df$doors)
df$doors <- sub('02-Mar','2',df$doors)
df$doors <- sub('>5','5',df$doors)
df$doors <- as.integer(df$doors)
df$mileage <- sub(' km','',df$mileage)
df$mileage <- as.integer(df$mileage)
As the column “engine_volume” consists of both turbo and non-turbo types, there is a need to separate the feature and add a column to represent the turbo type of the car.
df[c('engine_volume', 'turbo')] <- str_split_fixed(df$engine_volume, ' ', 2)
df$turbo <- sub('','No',df$turbo)
df$turbo <- gsub('.*Turbo*', "Yes", df$turbo)
df$engine_volume <- as.numeric(df$engine_volume)
In R, many of the functions will automatically convert characters to
factors, as factors are much easier to work with than characters.
As a result, the columns that consists of character data are converted to factor.
df$manufacturer <- as.factor(df$manufacturer)
df$model <- as.factor(df$model)
df$category <- as.factor(df$category)
df$leather_seat <- as.factor(df$leather_seat)
df$fuel_type <- as.factor(df$fuel_type)
df$gearbox_type <- as.factor(df$gearbox_type)
df$drive_wheels <- as.factor(df$drive_wheels)
df$wheel <- as.factor(df$wheel)
df$turbo <- as.factor(df$turbo)
The overview of the dataset after the data cleaning process is shown below.
summary(df)
## price levy manufacturer model
## Min. : 1 Min. : 0.0 HYUNDAI :3729 Prius : 1069
## 1st Qu.: 5331 1st Qu.: 0.0 TOYOTA :3606 Sonata : 1067
## Median : 13172 Median : 642.0 MERCEDES-BENZ:2043 Camry : 929
## Mean : 18587 Mean : 632.9 FORD :1088 Elantra : 910
## 3rd Qu.: 22063 3rd Qu.: 917.0 CHEVROLET :1047 E 350 : 534
## Max. :26307500 Max. :11714.0 BMW :1036 Santa FE: 527
## (Other) :6375 (Other) :13888
## prod_year category leather_seat fuel_type
## Min. :1939 Sedan :8600 No : 5193 CNG : 469
## 1st Qu.:2009 Jeep :5378 Yes:13731 Diesel :4001
## Median :2012 Hatchback:2799 Hybrid :3539
## Mean :2011 Minivan : 633 Hydrogen : 1
## 3rd Qu.:2015 Coupe : 528 LPG : 885
## Max. :2020 Universal: 361 Petrol :9944
## (Other) : 625 Plug-in Hybrid: 85
## engine_volume mileage cylinders gearbox_type
## Min. : 0.000 Min. :0.000e+00 Min. : 1.00 Automatic:13282
## 1st Qu.: 1.800 1st Qu.:7.020e+04 1st Qu.: 4.00 Manual : 1844
## Median : 2.000 Median :1.264e+05 Median : 4.00 Tiptronic: 3065
## Mean : 2.306 Mean :1.555e+06 Mean : 4.58 Variator : 733
## 3rd Qu.: 2.500 3rd Qu.:1.891e+05 3rd Qu.: 4.00
## Max. :20.000 Max. :2.147e+09 Max. :16.00
##
## drive_wheels doors wheel airbags
## 4x4 : 3969 Min. :2.000 Left wheel :17471 Min. : 0.000
## Front:12695 1st Qu.:4.000 Right-hand drive: 1453 1st Qu.: 4.000
## Rear : 2260 Median :4.000 Median : 6.000
## Mean :3.925 Mean : 6.568
## 3rd Qu.:4.000 3rd Qu.:12.000
## Max. :5.000 Max. :16.000
##
## turbo
## No :17032
## Yes: 1892
##
##
##
##
##
From the above printed results, we can see that the range of minimum
and maximum values of price, levy and
mileage are quite large. There is a high possibility that
these features contain outliers.
Therefore, box plots are then used to identify the potential outliers visually.
par(mfrow=c(4,1))
ggplot(data = df, aes(x = price)) + geom_boxplot(fill="yellow")
ggplot(data = df, aes(x = levy)) + geom_boxplot(fill="yellow")
ggplot(data = df, aes(x = mileage)) + geom_boxplot(fill="yellow")
ggplot(data = df, aes(x = airbags)) + geom_boxplot(fill="yellow")
Based on the boxplots shown above, outliers are noticed in
price, levy and mileage, where
the points are very far away from the maximum whiskers. Hence they need
to be removed.
The interquartile range (IQR) method is used to identify outliers and remove it accordingly from the dataset. The outliers are removed by setting the limitations on the values that are a factor 1.5 of the IQR below the 25th percentile and above the 75th percentile.
Q1 <- quantile(df$price, .25)
Q3 <- quantile(df$price, .75)
IQR <- IQR(df$price)
outlier_step <- IQR * 1.5
lower_bound <- Q1-outlier_step
upper_bound <- Q3+outlier_step
final_df <- df %>% filter(price >= lower_bound & price <= upper_bound)
Q1 <- quantile(final_df$levy, .25)
Q3 <- quantile(final_df$levy, .75)
IQR <- IQR(final_df$levy)
outlier_step <- IQR * 1.5
lower_bound <- Q1-outlier_step
upper_bound <- Q3+outlier_step
final_df <- final_df %>% filter(levy >= lower_bound & levy <= upper_bound)
Q1 <- quantile(final_df$mileage, .25)
Q3 <- quantile(final_df$mileage, .75)
IQR <- IQR(final_df$mileage)
outlier_step <- IQR * 1.5
lower_bound <- Q1-outlier_step
upper_bound <- Q3+outlier_step
final_df <- final_df %>% filter(mileage >= lower_bound & mileage <= upper_bound)
summary(final_df)
## price levy manufacturer model
## Min. : 1 Min. : 0.0 TOYOTA :3317 Prius : 990
## 1st Qu.: 5018 1st Qu.: 0.0 HYUNDAI :3162 Elantra: 901
## Median :12544 Median : 640.0 MERCEDES-BENZ:1767 Camry : 867
## Mean :14291 Mean : 597.2 CHEVROLET :1015 Sonata : 752
## 3rd Qu.:20385 3rd Qu.: 891.0 FORD :1002 E 350 : 494
## Max. :47120 Max. :2225.0 BMW : 926 FIT : 437
## (Other) :5879 (Other):12627
## prod_year category leather_seat fuel_type
## Min. :1939 Sedan :7842 No : 4981 CNG : 440
## 1st Qu.:2009 Jeep :4566 Yes:12087 Diesel :3548
## Median :2012 Hatchback:2711 Hybrid :3336
## Mean :2011 Minivan : 604 Hydrogen : 1
## 3rd Qu.:2014 Coupe : 462 LPG : 622
## Max. :2020 Universal: 328 Petrol :9038
## (Other) : 555 Plug-in Hybrid: 83
## engine_volume mileage cylinders gearbox_type
## Min. :0.000 Min. : 0 Min. : 1.000 Automatic:11973
## 1st Qu.:1.700 1st Qu.: 71904 1st Qu.: 4.000 Manual : 1710
## Median :2.000 Median :126292 Median : 4.000 Tiptronic: 2674
## Mean :2.256 Mean :132143 Mean : 4.527 Variator : 711
## 3rd Qu.:2.500 3rd Qu.:182625 3rd Qu.: 4.000
## Max. :6.300 Max. :364523 Max. :16.000
##
## drive_wheels doors wheel airbags
## 4x4 : 3417 Min. :2.000 Left wheel :15670 Min. : 0.000
## Front:11666 1st Qu.:4.000 Right-hand drive: 1398 1st Qu.: 4.000
## Rear : 1985 Median :4.000 Median : 6.000
## Mean :3.924 Mean : 6.585
## 3rd Qu.:4.000 3rd Qu.:12.000
## Max. :5.000 Max. :16.000
##
## turbo
## No :15459
## Yes: 1609
##
##
##
##
##
After cleaning the dataset, exploratory data analysis will then be performed.
EDA is the process to examine and understand the dataset thoroughly
to gain insights, identify patterns and uncover relationships between
variables.
The purpose of the EDA is to answer the following questions:
Correlation analysis is used to measure the strength and direction of the relationship between variables.
Correlation coefficients have a value of between -1 and 1: * 0: no relationship between the variables at all * 1: perfect positive correlation between the variables * -1: perfect negative correlation between the variables
However, the limitation of correlation analysis is that it does not establish causation, despite provides insights into the relationships between variables.
subset_df <- final_df %>% select_if(is.numeric) %>% select(-c("price"))
corr <- cor(subset_df)
ggcorrplot(corr, type="lower", lab=TRUE, ggtheme = ggplot2::theme_gray, colors = c("#636EFA", "white", "#EF553B"))
From the plot above, the highest positive correlation is between
features engine_volume and cylinders, which is
0.78.
# relationship between prod_year and price
p1 <- final_df %>% ggplot(aes(y=prod_year, x=price, group=1)) + geom_boxplot(fill="#636EFA") + coord_flip()
p13 <- final_df %>% ggplot(aes(x=prod_year)) + geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p2 <- final_df %>% ggplot(aes(y=prod_year, x=price)) + geom_violin(fill="#636EFA") + coord_flip()
# relationship between engine_volume and price
p3 <- final_df %>% ggplot(aes(y=engine_volume, x=price, group=1)) + geom_boxplot(fill="#EF553B") + coord_flip()
p14 <- final_df %>% ggplot(aes(x=engine_volume)) + geom_histogram(aes(y=after_stat(density)), fill="#EF553B", bins=30) + geom_density(alpha=.2, fill="#EF553B")
p4 <- final_df %>% ggplot(aes(y=engine_volume, x=price)) + geom_violin(fill="#EF553B") + coord_flip()
# relationship between mileage and price
p5 <- final_df %>% ggplot(aes(y=mileage, x=price, group=1)) + geom_boxplot(fill="#636EFA") + coord_flip()
p15 <- final_df %>% ggplot(aes(x=mileage)) + geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p6 <- final_df %>% ggplot(aes(y=mileage, x=price)) + geom_violin(fill="#636EFA") + coord_flip()
# relationship between cylinders and price
p7 <- final_df %>% ggplot(aes(y=cylinders, x=price, group=1)) + geom_boxplot(fill="#EF553B") + coord_flip()
p16 <- final_df %>% ggplot(aes(x=cylinders)) + geom_histogram(aes(y=after_stat(density)), fill="#EF553B", bins=30) + geom_density(alpha=.2, fill="#EF553B")
p8 <- final_df %>% ggplot(aes(y=cylinders, x=price)) + geom_violin(fill="#EF553B") + coord_flip()
# relationship between doors and price
p9 <- final_df %>% ggplot(aes(y=doors, x=price, group=1)) + geom_boxplot(fill="#636EFA") + coord_flip()
p17 <- final_df %>% ggplot(aes(x=doors)) + geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p10 <- final_df %>% ggplot(aes(y=doors, x=price)) + geom_violin(fill="#636EFA") + coord_flip()
# relationship between airbags and price
p11 <- final_df %>% ggplot(aes(y=airbags, x=price, group=1)) + geom_boxplot(fill="#EF553B") + coord_flip()
p18 <- final_df %>% ggplot(aes(x=airbags)) + geom_histogram(aes(y=after_stat(density)), fill="#EF553B", bins=30) + geom_density(alpha=.2, fill="#EF553B")
p12 <- final_df %>% ggplot(aes(y=airbags, x=price)) + geom_violin(fill="#EF553B") + coord_flip()
(p1 + p13 + p2)
From the plots above, the results show that the more recent the
production year of the used car, the higher the range of
the price. Most of the used cars in the data are quite
newly produce.
(p3 + p14 + p4)
Most of the used cars in the dataset have 2cc engines. Morover, 2cc
engines have large range in the price consisting both very
cheap and very expensive used cars. Other cc engine_volume
have fairly consistent prices compared to 2cc engine volumes
(p5 + p15 + p6)
As shown in the figures above, most used cars in the dataset have a
mileage of less than 200,000 km. Cars with more than
200,000 km mileage has a consistent price range.
(p7 + p16 + p8)
Most cars in the dataset have 4
cylinders.
(p9 + p17 + p10)
The dataset mostly consists of used cars with 4
doors.
(p11 + p18 + p12)
Used cars with 4 or 12
airbags have a large range of price
while others have fairly consistent price ranges.
# Relationship of manufacturer and price
pp1 <- final_df %>% ggplot(aes(x=price, y=reorder(manufacturer, price))) +
geom_boxplot(fill="#636EFA") +
labs(y="manufacturer", x="price", title="Relationship between manufacturer and price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp1
In general, HUMMER and SSANGGYONG has the highest median in used car
price among all manufacturers whereas ROLLS-ROYCE have the
lowest resale value.
# Relationship of category and price
pp2 <- final_df %>% ggplot(aes(x=reorder(category, price), y=price)) +
geom_boxplot(fill="#EF553B") +
labs(y="price", x="category", title="Relationship between category and price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp2
The universal category has the highest variability of
price.
# Relationship of leather_seat and price
pp3 <- final_df %>% ggplot(aes(x=price, y=reorder(leather_seat, price))) +
geom_boxplot(fill="#00CC96") +
labs(y="leather_seat", x="price", title="Relationship between leather_seat and price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp3
From the figure above, used car price can be higher when the used car
comes with leather_seat.
# Relationship of fuel_type and price
pp4 <- final_df %>% ggplot(aes(x=reorder(fuel_type, price), y=price)) +
geom_boxplot(fill="#E5FFCC") +
labs(y="price", x="fuel_type", title="Relationship between fuel_type and price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp4
Looking at the relationship between fuel_type and car
price, Plug-in Hybrid car has the highest median price, followed by
hydrogen, then diesel.
# Relationship of gearbox_type and price
pp5 <- final_df %>% ggplot(aes(x=reorder(gearbox_type, price), y=price)) +
geom_boxplot(fill="#99FF33") +
labs(y="price", x="gearbox_type", title="Relationship between gearbox_type and price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp5
The tiptronic gearbox_type is able to provide the highest
predicted price, and manual has the lowest price. However, the range of
automatic gearbox is the largest among the gearbox_type
feature.
# Relationship of drive_wheels and price
pp6 <- final_df %>% ggplot(aes(x=reorder(drive_wheels, price), y=price)) +
geom_boxplot(fill="#FF9933") +
labs(y="price", x="drive_wheels", title="Relationship between drive_wheels and price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp6
4x4 used cars plays fetch a lower used car price whereas front wheel
drives fetch the highest for drive_wheels features.
# Relationship of wheel and price
pp7 <- final_df %>% ggplot(aes(x=price, y=reorder(wheel, price))) +
geom_boxplot(fill="#00CC66") +
labs(y="wheel", x="price", title="Relationship between wheel and price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp7
The left wheel wheel type nets a higher used car price.
Hence, we know that the target buyers would be countries that follow the
right hand traffic.
# Relationship of turbo and price
pp8 <- final_df %>% ggplot(aes(x=price, y=reorder(turbo, price))) +
geom_boxplot(fill="#00CC66") +
labs(y="turbo", x="price", title="Relationship between turbo and price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp8
Used cars with turbo will give higher price than the
non-turbo cars.
Data splitting is important to ensure model effectiveness, prevent over-fitting and evaluate performance on unseen data.
A 70-30 split is aimed to strike a balance between providing enough data for effective model training and reserving a separate portion for unbiased model evaluation.
set.seed(7004)
# Split the data into X (features) and Y (target)
X <- final_df[, !(colnames(final_df) %in% c("price"))]
Y <- final_df[,"price"]
# Generate random indices for the training set
train_indices <- sample(nrow(final_df), nrow(final_df) * 0.7)
# Create the training set
train_data <- final_df[train_indices, ]
# Create the test set
test_data <- final_df[-train_indices, ]
train_data consists of 70% data randomly picked from original data, where it is used to train machine learning models by learning patterns and relationships in the data to make predictions or classifications.
test_data consists of 30% data randomly picked from original data, where it is used to evaluate the performance of the trained model and serves as a benchmark to measure how well the model generalizes to new, unseen data.
Using the encoding technique, categorical variables are transformed into numerical values to easily fit into machine learning model.
# Data Encoding
cols_to_encode <- c("manufacturer", "model", "category", "leather_seat", "fuel_type", "gearbox_type", "wheel", "turbo", "drive_wheels")
# Perform label encoding for each column
for (col in cols_to_encode) {
X[[col]] <- as.integer(factor(X[[col]]))
}
X_norm <- scale(X)
In machine learning, the curse of dimensionality dictates that redundant features will produce a slow, overfitted and complex model. Therefore, dimensionality reduction is conducted to reduce the number of features from a high dimensional feature space to a low dimensional feature space.
corr_matrix <- cor(X_norm)
ggcorrplot(corr_matrix)
Based on the correlation matrix above, the cylinders and
engine_volume has the highest value of correlation
coefficient, indicating a fairly strong positive linear correlation
between the two variables. We can see that production_year,
leather_seat, engine_volume,
cylinders, doors and airbags are
positively correlated towards levy too.
Due to the existence of multi-colinearity between the features in our dataset, Principle Component Analysis (PCA) is conducted to find out the best features that can represent the dataset the most.
# PCA
pca_result <- prcomp(corr_matrix)
scores <- pca_result$x
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 0.6710 0.4852 0.3874 0.32504 0.29331 0.28784 0.25405
## Proportion of Variance 0.3303 0.1727 0.1101 0.07752 0.06312 0.06079 0.04735
## Cumulative Proportion 0.3303 0.5030 0.6131 0.69064 0.75376 0.81454 0.86190
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.20211 0.19679 0.17457 0.15891 0.15044 0.13638 0.09684
## Proportion of Variance 0.02997 0.02841 0.02236 0.01853 0.01661 0.01365 0.00688
## Cumulative Proportion 0.89187 0.92029 0.94265 0.96117 0.97778 0.99142 0.99830
## PC15 PC16
## Standard deviation 0.04809 2.701e-17
## Proportion of Variance 0.00170 0.000e+00
## Cumulative Proportion 1.00000 1.000e+00
Among all the PCs, PC1 explains 33% of variance in the data whereas PC1+PC2 is able to explain the variance of half of the data. Therefore, a suitable number of PCs to choose from might be PC7 where 0.86 of the variance in data can be explained even when more than half of the features are removed.
for (i in 1:6){
PC <- pca_result$rotation[, i]
PC_scores <- abs(PC)
PC_scores_ordered <- sort(PC_scores, decreasing = TRUE)
top4_names <- names(PC_scores_ordered)[1:4]
print(top4_names)
}
## [1] "levy" "leather_seat" "engine_volume" "gearbox_type"
## [1] "cylinders" "engine_volume" "prod_year" "mileage"
## [1] "model" "category" "drive_wheels" "fuel_type"
## [1] "fuel_type" "drive_wheels" "wheel" "turbo"
## [1] "manufacturer" "airbags" "fuel_type" "mileage"
## [1] "turbo" "category" "doors" "airbags"
Based on the results of PCA, we can conclude that levy,
leather_seat, engine_volume and
gearbox_type are the most important features for PC1.
cylinders, engine_volume,
prod_year and mileage are the most important
features for PC2.
new_train_data <- subset(train_data, select = -c(model))
new_test_data <- subset(test_data, select = -c(model))
For regression, we will use Price as our target
variable.
Before regression modelling, categorical variables are converted into numerical through encoding.
# Encoding for Regression
cols_to_encode_reg <- c("leather_seat", "gearbox_type", "wheel", "drive_wheels", "category", "fuel_type", "turbo", "manufacturer")
# Perform label encoding for each column
for (col in cols_to_encode_reg) {
new_train_data[[col]] <- as.numeric(factor(new_train_data[[col]]))
new_test_data[[col]] <- as.numeric(factor(new_test_data[[col]]))
}
We are using gradient boosting regressor and random forest regressor as our regression prediction models.
1. Gradient Boosting Regressor
# Regression Prediction
set.seed(7004)
gradient_boosting_regressor <- gbm(new_train_data$price ~ ., data = new_train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 5)
# Predict the target variable for the test set
y_pred_gbm <- predict(gradient_boosting_regressor, newdata = new_test_data)
## Using 100 trees...
# Store predicted values to data frame
GBM <- data.frame(y_test = new_test_data$price, y_pred = y_pred_gbm)
# Show first 50 actual and predicted data
subset_GBM <- GBM[1:50, ]
# Visualize prediction using plot
plot(subset_GBM$y_test, type = "l", col = "blue", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_GBM$y_pred, col = "red", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lwd = 2)
title(main="Actual vs Predicted for Gradient Boosting Regressor Model")
2. Random Forest Regressor
# Regression Prediction
set.seed(7004)
random_forest_regressor <- randomForest(new_train_data$price ~ ., data = new_train_data, ntree = 100, importance = TRUE)
# Predict the target variable for the test set
y_pred_rf <- predict(random_forest_regressor, newdata = new_test_data)
# Store predicted values to data frame
RF <- data.frame(y_test = new_test_data$price, y_pred = y_pred_rf)
# Show first 50 actual and predicted data
subset_RF <- RF[1:50, ]
# Visualize prediction using plot
plot(subset_RF$y_test, type = "l", col = "blue", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_RF$y_pred, col = "red", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lwd = 2)
title(main="Actual vs Predicted for Random Forest Regressor Model")
# Compute error metrics for evaluation
mae_rf <- round(mean(abs(new_test_data$price - y_pred_rf)),3)
mae_gbm <- round(mean(abs(new_test_data$price - y_pred_gbm)),3)
rmse_rf <- rmse <- round(sqrt(mean((new_test_data$price - y_pred_rf)^2)),3)
rmse_gbm <- rmse <- round(sqrt(mean((new_test_data$price - y_pred_gbm)^2)),3)
r2_value_rf <- round(cor(y_pred_rf, new_test_data$price)^2, 3)
r2_value_gbm <- round(cor(y_pred_gbm, new_test_data$price)^2, 3)
| Random Forest Regressor | Gradient Boosting Regressor | |
|---|---|---|
| Mean Absolute Error | 3951.545 | 5170.289 |
| Root Mean Squared Error | 5956.678 | 7148.545 |
| R-squared Value | 0.725 | 0.603 |
The R-squared value of random forest regressor is 0.72. While
gradient boosting regressor has a R-squared value of 0.6. Similarly
random forest regressor has a lower Root Mean Squared Error (RMSE)
compared to gradient boosting regressor.
Thus, the results implying that random forest regressor is performing better in price prediction.
We will be discretizing price into
price category by classifying them into 3 price categories,
low, medium and high.
new_train_data_cat <- new_train_data %>%
mutate(price_category = discretize(new_train_data$price, breaks=3, labels = c("low","medium","high")))
new_test_data_cat <- new_test_data %>%
mutate(price_category = discretize(new_test_data$price, breaks=3, labels = c("low","medium","high")))
new_train_data_cat <- new_train_data_cat[,-1]
new_test_data_cat <- new_test_data_cat[,-1]
Random forest classification and support vector machine is chosen as the algorithm for our classification modelling.
1. Random Forest Classification
set.seed(7004)
rf_model <- randomForest(price_category ~ ., data = new_train_data_cat, ntree = 100)
# Predict the target variable for the test set
x_test_rf <- new_test_data_cat[,!(colnames(new_test_data_cat) %in% c("price_category"))]
y_test_rf <- new_test_data_cat[,'price_category']
y_pred_rf_2 <- predict(rf_model, x_test_rf)
# confusion matrix
confusion_matrix_rf <- confusionMatrix(new_test_data_cat$price_category, y_pred_rf_2)
print(confusion_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low medium high
## low 1448 190 54
## medium 235 1215 213
## high 118 304 1344
##
## Overall Statistics
##
## Accuracy : 0.7825
## 95% CI : (0.7709, 0.7937)
## No Information Rate : 0.3517
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6738
##
## Mcnemar's Test P-Value : 1.127e-09
##
## Statistics by Class:
##
## Class: low Class: medium Class: high
## Sensitivity 0.8040 0.7109 0.8343
## Specificity 0.9265 0.8687 0.8798
## Pos Pred Value 0.8558 0.7306 0.7610
## Neg Pred Value 0.8971 0.8571 0.9204
## Prevalence 0.3517 0.3337 0.3146
## Detection Rate 0.2828 0.2373 0.2624
## Detection Prevalence 0.3304 0.3247 0.3449
## Balanced Accuracy 0.8653 0.7898 0.8570
The accuracy of random forest classification is 0.7825. By observing the detailed accuracy of each class, the model appears to perform better for low and high price categories.
2. Support Vector Machine Classification
svm_model <- svm(price_category ~ ., data = new_train_data_cat)
# Predict the target variable for the test set
x_test_svm <- new_test_data_cat[,!(colnames(new_test_data_cat) %in% c("price_category"))]
y_test_svm <- new_test_data_cat[,'price_category']
y_pred_svm <- predict(svm_model, x_test_svm)
# Confusion matrix
confusion_matrix_svm <- confusionMatrix(new_test_data_cat$price_category, y_pred_svm)
print(confusion_matrix_svm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low medium high
## low 1411 234 47
## medium 289 1137 237
## high 190 341 1235
##
## Overall Statistics
##
## Accuracy : 0.7387
## 95% CI : (0.7265, 0.7507)
## No Information Rate : 0.3691
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6084
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: low Class: medium Class: high
## Sensitivity 0.7466 0.6641 0.8130
## Specificity 0.9130 0.8457 0.8526
## Pos Pred Value 0.8339 0.6837 0.6993
## Neg Pred Value 0.8603 0.8337 0.9154
## Prevalence 0.3691 0.3343 0.2966
## Detection Rate 0.2755 0.2220 0.2412
## Detection Prevalence 0.3304 0.3247 0.3449
## Balanced Accuracy 0.8298 0.7549 0.8328
The accuracy of support vector machine classification is 0.7387, where the sensitivity of the high class for price prediction is the highest, followed by low class then medium class. Similarly, the classifier is better at predicting low and high price categories.
By comparing the accuracy of both classification algorithm, random forest classification is performing better than support vector machine.
To better evaluate the performance of the algorithm, the ROC curve and AUC is required to calculate the prediction probabilities of the testing data with each classifier respectively.
new_test_data_cat$price_category <- as.numeric(new_test_data_cat$price_category)
y_pred_rf_2 <- as.numeric(y_pred_rf_2)
y_pred_svm <- as.numeric(y_pred_svm)
roc_rf <- roc(new_test_data_cat$price_category, y_pred_rf_2)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
roc_svm <- roc(new_test_data_cat$price_category, y_pred_svm)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot(roc_rf, main = "ROC Curves", col = "blue")
lines(roc_svm, col = "red")
legend("bottomright", legend = c("Random Forest", "Support Vector Machine"), col = c("blue", "red"), lwd = 2)
auc_rf <- auc(roc_rf)
auc_svm <- auc(roc_svm)
| Classifiction | AUC |
|---|---|
| Random Forest | 0.852773264302032 |
| Support Vector Machine | 0.830429782400714 |
AUC represented the likelihood of a random positive is placed to the
right of a random negative. A model with 100% correct prediction has an
AUC of 1.0, and 0.0 for the 100% wrong prediction.
From the calculated AUC values, random forest classification is performing better than support vector machine.
All in all, the objective of this initiative is to use data science methodologies to process, analyse, and predict Used Car Retail Prices. Initially, we provided a thorough introduction to the dataset, discussing its extent and the nature of the data within each column. Subsequently, we engaged in data pre-processing, such as dataset cleaning and segmentation, to generate a dataset that was model-friendly. To gain a deeper understanding of the distinct characteristics of the dataset, we conducted Exploratory Data Analysis employing correlation analysis and other techniques. We modelled the problem using both regression and classification models.
In conclusion, random forest performs well in both classification and regression for price prediction. Hence, random forest predict the best in the retail car price based on different attributes.