Project Title: Used Cars Retail Price Prediction

Prepared by:

Name	Matric Number
LEE GUANG SHEN	22052269
TIMOTHY CHEN XIAN YII	22056170
LIM KHAI FUNG	22058576
LEONG MIN QI	22077164
MANG YU JIE	22064556

Project Background

In this modern society, cars are slowly becoming a necessity for every adult. However, not everyone is able to afford to buy a brand new car. Therefore, there is a considerable increase in the demand of used cars, as people are looking for a more economical option. The retail price of a new car is hard to estimate as there are no concrete guidelines in setting the price of the car. More importantly, the price of a used car depends on the supply and demand, where the prices are highly dependent on the characteristics of the used car. Additional aspects such as car design, car condition and mileage will greately affect the retail price of the used car.

Thus, accurate prediction for used cars’ retail prices are beneficial to both seller and buyers. The seller can utilize the output of the prediction model to entice potential buyers and maximize their profits whereas the buyer can easily find out the price estimate of the car. Finding the correct price of used car can be difficult due to the consideration based on numerous factors. Hence, this study will be carried out.

Project Objectives

Pricing used cars is a dilemma for most of used car dealers and their buyers, as it is difficult to decide the worth of posted price for the pre-owned car. Hence, this project aims to explore the features that can help estimate the price of a used car and use the features to build machine learning models to both predict a price value of the car as well as classify the price category of the used car.

Data Understanding

library(lubridate)
library(dplyr)
library(tidyr)
library(stringr)
library(ggcorrplot)
library(patchwork)
library(randomForest)
library(caret)
library(corrplot)
library(gbm)
library(e1071)
library(stats)
library(arules)
library(pROC)

Dataset

The dataset used for this project is retrieved from Car Price Prediction Challege. This dataset is made available by Deep Contractor on Kaggle.

Import Data

df <- read.csv("https://raw.githubusercontent.com/mangyj922/car-price-prediction/main/car_price_prediction.csv")

Data Description

General Information

str(df)

## 'data.frame':    19237 obs. of  18 variables:
##  $ ID              : int  45654403 44731507 45774419 45769185 45809263 45802912 45656768 45816158 45641395 45756839 ...
##  $ Price           : int  13328 16621 8467 3607 11726 39493 1803 549 1098 26657 ...
##  $ Levy            : chr  "1399" "1018" "-" "862" ...
##  $ Manufacturer    : chr  "LEXUS" "CHEVROLET" "HONDA" "FORD" ...
##  $ Model           : chr  "RX 450" "Equinox" "FIT" "Escape" ...
##  $ Prod..year      : int  2010 2011 2006 2011 2014 2016 2010 2013 2014 2007 ...
##  $ Category        : chr  "Jeep" "Jeep" "Hatchback" "Jeep" ...
##  $ Leather.interior: chr  "Yes" "No" "No" "Yes" ...
##  $ Fuel.type       : chr  "Hybrid" "Petrol" "Petrol" "Hybrid" ...
##  $ Engine.volume   : chr  "3.5" "3" "1.3" "2.5" ...
##  $ Mileage         : chr  "186005 km" "192000 km" "200000 km" "168966 km" ...
##  $ Cylinders       : num  6 6 4 4 4 4 4 4 4 6 ...
##  $ Gear.box.type   : chr  "Automatic" "Tiptronic" "Variator" "Automatic" ...
##  $ Drive.wheels    : chr  "4x4" "4x4" "Front" "4x4" ...
##  $ Doors           : chr  "04-May" "04-May" "04-May" "04-May" ...
##  $ Wheel           : chr  "Left wheel" "Left wheel" "Right-hand drive" "Left wheel" ...
##  $ Color           : chr  "Silver" "Black" "Black" "White" ...
##  $ Airbags         : int  12 8 2 0 4 4 12 12 12 12 ...

From the description above, we can see that the dataset comprises of 19237 rows/observations with 18 columns/common features.
The content and the structure of the dataset can be found under the table below:

Column	Type	Description
ID	integer	ID of used car
Price	integer	The price of used car
Levy	character	The tax to be paid
Manufacturer	character	The brand of used car
Model	character	The model name of used car
Prod..year	integer	The production year of used car
Category	character	The category of used car
Leather.interior	character	Leather car seat (Boolean)
Fuel.type	character	The fuel type of used car
Engine.volume	character	The engine volume of used car
Mileage	character	The mileage of used car
Cylinders	numeric	The number of cylinder engines
Gear.box.type	character	The gear box type
Drive.wheels	character	The wheel routing system
Doors	character	The number of doors
Wheel	character	The placement of driver sear
Color	character	The color of used car
Airbags	integer	The number of airbags

Statistical Information

The summary of the dataset can be found below:

summary(df)

##        ID               Price              Levy           Manufacturer      
##  Min.   :20746880   Min.   :       1   Length:19237       Length:19237      
##  1st Qu.:45698374   1st Qu.:    5331   Class :character   Class :character  
##  Median :45772308   Median :   13172   Mode  :character   Mode  :character  
##  Mean   :45576536   Mean   :   18556                                        
##  3rd Qu.:45802036   3rd Qu.:   22075                                        
##  Max.   :45816654   Max.   :26307500                                        
##     Model             Prod..year     Category         Leather.interior  
##  Length:19237       Min.   :1939   Length:19237       Length:19237      
##  Class :character   1st Qu.:2009   Class :character   Class :character  
##  Mode  :character   Median :2012   Mode  :character   Mode  :character  
##                     Mean   :2011                                        
##                     3rd Qu.:2015                                        
##                     Max.   :2020                                        
##   Fuel.type         Engine.volume        Mileage            Cylinders     
##  Length:19237       Length:19237       Length:19237       Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 4.000  
##                                                           Mean   : 4.583  
##                                                           3rd Qu.: 4.000  
##                                                           Max.   :16.000  
##  Gear.box.type      Drive.wheels          Doors              Wheel          
##  Length:19237       Length:19237       Length:19237       Length:19237      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Color              Airbags      
##  Length:19237       Min.   : 0.000  
##  Class :character   1st Qu.: 4.000  
##  Mode  :character   Median : 6.000  
##                     Mean   : 6.583  
##                     3rd Qu.:12.000  
##                     Max.   :16.000

The summary above shows a quick overview of the basic statistics of the dataset, we can conclude that there are a total of 5 quantitative data and 13 qualitative data. Interestingly, the range between minimum and maximum values for the numerical features are very large.

Data Cleaning

Data cleaning is essential in the data preprocessing phase to ensure the quality, consistency and integrity of the data. It involves identifying and handling errors, missing values, outliers and other issues that may affect the accuracy and reliability of the data.

1. Handling missing values

First, we handle any missing data by checking if there are any Null values.

any(is.na(df))

## [1] FALSE

There are no null values within the dataset.

2. Drop the duplicate records

Secondly, we will drop any duplicate records in the dataset.

df <- df %>% distinct()
dim(df)

## [1] 18924    18

After dropping duplicate rows from the dataframe, the output is left with 18924 observations, indicating that there are duplicated observations in the dataset.

3. Dropping unnecessary columns

Unique identifiers such as “ID” and “Color” will be dropped from the dataset.

df <- df[,!names(df) %in% c("ID", "Color")]
dim(df)

## [1] 18924    16

After dropping the “ID” and “Color” columns, 16 features are left.

4. Rename all the columns

The feature names are renamed to ease the process of data analysis.

colnames(df) <- c('price','levy','manufacturer','model','prod_year','category','leather_seat','fuel_type','engine_volume','mileage','cylinders','gearbox_type','drive_wheels','doors','wheel','airbags')

5. Handling noisy data

In the column “levy”, there are some records showing “-”. The records with ‘-’ will be imputed with 0 and the column type will be converted to integer.

Besides, the column “doors” consists of non-consistent data, where there are some records of it shown in date format. The records in date format will be corrected and the whole column “doors” will be converted to integer.

Furthermore, the column “mileage” contains the distance unit “km” in the dataset. The unit does not bring any significance value to our dataset. Hence, the unit “km” will be removed to represent the column “mileage” in integer type.

df$levy <- sub('-','0',df$levy)
df$levy <- as.integer(df$levy)

df$doors <- sub('04-May','4',df$doors)
df$doors <- sub('02-Mar','2',df$doors)
df$doors <- sub('>5','5',df$doors)
df$doors <- as.integer(df$doors)

df$mileage <- sub(' km','',df$mileage)
df$mileage <- as.integer(df$mileage)

6. Adding necessary columns

As the column “engine_volume” consists of both turbo and non-turbo types, there is a need to separate the feature and add a column to represent the turbo type of the car.

df[c('engine_volume', 'turbo')] <- str_split_fixed(df$engine_volume, ' ', 2)
df$turbo <- sub('','No',df$turbo)
df$turbo <- gsub('.*Turbo*', "Yes", df$turbo)
df$engine_volume <- as.numeric(df$engine_volume)

7. Converting columns

In R, many of the functions will automatically convert characters to factors, as factors are much easier to work with than characters.

As a result, the columns that consists of character data are converted to factor.

df$manufacturer <- as.factor(df$manufacturer)
df$model <- as.factor(df$model)
df$category <- as.factor(df$category)
df$leather_seat <- as.factor(df$leather_seat)
df$fuel_type <- as.factor(df$fuel_type)
df$gearbox_type <- as.factor(df$gearbox_type)
df$drive_wheels <- as.factor(df$drive_wheels)
df$wheel <- as.factor(df$wheel)
df$turbo <- as.factor(df$turbo)

The overview of the dataset after the data cleaning process is shown below.

summary(df)

##      price               levy                manufacturer       model      
##  Min.   :       1   Min.   :    0.0   HYUNDAI      :3729   Prius   : 1069  
##  1st Qu.:    5331   1st Qu.:    0.0   TOYOTA       :3606   Sonata  : 1067  
##  Median :   13172   Median :  642.0   MERCEDES-BENZ:2043   Camry   :  929  
##  Mean   :   18587   Mean   :  632.9   FORD         :1088   Elantra :  910  
##  3rd Qu.:   22063   3rd Qu.:  917.0   CHEVROLET    :1047   E 350   :  534  
##  Max.   :26307500   Max.   :11714.0   BMW          :1036   Santa FE:  527  
##                                       (Other)      :6375   (Other) :13888  
##    prod_year         category    leather_seat          fuel_type   
##  Min.   :1939   Sedan    :8600   No : 5193    CNG           : 469  
##  1st Qu.:2009   Jeep     :5378   Yes:13731    Diesel        :4001  
##  Median :2012   Hatchback:2799                Hybrid        :3539  
##  Mean   :2011   Minivan  : 633                Hydrogen      :   1  
##  3rd Qu.:2015   Coupe    : 528                LPG           : 885  
##  Max.   :2020   Universal: 361                Petrol        :9944  
##                 (Other)  : 625                Plug-in Hybrid:  85  
##  engine_volume       mileage            cylinders        gearbox_type  
##  Min.   : 0.000   Min.   :0.000e+00   Min.   : 1.00   Automatic:13282  
##  1st Qu.: 1.800   1st Qu.:7.020e+04   1st Qu.: 4.00   Manual   : 1844  
##  Median : 2.000   Median :1.264e+05   Median : 4.00   Tiptronic: 3065  
##  Mean   : 2.306   Mean   :1.555e+06   Mean   : 4.58   Variator :  733  
##  3rd Qu.: 2.500   3rd Qu.:1.891e+05   3rd Qu.: 4.00                    
##  Max.   :20.000   Max.   :2.147e+09   Max.   :16.00                    
##                                                                        
##  drive_wheels      doors                    wheel          airbags      
##  4x4  : 3969   Min.   :2.000   Left wheel      :17471   Min.   : 0.000  
##  Front:12695   1st Qu.:4.000   Right-hand drive: 1453   1st Qu.: 4.000  
##  Rear : 2260   Median :4.000                            Median : 6.000  
##                Mean   :3.925                            Mean   : 6.568  
##                3rd Qu.:4.000                            3rd Qu.:12.000  
##                Max.   :5.000                            Max.   :16.000  
##                                                                         
##  turbo      
##  No :17032  
##  Yes: 1892  
##             
##             
##             
##             
##

From the above printed results, we can see that the range of minimum and maximum values of price, levy and mileage are quite large. There is a high possibility that these features contain outliers.

Therefore, box plots are then used to identify the potential outliers visually.

par(mfrow=c(4,1))
ggplot(data = df, aes(x = price)) + geom_boxplot(fill="yellow")

ggplot(data = df, aes(x = levy)) + geom_boxplot(fill="yellow")

ggplot(data = df, aes(x = mileage)) + geom_boxplot(fill="yellow")

ggplot(data = df, aes(x = airbags)) + geom_boxplot(fill="yellow")

Based on the boxplots shown above, outliers are noticed in price, levy and mileage, where the points are very far away from the maximum whiskers. Hence they need to be removed.

8. Removing outliers

The interquartile range (IQR) method is used to identify outliers and remove it accordingly from the dataset. The outliers are removed by setting the limitations on the values that are a factor 1.5 of the IQR below the 25th percentile and above the 75th percentile.

Q1 <- quantile(df$price, .25)
Q3 <- quantile(df$price, .75)
IQR <- IQR(df$price)
outlier_step <- IQR * 1.5
lower_bound <- Q1-outlier_step
upper_bound <- Q3+outlier_step
final_df <- df %>% filter(price >= lower_bound & price <= upper_bound)

Q1 <- quantile(final_df$levy, .25)
Q3 <- quantile(final_df$levy, .75)
IQR <- IQR(final_df$levy)
outlier_step <- IQR * 1.5
lower_bound <- Q1-outlier_step
upper_bound <- Q3+outlier_step
final_df <- final_df %>% filter(levy >= lower_bound & levy <= upper_bound)

Q1 <- quantile(final_df$mileage, .25)
Q3 <- quantile(final_df$mileage, .75)
IQR <- IQR(final_df$mileage)
outlier_step <- IQR * 1.5
lower_bound <- Q1-outlier_step
upper_bound <- Q3+outlier_step
final_df <- final_df %>% filter(mileage >= lower_bound & mileage <= upper_bound)

summary(final_df)

##      price            levy               manufacturer      model      
##  Min.   :    1   Min.   :   0.0   TOYOTA       :3317   Prius  :  990  
##  1st Qu.: 5018   1st Qu.:   0.0   HYUNDAI      :3162   Elantra:  901  
##  Median :12544   Median : 640.0   MERCEDES-BENZ:1767   Camry  :  867  
##  Mean   :14291   Mean   : 597.2   CHEVROLET    :1015   Sonata :  752  
##  3rd Qu.:20385   3rd Qu.: 891.0   FORD         :1002   E 350  :  494  
##  Max.   :47120   Max.   :2225.0   BMW          : 926   FIT    :  437  
##                                   (Other)      :5879   (Other):12627  
##    prod_year         category    leather_seat          fuel_type   
##  Min.   :1939   Sedan    :7842   No : 4981    CNG           : 440  
##  1st Qu.:2009   Jeep     :4566   Yes:12087    Diesel        :3548  
##  Median :2012   Hatchback:2711                Hybrid        :3336  
##  Mean   :2011   Minivan  : 604                Hydrogen      :   1  
##  3rd Qu.:2014   Coupe    : 462                LPG           : 622  
##  Max.   :2020   Universal: 328                Petrol        :9038  
##                 (Other)  : 555                Plug-in Hybrid:  83  
##  engine_volume      mileage         cylinders         gearbox_type  
##  Min.   :0.000   Min.   :     0   Min.   : 1.000   Automatic:11973  
##  1st Qu.:1.700   1st Qu.: 71904   1st Qu.: 4.000   Manual   : 1710  
##  Median :2.000   Median :126292   Median : 4.000   Tiptronic: 2674  
##  Mean   :2.256   Mean   :132143   Mean   : 4.527   Variator :  711  
##  3rd Qu.:2.500   3rd Qu.:182625   3rd Qu.: 4.000                    
##  Max.   :6.300   Max.   :364523   Max.   :16.000                    
##                                                                     
##  drive_wheels      doors                    wheel          airbags      
##  4x4  : 3417   Min.   :2.000   Left wheel      :15670   Min.   : 0.000  
##  Front:11666   1st Qu.:4.000   Right-hand drive: 1398   1st Qu.: 4.000  
##  Rear : 1985   Median :4.000                            Median : 6.000  
##                Mean   :3.924                            Mean   : 6.585  
##                3rd Qu.:4.000                            3rd Qu.:12.000  
##                Max.   :5.000                            Max.   :16.000  
##                                                                         
##  turbo      
##  No :15459  
##  Yes: 1609  
##             
##             
##             
##             
##

After cleaning the dataset, exploratory data analysis will then be performed.

Exploratory Data Analysis (EDA)

EDA is the process to examine and understand the dataset thoroughly to gain insights, identify patterns and uncover relationships between variables.

The purpose of the EDA is to answer the following questions:

What is the correlation between the numerical features?
What is the relationship between the independent variables and dependent variables?
Which model is the best model to predict the price of used car?
Which model is the best model to predict the price category of used car?

Correlation

Correlation analysis is used to measure the strength and direction of the relationship between variables.

Correlation coefficients have a value of between -1 and 1: * 0: no relationship between the variables at all * 1: perfect positive correlation between the variables * -1: perfect negative correlation between the variables

However, the limitation of correlation analysis is that it does not establish causation, despite provides insights into the relationships between variables.

Correlation between numeric features

subset_df <- final_df %>% select_if(is.numeric) %>% select(-c("price"))
corr <- cor(subset_df)
ggcorrplot(corr, type="lower", lab=TRUE, ggtheme = ggplot2::theme_gray, colors = c("#636EFA", "white", "#EF553B"))

From the plot above, the highest positive correlation is between features engine_volume and cylinders, which is 0.78.

Relationship between numerical variables and price

# relationship between prod_year and price
p1 <- final_df %>% ggplot(aes(y=prod_year, x=price, group=1)) + geom_boxplot(fill="#636EFA") + coord_flip()
p13 <- final_df %>% ggplot(aes(x=prod_year)) + geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p2 <- final_df %>% ggplot(aes(y=prod_year, x=price)) + geom_violin(fill="#636EFA") + coord_flip()

# relationship between engine_volume and price
p3 <- final_df %>% ggplot(aes(y=engine_volume, x=price, group=1)) + geom_boxplot(fill="#EF553B") + coord_flip()
p14 <- final_df %>% ggplot(aes(x=engine_volume)) + geom_histogram(aes(y=after_stat(density)), fill="#EF553B", bins=30) + geom_density(alpha=.2, fill="#EF553B")
p4 <- final_df %>% ggplot(aes(y=engine_volume, x=price)) + geom_violin(fill="#EF553B") + coord_flip()

# relationship between mileage and price
p5 <- final_df %>% ggplot(aes(y=mileage, x=price, group=1)) + geom_boxplot(fill="#636EFA") + coord_flip()
p15 <- final_df %>% ggplot(aes(x=mileage)) + geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p6 <- final_df %>% ggplot(aes(y=mileage, x=price)) + geom_violin(fill="#636EFA") + coord_flip()

# relationship between cylinders and price
p7 <- final_df %>% ggplot(aes(y=cylinders, x=price, group=1)) + geom_boxplot(fill="#EF553B") + coord_flip()
p16 <- final_df %>% ggplot(aes(x=cylinders)) + geom_histogram(aes(y=after_stat(density)), fill="#EF553B", bins=30) + geom_density(alpha=.2, fill="#EF553B")
p8 <- final_df %>% ggplot(aes(y=cylinders, x=price)) + geom_violin(fill="#EF553B") + coord_flip()

# relationship between doors and price
p9 <- final_df %>% ggplot(aes(y=doors, x=price, group=1)) + geom_boxplot(fill="#636EFA") + coord_flip()
p17 <- final_df %>% ggplot(aes(x=doors)) + geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p10 <- final_df %>% ggplot(aes(y=doors, x=price)) + geom_violin(fill="#636EFA") + coord_flip()

# relationship between airbags and price
p11 <- final_df %>% ggplot(aes(y=airbags, x=price, group=1)) + geom_boxplot(fill="#EF553B") + coord_flip()
p18 <- final_df %>% ggplot(aes(x=airbags)) + geom_histogram(aes(y=after_stat(density)), fill="#EF553B", bins=30) + geom_density(alpha=.2, fill="#EF553B")
p12 <- final_df %>% ggplot(aes(y=airbags, x=price)) + geom_violin(fill="#EF553B") + coord_flip()

Production year

(p1 + p13 + p2)

From the plots above, the results show that the more recent the production year of the used car, the higher the range of the price. Most of the used cars in the data are quite newly produce.

Engine volume

(p3 + p14 + p4)

Most of the used cars in the dataset have 2cc engines. Morover, 2cc engines have large range in the price consisting both very cheap and very expensive used cars. Other cc engine_volume have fairly consistent prices compared to 2cc engine volumes

Mileage

(p5 + p15 + p6)

As shown in the figures above, most used cars in the dataset have a mileage of less than 200,000 km. Cars with more than 200,000 km mileage has a consistent price range.

Cylinders

(p7 + p16 + p8)

Most cars in the dataset have 4 cylinders.

Doors

(p9 + p17 + p10)

The dataset mostly consists of used cars with 4 doors.

Airbags

(p11 + p18 + p12)

Used cars with 4 or 12 airbags have a large range of price while others have fairly consistent price ranges.

Relationship between categorical variables and price

Manufacturer vs Price

# Relationship of manufacturer and price
pp1 <- final_df %>% ggplot(aes(x=price, y=reorder(manufacturer, price))) + 
          geom_boxplot(fill="#636EFA") + 
          labs(y="manufacturer", x="price", title="Relationship between manufacturer and price") + 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp1

In general, HUMMER and SSANGGYONG has the highest median in used car price among all manufacturers whereas ROLLS-ROYCE have the lowest resale value.

Category vs Price

# Relationship of category and price
pp2 <- final_df %>% ggplot(aes(x=reorder(category, price), y=price)) + 
          geom_boxplot(fill="#EF553B") + 
          labs(y="price", x="category", title="Relationship between category and price") + 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp2

The universal category has the highest variability of price.

Leather_seat vs Price

# Relationship of leather_seat and price
pp3 <- final_df %>% ggplot(aes(x=price, y=reorder(leather_seat, price))) + 
          geom_boxplot(fill="#00CC96") + 
          labs(y="leather_seat", x="price", title="Relationship between leather_seat and price") + 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp3

From the figure above, used car price can be higher when the used car comes with leather_seat.

Fuel_type vs Price

# Relationship of fuel_type and price
pp4 <- final_df %>% ggplot(aes(x=reorder(fuel_type, price), y=price)) + 
          geom_boxplot(fill="#E5FFCC") + 
          labs(y="price", x="fuel_type", title="Relationship between fuel_type and price") + 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp4

Looking at the relationship between fuel_type and car price, Plug-in Hybrid car has the highest median price, followed by hydrogen, then diesel.

Gearbox_type vs Price

# Relationship of gearbox_type and price
pp5 <- final_df %>% ggplot(aes(x=reorder(gearbox_type, price), y=price)) + 
          geom_boxplot(fill="#99FF33") + 
          labs(y="price", x="gearbox_type", title="Relationship between gearbox_type and price") + 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp5

The tiptronic gearbox_type is able to provide the highest predicted price, and manual has the lowest price. However, the range of automatic gearbox is the largest among the gearbox_type feature.

Drive_wheels vs Price

# Relationship of drive_wheels and price
pp6 <- final_df %>% ggplot(aes(x=reorder(drive_wheels, price), y=price)) + 
          geom_boxplot(fill="#FF9933") + 
          labs(y="price", x="drive_wheels", title="Relationship between drive_wheels and price") + 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp6

4x4 used cars plays fetch a lower used car price whereas front wheel drives fetch the highest for drive_wheels features.

Wheel vs Price

# Relationship of wheel and price
pp7 <- final_df %>% ggplot(aes(x=price, y=reorder(wheel, price))) + 
          geom_boxplot(fill="#00CC66") + 
          labs(y="wheel", x="price", title="Relationship between wheel and price") + 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp7

The left wheel wheel type nets a higher used car price. Hence, we know that the target buyers would be countries that follow the right hand traffic.

Turbo vs Price

# Relationship of turbo and price
pp8 <- final_df %>% ggplot(aes(x=price, y=reorder(turbo, price))) + 
          geom_boxplot(fill="#00CC66") + 
          labs(y="turbo", x="price", title="Relationship between turbo and price") + 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pp8

Used cars with turbo will give higher price than the non-turbo cars.

Machine Learning Modelling

Train and Test Splitting

Data splitting is important to ensure model effectiveness, prevent over-fitting and evaluate performance on unseen data.

A 70-30 split is aimed to strike a balance between providing enough data for effective model training and reserving a separate portion for unbiased model evaluation.

set.seed(7004)
# Split the data into X (features) and Y (target)
X <- final_df[, !(colnames(final_df) %in% c("price"))]
Y <- final_df[,"price"]

# Generate random indices for the training set
train_indices <- sample(nrow(final_df), nrow(final_df) * 0.7)

# Create the training set
train_data <- final_df[train_indices, ]

# Create the test set
test_data <- final_df[-train_indices, ]

train_data consists of 70% data randomly picked from original data, where it is used to train machine learning models by learning patterns and relationships in the data to make predictions or classifications.

test_data consists of 30% data randomly picked from original data, where it is used to evaluate the performance of the trained model and serves as a benchmark to measure how well the model generalizes to new, unseen data.

Data Encoding & Standardisation

Using the encoding technique, categorical variables are transformed into numerical values to easily fit into machine learning model.

# Data Encoding
cols_to_encode <- c("manufacturer", "model", "category", "leather_seat", "fuel_type", "gearbox_type", "wheel", "turbo", "drive_wheels")

# Perform label encoding for each column
for (col in cols_to_encode) {
  X[[col]] <- as.integer(factor(X[[col]]))
}

X_norm <- scale(X)

Dimensionality Reduction

In machine learning, the curse of dimensionality dictates that redundant features will produce a slow, overfitted and complex model. Therefore, dimensionality reduction is conducted to reduce the number of features from a high dimensional feature space to a low dimensional feature space.

corr_matrix <- cor(X_norm)
ggcorrplot(corr_matrix)

Based on the correlation matrix above, the cylinders and engine_volume has the highest value of correlation coefficient, indicating a fairly strong positive linear correlation between the two variables. We can see that production_year, leather_seat, engine_volume, cylinders, doors and airbags are positively correlated towards levy too.

Due to the existence of multi-colinearity between the features in our dataset, Principle Component Analysis (PCA) is conducted to find out the best features that can represent the dataset the most.

# PCA
pca_result <- prcomp(corr_matrix)
scores <- pca_result$x

summary(pca_result)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     0.6710 0.4852 0.3874 0.32504 0.29331 0.28784 0.25405
## Proportion of Variance 0.3303 0.1727 0.1101 0.07752 0.06312 0.06079 0.04735
## Cumulative Proportion  0.3303 0.5030 0.6131 0.69064 0.75376 0.81454 0.86190
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.20211 0.19679 0.17457 0.15891 0.15044 0.13638 0.09684
## Proportion of Variance 0.02997 0.02841 0.02236 0.01853 0.01661 0.01365 0.00688
## Cumulative Proportion  0.89187 0.92029 0.94265 0.96117 0.97778 0.99142 0.99830
##                           PC15      PC16
## Standard deviation     0.04809 2.701e-17
## Proportion of Variance 0.00170 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00

Among all the PCs, PC1 explains 33% of variance in the data whereas PC1+PC2 is able to explain the variance of half of the data. Therefore, a suitable number of PCs to choose from might be PC7 where 0.86 of the variance in data can be explained even when more than half of the features are removed.

for (i in 1:6){
  PC <- pca_result$rotation[, i]
  PC_scores <- abs(PC)
  PC_scores_ordered <- sort(PC_scores, decreasing = TRUE)
  top4_names <- names(PC_scores_ordered)[1:4]
  print(top4_names)
}

## [1] "levy"          "leather_seat"  "engine_volume" "gearbox_type" 
## [1] "cylinders"     "engine_volume" "prod_year"     "mileage"      
## [1] "model"        "category"     "drive_wheels" "fuel_type"   
## [1] "fuel_type"    "drive_wheels" "wheel"        "turbo"       
## [1] "manufacturer" "airbags"      "fuel_type"    "mileage"     
## [1] "turbo"    "category" "doors"    "airbags"

Based on the results of PCA, we can conclude that levy, leather_seat, engine_volume and gearbox_type are the most important features for PC1. cylinders, engine_volume, prod_year and mileage are the most important features for PC2.

new_train_data <- subset(train_data, select = -c(model))
new_test_data <- subset(test_data, select = -c(model))

Regression

Regression Model Training and Evaluation

For regression, we will use Price as our target variable.

Before regression modelling, categorical variables are converted into numerical through encoding.

# Encoding for Regression
cols_to_encode_reg <- c("leather_seat", "gearbox_type", "wheel", "drive_wheels", "category", "fuel_type", "turbo", "manufacturer")

# Perform label encoding for each column
for (col in cols_to_encode_reg) {
  new_train_data[[col]] <- as.numeric(factor(new_train_data[[col]]))
  new_test_data[[col]] <- as.numeric(factor(new_test_data[[col]]))
}

We are using gradient boosting regressor and random forest regressor as our regression prediction models.

1. Gradient Boosting Regressor

# Regression Prediction
set.seed(7004)
gradient_boosting_regressor <- gbm(new_train_data$price ~ ., data = new_train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 5)

# Predict the target variable for the test set
y_pred_gbm <- predict(gradient_boosting_regressor, newdata = new_test_data)

## Using 100 trees...

# Store predicted values to data frame
GBM <- data.frame(y_test = new_test_data$price, y_pred = y_pred_gbm)

# Show first 50 actual and predicted data
subset_GBM <- GBM[1:50, ]

# Visualize prediction using plot
plot(subset_GBM$y_test, type = "l", col = "blue", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_GBM$y_pred, col = "red", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lwd = 2)
title(main="Actual vs Predicted for Gradient Boosting Regressor Model")

2. Random Forest Regressor

# Regression Prediction
set.seed(7004)
random_forest_regressor <- randomForest(new_train_data$price ~ ., data = new_train_data, ntree = 100, importance = TRUE)

# Predict the target variable for the test set
y_pred_rf <- predict(random_forest_regressor, newdata = new_test_data)

# Store predicted values to data frame
RF <- data.frame(y_test = new_test_data$price, y_pred = y_pred_rf)

# Show first 50 actual and predicted data
subset_RF <- RF[1:50, ]

# Visualize prediction using plot
plot(subset_RF$y_test, type = "l", col = "blue", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_RF$y_pred, col = "red", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lwd = 2)
title(main="Actual vs Predicted for Random Forest Regressor Model")

# Compute error metrics for evaluation
mae_rf <- round(mean(abs(new_test_data$price - y_pred_rf)),3)
mae_gbm <- round(mean(abs(new_test_data$price - y_pred_gbm)),3)
rmse_rf <- rmse <- round(sqrt(mean((new_test_data$price - y_pred_rf)^2)),3)
rmse_gbm <- rmse <- round(sqrt(mean((new_test_data$price - y_pred_gbm)^2)),3)
r2_value_rf <- round(cor(y_pred_rf, new_test_data$price)^2, 3)
r2_value_gbm <- round(cor(y_pred_gbm, new_test_data$price)^2, 3)

	Random Forest Regressor	Gradient Boosting Regressor
Mean Absolute Error	3951.545	5170.289
Root Mean Squared Error	5956.678	7148.545
R-squared Value	0.725	0.603

The R-squared value of random forest regressor is 0.72. While gradient boosting regressor has a R-squared value of 0.6. Similarly random forest regressor has a lower Root Mean Squared Error (RMSE) compared to gradient boosting regressor.

Thus, the results implying that random forest regressor is performing better in price prediction.

Classification

Discretizing Price Category

We will be discretizing price into price category by classifying them into 3 price categories, low, medium and high.

new_train_data_cat <- new_train_data %>% 
  mutate(price_category = discretize(new_train_data$price, breaks=3, labels = c("low","medium","high")))

new_test_data_cat <- new_test_data %>% 
  mutate(price_category = discretize(new_test_data$price, breaks=3, labels = c("low","medium","high")))

new_train_data_cat <- new_train_data_cat[,-1]
new_test_data_cat <- new_test_data_cat[,-1]

Classification Modelling and Evaluation

Random forest classification and support vector machine is chosen as the algorithm for our classification modelling.

1. Random Forest Classification

set.seed(7004)
rf_model <- randomForest(price_category ~ ., data = new_train_data_cat, ntree = 100)

# Predict the target variable for the test set
x_test_rf <- new_test_data_cat[,!(colnames(new_test_data_cat) %in% c("price_category"))]
y_test_rf <- new_test_data_cat[,'price_category']

y_pred_rf_2 <- predict(rf_model, x_test_rf)

# confusion matrix
confusion_matrix_rf <- confusionMatrix(new_test_data_cat$price_category, y_pred_rf_2)
print(confusion_matrix_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low medium high
##     low    1448    190   54
##     medium  235   1215  213
##     high    118    304 1344
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7825          
##                  95% CI : (0.7709, 0.7937)
##     No Information Rate : 0.3517          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6738          
##                                           
##  Mcnemar's Test P-Value : 1.127e-09       
## 
## Statistics by Class:
## 
##                      Class: low Class: medium Class: high
## Sensitivity              0.8040        0.7109      0.8343
## Specificity              0.9265        0.8687      0.8798
## Pos Pred Value           0.8558        0.7306      0.7610
## Neg Pred Value           0.8971        0.8571      0.9204
## Prevalence               0.3517        0.3337      0.3146
## Detection Rate           0.2828        0.2373      0.2624
## Detection Prevalence     0.3304        0.3247      0.3449
## Balanced Accuracy        0.8653        0.7898      0.8570

The accuracy of random forest classification is 0.7825. By observing the detailed accuracy of each class, the model appears to perform better for low and high price categories.

2. Support Vector Machine Classification

svm_model <- svm(price_category ~ ., data = new_train_data_cat)

# Predict the target variable for the test set
x_test_svm <- new_test_data_cat[,!(colnames(new_test_data_cat) %in% c("price_category"))]
y_test_svm <- new_test_data_cat[,'price_category']

y_pred_svm <- predict(svm_model, x_test_svm)

# Confusion matrix
confusion_matrix_svm <- confusionMatrix(new_test_data_cat$price_category, y_pred_svm)
print(confusion_matrix_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low medium high
##     low    1411    234   47
##     medium  289   1137  237
##     high    190    341 1235
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7387          
##                  95% CI : (0.7265, 0.7507)
##     No Information Rate : 0.3691          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6084          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: low Class: medium Class: high
## Sensitivity              0.7466        0.6641      0.8130
## Specificity              0.9130        0.8457      0.8526
## Pos Pred Value           0.8339        0.6837      0.6993
## Neg Pred Value           0.8603        0.8337      0.9154
## Prevalence               0.3691        0.3343      0.2966
## Detection Rate           0.2755        0.2220      0.2412
## Detection Prevalence     0.3304        0.3247      0.3449
## Balanced Accuracy        0.8298        0.7549      0.8328

The accuracy of support vector machine classification is 0.7387, where the sensitivity of the high class for price prediction is the highest, followed by low class then medium class. Similarly, the classifier is better at predicting low and high price categories.

By comparing the accuracy of both classification algorithm, random forest classification is performing better than support vector machine.

To better evaluate the performance of the algorithm, the ROC curve and AUC is required to calculate the prediction probabilities of the testing data with each classifier respectively.

new_test_data_cat$price_category <- as.numeric(new_test_data_cat$price_category)
y_pred_rf_2 <- as.numeric(y_pred_rf_2)
y_pred_svm <- as.numeric(y_pred_svm)

roc_rf <- roc(new_test_data_cat$price_category, y_pred_rf_2)

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

roc_svm <- roc(new_test_data_cat$price_category, y_pred_svm)

## Setting levels: control = 1, case = 2
## Setting direction: controls < cases

plot(roc_rf, main = "ROC Curves", col = "blue")
lines(roc_svm, col = "red")
legend("bottomright", legend = c("Random Forest", "Support Vector Machine"), col = c("blue", "red"), lwd = 2)

auc_rf <- auc(roc_rf)
auc_svm <- auc(roc_svm)

Classifiction	AUC
Random Forest	0.852773264302032
Support Vector Machine	0.830429782400714

AUC represented the likelihood of a random positive is placed to the right of a random negative. A model with 100% correct prediction has an AUC of 1.0, and 0.0 for the 100% wrong prediction.

From the calculated AUC values, random forest classification is performing better than support vector machine.

Conclusion

All in all, the objective of this initiative is to use data science methodologies to process, analyse, and predict Used Car Retail Prices. Initially, we provided a thorough introduction to the dataset, discussing its extent and the nature of the data within each column. Subsequently, we engaged in data pre-processing, such as dataset cleaning and segmentation, to generate a dataset that was model-friendly. To gain a deeper understanding of the distinct characteristics of the dataset, we conducted Exploratory Data Analysis employing correlation analysis and other techniques. We modelled the problem using both regression and classification models.

In conclusion, random forest performs well in both classification and regression for price prediction. Hence, random forest predict the best in the retail car price based on different attributes.

WQD7004 Programming for Data Science

Group Assignmment for Group 8