1. Introduction.

It is much easier to establish the price that a manufacturer will set for a new car because of the fixed costs involved in its production, the taxes levied by the government and the market segmentation that the manufacturer is targeting with a particular car model. However, this is not the case with used or old cars because these factors don’t necessarily come into play. With the prices of new cars constantly rising, many buyers have been forced to purchase used or old cars as alternatives, a trend which is on an upward trajectory. A consumer trying to purchase an old or used car faces the challenge of not knowing how the price of such a car can be determined or predicted. This problem would not exist if there was a system that would adequately predict the prices of such cars with some desired accuracy by utilizing known car features. Our project aims at building a linear regression model to estimate the prices of used cars using Linear Regression.

The used cars’ data set utilized for this work was obtained from Kaggle [1]. It has 4340 records and it was explored for its suitability.

This work is divided into the following sections: Section II dwells on work related to used cars price prediction. We covered methodology in section III, and which provides detailed steps that we adopted in the extensive preprocessing of data before it entered into analysis, the building of a linear regression model as well as two regression trees built with rpart and by bagging which we developed and used to validate our linear regression model.

In section IV, we examined the performance of the regression model by cross-validating it with the regression tree.

The conclusions drawn from our study are elaborated in section V ,and finally, in Section VI, we will provide a deeper explanation of the data that we used for our study and the software that we utilized for the same.

Lastly, the final section of this work records all the references that we used for this study.

3. Methodology

In this section, we provide the detailed steps that we took in building our Linear regression Model. We cover the steps that we employed in the extensive pre-processing of our data before it was used for analysis. We also cover the two different regression tree algorithms which we used to validate our model.

This section is further divided into the following subsections: data set information, data cleaning, data transformation, data sampling, obtaining of training and testing sets, the building of the linear regression model with the lm algorithm, and the building of the regression tree model for validation.

3.1 Data Set Information.

The data for this study was obtained from Kaggle. It contains a total of 4340 records with eight attributes namely;

name- The car’s make including its model.
selling_price- This is the current selling price of the car.
Year- The year that the car was initially bought.
km_driven- The number of kilometers that the car has accumulated while being driven.
fuel_The type of fuel that the car consumes.
seller_type-The person selling the car whether it is the owner or a dealer.
transmission- This is how the car’s power moves from the engine to the wheels-Automatic or Manual transmission.
Owner- The number of people that have owned the car since its manufacture.

The data set was obtained from the online Data Science and Machine Learning interactive site Kaggle.

We needed to load the following dependencies to carry our analysis on R after which we imported our used cars data which was structured as a CSV file.

set.seed(12300)
suppressPackageStartupMessages({
library(dplyr)
library(tidyverse)
library(ggplot2)
library(lubridate)
library(car)
library(stringr)
library(psych)
library(rpart)
library(rpart.plot)
library(ipred)
library(caret)
library(GGally)})

usedcars_df <- read.csv(file="C:/Users/Hill85/Desktop/CSV Data/Used_cars_data.csv")
    
head(usedcars_df)

printing the attributes of the data set.

colnames(usedcars_df)

## [1] "name"          "year"          "selling_price" "km_driven"    
## [5] "fuel"          "seller_type"   "transmission"  "owner"

3.1 Data Exploration and Cleaning.

This was the most crucial part in this work and it entailed a number of steps. The first was changing the names of the variables to make it easier to work with the data set. We then split the Car Make attribute which combined both the cars’ make and model records to obtain two different columns.

After that, we checked for the missing values in all the columns as well as obtaining their counts. After dealing with the issue of missing values, we renamed some records in the data set which had been wrongly labeled. Next was to check for the existence of duplicates in our data and finally, we had to deal with the issue of categorical variables which included Car make, Transmission, Owner type, Seller type, and fuel type. We changed these variables into numeric ones through ordinal encoding.

We renamed the attributes of the data set to make it easier to work with.

usedcars <- usedcars_df %>% rename(Car_Make=name,
         Year=year,
         Price=selling_price,
         Mileage=km_driven,
         Fuel_Type=fuel,
         Seller_Type=seller_type,
         Transmission=transmission,
         Num_Owners=owner)
head(usedcars,5)

we then split the Car_make column which comprised the car’s make and its model into two different columns namely Car_Make and Car_model.

usedcars <-extract(usedcars,Car_Make,c("Car_Make","Car_Model"), "([^ ]+) (.*)")
head(usedcars,5)

We then checked for missing values in the data.

apply(usedcars, 2, anyNA)

##     Car_Make    Car_Model         Year        Price      Mileage    Fuel_Type 
##        FALSE        FALSE        FALSE        FALSE        FALSE        FALSE 
##  Seller_Type Transmission   Num_Owners 
##        FALSE        FALSE        FALSE

colSums(is.na(usedcars))

##     Car_Make    Car_Model         Year        Price      Mileage    Fuel_Type 
##            0            0            0            0            0            0 
##  Seller_Type Transmission   Num_Owners 
##            0            0            0

Two records in the car make attribute were wrongly named. These were Land and Daewood car makes. We changed these names from Land, Daewoo, and OpelCorsa to their correct names Land Rover, Daewood, and Opel Corsa respectively

usedcars[usedcars[, "Car_Make"] == "Land", "Car_Make"] <- "Land Rover"
usedcars[usedcars[, "Car_Make"] == "Daewoo", "Car_Make"] <- "Daewood "
usedcars[usedcars[, "Car_Make"] == "OpelCorsa", "Car_Make"] <- "Opel Corsa "
#head(usedcars)

we then checked for duplicates in our data set.

3.2 Data Description/Distribution.

In this sub-section, we carried out various statistical distributions of the data’s attributes to see how they were related to each other, and, to Price which is the predictor variable.

3.2.1 Distribution of years the used cars were first bought.

The aim was to establish the counts of cars according to the years that they were initially bought

Year_COUNT <- usedcars %>%
  group_by(Year) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)*100) %>% as.data.frame(usedcars_tibble) %>% 
  arrange(desc(Freq))
head(Year_COUNT,5)

ggplot(data=Year_COUNT,aes(x=Year,y=Freq)) +
  geom_line(color="#69b3a2",size=1) + 
   geom_point(size=3,color="blue4") +
  labs(x="Year",y="Number of cars(Freq)",
       title="Plot of Frequency of Car count per Year")+
  theme(plot.title =element_text(color="black",size=12,
                                 face="bold",
                                 lineheight = 0.8),
        axis.text.x = element_text())

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.

2017 had the largest count of used cars at 466 representing 10.74 % of the total used cars while 1995 and 1992 each had one car each representing 0.02 % of the total car count respectively.

3.2.2 Distribution of Car Seller Types vis-a-vis the years the cars were first bought.

ggplot(usedcars, aes(x=Seller_Type, y=Year, fill=Seller_Type)) + 
    geom_boxplot()+
  labs(x="Seller_Type",y="Year",
                        title="Boxplot of Car Seller Type Distribution ")+
  labs(x="Seller Type",y="Year")+
  theme(plot.title =element_text(color="black",size=12,face="bold",
                                 lineheight = 0.8),
        axis.text.x = element_text())

All the cars that were sold by Trustmark Dealers were bought after 2015 while the oldest vehicles are those being sold by individual owners.

3.2.3 Distribution of Car Makes.

Next, we obtained the count of the Car Makes to establish the car brands that had the highest and lowest counts after which we visualized their results in a plot.

Car_Make_Count <- usedcars %>%
  group_by(Car_Make) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)*100) %>% as.data.frame(usedcars_tibble) %>% 
  arrange(desc(Freq))
head(Car_Make_Count,5)

ggplot(data=Car_Make_Count, aes(x=Car_Make, y=n,fill=Car_Make))+
  geom_bar(stat="identity")+
  labs(x="Car Make",y="Number of Cars",
       title=" Bar Plot of Car Make Count.")+
  theme(plot.title =element_text(color="black",
  size=12,face="bold",lineheight = 0.8))+
  theme(axis.text.x = element_text(angle = 90, hjust = 0.5))

Maruti had the biggest count in the data set with 1280 records in the data set accounting for 29.5 % of all the vehicles followed by Hyundai which had 821 records. Kia, Isuzu, Force, and Daewood all had one record, each accounting for only 0.02% respectively of all the records.

3.2.4 computing the number of cars that utilize each category of fuel type.

Fuel_Type_Count <- usedcars %>%
  group_by(Fuel_Type) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)*100) %>% as.data.frame(usedcars_tibble) %>% 
  arrange(desc(Freq))
head(Fuel_Type_Count,5)

ggplot(data=Fuel_Type_Count, aes(x=Fuel_Type, y=n))+
  geom_bar(stat="identity",fill="red")+
  theme_dark()+
  labs(x="Fuel Type",y="Number of Cars",
       title="Bar Plot of The Type of Fuels utilized by Used Cars")+
  theme(plot.title =element_text(color="black",
  size=12,face="bold",               
  lineheight = 0.8))

The used cars are almost divided into two categories-diesel and petrol powered, with both accounting for 98 % of the total cars in the data set. There was only one car that utilized electric power.

3.2.5 Plotting the distribution of the car Prices according to their makes.

ggplot(usedcars, aes(x = Car_Make, y = Price)) +
  geom_boxplot(fill = "#0099f8")+
    labs(
    title = "Box plot of Car Make prices",
    x = "Car Make",
    y = "Price")+
  coord_flip()

BMW and Mercedes were the most expensive cars in the data set while Daewood and Opel Corsa were the cheapest.

3.2.6 Finding the mean car prices for each car make.

options(scippen=999)
mean_car_prices <- usedcars %>%
  group_by(Car_Make) %>%
  summarize(mean_car_price =mean(Price)) %>%
  mutate(Car_Make=fct_reorder(Car_Make, mean_car_price))%>% 
  arrange(desc(mean_car_price))
head(mean_car_prices,5)

ggplot(data=mean_car_prices, aes(x=Car_Make, y=mean_car_price)) +
  geom_col(fill="#56B4E9") +
  labs(x="Car_Make",y="Mean Car prices",title=" Bar Plot of Car Mean prices")+
  theme(plot.title =element_text(color="black",size=12,face="bold",
  lineheight = 0.8),axis.text.x = element_text())+
  theme(axis.text.x = element_text(angle = 90, hjust = 0.5))

Land Rover had the highest mean price of all the car makes followed by BMW whereas Daewood had the lowest mean prices.

3.2.7 Exploring the distribution of cars mileage with a histogram plot

options(scippen=999)
ggplot(usedcars, aes(x=Mileage)) + 
  geom_histogram(color="red", fill="blue4", bins = 160) +
  labs(x='Car Mileage ',y='Number of Cars(Freq)',title = "Histogram Plot of Car Mileage Distribution")  +
  scale_x_continuous(trans='log10')

The plotted histogram pointed out that the car mileages were skewed to the right which is an indication that most of the cars in the data set had accumulated more than 100,000 km while being driven on the road.

3.2.8 Distribution of the type of used car sellers

Seller_Type_Count <- usedcars %>%
  group_by(Seller_Type) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)*100) %>% as.data.frame(usedcars_tibble) %>% 
  arrange(desc(Freq))
head(Seller_Type_Count,5)

ggplot(data=Seller_Type_Count, aes(x=Seller_Type, y=n))+
  geom_bar(stat="identity",fill="purple")+
  theme_dark()+
  labs(x="Type of Car Sellers",y="Seller Type(Freq)",
       title="Bar Plot of the Type of Car Seller")+
  theme(plot.title =element_text(color="black",size=12,face="bold",
                                lineheight = 0.8))

Individual car sellers were the top cars sellers at 74.7% followed by Dealers who came second at 22.9%. Trustmark Dealer came last at 2.4 %.

3.2.9 Distribution of Car Ownership.

Number_of_owners <- usedcars %>%
  group_by(Num_Owners) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)*100) %>% as.data.frame(usedcars_tibble) %>% 
  arrange(desc(Freq))
head(Number_of_owners,5)

ggplot(data=Number_of_owners, aes(x=Num_Owners, y=n))+
  geom_bar(stat="identity",fill="indianred4")+
  labs(x="Type of Car Ownership",y="Car Ownership(Freq)",
  title=" Bar Plot of the type of Car Ownership")+
  theme(plot.title =element_text(color="black",size=12,face="bold",
  lineheight = 0.8),axis.text.x = element_text())+
  theme(axis.text.x = element_text(angle = 10, hjust = 0.5))

Majority of the car owners are first owners meaning that the cars had not been bought before by any other person.

Exploring the Distribution of Price.

price was our target variable and we looked at its distribution so that we could establish whether it was normally distributed.

ggplot(usedcars, aes(x=Price)) + 
  labs(x='Price of Used cars') +
  labs(title = "Histogram Graph of the Prices of used Cars") +
  geom_histogram(aes(y=..density..), colour="violet", fill="maroon")+
  geom_density() +
  scale_x_continuous(trans='log2')

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## i Please use `after_stat(density)` instead.

##

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histogram plot, we established that Price had a fairly good normal distribution.

Checking for Outliers Using Boxplots

We plotted box plots of all the numeric variables in the data set namely price, mileage and year to check for outliers.

suppressMessages(library(reshape))
reshaped_usedcars <- melt(usedcars)

## Using Car_Make, Car_Model, Fuel_Type, Seller_Type, Transmission, Num_Owners as id variables

boxplots <- ggplot(reshaped_usedcars, aes(factor(variable), value))
boxplots + 
  geom_boxplot() + 
  facet_wrap(~variable, scale="free")

Price,year and mileage had outliers but which were insignificant.

3.3 Data Transformation.

Our data set had five attributes which were categorical. These were Car Make, Seller Type, Fuel Type, Transmission, and Num_Owners and we converted them to integers through ordinal encoding since regression only utilizes numeric variables. Thus, a label encoding algorithm was implemented to help normalize these attributes.

3.3.1 Converting the Car_make attribute from categorical values to ordinal values.

carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Daewood', '0')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'OpelCorsa', '1')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Ambassador', '2')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Chevrolet', '3')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Fiat', '4')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Tata', '5')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Datsun', '6')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Maruti', '7')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Force', '8')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Renault', '9')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Hyundai', '10')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Nissan', '11')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Volkswagen', '12')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Skoda', '13')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Honda', '14')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Ford', '15')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Mahindra', '16')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Mitsubishi', '17')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Toyota', '18')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Kia', '19')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Isuzu', '20')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Jeep', '21')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'MG', '22')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Audi', '23')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Jaguar', '24')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Volvo', '25')
carmake_ordinal <- usedcars$Car_Make <- 
  str_replace(usedcars$Car_Make, 'mercedes-Benz','26')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'BMW', '27')
carmake_ordinal <- usedcars$Car_Make <- str_replace(usedcars$Car_Make, 'Land Rover', '28')
carmake_ordinal <- usedcars$Car_Make <- as.numeric(usedcars$Car_Make)

## Warning: NAs introduced by coercion

#table(carmake_ordinal)

3.3.2 Converting the Seller_Type Column into ordinal.

seller <- usedcars$Seller_Type <- str_replace(usedcars$Seller_Type,
'Trustmark Dealer', "0")
seller <- usedcars$Seller_Type <- str_replace(usedcars$Seller_Type, 'Dealer', "1")
seller <- usedcars$Seller_Type <- str_replace(usedcars$Seller_Type, 'Individual', "2")
seller <- usedcars$Seller_Type<- as.numeric(usedcars$Seller_Type)
table(seller)

## seller
##    0    1    2 
##  102  994 3244

3.3.3 Turning the Fuel_Type Column into ordinal.

fuel_usage <- usedcars$Fuel_Type <- str_replace(usedcars$Fuel_Type, 'Diesel', "0")
fuel_usage <- usedcars$Fuel_Type <- str_replace(usedcars$Fuel_Type, 'Petrol', "1")
fuel_usage <- usedcars$Fuel_Type <- str_replace(usedcars$Fuel_Type,'CNG',"2")
fuel_usage <-usedcars_df$Fuel_Type <- str_replace(usedcars$Fuel_Type, 'LPG', "3")
fuel_usage <-usedcars$Fuel_Type <- as.numeric(usedcars$Fuel_Type)

## Warning: NAs introduced by coercion

table(fuel_usage)

## fuel_usage
##    0    1    2 
## 2153 2123   40

3.3.4 Turning the Num_Owners Column into ordinal.

car_ownership <- usedcars$Num_Owners <- 
  str_replace(usedcars$Num_Owners, 'First Owner', "0")
car_ownership <- usedcars$Num_Owners <- 
  str_replace(usedcars$Num_Owners, 'Second Owner', "1")
car_ownership <- usedcars$Num_Owners <- 
  str_replace(usedcars$Num_Owners, 'Third Owner', "2")
car_ownership <-usedcars$Num_Owners<-
  str_replace(usedcars$Num_Owners,'Fourth & Above Owner', "3")

car_ownership <-usedcars$Num_Owners <- 
  str_replace(usedcars$Num_Owners, 'Test Drive Car', "4")
car_ownership <-usedcars$Num_Owners <- 
  as.numeric(usedcars$Num_Owners)
table(car_ownership)

## car_ownership
##    0    1    2    3    4 
## 2832 1106  304   81   17

3.3.5 Making the Transmission column ordinal.

transmission_mode <- usedcars$Transmission <- 
  str_replace(usedcars$Transmission, 'Manual', "0")
transmission_mode <- usedcars$Transmission <- 
  str_replace(usedcars$Transmission, 'Automatic', "1")
transmission_mode <- usedcars$Transmission <- 
  as.numeric(usedcars$Transmission)
table(transmission_mode)

## transmission_mode
##    0    1 
## 3892  448

We ended up with the following transformed data set.

usedcars_transformed <- usedcars
head(usedcars_transformed,5)

3.3.6 Dropping the Car Model Attribute

The Car Model attribute had so many unique records in it making it hard to carry ordinal encoding on it and we had to drop it from the data.

usedcars_transformed <- subset(usedcars, select = -c(Car_Model))
head(usedcars_transformed,5)

In the process of turning some of our categorical columns ordinal, we realize that some NA values got introduced into our data by the default coercion of R, and we had to expunge them from the data set.

usedcars_transformed <-usedcars_transformed %>% drop_na()
#head(usedcars_transformed)
dim(usedcars_transformed)

## [1] 4279    8

3.4 Obtaining Summary Statistics of the transformed data set.

We obtained summary statistics for all the predictor variables as well as the price (target variable) in our transformed data.

psych::describe(usedcars_transformed)

3.5 Exploring the Correlation of the independent variables .

We obtained the correlation matrix from our data set to see how the predictor variables correlated with Price as well as with each other. We utilized a corregram that depicts the correlations of all variables. Positive correlations among variables are depicted in blue, and negative correlations are in red. The darker the color (blue or red) is an indication that the correlation coefficients existing between cross-referenced variables on the corrgram are strong either positively or negatively.

suppressMessages(require(corrgram))
corrgram(usedcars_transformed, order=TRUE)

From the resulting corrgram plot, we can see that the target variable(price) is very positively correlated with three predictor variables namely car make, transmission, and year. Mileage and Num_Owners, Mileage and Seller Type, Num_Owners and Seller_Type, Year, and Transmission all had very little correlation respectively. Transmission and Car_Make attributes had a slight positive correlation but which was insignificant.

we also obtained the Pearson correlation between the variables. The Pearson correlation coefficient provides a numeric correlation between variables. It expresses how two variables change correspondingly to each other[5]. A coefficient of 0.5 and 0.7 indicate attributes that are moderately correlated. A variable less than .5 indicates a low correlation between variables.

ggcorr(usedcars_transformed, label = T)

3.5 Data Sampling

Our transformed data had 4279 records. To get a reasonable size to work with, we obtained a sample that had sixty percent of the original transformed data set records. The sampled data that we obtained had 2568 records.

sample_percentage <- 0.6 

sample_size <- ceiling(sample_percentage * nrow(usedcars_transformed))

usedcars_sample <- usedcars_transformed %>% slice_sample(n=sample_size,replace=FALSE)

usedcars_sample = usedcars_transformed[sample(nrow(usedcars_transformed),
                                              
sample_size, replace = FALSE), ]

usedcars_sample <-usedcars_sample %>% drop_na()
head(usedcars_sample,5)

dim(usedcars_sample)

## [1] 2568    8

3.6 Splitting our sampled data set into Training and Testing Sets

We split our sampled data set into training and testing sets and obtained 2054 and 514 records respectively. This represented 80% and 20% of our sampled data set respectively

set.seed(1230)
usedcars_sample$id <- 1:nrow(usedcars_sample)

#split 80% of the dataset into training set and the remaining 20% as the test set 
training_set <- usedcars_sample %>% dplyr::sample_frac(0.8)
testing_set  <- dplyr::anti_join(usedcars_sample, training_set, by = 'id')
#drop the id columns from both sets
training_data <- subset(training_set, select = -c(id))
testing_data <- subset(testing_set, select = -c(id))
head(training_data)

head(testing_data)

dim(training_data)

## [1] 2054    8

dim(testing_data)

## [1] 514   8

3.7 Building Linear Regression model.

We trained a linear regression model with Price being a function of the predictor attributes Car_Make, Year, Mileage, Fuel_Type, Seller_Type, Transmission, and Num_owners.

set.seed(12300)
Regression_model_1 <- lm(
  formula = Price ~ Year + Car_Make + Mileage + 
    Fuel_Type + Seller_Type + Transmission + 
    Num_Owners,data = training_data)

summary(Regression_model_1)

## 
## Call:
## lm(formula = Price ~ Year + Car_Make + Mileage + Fuel_Type + 
##     Seller_Type + Transmission + Num_Owners, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -870169 -150208  -13546  119870 2938142 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.295e+07  4.338e+06 -16.816  < 2e-16 ***
## Year          3.637e+04  2.151e+03  16.910  < 2e-16 ***
## Car_Make      3.905e+04  1.792e+03  21.798  < 2e-16 ***
## Mileage      -7.955e-01  1.838e-01  -4.329 1.57e-05 ***
## Fuel_Type    -1.548e+05  1.597e+04  -9.692  < 2e-16 ***
## Seller_Type  -6.342e+04  1.573e+04  -4.032 5.73e-05 ***
## Transmission  5.506e+05  2.790e+04  19.734  < 2e-16 ***
## Num_Owners   -1.796e+04  1.143e+04  -1.572    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 339000 on 2046 degrees of freedom
## Multiple R-squared:  0.565,  Adjusted R-squared:  0.5635 
## F-statistic: 379.6 on 7 and 2046 DF,  p-value: < 2.2e-16

3.7.1 Interpreting Linear Regression Coefficients from the model’s output.

The Intercept represents the price value that will be fetched when all the variables are held constant while the slope of the model is represented by the predictor variables, and they give the rate of change that the price of used cars will go through with every unit that those variables change.

In the model output, Pr(>|t|) indicate the p-value, which can be subjected to comparison with the alpha value of 0.05 to test whether they are significant or not[6]. All p-values from our model are significant except the Num_Owners variable whose p-value is greater that the alpha value.

3.7.2 Multiple R-squared: 0.565

The R-squared is the coefficient of determination. It lies between 0 and 1. From this model, 0.5795 is an indication that the intercept, Year, Car_Make, Mileage, Fuel_Type, Seller_Type, Transmission, and Num_Owners explain 57.95% of the variance in the predicted variable-Price.

3.7.3 Adjusted R-squared: 0.5635

The Adjusted R-squared value is a measure of how the addition of a new predictor variable(s) to the model makes an improvement to it or not.

3.7.4 3. F-statistic: 376.6 on 7 and 2046 DF, p-value: < 2.2e-16

The F-Statistic and p-value provide the ground to test for the overall significance of the model. The lm function runs the Anova test and sets the null hypothesis to read that the model is not significant, and the alternative hypothesis to be the contrary.The p-value from the model is < 0.05 indicating that our model is significant.

3.7.5 Root Mean Square Error (RMSE)

The RMSE helps us to know which is the better model. A lower RMSE value is an indication that the model can generalize better. It also enables us to compare the train and test errors so that we can know whether the test data was data previously unseen to the model.

names(Regression_model_1)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

The output vector above represents the objects that the model is constituted of. Fitted values represent the predicted values that we can use to generate the RMSEs for both the training and testing data.

set.seed(12300)
suppressMessages(library(Metrics))
rmse(actual = training_data$Price, predicted =Regression_model_1$fitted.values)

## [1] 338294.7

rmse(actual = testing_data$Price, predicted = Regression_model_1$fitted.values)

## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length

## [1] 636535.5

The model’s training and testing errors were manually calculated and found to be almost similar to that generated by lm.

set.seed(12300)
Regression_prdct <- predict(Regression_model_1,data = training_data)

Regression_prdct_MSE <- mean((training_data$Price - Regression_prdct)^2)
Regression_prdct_MSE

## [1] 114443275697

We then computed the root mean squared errors (RMSE):

sqrt(Regression_prdct_MSE)

## [1] 338294.7

Next, we computed the testing error of the model.

set.seed(12300)
Regression_test_prdct <- predict(Regression_model_1,data = testing_data)

Regression_test_MSE <- mean((testing_data$Price - Regression_test_prdct)^2)

## Warning in testing_data$Price - Regression_test_prdct: longer object length is
## not a multiple of shorter object length

Regression_test_MSE

## [1] 405177429589

sqrt(Regression_test_MSE)

## [1] 636535.5

3.7.5 Predicting the price in the testing data.

Our model performance was tested with the testing data to see how stable it was and we generated the R-Squared value which we found to be so close to the one generated from the training set(0.565 versus 0.5651328)

set.seed(12300)
testing_data$Predicted_Price <- predict(Regression_model_1, testing_data)
#Computing RMSE from testing data
actual_price <- testing_data$Price
predicted_price <- testing_data$Predicted_Price
r_s <- sum((predicted_price - actual_price) ^ 2)
t_s <- sum((actual_price - mean(actual_price)) ^ 2)
r_squared <- 1 - r_s/t_s

r_squared

## [1] 0.5651328

3.7.4 Predicting selling price with the test data with our Model

pred_lreg <- predict(Regression_model_1,testing_data)
error_lreg <- testing_data$Price - pred_lreg
RMSE_lreg <- sqrt(mean(error_lreg^2))
RMSE_lreg

## [1] 333238

3.7.5 Plotting predicted Price vs. actual price Using the Testing data

plot(testing_data$Price,pred_lreg, main="Plot of Actual vs Predicted Selling Price", 
     col = c("darkblue","red"), 
     xlab = "Actual Car Selling Price", 
     ylab = "Predicted Car Selling Price")

plot(Regression_model_1)

3.8 Validating our Linear Regression Model

3.8.1 Building a Basic Regression Tree as a validation model.

To establish whether our regression model could generalize better, we built a Regression Tree to predict Price as a function of the seven predictor variables namely Year, Car_Make, Mileage, Fuel_Type, Seller_Type, Transmission, and Num_Owners.

We fitted the Regression Tree using rpart by setting the method to ANOVA for reproducibility since rpart has an intelligent auto-guesser that keeps changing the method it adopts.

set.seed(12300)
RTree_model <- rpart(
  formula = Price ~ Year + Car_Make + Mileage + Fuel_Type + 
  Seller_Type + Transmission + Num_Owners,
  data    = training_set,
  method  = "anova")

RTree_model

## n= 2054 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 2054 5.403244e+14  477488.1  
##    2) Car_Make< 20 1998 2.353498e+14  424595.9  
##      4) Year< 2013.5 991 2.917685e+13  250992.5  
##        8) Year< 2010.5 502 6.699610e+12  169667.7 *
##        9) Year>=2010.5 489 1.574880e+13  334479.3 *
##      5) Year>=2013.5 1007 1.469137e+14  595440.9  
##       10) Car_Make< 14.5 790 4.636612e+13  516383.5 *
##       11) Car_Make>=14.5 217 7.763459e+13  883253.4  
##         22) Transmission< 0.5 188 1.887512e+13  717393.6 *
##         23) Transmission>=0.5 29 2.006026e+13 1958483.0  
##           46) Year< 2016.5 9 2.668756e+12 1217778.0 *
##           47) Year>=2016.5 20 1.023170e+13 2291800.0 *
##    3) Car_Make>=20 56 9.995680e+13 2364607.0  
##      6) Year< 2017.5 39 2.004685e+13 1721410.0 *
##      7) Year>=2017.5 17 2.676139e+13 3840176.0 *

#summary(RTree_model)

3.8.1.1 Obtaining the Basic Regression Tree’s training error from the training data

pred <- predict(RTree_model, training_data)
RMSE(pred, training_data$Price)

## [1] 267883.6

3.8.1.2 Obtaining the Basic Regression Tree’s testing error from the testing set

pred <- predict(RTree_model,testing_data)
RMSE(pred = pred,testing_data$Price)

## [1] 316277.7

The fitted regression tree model started at 2054 observations as the root node and the first variable that entered the split was Car_Make. It is the first that utilized the reduction in the SSE. The same variable entered the second branch of the tree with a total of 1998 observations. The price at this second branch is 424,595.9 and the SSE is 2.353498e + 14.

The third branch had a total of 56 observations, and it listed the price as 236460.7 and the SSE as 9.995680e+ 13. We deduced therefore that the most important variable in predicting the sale price of used cars from the basic Regression Tree is the Car Make because it had the highest reduction in SSE.

This model generated RMSEs of 267,888 and 316,277.7 on the training and testing sets.

From the visualized plot of our basic regression tree, we found out that Our model had 10 internal and 12 terminal nodes that captured the percentages of the data that fell in those nodes as well the corresponding prices in each of those branches or nodes.

3.8.1.3 Plotting the Regression Tree model

rpart.plot(RTree_model)

### 3.8.2 Improving the performance of our regression Tree with Bootstrap Aggregating(Bagging).

Regression trees are associated with high variance, and one of the methods that can significantly reduce the variance associated with these models is Bagging(Bootstrap aggregating). Bootstrap aggregating achieves this by merging and averaging multiple trees and this has the net effect of reducing over-fitting over one tree [7]

Bagging follows three simple steps: The first thing is to generate m number of bootstrap samples from the training set. These samples make it possible to come up with different data sets while at the same time maintaining the same distribution that exists in the training set. Secondly, you train a single regression tree that is unpruned from each bootstrap sample that you had generated. Finally, the process ends when each individual prediction from each tree is averaged so that one general predicted value is obtained.

One important thing to note is that a bootstrap sample contains 0.67 of the training set leaving 0.33 out of it. This data that has been left out is called OOB (Out of Bag).[8],[9].

The OOB will then act as the test set for gauging whether the model is able to generalize better or not.

To see whether we could improve our regression tree, we trained a regression tree by bagging.

3.8.3 Building an improved Regression Tree by Bagging.

Ipred improves prediction models through bagging and indirect classification through re-sampling that is based on estimators of prediction error.

One advantage of this method is that it makes it very simple to carry out cross-validation. A much better error is obtained even if one still chooses to use the OOB(that is the testing set which is 0.33 of the training set).

Secondly, and most importantly, we can know exactly or rank the importance of the predictor variables in the model.

We first instructed our model to carry out 10 cross-validations.

set.seed(12300)
validation_control <- trainControl(method = "cv", number=10) # 10 cross validations

crossValidated_baggingModel <- train(
  Price ~ Year + Car_Make + Mileage + Fuel_Type + 
  Seller_Type + Transmission +
  Num_Owners,data = training_data,
  method = "treebag",
  trControl = validation_control,
importance = TRUE) #Rank the predictor variable importance in the model

crossValidated_baggingModel

## Bagged CART 
## 
## 2054 samples
##    7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1848, 1848, 1849, ... 
## Resampling results:
## 
##   RMSE    Rsquared   MAE     
##   261786  0.7384331  164501.9

From the bagging model, performing a 10 fold cross validation generated a RMSE of 261,786 and a R-squared of 73.8%

3.8.4 Ranking the Importance of the predictor attributes in the model.

An important predictor in a regression tree is the one which has the greatest number of splits on the total SSE.

plot(varImp(crossValidated_baggingModel)) #predictor importance of the seven variables.

3.8.5 Intepretting the Regression Tree built by Bagging

we obtained an R-Square of 73.8 % from the regression tree built by bagging which is way much better than the one with we obtained in Linear regression (56.5%).

The RMSE of the bagged regression Tree (261786) is also much much better compared to the one obtained from basic regression (267883).

The Year the car was manufactured, the Car make and Mileage were the most important predictor variables in predicting the selling price of a used car while Seller_Type and Num_Owners were the worst predictors in the improved Regression Tree model.

4.0 Results and Discussion

In our work, data was split into training and testing sets on a ratio of 80:20 respectively.When we trained our linear regression on the training set, we obtained an RMSE of 338297.7 and an R-Square of 56.5%.

The model’s p-value was 2.2e-16 indicating that it was statistically significant since it was less than the lambda value of 0.05.

The Linear regression model was validated using two different trees, one built by the rpart function and the other through bagging.

The Basic Regression Tree from rpart yielded an RMSE of 267883 whereas the improved Regression Tree (from bagging) yielded an RMSE of 261786 and a R-Square of 73.8 %

In the Linear regression model, six of the seven predictor variables had p-values less than the statistically significant lambda value of 0.05. It is only Num_Owners variable that yielded a higher p-value indicating that it was contributing very little in explaining the variance in the model. The same (together with Seller_Type attribute) is replicated in the improved Regression Tree built with bagging.

The Regression Tree obtained through bagging indicated that the most important variables that help in the prediction of prices of used cars are year, car make, and Mileage.

5. Conclusion

we found out that a better prediction of used car prices can be achieved with the use of regression trees built through bootstrap aggregating (Bagging) since they are able to generalize better than Linear Regression and basic Regression Trees. The bagged Regression Tree model’s accuracy is superior as compared to the Linear Regression model.

we also established that both Linear Regression and the Regression Trees models pointed out that the most important predictors of used car prices are the car make, year of manufacture, and Mileage.

Both models also established that the number of people that have owned the cars since they were first bought and the seller types contribute very little in determining the selling price of used cars.

We, therefore, hold the view that a better and superior model of predicting the price of used cars can be obtained by eradicating some predictor attributes using feature selection as established by K. Noor et al discussed in section II of this work.

5 Data and Software Availability

5.1 Data.

5.1.1 Link to the data set utilized for this work.

The data utilized for this study was obtained from Kaggle.The link to the data set is found here[Link to the data set utilized for this work]

This data was initially scrapped from Car Dheko which is a site that provides information to users that want to buy cars in India.

The data set has seven attributes that were very relevant to our work. We preprocessed this data and later sampled it since the initial data set had 4330 records. Our sampled data was then split into training and testing sets in a 4:1 ratio respectively.

Link to all Kaggle data sets.

The link to the official page for all Kaggle data sets is [10]

5.2 Software Availability.

5.2.1 Link to the rpubs page where we have published our work.

We have published our work in Rpubs.This can be accessed through this link [Link to the codes that generated our work]

5.2.2 Link to the codes that generated our work.

The codes that generated our work can be accessed through this link[Link to the codes that generated this work]

5.2.3 Other softwares utilized for this work.

we utilized several R packages in our work, most of which played supporting roles.However, the main emphasis are on the rpart, lm, and ipred packages.

dplyr[dplry]. We used dplyr to preprocess our data by checking for missing values, renaming attributes and wrongly labelled records.This library from R was also used to check for unique values in in R so that duplicated records could not be used for analysis.
lm function:Fitting Linear Models [lm documentation]. We utilized the lm function to build our linear regression model .
rpart-Recursive Partitioning and Regression Trees[rpart documentation]. This function was used to build a basic regression tree which, alongside an improved regression tree built by bagging, was used to validate the linear regression model.
ipred: Interfaces for ipred package for data science pipelines[ipred documentation]. The second bagged regression tree was built with this package.
GGally:Extension to ggplot2[GGally] This package enabled us to establish the correlation existing between the variables in our data as well as with the explained variable, price.

6 References

Kaggle datasets.https://www.kaggle.com/code/celestioushawk/cardekho-used-cars/data
Samruddhi, K., and R. Ashok Kumar. “Used Car Price Prediction using K-Nearest Neighbor Based Model.” Int. J. Innov. Res. Appl. Sci. Eng.(IJIRASE) 4 (2020): 629-632.
Gegic, E., Isakovic, B., Keco, D., Masetic, Z., & Kevric, J. (2019). “Car price prediction using machine learning techniques.” TEM Journal, 8(1), 113.
Noor, Kanwal, and Sadaqat Jan. “Vehicle price prediction system using machine learning techniques.” International Journal of Computer Applications 167.9 (2017): 27-31.
Sedgwick, Philip. “Pearson’s correlation coefficient.” Bmj 345 (2012).
O’Brien, Sheila F., Lori Osmond, and Qi‐Long Yi. “How do I interpret ap value?.” Transfusion 55.12 (2015): 2778-2782.
Breiman, L. Bagging Predictors. Machine Learning 24, 123–140 (1996). https://doi.org/10.1023/A:1018054314350
Breiman, L. “Out-of-bag estimation: Technical Report.” Department of Statistics, University of California (1996).
Hothorn, Torsten, and Berthold Lausen. “Bundling classifiers by bagging trees.” Computational Statistics & Data Analysis 49.4 (2005): 1068-1078.
www.Kaggle.Datasets.https://www.kaggle.com/dataset

Used Cars Price Prediction using Linear Regression

Hillary Kemei,Sai Bollam