Introduction:

The aim of this project is to build algorithms to predict selling price of cars. The dataset is taken from Kaggle. Data Source: https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho?select=Car+details+v3.csv

I will use three different models, linear regression, random forest and gradient boosting to predict selling prices of cars and compare the results.

Attribute Information:

name - Name of the cars

year - Year of the car when it was bought

selling_price - Price at which the car is being sold

km_driven - Number of Kilometers the car is driven

fuel - Fuel type of car (petrol / diesel / CNG / LPG / electric)

seller_type - Tells if a Seller is Individual or a Dealer

transmission - Gear transmission of the car (Automatic/Manual)

Owner - Number of previous owners of the car.

#Reading the data
car<- read.csv("https://raw.githubusercontent.com/swetaswarupa/Car-Price-Prediction/main/Car%20details%20v3.csv", header = TRUE)
str(car)
## 'data.frame':    8128 obs. of  13 variables:
##  $ name         : chr  "Maruti Swift Dzire VDI" "Skoda Rapid 1.5 TDI Ambition" "Honda City 2017-2020 EXi" "Hyundai i20 Sportz Diesel" ...
##  $ year         : int  2014 2014 2006 2010 2007 2017 2007 2001 2011 2013 ...
##  $ selling_price: int  450000 370000 158000 225000 130000 440000 96000 45000 350000 200000 ...
##  $ km_driven    : int  145500 120000 140000 127000 120000 45000 175000 5000 90000 169000 ...
##  $ fuel         : chr  "Diesel" "Diesel" "Petrol" "Diesel" ...
##  $ seller_type  : chr  "Individual" "Individual" "Individual" "Individual" ...
##  $ transmission : chr  "Manual" "Manual" "Manual" "Manual" ...
##  $ owner        : chr  "First Owner" "Second Owner" "Third Owner" "First Owner" ...
##  $ mileage      : chr  "23.4 kmpl" "21.14 kmpl" "17.7 kmpl" "23.0 kmpl" ...
##  $ engine       : chr  "1248 CC" "1498 CC" "1497 CC" "1396 CC" ...
##  $ max_power    : chr  "74 bhp" "103.52 bhp" "78 bhp" "90 bhp" ...
##  $ torque       : chr  "190Nm@ 2000rpm" "250Nm@ 1500-2500rpm" "12.7@ 2,700(kgm@ rpm)" "22.4 kgm at 1750-2750rpm" ...
##  $ seats        : int  5 5 5 5 5 5 5 4 5 5 ...

A portion of the car data set is shown below:

Car Data
name year selling_price km_driven fuel seller_type transmission owner mileage engine
Maruti Swift Dzire VDI 2014 450000 145500 Diesel Individual Manual First Owner 23.4 kmpl 1248 CC
Skoda Rapid 1.5 TDI Ambition 2014 370000 120000 Diesel Individual Manual Second Owner 21.14 kmpl 1498 CC
Honda City 2017-2020 EXi 2006 158000 140000 Petrol Individual Manual Third Owner 17.7 kmpl 1497 CC
Hyundai i20 Sportz Diesel 2010 225000 127000 Diesel Individual Manual First Owner 23.0 kmpl 1396 CC
Maruti Swift VXI BSIII 2007 130000 120000 Petrol Individual Manual First Owner 16.1 kmpl 1298 CC

There are 8128 rows and 13 variables. Our target variable is the selling_price, which signifies the price of the car. We will use other variables to predict selling_price.

Data Exploration, Data Cleaning and Data Transformation

Car name variable


#Extracting brand name from car name

car$name <- sapply(strsplit(car$name, " "), `[`, 1)

#Plotting car name to check the distribution

ggplot(data = car, aes(x=name, fill = name)) +
  geom_bar() + labs(x='Car Brand') + labs(title = "Bar Graph of Car Brand") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Converting car name into Ordinal Encoder

car$name <- str_replace(car$name, 'Maruti', '0')
car$name <- str_replace(car$name, 'Skoda', '1')
car$name <- str_replace(car$name, 'Honda', '2')
car$name <- str_replace(car$name, 'Hyundai', '3')
car$name <- str_replace(car$name, 'Toyota', '4')
car$name <- str_replace(car$name, 'Ford', '5')
car$name <- str_replace(car$name, 'Renault', '6')
car$name <- str_replace(car$name, 'Mahindra', '7')
car$name <- str_replace(car$name, 'Tata', '8')
car$name <- str_replace(car$name, 'Chevrolet', '9')
car$name <- str_replace(car$name, 'Fiat', '10')
car$name <- str_replace(car$name, 'Datsun', '11')
car$name <- str_replace(car$name, 'Jeep', '12')
car$name <- str_replace(car$name, 'Mercedes-Benz', '13')
car$name <- str_replace(car$name, 'Mitsubishi', '14')
car$name <- str_replace(car$name, 'Audi', '15')
car$name <- str_replace(car$name, 'Volkswagen', '16')
car$name <- str_replace(car$name, 'BMW', '17')
car$name <- str_replace(car$name, 'Nissan', '18')
car$name <- str_replace(car$name, 'Lexus', '19')
car$name <- str_replace(car$name, 'Jaguar', '20')
car$name <- str_replace(car$name, 'Land', '21')
car$name <- str_replace(car$name, 'MG', '22')
car$name <- str_replace(car$name, 'Volvo', '23')
car$name <- str_replace(car$name, 'Daewoo', '24')
car$name <- str_replace(car$name, 'Kia', '25')
car$name <- str_replace(car$name, 'Force', '26')
car$name <- str_replace(car$name, 'Ambassador', '27')
car$name <- str_replace(car$name, 'Ashok', '28')
car$name <- str_replace(car$name, 'Isuzu', '29')
car$name <- str_replace(car$name, 'Opel', '30')
car$name <- str_replace(car$name, 'Peugeot', '31')

#Converting car name from categorical to numerical value

car$name <- as.numeric(car$name)
table(car$name)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2448  105  467 1415  488  397  228  772  734  230   47   65   31   54   14   40 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##  186  120   81   34   71    6    3   67    3    4    6    4    1    5    1    1

Highest numbers of cars fall into Maruti brand followed by Hyundai, Mahindra and Tata

Substituting blank with NA for columns mileage, engine, max_power

car$mileage[car$mileage == ""] <- NA
car$engine[car$engine == ""] <- NA
car$max_power[car$max_power == ""] <- NA

Checking for missing values

# Checking for missing values
sapply(car, function(x) sum(is.na(x)))
##          name          year selling_price     km_driven          fuel 
##             0             0             0             0             0 
##   seller_type  transmission         owner       mileage        engine 
##             0             0             0           221           221 
##     max_power         seats 
##           215           221

There are 221 missing values for mileage, engine, seats and 215 missing values for max_power

Transforming mileage, engine, max_power and seat from categorical to numerical value and replacing missing values with their mean values

#Removing unit from mileage, converting it to numeric value and replacing the missing values
car$mileage <- str_replace(car$mileage, 'kmpl', '')
car$mileage <- str_replace(car$mileage, 'km/kg', '')
car$mileage <- as.numeric(car$mileage)
car$mileage[is.na(car$mileage)]<-mean(car$mileage,na.rm=TRUE)
#Removing unit from engine, converting it to numeric value and replacing the missing values

car$engine <- str_replace(car$engine, 'CC', '')
car$engine <- as.numeric(car$engine)
car$engine[is.na(car$engine)]<-mean(car$engine,na.rm=TRUE)
#Removing unit from max_power, converting it to numeric value and replacing the missing values
car$max_power <- str_replace(car$max_power, 'bhp', '')
car$max_power <- as.numeric(car$max_power)
car$max_power[is.na(car$max_power)]<-mean(car$max_power,na.rm=TRUE)
#Converting seats to numeric value and replacing the missing values
car$seats <- as.numeric(car$seats)
car$seats[is.na(car$seats)]<-median(car$seats,na.rm=TRUE)

Let’s check for missing values after treating missing values

# Checking for missing values once again
sapply(car, function(x) sum(is.na(x)))
##          name          year selling_price     km_driven          fuel 
##             0             0             0             0             0 
##   seller_type  transmission         owner       mileage        engine 
##             0             0             0             0             0 
##     max_power         seats 
##             0             0

There are no missing vales any more.

Plotting categorical Values and checking for distribution

# Bar graph of Fuel
ggplot(data = car, aes(x=reorder(fuel, fuel, function(x)-length(x)), fill = fuel)) +
  geom_bar() + labs(x='Fuel') + labs(title = "Bar Graph of Fuel") 

Most of the cars fall into Diesel category followed by Petrol. Very few cars fall into CNG and LPG category.

#Bar graph of Seller Typs
ggplot(data = car, aes(x=reorder(seller_type, seller_type, function(x)-length(x)), fill = seller_type)) +
  geom_bar() + labs(x='Seller Type') + labs(title = "Bar Graph of Seller Type")

Huge number of cars are owned by individual owners followed by Dealer and Trustmark Dealers.

# Bar graph of Owner
ggplot(data = car, aes(x=reorder(owner, owner, function(x)-length(x)), fill = owner)) +
  geom_bar() + labs(x='Owner') + labs(title = "Bar Graph of Owner") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Most of the cars are owned by first owners.

# Bar graph of seats
ggplot(data = car, aes(x=reorder(seats, seats, function(x)-length(x)), fill = seats)) +
  geom_bar() + labs(x='Seats') + labs(title = "Bar Graph of Seats") +theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Most of the cars are 5 seater.

Converting transmission, owner, seller type and fuel into ordinal encoder

#Converting transmission column into binary 0 if Manual and 1 if Automatic
car$transmission <- str_replace(car$transmission, 'Manual', "0")
car$transmission <- str_replace(car$transmission, 'Automatic', "1")
car$transmission <- as.numeric(car$transmission)
table(car$transmission)
## 
##    0    1 
## 7078 1050
#Converting owner into Ordinal Encoder
car$owner <- str_replace(car$owner, 'First Owner', "0")
car$owner <- str_replace(car$owner, 'Second Owner', "1")
car$owner <- str_replace(car$owner, 'Third Owner', "2")
car$owner <- str_replace(car$owner, 'Fourth & Above Owner', "3")
car$owner <- str_replace(car$owner, 'Test Drive Car', "4")
car$owner <- as.numeric(car$owner)
table(car$owner)
## 
##    0    1    2    3    4 
## 5289 2105  555  174    5
#Converting seller_type into Ordinal Encoder
car$seller_type <- str_replace(car$seller_type, "Trustmark Dealer", "0")
car$seller_type <- str_replace(car$seller_type, "Dealer", "1")
car$seller_type <- str_replace(car$seller_type, "Individual", "2")
car$seller_type <- as.numeric(car$seller_type)
table(car$seller_type)
## 
##    0    1    2 
##  236 1126 6766
#Converting fuel into Ordinal Encoder
car$fuel <- str_replace(car$fuel, 'Diesel', "0")
car$fuel <- str_replace(car$fuel, 'Petrol', "1")
car$fuel <- str_replace(car$fuel, 'CNG', "2")
car$fuel <- str_replace(car$fuel, 'LPG', "3")
car$fuel <- as.numeric(car$fuel)
table(car$fuel)
## 
##    0    1    2    3 
## 4402 3631   57   38

Plotting histogram of selling price, km driven to check the distribution

#Histogram of Selling Price
ggplot(car, aes(x=selling_price)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="blue")+
  labs(x='Selling Price ') + labs(title = "Histogram Graph of Selling Price") +
  scale_x_continuous(trans='log10')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that selling price is heavily skewed.

#Histogram of Km Driven
ggplot(car, aes(x=km_driven)) + 
  geom_histogram(color="black", fill="blue", bins = 200)+
  labs(x='Km Driven ') + labs(title = "Histogram Graph of Km Driven") +
  scale_x_continuous(trans='log10')

Checking correlation between variables

library(corrplot)
corrplot(cor(car), type="full", 
         method ="color", title = "Correlation Plot", 
         mar=c(0,0,1,0), tl.cex= 0.8, outline= T, tl.col="indianred4")

round(cor(car),2)
##                name  year selling_price km_driven  fuel seller_type
## name           1.00  0.12          0.50     -0.03 -0.29       -0.10
## year           0.12  1.00          0.41     -0.42 -0.06       -0.23
## selling_price  0.50  0.41          1.00     -0.23 -0.21       -0.32
## km_driven     -0.03 -0.42         -0.23      1.00 -0.24        0.19
## fuel          -0.29 -0.06         -0.21     -0.24  1.00       -0.03
## seller_type   -0.10 -0.23         -0.32      0.19 -0.03        1.00
## transmission   0.34  0.24          0.59     -0.20  0.01       -0.36
## owner         -0.06 -0.50         -0.22      0.28  0.00        0.20
## mileage       -0.28  0.31         -0.13     -0.17 -0.04        0.02
## engine         0.43  0.02          0.45      0.20 -0.48       -0.12
## max_power      0.51  0.21          0.74     -0.04 -0.30       -0.24
## seats          0.11  0.01          0.05      0.22 -0.34        0.07
##               transmission owner mileage engine max_power seats
## name                  0.34 -0.06   -0.28   0.43      0.51  0.11
## year                  0.24 -0.50    0.31   0.02      0.21  0.01
## selling_price         0.59 -0.22   -0.13   0.45      0.74  0.05
## km_driven            -0.20  0.28   -0.17   0.20     -0.04  0.22
## fuel                  0.01  0.00   -0.04  -0.48     -0.30 -0.34
## seller_type          -0.36  0.20    0.02  -0.12     -0.24  0.07
## transmission          1.00 -0.14   -0.18   0.28      0.54 -0.07
## owner                -0.14  1.00   -0.17   0.01     -0.10  0.02
## mileage              -0.18 -0.17    1.00  -0.58     -0.37 -0.45
## engine                0.28  0.01   -0.58   1.00      0.70  0.61
## max_power             0.54 -0.10   -0.37   0.70      1.00  0.19
## seats                -0.07  0.02   -0.45   0.61      0.19  1.00

We can see that selling price is highly correlated to max_power then transmission and name.

Splitting the Data into training and test data sets

set.seed(5)
trainIndex <- createDataPartition(car$selling_price, p = .7,
                                  list = FALSE,
                                  times = 1)
Train <- car[ trainIndex,]
Test <- car[-trainIndex,]

Splitting data into 70% Training and 30% Test.

Model 1 - Linear Regression

Building Model

m1_lr <- lm(selling_price ~ name+year+km_driven+seller_type+mileage+transmission+max_power, data = Train)
summary(m1_lr)
## 
## Call:
## lm(formula = selling_price ~ name + year + km_driven + seller_type + 
##     mileage + transmission + max_power, data = Train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2439581  -212038    -5929   162978  3916432 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5.961e+07  3.768e+06 -15.818  < 2e-16 ***
## name          2.471e+04  1.374e+03  17.981  < 2e-16 ***
## year          2.919e+04  1.877e+03  15.550  < 2e-16 ***
## km_driven    -1.496e+00  1.471e-01 -10.163  < 2e-16 ***
## seller_type  -1.031e+05  1.393e+04  -7.402 1.54e-13 ***
## mileage       2.033e+04  1.816e+03  11.192  < 2e-16 ***
## transmission  4.344e+05  2.259e+04  19.228  < 2e-16 ***
## max_power     1.303e+04  2.356e+02  55.287  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 449500 on 5683 degrees of freedom
## Multiple R-squared:  0.6954, Adjusted R-squared:  0.695 
## F-statistic:  1853 on 7 and 5683 DF,  p-value: < 2.2e-16
plot(m1_lr)

Using the model to predict selling price in the Test dataset

pred_lr <- predict(m1_lr, newdata = Test)
error_lr <- Test$selling_price - pred_lr
RMSE_lr <- sqrt(mean(error_lr^2))
RMSE_lr
## [1] 457916.9

Plotting predicted vs. actual values

plot(Test$selling_price,pred_lr, main="Scatterplot", col = c("red","blue"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")

Built Linear Regression models with different variables but kept the model with best RMSE value. RMSE value of 457916.9

Model 2 - Random Forest

Building Model

m2_rf <- randomForest(selling_price~.,data = Train)
m2_rf
## 
## Call:
##  randomForest(formula = selling_price ~ ., data = Train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 26231797034
##                     % Var explained: 96.04
plot(m2_rf)

Feature Importance Plot

varImpPlot(m2_rf, main ='Feature Importance')

Using the model to predict selling price in the Test dataset

pred_rf <- predict(m2_rf, Test)
error_rf <- Test$selling_price - pred_rf
RMSE_rf <- sqrt(mean(error_rf^2))
RMSE_rf
## [1] 128704

Plotting predicted vs. actual values

plot(Test$selling_price,pred_rf, main="Scatterplot", col = c("red","blue"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")

We got RMSE value of 129840.9

Model 3 - Gradient Boosting

Building Model

library(gbm)
## Loaded gbm 2.1.8
set.seed(123)
m3_gbm <- gbm(
  formula = selling_price ~ .,
  distribution = "gaussian",
  data = Train,
  n.trees = 6000,
  interaction.depth = 3,
  shrinkage = 0.1,
  cv.folds = 5,
  n.cores = NULL, # will use all cores by default
  verbose = FALSE
)  

m3_gbm
## gbm(formula = selling_price ~ ., distribution = "gaussian", data = Train, 
##     n.trees = 6000, interaction.depth = 3, shrinkage = 0.1, cv.folds = 5, 
##     verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 6000 iterations were performed.
## The best cross-validation iteration was 2983.
## There were 11 predictors of which 11 had non-zero influence.

plot loss function as a result of n trees added to the ensemble

gbm.perf(m3_gbm, method = "cv")

## [1] 2983

Variable importance

summary(
  m3_gbm, 
  cBars = 10,
  method = relative.influence, las = 2
)

Using the model to predict selling price in the Test dataset

pred_gbm <- predict(m3_gbm, Test)
## Using 2983 trees...
error_gbm <- Test$selling_price - pred_gbm
RMSE_gbm <- sqrt(mean(error_gbm^2))
RMSE_gbm
## [1] 135282.4

Plotting predicted vs. actual values

plot(Test$selling_price,pred_gbm, main="Scatterplot", col = c("red","blue"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")

We got RMSE value of 135282.4

Conclusion and Model Comparison

We used linear regression, random forest and gradient boosting models to predict selling price of cars and we see that random forest gives us a better RMSE among the three models. The RMSE comparison for three different models is shown below.

Model RMSE
Linear Regression 457916.9
Random Forest 129840.9
Gradient Boosting 135282.4

Random Forest explains 96% of the variation. Variables that are useful to describe the variance are max_power, name, engine and year. The accuracy of the model in predicting the car price is measured with RMSE, RMSE of test dataset is 129840.9.In the random forest model we used 500 number of trees and number of variables tried at each split as 3. We can further tune the model to get better RMSE.