Abstract

When a new automobile launches in the market its price depends upon various features and factors which mainly includes brand, fuel-type, body-style, engine size, horse-power etc. But every automobile is built while keeping the budget in mind so obviously more the budget more and good the features in the automobile as all the features and factors definitely contribute in the price of automobile. Thus we are going to choose a model which can precisely and significantly predict the price of any automobile depending on the respective features and factors.

Table of Content

  1. Introduction
  2. About Data
  3. Data Preprocessing
  4. Exploring Data
  5. Model creation and Prediction
  6. Conclusion

Introduction

In this project our goal is to predict the price of any automobile. First we will find the predictors which are playing a significant role in affecting the price of the automobiles. After that we will create a machine learning regression model to significantly predict the price of respective automobiles. After validation and tuning at last we will apply the same model on the test data set to get an overview about how the model is predicting, out of sample error and accuracy etc.

About Data

The data abstract is from 1985 Ward’s Automotive Yearbook

Data set is available here

Names file for the data set is available here

For more info on data source click here

This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…), and represents the average loss per car per year.

Data Preprocessing

#loading required libraries
library(ggplot2)
library(corrplot)
library(caret)
#reading the data
data<-read.table("imports-85.data",header = FALSE,sep = ",",
                 na.strings = "?",stringsAsFactors = TRUE)
dim(data)
## [1] 205  26
#adding names to the columns
names(data)<-c("symboling","normalized.losses","make","fuel.type","aspiration",
               "num.of.doors","body.style","drive.wheels","engine.location",
               "wheel.base","length","width","height","curb.weight","engine.type",
               "num.of.cylinders","engine.size","fuel.system","bore","stroke",
               "compression.ratio","horsepower","peak.rpm","city.mpg","highway.mpg","price")
str(data)
## 'data.frame':    205 obs. of  26 variables:
##  $ symboling        : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ normalized.losses: int  NA NA NA 164 164 NA 158 NA 158 NA ...
##  $ make             : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
##  $ fuel.type        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration       : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ num.of.doors     : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
##  $ body.style       : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drive.wheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ engine.location  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheel.base       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ length           : num  169 169 171 177 177 ...
##  $ width            : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ height           : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curb.weight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ engine.type      : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
##  $ num.of.cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
##  $ engine.size      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuel.system      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ bore             : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke           : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compression.ratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower       : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peak.rpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ city.mpg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highway.mpg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price            : int  13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ...
#removing all row having NA values in any column
data<-data[which(complete.cases(data)!=FALSE),]

#summary of price
summary(data$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5118    7372    9233   11446   14720   35056

Exploring Data

#plot maximum price according to the automobile maker i.e., brand
maxPrice<-aggregate(data$price,by=list(data$make),FUN="max")

names(maxPrice)<-c("Brand","Price")

maxPrice<-maxPrice[order(maxPrice$Price),]

maxPrice$Brand<-factor(maxPrice$Brand,levels = maxPrice$Brand)

ggplot(data = maxPrice,aes(x=Price,y=Brand))+
    geom_bar(stat="identity",fill="lightgreen")+
    geom_text(aes(label=paste0("$ ",Price)),hjust=-0.05)+
    coord_cartesian(xlim = c(0,38000))+
    labs(title = "Max price of automobile in each Brand")

#Plotting correlation plot to find correlation between all variables
#converting all the factors to numeric first
x<-data #preserving data
x[,1:ncol(x)]<-lapply(x[,1:ncol(x)],function(x){
    x=as.numeric(x)
})
cr<-cor(x)
corrplot(cr,method = "circle")

#removing all variables which do not have significant effect on price
mm<-as.data.frame(cr)
i<-which(!is.na(mm$price) & (mm$price>0.6 | mm$price<(-0.6)))
data<-data[,i]

Model creation and Prediction

Creating train and test dataset

set.seed(19122021) #to make this analysis reproducible

#partitioning data into train(70%) and test(30%) data set
inTrain<-createDataPartition(data$price,p=0.70,list = FALSE)
trainSet<-data[inTrain,]
dim(trainSet)
## [1] 112   9
testSet<-data[-inTrain,]
dim(testSet)
## [1] 47  9

Creating prediction models

Decision Trees

dtMod<-train(price~.,data = trainSet, method="rpart",
           trControl=trainControl(method = "cv", number = 3, verboseIter = F))

Random Forest

rfMod<-train(price~.,data = trainSet, method="rf",
           trControl=trainControl(method = "cv", number = 3, verboseIter = F))

Gradient Boosting Machine

gbmMod<-train(price~.,data = trainSet, method="gbm",
             trControl=trainControl(method = "cv", number = 3, verboseIter = F),
             verbose=FALSE)

Generalized Linear Model

glmMod<-train(price~.,data = trainSet, method="glm",
           trControl=trainControl(method = "cv", number = 3, verboseIter = F))

Comparing Models by the percent of variance explained

compMod<-data.frame(Models=c("Decision Tree", "Random Forest", "GBM", "GLM"),
                    Variance=round(c(mean(dtMod$resample$Rsquared),mean(rfMod$resample$Rsquared),
                               mean(gbmMod$resample$Rsquared), mean(glmMod$resample$Rsquared))*100,2))
compMod

Random Forest model has explained the maximum variance of all for train data set i.e., around 90.66%. So we will move forward with Random Forest model.

#plotting random forest model
plot(rfMod)

#best mtry value for the model
rfMod$bestTune$mtry
## [1] 5

Here no tuning required as model itself has chosen the best mtry value for feature selection.

Let’s apply our final model on test set

#predicting price for testSet
pred<-predict(rfMod,testSet)

#rounding off for simplification
pred<-round(pred)

#comparing RMSE for model and testSet
RMSE<-data.frame(modelRMSE=sqrt(mean(rfMod$finalModel$mse)),
                 testSetRMSE=sqrt(sum((testSet$price-pred)^2)/length(testSet$price)))
RMSE

Conclusion

Choosing a perfect model for regression is quite a mighty process but finally we found Random Forest model for the regression of our price for automobile. It almost explained above 90% variance which was ofcourse so far from the other models we built, but that’s not enough to be sure about a model so we first predicted price for testSet by applying the same model on it and on comparing Root Mean Square Error for trainSet and testSet we found a significant difference and hence confirmed Random Forest as the best model for our data set.