When a new automobile launches in the market its price depends upon various features and factors which mainly includes brand, fuel-type, body-style, engine size, horse-power etc. But every automobile is built while keeping the budget in mind so obviously more the budget more and good the features in the automobile as all the features and factors definitely contribute in the price of automobile. Thus we are going to choose a model which can precisely and significantly predict the price of any automobile depending on the respective features and factors.
In this project our goal is to predict the price of any automobile. First we will find the predictors which are playing a significant role in affecting the price of the automobiles. After that we will create a machine learning regression model to significantly predict the price of respective automobiles. After validation and tuning at last we will apply the same model on the test data set to get an overview about how the model is predicting, out of sample error and accuracy etc.
The data abstract is from 1985 Ward’s Automotive Yearbook
Data set is available here
Names file for the data set is available here
For more info on data source click here
This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…), and represents the average loss per car per year.
#loading required libraries
library(ggplot2)
library(corrplot)
library(caret)
#reading the data
data<-read.table("imports-85.data",header = FALSE,sep = ",",
na.strings = "?",stringsAsFactors = TRUE)
dim(data)
## [1] 205 26
#adding names to the columns
names(data)<-c("symboling","normalized.losses","make","fuel.type","aspiration",
"num.of.doors","body.style","drive.wheels","engine.location",
"wheel.base","length","width","height","curb.weight","engine.type",
"num.of.cylinders","engine.size","fuel.system","bore","stroke",
"compression.ratio","horsepower","peak.rpm","city.mpg","highway.mpg","price")
str(data)
## 'data.frame': 205 obs. of 26 variables:
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ normalized.losses: int NA NA NA 164 164 NA 158 NA 158 NA ...
## $ make : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
## $ fuel.type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
## $ num.of.doors : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
## $ body.style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
## $ drive.wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
## $ engine.location : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheel.base : num 88.6 88.6 94.5 99.8 99.4 ...
## $ length : num 169 169 171 177 177 ...
## $ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curb.weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ engine.type : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
## $ num.of.cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
## $ engine.size : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuel.system : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ bore : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compression.ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peak.rpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ city.mpg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highway.mpg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : int 13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ...
#removing all row having NA values in any column
data<-data[which(complete.cases(data)!=FALSE),]
#summary of price
summary(data$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5118 7372 9233 11446 14720 35056
#plot maximum price according to the automobile maker i.e., brand
maxPrice<-aggregate(data$price,by=list(data$make),FUN="max")
names(maxPrice)<-c("Brand","Price")
maxPrice<-maxPrice[order(maxPrice$Price),]
maxPrice$Brand<-factor(maxPrice$Brand,levels = maxPrice$Brand)
ggplot(data = maxPrice,aes(x=Price,y=Brand))+
geom_bar(stat="identity",fill="lightgreen")+
geom_text(aes(label=paste0("$ ",Price)),hjust=-0.05)+
coord_cartesian(xlim = c(0,38000))+
labs(title = "Max price of automobile in each Brand")
#Plotting correlation plot to find correlation between all variables
#converting all the factors to numeric first
x<-data #preserving data
x[,1:ncol(x)]<-lapply(x[,1:ncol(x)],function(x){
x=as.numeric(x)
})
cr<-cor(x)
corrplot(cr,method = "circle")
#removing all variables which do not have significant effect on price
mm<-as.data.frame(cr)
i<-which(!is.na(mm$price) & (mm$price>0.6 | mm$price<(-0.6)))
data<-data[,i]
set.seed(19122021) #to make this analysis reproducible
#partitioning data into train(70%) and test(30%) data set
inTrain<-createDataPartition(data$price,p=0.70,list = FALSE)
trainSet<-data[inTrain,]
dim(trainSet)
## [1] 112 9
testSet<-data[-inTrain,]
dim(testSet)
## [1] 47 9
dtMod<-train(price~.,data = trainSet, method="rpart",
trControl=trainControl(method = "cv", number = 3, verboseIter = F))
rfMod<-train(price~.,data = trainSet, method="rf",
trControl=trainControl(method = "cv", number = 3, verboseIter = F))
gbmMod<-train(price~.,data = trainSet, method="gbm",
trControl=trainControl(method = "cv", number = 3, verboseIter = F),
verbose=FALSE)
glmMod<-train(price~.,data = trainSet, method="glm",
trControl=trainControl(method = "cv", number = 3, verboseIter = F))
compMod<-data.frame(Models=c("Decision Tree", "Random Forest", "GBM", "GLM"),
Variance=round(c(mean(dtMod$resample$Rsquared),mean(rfMod$resample$Rsquared),
mean(gbmMod$resample$Rsquared), mean(glmMod$resample$Rsquared))*100,2))
compMod
Random Forest model has explained the maximum variance of all for train data set i.e., around 90.66%. So we will move forward with Random Forest model.
#plotting random forest model
plot(rfMod)
#best mtry value for the model
rfMod$bestTune$mtry
## [1] 5
Here no tuning required as model itself has chosen the best mtry value for feature selection.
Let’s apply our final model on test set
#predicting price for testSet
pred<-predict(rfMod,testSet)
#rounding off for simplification
pred<-round(pred)
#comparing RMSE for model and testSet
RMSE<-data.frame(modelRMSE=sqrt(mean(rfMod$finalModel$mse)),
testSetRMSE=sqrt(sum((testSet$price-pred)^2)/length(testSet$price)))
RMSE
Choosing a perfect model for regression is quite a mighty process but finally we found Random Forest model for the regression of our price for automobile. It almost explained above 90% variance which was ofcourse so far from the other models we built, but that’s not enough to be sure about a model so we first predicted price for testSet by applying the same model on it and on comparing Root Mean Square Error for trainSet and testSet we found a significant difference and hence confirmed Random Forest as the best model for our data set.