Linear Regression In Used Car Price Prediction
Intro
Linear Regression
Linear regression is an algorithm used to predict, or visualize, a relationship between two different features/variables. In linear regression tasks, there are two kinds of variables being examined: the dependent variable and the independent variable. The independent or predictor variable is the variable that stands by itself, not impacted by the other variable. As the independent variable is adjusted, the levels of the dependent variable will fluctuate. The dependent or target variable is the variable that is being studied, and it is what the regression model solves for/attempts to predict. In linear regression tasks, every observation/instance is comprised of both the dependent variable value and the independent variable value.
Dataset
This dataset contains information about used cars. The description of each feature explained below:
name: Name of the carsyear: Year of the car when it was boughtselling_price: Price at which the car is being soldkm_driven: Number of Kilometres the car is drivenfuel: Fuel type of car (petrol / diesel / CNG / LPG / electric)seller_type: Tells if car is sold by individual or dealertransmission: Gear transmission of the car (Automatic/Manual)Owner: Number of previous ownersmileage: Mileage of the car (kmpl)engine: Engine capacity of the car (CC)max_power: Maximum power of engine (bhp)seats: number of seats in the car
This data set will be used to predict the selling price of used cars, so that we will set selling price as the target variable.
Data Preparation
Import Required Package
library(tidyverse)
library(caret)
library(GGally)
library(car)
library(lmtest)
library(data.table)
library(MLmetrics)
library(performance)Load Dataset
car <- read.csv('Used_Cars.csv')
rmarkdown::paged_table(car)glimpse(car)## Rows: 8,128
## Columns: 13
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
## $ name <chr> "Maruti Swift Dzire VDI", "Skoda Rapid 1.5 TDI Ambition"~
## $ year <int> 2014, 2014, 2006, 2010, 2007, 2017, 2007, 2001, 2011, 20~
## $ selling_price <int> 450000, 370000, 158000, 225000, 130000, 440000, 96000, 4~
## $ km_driven <int> 145500, 120000, 140000, 127000, 120000, 45000, 175000, 5~
## $ fuel <chr> "Diesel", "Diesel", "Petrol", "Diesel", "Petrol", "Petro~
## $ seller_type <chr> "Individual", "Individual", "Individual", "Individual", ~
## $ transmission <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manua~
## $ owner <chr> "First Owner", "Second Owner", "Third Owner", "First Own~
## $ mileage <chr> "23.4 kmpl", "21.14 kmpl", "17.7 kmpl", "23.0 kmpl", "16~
## $ engine <chr> "1248 CC", "1498 CC", "1497 CC", "1396 CC", "1298 CC", "~
## $ max_power <chr> "74 bhp", "103.52 bhp", "78 bhp", "90 bhp", "88.2 bhp", ~
## $ seats <int> 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, NA, 5, 5, 7, 5, 5~
The data has 8,128 rows and 13 columns. X and name are unique identifier for each car, so we can drop them because we don’t need that information.
Before we go any further, we first need to make sure that our data is of the correct type. There are some features that we need to clean up and put in the correct type. What we do for the next step is:
- Change the
fuel,seller_type,transmission, andownerdata types to factors - Remove “kmpl” from
mileageand change the data type to factor - Remove “CC” from
engineand change the data type to factor - Remove “bhp” from
max_power
car<- car %>%
mutate(mileage = as.numeric(str_replace(mileage," kmpl| km/kg","")),
engine = as.numeric(str_replace(engine," CC","")),
max_power = as.numeric(str_replace(max_power," bhp","")),
fuel = as.factor(fuel),
seller_type = as.factor(seller_type),
transmission = as.factor(transmission),
owner = as.factor(owner)) %>%
select(-c(X,name))
str(car)## 'data.frame': 8128 obs. of 11 variables:
## $ year : int 2014 2014 2006 2010 2007 2017 2007 2001 2011 2013 ...
## $ selling_price: int 450000 370000 158000 225000 130000 440000 96000 45000 350000 200000 ...
## $ km_driven : int 145500 120000 140000 127000 120000 45000 175000 5000 90000 169000 ...
## $ fuel : Factor w/ 4 levels "CNG","Diesel",..: 2 2 4 2 4 4 3 4 2 2 ...
## $ seller_type : Factor w/ 3 levels "Dealer","Individual",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ transmission : Factor w/ 2 levels "Automatic","Manual": 2 2 2 2 2 2 2 2 2 2 ...
## $ owner : Factor w/ 5 levels "First Owner",..: 1 3 5 1 1 1 1 3 1 1 ...
## $ mileage : num 23.4 21.1 17.7 23 16.1 ...
## $ engine : num 1248 1498 1497 1396 1298 ...
## $ max_power : num 74 103.5 78 90 88.2 ...
## $ seats : int 5 5 5 5 5 5 5 4 5 5 ...
Check Missing Values
colSums(is.na(car))## year selling_price km_driven fuel seller_type
## 0 0 0 0 0
## transmission owner mileage engine max_power
## 0 0 221 221 216
## seats
## 221
There are some missing values from mileage, engine, max_power, and seats columns, because most of them are from the same row so we have to drop them all.
car <- drop_na(car)
dim(car)## [1] 7906 11
Exploratory Data Analysis
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
First we want to know the distribution of our target variable which is selling_price
hist(car$selling_price, col="darkblue")From the histogram it is known that most of the selling prices of used cars are below 2,000,000
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will set 80% of the data as the training data and the rest of it as the testing data.
RNGkind(sample.kind = "Rounding")
set.seed(123)
intrain <- sample(x=nrow(car), size = nrow(car)*0.8)
car_train <- car[intrain,]
car_test <- car[-intrain,]
dim(car_train)## [1] 6324 11
dim(car_test)## [1] 1582 11
Check the distribution of target variable from Train Dataset
hist(car_train$selling_price, col= "darkblue")Check the distribution of target variable from Test Dataset
hist(car_test$selling_price, col="darkblue")selling_price have almost the same distribution in both the training dataset and the test dataset
Find the Pearson correlation between features.
ggcorr(car_train, label = T)The graphic shows that km_driven and mileage have negative correlation with selling_price. selling_price has strong correlation with max_power (0.8)
Modeling
The first model we can make is to use all variables (except selling_price) as predictor variables.
model0 <- lm(selling_price~1, car_train)
model_all <- lm(selling_price~., car_train)
summary(model_all)##
## Call:
## lm(formula = selling_price ~ ., data = car_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2707308 -197720 10512 153376 4458792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.724e+07 4.289e+06 -15.676 < 2e-16 ***
## year 3.335e+04 2.143e+03 15.564 < 2e-16 ***
## km_driven -9.389e-01 1.163e-01 -8.073 8.15e-16 ***
## fuelDiesel -3.289e+04 6.878e+04 -0.478 0.632578
## fuelLPG 1.536e+05 1.134e+05 1.355 0.175540
## fuelPetrol -7.180e+04 6.914e+04 -1.039 0.299063
## seller_typeIndividual -2.481e+05 1.849e+04 -13.418 < 2e-16 ***
## seller_typeTrustmark Dealer -3.452e+05 3.855e+04 -8.956 < 2e-16 ***
## transmissionManual -4.487e+05 2.206e+04 -20.336 < 2e-16 ***
## ownerFourth & Above Owner 4.160e+04 4.300e+04 0.967 0.333392
## ownerSecond Owner -4.278e+04 1.499e+04 -2.853 0.004343 **
## ownerTest Drive Car 2.631e+06 2.307e+05 11.403 < 2e-16 ***
## ownerThird Owner -1.359e+04 2.576e+04 -0.528 0.597849
## mileage 1.483e+04 2.369e+03 6.261 4.07e-10 ***
## engine 1.070e+02 2.643e+01 4.050 5.18e-05 ***
## max_power 1.288e+04 2.890e+02 44.574 < 2e-16 ***
## seats -3.219e+04 8.812e+03 -3.653 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 459300 on 6307 degrees of freedom
## Multiple R-squared: 0.6897, Adjusted R-squared: 0.6889
## F-statistic: 876.3 on 16 and 6307 DF, p-value: < 2.2e-16
The summary of model_all shows a lot of information. But for now, we may be better focus on the Pr(>|t|). This column shows the significant level of the variable toward the model. If the value is below 0.05, than we can safely assume that the variable has significant effect toward the model.
Feature Selection
Feature selection is the stage in selecting the variables to be used, working by evaluating and reducing insignificant variables and paying attention to the AIC value. AIC (Akaike Information Criterion) is a value that represents a lot of missing information in the model, the smaller the AIC value, the better the model.
There are three steps of feature selection that can we implement:
- backward elimination: Of all the predictors used, the model is then evaluated by reducing the predictor variables so that the smallest AIC (Akaike Information Criterion) model is obtained..
- forward selection: From the model without a predictor, then the model is evaluated by adding predictor variables so that the model with the smallest AIC is obtained.
- both: From the model made, it is possible to evaluate the model by adding or subtracting predictor variables so that the model with the smallest AIC is obtained.
Backward Elimination
model_back <- step(model_all,direction = "backward", trace = 0)
summary(model_back)##
## Call:
## lm(formula = selling_price ~ year + km_driven + fuel + seller_type +
## transmission + owner + mileage + engine + max_power + seats,
## data = car_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2707308 -197720 10512 153376 4458792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.724e+07 4.289e+06 -15.676 < 2e-16 ***
## year 3.335e+04 2.143e+03 15.564 < 2e-16 ***
## km_driven -9.389e-01 1.163e-01 -8.073 8.15e-16 ***
## fuelDiesel -3.289e+04 6.878e+04 -0.478 0.632578
## fuelLPG 1.536e+05 1.134e+05 1.355 0.175540
## fuelPetrol -7.180e+04 6.914e+04 -1.039 0.299063
## seller_typeIndividual -2.481e+05 1.849e+04 -13.418 < 2e-16 ***
## seller_typeTrustmark Dealer -3.452e+05 3.855e+04 -8.956 < 2e-16 ***
## transmissionManual -4.487e+05 2.206e+04 -20.336 < 2e-16 ***
## ownerFourth & Above Owner 4.160e+04 4.300e+04 0.967 0.333392
## ownerSecond Owner -4.278e+04 1.499e+04 -2.853 0.004343 **
## ownerTest Drive Car 2.631e+06 2.307e+05 11.403 < 2e-16 ***
## ownerThird Owner -1.359e+04 2.576e+04 -0.528 0.597849
## mileage 1.483e+04 2.369e+03 6.261 4.07e-10 ***
## engine 1.070e+02 2.643e+01 4.050 5.18e-05 ***
## max_power 1.288e+04 2.890e+02 44.574 < 2e-16 ***
## seats -3.219e+04 8.812e+03 -3.653 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 459300 on 6307 degrees of freedom
## Multiple R-squared: 0.6897, Adjusted R-squared: 0.6889
## F-statistic: 876.3 on 16 and 6307 DF, p-value: < 2.2e-16
Forward Selection
model_forward <- step(model0, direction = "forward", scope = list(lower = model0,
upper = model_all),
trace = 0)
summary(model_forward)##
## Call:
## lm(formula = selling_price ~ max_power + year + transmission +
## seller_type + owner + mileage + km_driven + engine + seats +
## fuel, data = car_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2707308 -197720 10512 153376 4458792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.724e+07 4.289e+06 -15.676 < 2e-16 ***
## max_power 1.288e+04 2.890e+02 44.574 < 2e-16 ***
## year 3.335e+04 2.143e+03 15.564 < 2e-16 ***
## transmissionManual -4.487e+05 2.206e+04 -20.336 < 2e-16 ***
## seller_typeIndividual -2.481e+05 1.849e+04 -13.418 < 2e-16 ***
## seller_typeTrustmark Dealer -3.452e+05 3.855e+04 -8.956 < 2e-16 ***
## ownerFourth & Above Owner 4.160e+04 4.300e+04 0.967 0.333392
## ownerSecond Owner -4.278e+04 1.499e+04 -2.853 0.004343 **
## ownerTest Drive Car 2.631e+06 2.307e+05 11.403 < 2e-16 ***
## ownerThird Owner -1.359e+04 2.576e+04 -0.528 0.597849
## mileage 1.483e+04 2.369e+03 6.261 4.07e-10 ***
## km_driven -9.389e-01 1.163e-01 -8.073 8.15e-16 ***
## engine 1.070e+02 2.643e+01 4.050 5.18e-05 ***
## seats -3.219e+04 8.812e+03 -3.653 0.000261 ***
## fuelDiesel -3.289e+04 6.878e+04 -0.478 0.632578
## fuelLPG 1.536e+05 1.134e+05 1.355 0.175540
## fuelPetrol -7.180e+04 6.914e+04 -1.039 0.299063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 459300 on 6307 degrees of freedom
## Multiple R-squared: 0.6897, Adjusted R-squared: 0.6889
## F-statistic: 876.3 on 16 and 6307 DF, p-value: < 2.2e-16
Both
model_both <- step(model0, direction = "both", scope = list(lower = model0,
upper = model_all),
trace = 0)
summary(model_both)##
## Call:
## lm(formula = selling_price ~ max_power + year + transmission +
## seller_type + owner + mileage + km_driven + engine + seats +
## fuel, data = car_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2707308 -197720 10512 153376 4458792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.724e+07 4.289e+06 -15.676 < 2e-16 ***
## max_power 1.288e+04 2.890e+02 44.574 < 2e-16 ***
## year 3.335e+04 2.143e+03 15.564 < 2e-16 ***
## transmissionManual -4.487e+05 2.206e+04 -20.336 < 2e-16 ***
## seller_typeIndividual -2.481e+05 1.849e+04 -13.418 < 2e-16 ***
## seller_typeTrustmark Dealer -3.452e+05 3.855e+04 -8.956 < 2e-16 ***
## ownerFourth & Above Owner 4.160e+04 4.300e+04 0.967 0.333392
## ownerSecond Owner -4.278e+04 1.499e+04 -2.853 0.004343 **
## ownerTest Drive Car 2.631e+06 2.307e+05 11.403 < 2e-16 ***
## ownerThird Owner -1.359e+04 2.576e+04 -0.528 0.597849
## mileage 1.483e+04 2.369e+03 6.261 4.07e-10 ***
## km_driven -9.389e-01 1.163e-01 -8.073 8.15e-16 ***
## engine 1.070e+02 2.643e+01 4.050 5.18e-05 ***
## seats -3.219e+04 8.812e+03 -3.653 0.000261 ***
## fuelDiesel -3.289e+04 6.878e+04 -0.478 0.632578
## fuelLPG 1.536e+05 1.134e+05 1.355 0.175540
## fuelPetrol -7.180e+04 6.914e+04 -1.039 0.299063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 459300 on 6307 degrees of freedom
## Multiple R-squared: 0.6897, Adjusted R-squared: 0.6889
## F-statistic: 876.3 on 16 and 6307 DF, p-value: < 2.2e-16
Model Comparison
compare_performance(model_all,model_back,model_forward,model_both)## # Comparison of Model Performance Indices
##
## Name | Model | AIC | AIC weights | BIC | BIC weights | R2 | R2 (adj.) | RMSE | Sigma
## ---------------------------------------------------------------------------------------------------------------------
## model_all | lm | 1.829e+05 | 0.250 | 1.830e+05 | 0.250 | 0.690 | 0.689 | 4.587e+05 | 4.593e+05
## model_back | lm | 1.829e+05 | 0.250 | 1.830e+05 | 0.250 | 0.690 | 0.689 | 4.587e+05 | 4.593e+05
## model_forward | lm | 1.829e+05 | 0.250 | 1.830e+05 | 0.250 | 0.690 | 0.689 | 4.587e+05 | 4.593e+05
## model_both | lm | 1.829e+05 | 0.250 | 1.830e+05 | 0.250 | 0.690 | 0.689 | 4.587e+05 | 4.593e+05
After performing feature selection in three ways, it turns out that there is no change in the initial model both from the AIC, R2 (adj.), and RMSE values, then the model we will use is model_all
Linear Regression Assumption
As a statistical model, linear regression has several assumptions that need to be fulfilled so that the interpretation obtained is not biased. This assumption only needs to be fulfilled if the purpose of making a linear regression model is to want an interpretation or to see the effect of each predictor on the value of the target variable. If you only want to use linear regression to make predictions, then the model assumptions are not required to be met.
Linearity
Linearity means that the target variable with its predictor has a linear relationship or the relationship is a straight line. In addition, the effect or coefficient value between variables is additive. If this linearity is not met, then automatically all the coefficient values that we get are invalid because the model assumes that the pattern that we will make is linear.
Normality of Residual (Residual Normal)
The normality assumption means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value
Homoscedasticity of Residual
Homocesdasticity indicates that the residual or error is constant or does not form a certain pattern. If the error forms a certain pattern such as a linear or conical line, then we call it Heterocesdasticity and it will affect the standard error value on a biased estimate/coefficient of predictor (too narrow or too wide). Homocesdasticity can be checked visually by seeing whether there is a pattern between the predicted results from the data and the residual value.
No Multicolinearity
Multicolinearity occurs when the predictor variables used in the model have a strong relationship. a good model is not expected to have multicolinearity. The presence or absence of multicolinearity can be seen from the value of VIF (Variance Inflation Factor). When the VIF value is more than 10, it means that there is multicolinearity
check_model(model_all)Model Using Scaled Data
The picture above shows that the assumptions that are met are only multicollinearity, while the others are not appropriate. Now I will try to do data scaling on predictor variables and target variable to overcome Normality of Residual and Heterocesdasticity
# Scaling data
num_data <- car_train %>%
select(is.numeric) %>%
sapply(scale)## Warning: Predicate functions must be wrapped in `where()`.
##
## # Bad
## data %>% select(is.numeric)
##
## # Good
## data %>% select(where(is.numeric))
##
## i Please update your code.
## This message is displayed once per session.
fac_data <- car_train %>% select(is.factor)## Warning: Predicate functions must be wrapped in `where()`.
##
## # Bad
## data %>% select(is.factor)
##
## # Good
## data %>% select(where(is.factor))
##
## i Please update your code.
## This message is displayed once per session.
car_scale <- data.frame(num_data,fac_data)
model_scale <- lm(selling_price~., car_scale)
summary(model_scale)##
## Call:
## lm(formula = selling_price ~ ., data = car_scale)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2874 -0.2401 0.0128 0.1862 5.4142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.805766 0.087509 9.208 < 2e-16 ***
## year 0.155202 0.009972 15.564 < 2e-16 ***
## km_driven -0.067133 0.008316 -8.073 8.15e-16 ***
## mileage 0.073018 0.011662 6.261 4.07e-10 ***
## engine 0.065462 0.016163 4.050 5.18e-05 ***
## max_power 0.560830 0.012582 44.574 < 2e-16 ***
## seats -0.037662 0.010310 -3.653 0.000261 ***
## fuelDiesel -0.039934 0.083522 -0.478 0.632578
## fuelLPG 0.186478 0.137646 1.355 0.175540
## fuelPetrol -0.087191 0.083956 -1.039 0.299063
## seller_typeIndividual -0.301247 0.022450 -13.418 < 2e-16 ***
## seller_typeTrustmark Dealer -0.419179 0.046806 -8.956 < 2e-16 ***
## transmissionManual -0.544820 0.026790 -20.336 < 2e-16 ***
## ownerFourth & Above Owner 0.050515 0.052218 0.967 0.333392
## ownerSecond Owner -0.051941 0.018205 -2.853 0.004343 **
## ownerTest Drive Car 3.194600 0.280164 11.403 < 2e-16 ***
## ownerThird Owner -0.016501 0.031281 -0.528 0.597849
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5577 on 6307 degrees of freedom
## Multiple R-squared: 0.6897, Adjusted R-squared: 0.6889
## F-statistic: 876.3 on 16 and 6307 DF, p-value: < 2.2e-16
check_model(model_scale)After creating a model with data that has been scaled, it turns out that the results given are not much different from the model_all and produce the same R value of 0.6889, and the assumptions that are met are only multicollinearity. For this reason, it is recommended to use other regression models on this data such as polynomial regression, random forest regression, etc.
Prediction
pred <- predict(model_all, newdata = car_test, interval = "prediction", level = 0.95)
head(pred)## fit lwr upr
## 2 693441.39 -207413.9 1594296.7
## 3 18287.87 -883617.2 920193.0
## 13 72350.87 -828574.5 973276.2
## 15 485480.94 -415205.7 1386167.6
## 20 98401.11 -802465.1 999267.3
## 22 518553.09 -382295.0 1419401.2
Evaluation
After making predictions from the data, we must find out whether the machine learning model that has been created can produce predictions with the smallest error. There are several ways to do model evaluation on the regression model. To perform model evaluation on the regression model, there are several metrics that can be used:
- R-Squared and Adj R-Squared : to determine how well the model explains the variance of the target variable
- Error value : to see if the prediction made produces the smallest error value
Error values that we will use to see the performance of the model are MAE (Mean Absolute Error) and MAPE. MAE shows the average of the absolute error values while MAPE shows how big the deviation is in percentage terms.
\[ MAE = \frac{\sum |\hat y - y|}{n} \] \[ MAPE = \frac{1}{n} \sum\frac{|\hat y - y|}{y} \]
min <- min(car$selling_price)
max <- max(car$selling_price)
mape <- MAPE(y_pred = pred, y_true = car_test$selling_price)
mae <- MAE(y_pred = pred, y_true = car_test$selling_price)
value <- c(min,max,mape,mae)
eval <- as.data.frame(value)
row.names(eval) <- c("min","max","mape","mae")
eval## value
## min 2.999900e+04
## max 1.000000e+07
## mape 2.263373e+00
## mae 7.127226e+05
When viewed from the MAPE (0.87), it means that the error in the prediction of this model is about 87% so that it can be said that linear regression is not suitable for predicting the selling price of used cars in this data.
Conclusion
The data is a history of used car sales sourced from kaggle.com. The purpose of this analysis is to create a model that can predict the selling price of a used car based on several existing features. linear regression models and tuning models have been used by means of feature selection but still give the same results. only one assumption of linear regression can be met, namely multicolinearity, while the assumptions of Normality of Residual, linearity, and homoscedasticity are not met.
From the results of the analysis that has been carried out I conclude that linear regression is not suitable for predicting the selling price of used cars in this data, it is recommended to use other regression models on this data such as polynomial regression, random forest regression, etc.