Linear Regression on Car Price Prediction
Intro
What We’ll Do
We will learn to use linear regression model using car price assignment
dataset. We want to know the relationship among variables, especially between the car price
with other variables. We also want to predit the price of a new car based on the historical data. You can download the data here .
Business Goal
This section is copied from this Kaggle kernel , who has done the same analysis using python instead of R.
A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.
They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:
- Which variables are significant in predicting the price of a car
- How well those variables describe the price of a car
Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market.
We will required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.
Data Preparation
Load the required package.
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
options(scipen = 100, max.print = 1e+06)
Load the dataset.
Classes 'data.table' and 'data.frame': 205 obs. of 26 variables:
$ car_ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
$ CarName : chr "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
$ fueltype : chr "gas" "gas" "gas" "gas" ...
$ aspiration : chr "std" "std" "std" "std" ...
$ doornumber : chr "two" "two" "two" "four" ...
$ carbody : chr "convertible" "convertible" "hatchback" "sedan" ...
$ drivewheel : chr "rwd" "rwd" "rwd" "fwd" ...
$ enginelocation : chr "front" "front" "front" "front" ...
$ wheelbase : num 88.6 88.6 94.5 99.8 99.4 ...
$ carlength : num 169 169 171 177 177 ...
$ carwidth : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
$ carheight : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
$ curbweight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
$ enginetype : chr "dohc" "dohc" "ohcv" "ohc" ...
$ cylindernumber : chr "four" "four" "six" "four" ...
$ enginesize : int 130 130 152 109 136 136 136 136 131 131 ...
$ fuelsystem : chr "mpfi" "mpfi" "mpfi" "mpfi" ...
$ boreratio : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
$ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
$ compressionratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
$ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
$ peakrpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
$ citympg : int 21 21 19 24 18 19 19 19 17 16 ...
$ highwaympg : int 27 27 26 30 22 25 25 25 20 22 ...
$ price : num 13495 16500 16500 13950 17450 ...
- attr(*, ".internal.selfref")=<externalptr>
The data has 205 rows and 26 columns. Car_ID is a unique identifier for each car, so we can ignore it. Our target variable is the price
, which signifies the price of the said car. We will use other variable except the CarName
.
Before we go further, first we need to make sure that our data is clean and will be useful. If you watch closely, there are some problems with the categorical variable. For example, look at the cylindernumber
variable below.
The are some categories with only 1 instance, such as machine with 3 cylindernumber and 12 cylindernumber. This information may be not be too useful for use and we can consider them as a noise. Furthermore, this will cause problem during train-test data split, where we will not find same category in either one of the dataset. Same problem can be found in fuelsystem
, and enginetype
variables. So, we need to filter and select categories with number of instances > 3. Since we don’t use the car_ID
and CarName
variable, we may remove them as well.
# remove unused variables
car_data <- car %>% select(-c(car_ID, CarName))
# remove cylinder number with only 1 instance
cylinder <- car %>% count(cylindernumber) %>% filter(n > 3)
car_data <- car_data[car_data$cylindernumber %in% cylinder$cylindernumber, ]
car_data$cylindernumber <- factor(car_data$cylindernumber, unique(car_data$cylindernumber))
# remove fueltype with only 1 instance
fuel <- car_data %>% count(fuelsystem) %>% filter(n > 3)
car_data <- car_data[car_data$fuelsystem %in% fuel$fuelsystem, ]
car_data$fuelsystem <- factor(car_data$fuelsystem, unique(car_data$fuelsystem))
# remove engine type with 1 instance
engine <- car_data %>% count(enginetype) %>% filter(n > 3)
car_data <- car_data[car_data$enginetype %in% engine$enginetype, ]
car_data$enginetype <- factor(car_data$enginetype, unique(car_data$enginetype))
# transform character into factor
car_data <- car_data %>% mutate_if(~is.character(.), ~as.factor(.))
Exploratory Data Analysis
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between features.
The graphic shows that a lot of variable has strong correlation with the price
variables.
Modeling
Holdout: Train-Test Split
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 70% of the data as the training data and the rest of it as the testing data.
Linear Regression
Now we will try to model the linear regression using price
as the target variable
Call:
lm(formula = price ~ ., data = data_train)
Residuals:
Min 1Q Median 3Q Max
-4300.0 -998.4 -7.8 867.1 7489.9
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -23895.5859 17597.7769 -1.358 0.177500
symboling -243.6330 292.7094 -0.832 0.407163
fueltypegas -9771.5224 8704.8820 -1.123 0.264271
aspirationturbo 2050.9818 1027.0956 1.997 0.048503 *
doornumbertwo 348.7621 662.2424 0.527 0.599588
carbodyhardtop -3943.9655 1496.9590 -2.635 0.009735 **
carbodyhatchback -4039.9905 1366.0964 -2.957 0.003856 **
carbodysedan -2937.7998 1453.2513 -2.022 0.045843 *
carbodywagon -3165.9489 1544.2484 -2.050 0.042914 *
drivewheelfwd 274.0162 1590.6282 0.172 0.863567
drivewheelrwd -330.3959 1709.1286 -0.193 0.847099
enginelocationrear 6174.1259 2882.7907 2.142 0.034597 *
wheelbase 18.7890 123.5771 0.152 0.879454
carlength -44.3402 58.1712 -0.762 0.447678
carwidth 564.4492 277.6284 2.033 0.044639 *
carheight -143.8636 155.5097 -0.925 0.357092
curbweight 4.0568 2.0437 1.985 0.049831 *
enginetypeohcv -7447.5435 1716.1990 -4.340 0.0000337 ***
enginetypeohc 4396.4721 1322.6065 3.324 0.001233 **
enginetypel 934.8513 2115.6359 0.442 0.659513
enginetypeohcf 368.0238 2197.4718 0.167 0.867327
cylindernumbersix 6043.9415 1794.1427 3.369 0.001067 **
cylindernumberfive 2312.4204 1390.5507 1.663 0.099392 .
cylindernumbereight 15504.1504 3860.5657 4.016 0.000113 ***
enginesize 126.8831 31.7339 3.998 0.000121 ***
fuelsystem2bbl -130.0844 682.3101 -0.191 0.849176
fuelsystem1bbl -498.4802 1101.2055 -0.453 0.651749
fuelsystemidi NA NA NA NA
fuelsystemspdi -2773.5140 1250.9062 -2.217 0.028829 *
boreratio -76.9598 2022.5042 -0.038 0.969721
stroke -5229.6855 1134.8094 -4.608 0.0000118 ***
compressionratio -659.8354 662.5603 -0.996 0.321662
horsepower 10.8915 26.8331 0.406 0.685668
peakrpm 2.7467 0.7146 3.844 0.000211 ***
citympg 89.5902 205.8098 0.435 0.664260
highwaympg 37.5817 175.5260 0.214 0.830890
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2070 on 102 degrees of freedom
Multiple R-squared: 0.9457, Adjusted R-squared: 0.9277
F-statistic: 52.29 on 34 and 102 DF, p-value: < 0.00000000000000022
We get a lot more coefficient than variables that we have. This is because the categorical variable are transformed into dummy variables. For example, we can transform the cylindernumber
variable into several dummy variables using contrasts
.
six five two eight
four 0 0 0 0
six 1 0 0 0
five 0 1 0 0
two 0 0 1 0
eight 0 0 0 1
The matrix mean that if the value of cylindernumbersix
is 1, cylindernumberfive
is 0, cylindernumbereight
is 0, and cylindernumbertwo
is 0, it means that the car has 6 cylinders.
The summary of car_lm
model shows a lot of information. But for now, we may be better focus on the Pr(>|t|)
. This column shows the signifance level of the variable toward the model. If the value is below 0.05, than we can safely asume that the variable has significant effect toward the model (meaning that the estimated coefficient are no different than 0), and vice versa. Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model. The estimate value shows the coefficient of each variable. To interpret the value of each coefficient, for example with every increased value of 1 unit-point in highwaympg will contribute to 37.5817 increase in the car price.
car2 <- car_data %>% select(aspiration, carbody, enginelocation, carwidth, curbweight,
enginetype, cylindernumber, enginesize, stroke, peakrpm, price)
data_train2 <- car2[index, ]
data_test2 <- car2[-index, ]
set.seed(123)
car_lm2 <- lm(price ~ ., data = data_train2)
summary(car_lm2)
Call:
lm(formula = price ~ ., data = data_train2)
Residuals:
Min 1Q Median 3Q Max
-5192.3 -1096.6 99.6 950.7 8836.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41853.6466 11569.3089 -3.618 0.000439 ***
aspirationturbo 2425.8590 603.7624 4.018 0.000104 ***
carbodyhardtop -3946.7773 1380.5740 -2.859 0.005029 **
carbodyhatchback -3890.9733 1242.4144 -3.132 0.002191 **
carbodysedan -3016.0963 1225.3726 -2.461 0.015287 *
carbodywagon -3530.2860 1315.1982 -2.684 0.008316 **
enginelocationrear 6154.8597 2254.6829 2.730 0.007308 **
carwidth 546.0976 193.0814 2.828 0.005499 **
curbweight 2.5934 1.3629 1.903 0.059487 .
enginetypeohcv -7300.5773 1441.9784 -5.063 0.00000153833 ***
enginetypeohc 4004.0916 1198.0752 3.342 0.001114 **
enginetypel 1507.6124 1605.9621 0.939 0.349772
enginetypeohcf 771.4437 1642.9510 0.470 0.639545
cylindernumbersix 6806.2328 1184.9244 5.744 0.00000007340 ***
cylindernumberfive 3092.1669 946.8885 3.266 0.001430 **
cylindernumbereight 16694.5004 2640.3033 6.323 0.00000000475 ***
enginesize 117.1310 20.5899 5.689 0.00000009467 ***
stroke -4462.1866 867.4123 -5.144 0.00000108168 ***
peakrpm 2.0998 0.4438 4.732 0.00000623427 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2045 on 118 degrees of freedom
Multiple R-squared: 0.9387, Adjusted R-squared: 0.9294
F-statistic: 100.4 on 18 and 118 DF, p-value: < 0.00000000000000022
We have removed the non-significant variables. Too see if this action affect our model, we can check the Adjusted R-Squared
value from our two previous models. The first model with complete variables has adjusted R-squared of 0.9277, meaning that the model can explain 92.77% of variance in the target variable (car price). Meanwhile, our simpler model has adjusted R-squared of 92.67%, no big difference with our first model. This shows that it is safe to remove variables that has no significant coefficient values.
Evaluation
Model Performance
The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error: \[ RMSE = \sqrt \frac {\sum {(prediction_i - actual_i)^2}} {n} \] RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE ()
functions from caret
package. Below is the first model (with complete variables) performance.
lm_pred <- predict(car_lm, newdata = data_test %>% select(-price))
# RMSE of train dataset
RMSE(pred = car_lm$fitted.values, obs = data_train$price)
[1] 1786.054
[1] 2891.597
Below is the second model (with removed variables) performance.
lm_pred2 <- predict(car_lm2, newdata = data_test2 %>% select(-price))
# RMSE of train dataset
RMSE(pred = car_lm2$fitted.values, obs = data_train2$price)
[1] 1898.138
[1] 2819.664
The second model is better than the first one in predicting the testing dataset, even only by a small margin. On both models, the error of the training dataset is lower than the test dataset, suggesting that our model may be overfit.
Assumptions
Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model with removed variables).
1. Linearity
The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.
Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values.
Left: A strong pattern in the residuals indicates non-linearity in the data, Right: There is little pattern in the data (source: James et al., 2013)
resact <- data.frame(residual = car_lm2$residuals, fitted = car_lm2$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())
There is a pattern in the data, with the residuals has become more negative as the fitted values increase before increased again. The pattern indicate that our model may not be linear enough.
2. Normality Test
The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.
Shapiro-Wilk normality test
data: car_lm2$residuals
W = 0.96665, p-value = 0.001962
the null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.
3. Autocorrelation
The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms (no autocorrelation). If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.
Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.
lag Autocorrelation D-W Statistic p-value
1 0.2579343 1.45034 0.002
Alternative hypothesis: rho != 0
The result shows that the null hypothesis is rejected, meaning that our residual has autocorrelation in it.
4. Heterocedasticity
Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.
Left: Heterocedasticity (Non-Constant Variances of Residuals), Right: Homocedasticity (Constant Variances of Residuals) (source: James et al., 2013)
studentized Breusch-Pagan test
data: car_lm2
BP = 45.297, df = 18, p-value = 0.0003754
We can observe that on lower fitted values, the residuals are concentrated around the value of 0. As the fitted value increases, the residuals are also got bigger. Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value < 0.05, we can conclude that heterocesdasticity is present in our model.
5. Multicollinearity
Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
GVIF Df GVIF^(1/(2*Df))
aspiration 1.889182 1 1.374475
carbody 3.166816 4 1.154989
enginelocation 2.395079 1 1.547604
carwidth 5.363939 1 2.316018
curbweight 14.985214 1 3.871074
enginetype 10.573281 4 1.342846
cylindernumber 13.910093 3 1.550797
enginesize 18.188189 1 4.264761
stroke 2.179364 1 1.476267
peakrpm 1.434420 1 1.197673
Model Improvement
Tuning
We has already seen that our model doesn’t satisfy some of the assumptions, including the linearity, heterocesdasticity and the autocorrelation. Now we will try to fix them. To made the model more linear, we can transform some of the variables, both the target and the predictors. One of the approaches that can be adopted is to shun off the variables that have correlation coefficient above 0.7.
Based on the correlation matrix, enginesize
has a strong correlation with other variables. We will try to remove the variable and see if autocorrelation still present.
# transform variable
car3 <- car2 %>% select(-c(enginesize, carbody))
car3 <- car3 %>% mutate_if(~is.numeric(.), ~log10(.))
head(car3)
Performance
We will measure the R-square and the RMSE. Since our target is transformed with log10, we need to transform it back to \(10^{x}\) to get the original price value and get a meaningful comparison to our previous model.
Call:
lm(formula = price ~ ., data = data_train3)
Residuals:
Min 1Q Median 3Q Max
-0.204990 -0.036988 -0.001635 0.034233 0.181823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.75056 1.44412 -6.059 0.000000015399 ***
aspirationturbo 0.03736 0.01693 2.207 0.029204 *
enginelocationrear 0.24878 0.06269 3.968 0.000122 ***
carwidth 3.00424 0.88911 3.379 0.000975 ***
curbweight 1.64020 0.14867 11.032 < 0.0000000000000002 ***
enginetypeohcv -0.16921 0.04432 -3.818 0.000212 ***
enginetypeohc 0.04616 0.03585 1.288 0.200308
enginetypel -0.03742 0.04807 -0.778 0.437809
enginetypeohcf -0.02427 0.04801 -0.506 0.614088
cylindernumbersix 0.20918 0.02982 7.015 0.000000000135 ***
cylindernumberfive 0.06677 0.02920 2.287 0.023922 *
cylindernumbereight 0.39591 0.06526 6.067 0.000000014866 ***
stroke -0.48706 0.16364 -2.976 0.003513 **
peakrpm 0.52712 0.15907 3.314 0.001210 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06426 on 123 degrees of freedom
Multiple R-squared: 0.9161, Adjusted R-squared: 0.9072
F-statistic: 103.3 on 13 and 123 DF, p-value: < 0.00000000000000022
lm_pred3 <- predict(car_lm3, newdata = data_test3 %>% select(-price))
# RMSE of train dataset
RMSE(pred = 10^(car_lm3$fitted.values), obs = 10^(data_train3$price))
[1] 2222.705
[1] 2896.752
Our model has R-squared of 90.72%. The value may be dropped a bit, but in return we will get a model that satisfies the assumption. Unfortunately, our model seems to overfit, with RMSE of training dataset is 2222.705 while the test dataset has RMSE of 2896.75. You can try adjust the proportion of training-test dataset and see if the model still overfit.
Assumptions
1. Linearity
resact <- data.frame(residual = car_lm3$residuals, fitted = car_lm3$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())
There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.
2. Normality Test
Shapiro-Wilk normality test
data: car_lm3$residuals
W = 0.98576, p-value = 0.1674
With p-value > 0.05, we can conclude that our residuals are normally distributed.
3. Autocorrelation
lag Autocorrelation D-W Statistic p-value
1 0.05555662 1.873862 0.44
Alternative hypothesis: rho != 0
With p-value > 0.05, we can conclude that autocorrelation is not present.
4. Heterocedasticity
studentized Breusch-Pagan test
data: car_lm3
BP = 21.561, df = 13, p-value = 0.06255
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())
With p-value > 0.05, we can conclude that heterocesdasticity is not present.
5. Multicollinearity
GVIF Df GVIF^(1/(2*Df))
aspiration 1.505239 1 1.226882
enginelocation 1.875705 1 1.369564
carwidth 4.862059 1 2.205008
curbweight 4.907923 1 2.215383
enginetype 7.797281 4 1.292686
cylindernumber 6.872150 3 1.378845
stroke 1.560167 1 1.249066
peakrpm 1.385751 1 1.177179
Conclusion
Variables that are useful to describe the variances in car prices are aspiration, enginelocation, carwidth, curbweight, enginetype, cylindernumber, stroke, peakrpm, price. Our final model has satisfied the classical assumptions. The R-squared of the model is high, with 90.72% of the variables can explain the variances in the car price. The accuracy of the model in predicting the car price is measured with RMSE, with training data has RMSE of 2222.705 and testing data has RMSE of 2896.752, suggesting that our model may overfit the traning dataset.
We have already learn how to build a linear regression model and what need to be concerned when building the model.
Reference
- Aora, Ridhi. Answering question in “What are the ways to deal with auto-correlation problems in Multiple Regression Analysis?”. Accessed September 23 2019, via https://www.researchgate.net/post/What_are_the_ways_to_deal_with_auto-correlation_problems_in_Multiple_Regression_Analysis.
- James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An Introduction to Statistical Learning: with Applications in R . New York: Springer.
- Gujarati, Damodar and Dawn C. Porter. 2009. Basic Econometrics . New York: McGraw-Hill.