Linear Regression on Car Price Prediction

Intro

What We’ll Do

We will learn to use linear regression model using car price assignmentdataset. We want to know the relationship among variables, especially between the car price with other variables. We also want to predit the price of a new car based on the historical data. You can download the data here .

Business Goal

This section is copied from this Kaggle kernel , who has done the same analysis using python instead of R.

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

  • Which variables are significant in predicting the price of a car
  • How well those variables describe the price of a car

Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market.

We will required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

Data Preparation

Load the required package.

Load the dataset.

Classes 'data.table' and 'data.frame':  205 obs. of  26 variables:
 $ car_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
 $ CarName         : chr  "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
 $ fueltype        : chr  "gas" "gas" "gas" "gas" ...
 $ aspiration      : chr  "std" "std" "std" "std" ...
 $ doornumber      : chr  "two" "two" "two" "four" ...
 $ carbody         : chr  "convertible" "convertible" "hatchback" "sedan" ...
 $ drivewheel      : chr  "rwd" "rwd" "rwd" "fwd" ...
 $ enginelocation  : chr  "front" "front" "front" "front" ...
 $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
 $ carlength       : num  169 169 171 177 177 ...
 $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
 $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
 $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
 $ enginetype      : chr  "dohc" "dohc" "ohcv" "ohc" ...
 $ cylindernumber  : chr  "four" "four" "six" "four" ...
 $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
 $ fuelsystem      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
 $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
 $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
 $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
 $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
 $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
 $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
 $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
 $ price           : num  13495 16500 16500 13950 17450 ...
 - attr(*, ".internal.selfref")=<externalptr> 

The data has 205 rows and 26 columns. Car_ID is a unique identifier for each car, so we can ignore it. Our target variable is the price, which signifies the price of the said car. We will use other variable except the CarName.

Before we go further, first we need to make sure that our data is clean and will be useful. If you watch closely, there are some problems with the categorical variable. For example, look at the cylindernumber variable below.

The are some categories with only 1 instance, such as machine with 3 cylindernumber and 12 cylindernumber. This information may be not be too useful for use and we can consider them as a noise. Furthermore, this will cause problem during train-test data split, where we will not find same category in either one of the dataset. Same problem can be found in fuelsystem, and enginetype variables. So, we need to filter and select categories with number of instances > 3. Since we don’t use the car_ID and CarName variable, we may remove them as well.

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

Find the Pearson correlation between features.

The graphic shows that a lot of variable has strong correlation with the price variables.

Modeling

Holdout: Train-Test Split

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 70% of the data as the training data and the rest of it as the testing data.

Linear Regression

Now we will try to model the linear regression using price as the target variable


Call:
lm(formula = price ~ ., data = data_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4300.0  -998.4    -7.8   867.1  7489.9 

Coefficients: (1 not defined because of singularities)
                       Estimate  Std. Error t value  Pr(>|t|)    
(Intercept)         -23895.5859  17597.7769  -1.358  0.177500    
symboling             -243.6330    292.7094  -0.832  0.407163    
fueltypegas          -9771.5224   8704.8820  -1.123  0.264271    
aspirationturbo       2050.9818   1027.0956   1.997  0.048503 *  
doornumbertwo          348.7621    662.2424   0.527  0.599588    
carbodyhardtop       -3943.9655   1496.9590  -2.635  0.009735 ** 
carbodyhatchback     -4039.9905   1366.0964  -2.957  0.003856 ** 
carbodysedan         -2937.7998   1453.2513  -2.022  0.045843 *  
carbodywagon         -3165.9489   1544.2484  -2.050  0.042914 *  
drivewheelfwd          274.0162   1590.6282   0.172  0.863567    
drivewheelrwd         -330.3959   1709.1286  -0.193  0.847099    
enginelocationrear    6174.1259   2882.7907   2.142  0.034597 *  
wheelbase               18.7890    123.5771   0.152  0.879454    
carlength              -44.3402     58.1712  -0.762  0.447678    
carwidth               564.4492    277.6284   2.033  0.044639 *  
carheight             -143.8636    155.5097  -0.925  0.357092    
curbweight               4.0568      2.0437   1.985  0.049831 *  
enginetypeohcv       -7447.5435   1716.1990  -4.340 0.0000337 ***
enginetypeohc         4396.4721   1322.6065   3.324  0.001233 ** 
enginetypel            934.8513   2115.6359   0.442  0.659513    
enginetypeohcf         368.0238   2197.4718   0.167  0.867327    
cylindernumbersix     6043.9415   1794.1427   3.369  0.001067 ** 
cylindernumberfive    2312.4204   1390.5507   1.663  0.099392 .  
cylindernumbereight  15504.1504   3860.5657   4.016  0.000113 ***
enginesize             126.8831     31.7339   3.998  0.000121 ***
fuelsystem2bbl        -130.0844    682.3101  -0.191  0.849176    
fuelsystem1bbl        -498.4802   1101.2055  -0.453  0.651749    
fuelsystemidi                NA          NA      NA        NA    
fuelsystemspdi       -2773.5140   1250.9062  -2.217  0.028829 *  
boreratio              -76.9598   2022.5042  -0.038  0.969721    
stroke               -5229.6855   1134.8094  -4.608 0.0000118 ***
compressionratio      -659.8354    662.5603  -0.996  0.321662    
horsepower              10.8915     26.8331   0.406  0.685668    
peakrpm                  2.7467      0.7146   3.844  0.000211 ***
citympg                 89.5902    205.8098   0.435  0.664260    
highwaympg              37.5817    175.5260   0.214  0.830890    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2070 on 102 degrees of freedom
Multiple R-squared:  0.9457,    Adjusted R-squared:  0.9277 
F-statistic: 52.29 on 34 and 102 DF,  p-value: < 0.00000000000000022

We get a lot more coefficient than variables that we have. This is because the categorical variable are transformed into dummy variables. For example, we can transform the cylindernumber variable into several dummy variables using contrasts.

      six five two eight
four    0    0   0     0
six     1    0   0     0
five    0    1   0     0
two     0    0   1     0
eight   0    0   0     1

The matrix mean that if the value of cylindernumbersix is 1, cylindernumberfive is 0, cylindernumbereight is 0, and cylindernumbertwo is 0, it means that the car has 6 cylinders.

The summary of car_lm model shows a lot of information. But for now, we may be better focus on the Pr(>|t|). This column shows the signifance level of the variable toward the model. If the value is below 0.05, than we can safely asume that the variable has significant effect toward the model (meaning that the estimated coefficient are no different than 0), and vice versa. Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model. The estimate value shows the coefficient of each variable. To interpret the value of each coefficient, for example with every increased value of 1 unit-point in highwaympg will contribute to 37.5817 increase in the car price.


Call:
lm(formula = price ~ ., data = data_train2)

Residuals:
    Min      1Q  Median      3Q     Max 
-5192.3 -1096.6    99.6   950.7  8836.4 

Coefficients:
                       Estimate  Std. Error t value      Pr(>|t|)    
(Intercept)         -41853.6466  11569.3089  -3.618      0.000439 ***
aspirationturbo       2425.8590    603.7624   4.018      0.000104 ***
carbodyhardtop       -3946.7773   1380.5740  -2.859      0.005029 ** 
carbodyhatchback     -3890.9733   1242.4144  -3.132      0.002191 ** 
carbodysedan         -3016.0963   1225.3726  -2.461      0.015287 *  
carbodywagon         -3530.2860   1315.1982  -2.684      0.008316 ** 
enginelocationrear    6154.8597   2254.6829   2.730      0.007308 ** 
carwidth               546.0976    193.0814   2.828      0.005499 ** 
curbweight               2.5934      1.3629   1.903      0.059487 .  
enginetypeohcv       -7300.5773   1441.9784  -5.063 0.00000153833 ***
enginetypeohc         4004.0916   1198.0752   3.342      0.001114 ** 
enginetypel           1507.6124   1605.9621   0.939      0.349772    
enginetypeohcf         771.4437   1642.9510   0.470      0.639545    
cylindernumbersix     6806.2328   1184.9244   5.744 0.00000007340 ***
cylindernumberfive    3092.1669    946.8885   3.266      0.001430 ** 
cylindernumbereight  16694.5004   2640.3033   6.323 0.00000000475 ***
enginesize             117.1310     20.5899   5.689 0.00000009467 ***
stroke               -4462.1866    867.4123  -5.144 0.00000108168 ***
peakrpm                  2.0998      0.4438   4.732 0.00000623427 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2045 on 118 degrees of freedom
Multiple R-squared:  0.9387,    Adjusted R-squared:  0.9294 
F-statistic: 100.4 on 18 and 118 DF,  p-value: < 0.00000000000000022

We have removed the non-significant variables. Too see if this action affect our model, we can check the Adjusted R-Squared value from our two previous models. The first model with complete variables has adjusted R-squared of 0.9277, meaning that the model can explain 92.77% of variance in the target variable (car price). Meanwhile, our simpler model has adjusted R-squared of 92.67%, no big difference with our first model. This shows that it is safe to remove variables that has no significant coefficient values.

Evaluation

Model Performance

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error: \[ RMSE = \sqrt \frac {\sum {(prediction_i - actual_i)^2}} {n} \] RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.

[1] 1786.054
[1] 2891.597

Below is the second model (with removed variables) performance.

[1] 1898.138
[1] 2819.664

The second model is better than the first one in predicting the testing dataset, even only by a small margin. On both models, the error of the training dataset is lower than the test dataset, suggesting that our model may be overfit.

Assumptions

Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model with removed variables).

1. Linearity

The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.

Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values.

Left: A strong pattern in the residuals indicates non-linearity in the data, Right: There is little pattern in the data (source: James et al., 2013)

Left: A strong pattern in the residuals indicates non-linearity in the data, Right: There is little pattern in the data (source: James et al., 2013)

There is a pattern in the data, with the residuals has become more negative as the fitted values increase before increased again. The pattern indicate that our model may not be linear enough.

2. Normality Test

The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.


    Shapiro-Wilk normality test

data:  car_lm2$residuals
W = 0.96665, p-value = 0.001962

the null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.

3. Autocorrelation

The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms (no autocorrelation). If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.

Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.

 lag Autocorrelation D-W Statistic p-value
   1       0.2579343       1.45034   0.002
 Alternative hypothesis: rho != 0

The result shows that the null hypothesis is rejected, meaning that our residual has autocorrelation in it.

4. Heterocedasticity

Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.

Left: Heterocedasticity (Non-Constant Variances of Residuals), Right: Homocedasticity (Constant Variances of Residuals) (source: James et al., 2013)

Left: Heterocedasticity (Non-Constant Variances of Residuals), Right: Homocedasticity (Constant Variances of Residuals) (source: James et al., 2013)


    studentized Breusch-Pagan test

data:  car_lm2
BP = 45.297, df = 18, p-value = 0.0003754

We can observe that on lower fitted values, the residuals are concentrated around the value of 0. As the fitted value increases, the residuals are also got bigger. Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value < 0.05, we can conclude that heterocesdasticity is present in our model.

5. Multicollinearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

                    GVIF Df GVIF^(1/(2*Df))
aspiration      1.889182  1        1.374475
carbody         3.166816  4        1.154989
enginelocation  2.395079  1        1.547604
carwidth        5.363939  1        2.316018
curbweight     14.985214  1        3.871074
enginetype     10.573281  4        1.342846
cylindernumber 13.910093  3        1.550797
enginesize     18.188189  1        4.264761
stroke          2.179364  1        1.476267
peakrpm         1.434420  1        1.197673

Model Improvement

Tuning

We has already seen that our model doesn’t satisfy some of the assumptions, including the linearity, heterocesdasticity and the autocorrelation. Now we will try to fix them. To made the model more linear, we can transform some of the variables, both the target and the predictors. One of the approaches that can be adopted is to shun off the variables that have correlation coefficient above 0.7.

Based on the correlation matrix, enginesize has a strong correlation with other variables. We will try to remove the variable and see if autocorrelation still present.

Performance

We will measure the R-square and the RMSE. Since our target is transformed with log10, we need to transform it back to \(10^{x}\) to get the original price value and get a meaningful comparison to our previous model.


Call:
lm(formula = price ~ ., data = data_train3)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.204990 -0.036988 -0.001635  0.034233  0.181823 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)         -8.75056    1.44412  -6.059       0.000000015399 ***
aspirationturbo      0.03736    0.01693   2.207             0.029204 *  
enginelocationrear   0.24878    0.06269   3.968             0.000122 ***
carwidth             3.00424    0.88911   3.379             0.000975 ***
curbweight           1.64020    0.14867  11.032 < 0.0000000000000002 ***
enginetypeohcv      -0.16921    0.04432  -3.818             0.000212 ***
enginetypeohc        0.04616    0.03585   1.288             0.200308    
enginetypel         -0.03742    0.04807  -0.778             0.437809    
enginetypeohcf      -0.02427    0.04801  -0.506             0.614088    
cylindernumbersix    0.20918    0.02982   7.015       0.000000000135 ***
cylindernumberfive   0.06677    0.02920   2.287             0.023922 *  
cylindernumbereight  0.39591    0.06526   6.067       0.000000014866 ***
stroke              -0.48706    0.16364  -2.976             0.003513 ** 
peakrpm              0.52712    0.15907   3.314             0.001210 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06426 on 123 degrees of freedom
Multiple R-squared:  0.9161,    Adjusted R-squared:  0.9072 
F-statistic: 103.3 on 13 and 123 DF,  p-value: < 0.00000000000000022
[1] 2222.705
[1] 2896.752

Our model has R-squared of 90.72%. The value may be dropped a bit, but in return we will get a model that satisfies the assumption. Unfortunately, our model seems to overfit, with RMSE of training dataset is 2222.705 while the test dataset has RMSE of 2896.75. You can try adjust the proportion of training-test dataset and see if the model still overfit.

Assumptions

2. Normality Test


    Shapiro-Wilk normality test

data:  car_lm3$residuals
W = 0.98576, p-value = 0.1674

With p-value > 0.05, we can conclude that our residuals are normally distributed.

3. Autocorrelation

 lag Autocorrelation D-W Statistic p-value
   1      0.05555662      1.873862    0.44
 Alternative hypothesis: rho != 0

With p-value > 0.05, we can conclude that autocorrelation is not present.

4. Heterocedasticity


    studentized Breusch-Pagan test

data:  car_lm3
BP = 21.561, df = 13, p-value = 0.06255

With p-value > 0.05, we can conclude that heterocesdasticity is not present.

5. Multicollinearity

                   GVIF Df GVIF^(1/(2*Df))
aspiration     1.505239  1        1.226882
enginelocation 1.875705  1        1.369564
carwidth       4.862059  1        2.205008
curbweight     4.907923  1        2.215383
enginetype     7.797281  4        1.292686
cylindernumber 6.872150  3        1.378845
stroke         1.560167  1        1.249066
peakrpm        1.385751  1        1.177179

Conclusion

Variables that are useful to describe the variances in car prices are aspiration, enginelocation, carwidth, curbweight, enginetype, cylindernumber, stroke, peakrpm, price. Our final model has satisfied the classical assumptions. The R-squared of the model is high, with 90.72% of the variables can explain the variances in the car price. The accuracy of the model in predicting the car price is measured with RMSE, with training data has RMSE of 2222.705 and testing data has RMSE of 2896.752, suggesting that our model may overfit the traning dataset.

We have already learn how to build a linear regression model and what need to be concerned when building the model.

Reference

  1. Aora, Ridhi. Answering question in “What are the ways to deal with auto-correlation problems in Multiple Regression Analysis?”. Accessed September 23 2019, via https://www.researchgate.net/post/What_are_the_ways_to_deal_with_auto-correlation_problems_in_Multiple_Regression_Analysis.
  2. James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An Introduction to Statistical Learning: with Applications in R . New York: Springer.
  3. Gujarati, Damodar and Dawn C. Porter. 2009. Basic Econometrics . New York: McGraw-Hill.

2019-10-02