This is a data set that compares technological advancement of hybrid electric vehichles in different market segments. I got the data set from UFL. With technology improving, electic vehicles have become more popular but are expensive. The electric vehicles in the dataset are from different countries and years. The sample size in 154 HEVs including 11 plugin HEVs from 1997 to 2013. The EPA database was used to collect the required information from the vehicles.
The numeric variables are vehicle id (carid), model year(year), manufacturer’s suggested retail price in 2013 $ (msrp), acceleration rate in km/hour/seccond (accelrate), fuel economy in miles/gallon (mpg), max of mpg and mpge (mpgmpge), and molel class ID (carclass_id). The categorical varaibles are vehicle (vehicle) and model class (carclass): C = compact, M = midsize, TS = 2 seater, L = large, PT = pickup truck, MV = minivan, SUV = sport utility vehicle. Using these variables, is there a multiple linear model that can predict the price of an electric car that was manufactured between 1997 and 2013?
I am taking out some of the variables from the data set that are difficult to categorize or are redonedant. I am removing the variable vehicle id (carid) because it doesn’t have any effect on price and mpge (mpgmpge) because it measures the same exact value as mpg. I am also removing the categorical variable carclass because it is recoded as a numeric variable with 1 = compact, 3 = midsize, 7 = 2 seater, 2 = large, 5 = pickup truck, 4 = minivan, and 6 = suv.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 854011.2647 | 771125.6371 | 1.107487 | 0.2698803 |
| year | -419.6596 | 384.1168 | -1.092531 | 0.2763749 |
| accelrate | 4311.3907 | 499.3628 | 8.633784 | 0.0000000 |
| mpg | -546.1041 | 135.5074 | -4.030067 | 0.0000889 |
| carclass_id | -1066.7506 | 679.6868 | -1.569474 | 0.1186731 |
According to the summary, only two of the variables have statistically significant slopes. The variable accelrate (t=8.63, p < 0.001) and mpg (t= -4.03, p < 0.001) have statistically significant slopes with the variables carclass_id and year having p-values > \(\alpha\) = 0.05. This means that carclass_id and year, are not significantly contributing to the linear model so we will remove them from the model.
From the residual plots, we can see that there are some violations. The variance of the residuals doesn’t appear to be consistently constant. The Q-Q plot shows that the distribution of the residuals is slightly off from the normal distribution. There appears to be weak curvature of the residuals but there are no concerning outliers or leverage values.
We perform a Box-Cox transformation with log-transformed mpg and the square root of the response.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 254.338039 | 42.115772 | 6.039021 | 0.0e+00 |
| accelrate | 9.105136 | 1.047759 | 8.690108 | 0.0e+00 |
| log(mpg) | -48.877987 | 9.804840 | -4.985088 | 1.7e-06 |
There is some improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9.7276200 | 0.2023182 | 48.080789 | 0.0000000 |
| accelrate | 0.0928168 | 0.0108268 | 8.572907 | 0.0000000 |
| mpg | -0.0109725 | 0.0029012 | -3.782102 | 0.0002241 |
There is even more improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.
Next we will test all three models and determine which one is the best.
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 3.232045e+10 | 0.5366073 | 0.5240832 | 5 | 2942.7848 | 2957.9370 | 3.477941e+10 |
| sqrt.price.log.mpg | 1.619397e+05 | 0.5616349 | 0.5557900 | 3 | 1071.5748 | 1080.6661 | 1.692585e+05 |
| log.price | 1.718713e+01 | 0.5194495 | 0.5130422 | 3 | -328.5004 | -319.4091 | 1.797001e+01 |
We can see from the above table that the goodness-of-fit measures of the second model are unanimously better than the other two models. Considering the interpretability, goodness-of-fit, and simplicity, we choose the second model as the final model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 254.338039 | 42.115772 | 6.039021 | 0.0e+00 |
| accelrate | 9.105136 | 1.047759 | 8.690108 | 0.0e+00 |
| log(mpg) | -48.877987 | 9.804840 | -4.985088 | 1.7e-06 |
The final model is \(sqrt(price) = 254.34 + 9.1*accelrate - 48.88*log(mpg).\)
This is interpreted as the acceleration rate increases by 1, the price increases by 3% if the mpg are the same.
We use various regression techniques such as Box-Cox transformation for response variables and other transformations of the explanatory variables to search for the final model in the case study. There are four variables in the data set but only two were significant so we did have to perform a variable selection procedure.
All candidate models have the same number of variables. We use commonly-used global goodness-of-fit measures as model selection criteria.
The interpretation of the regression coefficients is not straightforward since the response variable was transformed into a sqaure root and the variable mpg was transformed into a log. We used some algebra to derive the practical interpretation of the regression coefficients associated with the variables at their original scales.
The violation of the normal assumption of the residuals was corrected but the constant variance of the residuals remained uncorrected.