This is a data set that compares technological advancement of hybrid electric vehichles in different market segments. I got the data set from UFL. With technology improving, electic vehicles have become more popular but are expensive. The electric vehicles in the dataset are from different countries and years. The sample size in 154 HEVs including 11 plugin HEVs from 1997 to 2013. The EPA database was used to collect the required information from the vehicles.
The numeric variables are vehicle id (carid), model year(year), manufacturer’s suggested retail price in 2013 $ (msrp), acceleration rate in km/hour/seccond (accelrate), fuel economy in miles/gallon (mpg), max of mpg and mpge (mpgmpge), and molel class ID (carclass_id). The categorical varaibles are vehicle (vehicle) and model class (carclass): C = compact, M = midsize, TS = 2 seater, L = large, PT = pickup truck, MV = minivan, SUV = sport utility vehicle. Using these variables, is there a multiple linear model that can predict the price of an electric car that was manufactured between 1997 and 2013?
I am taking out some of the variables from the data set that are difficult to categorize or are redonedant. I am removing the variable vehicle id (carid) because it doesn’t have any effect on price and mpge (mpgmpge) because it measures the same exact value as mpg. I am also removing the categorical variable carclass because it is recoded as a numeric variable with 1 = compact, 3 = midsize, 7 = 2 seater, 2 = large, 5 = pickup truck, 4 = minivan, and 6 = suv.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 854011.2647 | 771125.6371 | 1.107487 | 0.2698803 |
| year | -419.6596 | 384.1168 | -1.092531 | 0.2763749 |
| accelrate | 4311.3907 | 499.3628 | 8.633784 | 0.0000000 |
| mpg | -546.1041 | 135.5074 | -4.030067 | 0.0000889 |
| carclass_id | -1066.7506 | 679.6868 | -1.569474 | 0.1186731 |
According to the summary, only two of the variables have statistically significant slopes. The variable accelrate (t=8.63, p < 0.001) and mpg (t= -4.03, p < 0.001) have statistically significant slopes with the variables carclass_id and year having p-values > \(\alpha\) = 0.05. This means that carclass_id and year, are not significantly contributing to the linear model so we will remove them from the model.
From the residual plots, we can see that there are some violations. The variance of the residuals doesn’t appear to be consistently constant. The Q-Q plot shows that the distribution of the residuals is slightly off from the normal distribution. There appears to be weak curvature of the residuals but there are no concerning outliers or leverage values.
We perform a Box-Cox transformation with log-transformed mpg and the square root of the response.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 254.338039 | 42.115772 | 6.039021 | 0.0e+00 |
| accelrate | 9.105136 | 1.047759 | 8.690108 | 0.0e+00 |
| log(mpg) | -48.877987 | 9.804840 | -4.985088 | 1.7e-06 |
There is some improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9.7276200 | 0.2023182 | 48.080789 | 0.0000000 |
| accelrate | 0.0928168 | 0.0108268 | 8.572907 | 0.0000000 |
| mpg | -0.0109725 | 0.0029012 | -3.782102 | 0.0002241 |
There is even more improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.
Next we will test all three models and determine which one is the best.
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 3.232045e+10 | 0.5366073 | 0.5240832 | 5 | 2942.7848 | 2957.9370 | 3.477941e+10 |
| sqrt.price.log.mpg | 1.619397e+05 | 0.5616349 | 0.5557900 | 3 | 1071.5748 | 1080.6661 | 1.692585e+05 |
| log.price | 1.718713e+01 | 0.5194495 | 0.5130422 | 3 | -328.5004 | -319.4091 | 1.797001e+01 |
We can see from the above table that the goodness-of-fit measures of the second model are unanimously better than the other two models. Considering the interpretability, goodness-of-fit, and simplicity, we choose the second model as the final model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 254.338039 | 42.115772 | 6.039021 | 0.0e+00 |
| accelrate | 9.105136 | 1.047759 | 8.690108 | 0.0e+00 |
| log(mpg) | -48.877987 | 9.804840 | -4.985088 | 1.7e-06 |
Because the assumpution of constant variance is still violated, a bootstrap method is used to find confidence intervals for the coefficients of the final regression model.
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 254.3380 | 42.1158 | 6.0390 | 0.0000 | [ 68.9915 , 322.0853 ] |
| accelrate | 9.1051 | 1.0478 | 8.6901 | 0.0000 | [ -3.1656 , 2.9992 ] |
| log(mpg) | -48.8780 | 9.8048 | -4.9851 | 0.0000 | [ -30.5112 , 28.3308 ] |
The histograms appear to be approximately normal. The bootstrap confidence intervals of the variables accelrate and mpg include 0 in them which means that the slopes of these variables do not have statistical significance on the response variable (price of the car).
The histogram of the residuals appears to be approximately normal with a
slight right tailed skewness.
The histograms of the residuals are not normal and the normal and LOESS curves are no where near each other which mean that the indicated inference of the significance of variables based on p-values and residual bootstrap do not yield the same results.
The 95% residual bootstrap confidence intervals are given below
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 254.3380 | 42.1158 | 6.0390 | 0.0000 | [ 0 , 254.338 ] |
| accelrate | 9.1051 | 1.0478 | 8.6901 | 0.0000 | [ 0 , 9.1051 ] |
| log(mpg) | -48.8780 | 9.8048 | -4.9851 | 0.0000 | [ -48.878 , 0 ] |
The residual bootstrap confidence intervals include 0 which means that these slopes are not statistically significant to the model and they do not yield the same results as the p-values do.
| Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 254.3380 | 42.1158 | 0.0000 | [ 68.9915 , 322.0853 ] | [ 0 , 254.338 ] |
| accelrate | 9.1051 | 1.0478 | 0.0000 | [ -3.1656 , 2.9992 ] | [ 0 , 9.1051 ] |
| log(mpg) | -48.8780 | 9.8048 | 0.0000 | [ -30.5112 , 28.3308 ] | [ -48.878 , 0 ] |
The bootstrap confidence intervals do not agree with the p-value results which means that this model should not be used for prediction and estimation that the final model does have serious violations of the model assumptions.
| btc.wd | btr.wd |
|---|---|
| 253.093827 | 254.338039 |
| 6.164756 | 9.105136 |
| 58.841922 | 48.877987 |
The widths of residual bootstrap and case-bootstrap confidence intervals are not similar to each other showing that model assumptions are sitll violated.
The main finding was that this model is not a good model for prediction and estimation. Also, the inferential statistics did not agree with the bootstrap conclusion. On one hand, the p-values were less than alpha but the confidence intervals in the bootstrap method all included zero meaning that the variables did not statistically contribute to the model.
If the final model was statistically significant, because the response variable was transformed, simple algebra would have to be applied for interpretation. For example, if the model yield $2,000, this quantity would have to be squared because the final model takes the square root of the response.
Overall, the final model should not be used for interpretation because it failed the bootstrap method thus meaning that their were major model assumption violations. This shows that the bootstrap method is great to check your work instead of just basing your conclusion off of p-values.