Description of the Data Set

This is a data set that compares technological advancement of hybrid electric vehichles in different market segments. I got the data set from UFL. With technology improving, electic vehicles have become more popular but are expensive. The electric vehicles in the dataset are from different countries and years. The sample size in 154 HEVs including 11 plugin HEVs from 1997 to 2013. The EPA database was used to collect the required information from the vehicles.

The numeric variables are vehicle id (carid), model year(year), manufacturer’s suggested retail price in 2013 $ (msrp), acceleration rate in km/hour/seccond (accelrate), fuel economy in miles/gallon (mpg), max of mpg and mpge (mpgmpge), and molel class ID (carclass_id). The categorical varaibles are vehicle (vehicle) and model class (carclass): C = compact, M = midsize, TS = 2 seater, L = large, PT = pickup truck, MV = minivan, SUV = sport utility vehicle. Using these variables, is there a multiple linear model that can predict the price of an electric car that was manufactured between 1997 and 2013?

I am taking out some of the variables from the data set that are difficult to categorize or are redonedant. I am removing the variable vehicle id (carid) because it doesn’t have any effect on price and mpge (mpgmpge) because it measures the same exact value as mpg. I am also removing the categorical variable carclass because it is recoded as a numeric variable with 1 = compact, 3 = midsize, 7 = 2 seater, 2 = large, 5 = pickup truck, 4 = minivan, and 6 = suv.

Full Linear Model

Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 854011.2647 771125.6371 1.107487 0.2698803
year -419.6596 384.1168 -1.092531 0.2763749
accelrate 4311.3907 499.3628 8.633784 0.0000000
mpg -546.1041 135.5074 -4.030067 0.0000889
carclass_id -1066.7506 679.6868 -1.569474 0.1186731

According to the summary, only two of the variables have statistically significant slopes. The variable accelrate (t=8.63, p < 0.001) and mpg (t= -4.03, p < 0.001) have statistically significant slopes with the variables carclass_id and year having p-values > \(\alpha\) = 0.05. This means that carclass_id and year, are not significantly contributing to the linear model so we will remove them from the model.

From the residual plots, we can see that there are some violations. The variance of the residuals doesn’t appear to be consistently constant. The Q-Q plot shows that the distribution of the residuals is slightly off from the normal distribution. There appears to be weak curvature of the residuals but there are no concerning outliers or leverage values.

Box-Cox Transformation

Square Root Transformation

We perform a Box-Cox transformation with log-transformed mpg and the square root of the response.

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 254.338039 42.115772 6.039021 0.0e+00
accelrate 9.105136 1.047759 8.690108 0.0e+00
log(mpg) -48.877987 9.804840 -4.985088 1.7e-06

There is some improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.

Log Transformation

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.7276200 0.2023182 48.080789 0.0000000
accelrate 0.0928168 0.0108268 8.572907 0.0000000
mpg -0.0109725 0.0029012 -3.782102 0.0002241

There is even more improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.

Goodness of Fit Measures

Next we will test all three models and determine which one is the best.

Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 3.232045e+10 0.5366073 0.5240832 5 2942.7848 2957.9370 3.477941e+10
sqrt.price.log.mpg 1.619397e+05 0.5616349 0.5557900 3 1071.5748 1080.6661 1.692585e+05
log.price 1.718713e+01 0.5194495 0.5130422 3 -328.5004 -319.4091 1.797001e+01

We can see from the above table that the goodness-of-fit measures of the second model are unanimously better than the other two models. Considering the interpretability, goodness-of-fit, and simplicity, we choose the second model as the final model.

Final Model

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 254.338039 42.115772 6.039021 0.0e+00
accelrate 9.105136 1.047759 8.690108 0.0e+00
log(mpg) -48.877987 9.804840 -4.985088 1.7e-06

Bootstap Final Model

Because the assumpution of constant variance is still violated, a bootstrap method is used to find confidence intervals for the coefficients of the final regression model.

Regression Coefficient Matrix
Estimate Std. Error t value Pr(>|t|) btc.ci.95
(Intercept) 254.3380 42.1158 6.0390 0.0000 [ 68.9915 , 322.0853 ]
accelrate 9.1051 1.0478 8.6901 0.0000 [ -3.1656 , 2.9992 ]
log(mpg) -48.8780 9.8048 -4.9851 0.0000 [ -30.5112 , 28.3308 ]

The histograms appear to be approximately normal. The bootstrap confidence intervals of the variables accelrate and mpg include 0 in them which means that the slopes of these variables do not have statistical significance on the response variable (price of the car).

The histogram of the residuals appears to be approximately normal with a slight right tailed skewness.

Residual Bootstrap Regression

The histograms of the residuals are not normal and the normal and LOESS curves are no where near each other which mean that the indicated inference of the significance of variables based on p-values and residual bootstrap do not yield the same results.

The 95% residual bootstrap confidence intervals are given below

Regression Coefficient Matrix with 95% Residual Bootstrap CI
Estimate Std. Error t value Pr(>|t|) btr.ci.95
(Intercept) 254.3380 42.1158 6.0390 0.0000 [ 0 , 254.338 ]
accelrate 9.1051 1.0478 8.6901 0.0000 [ 0 , 9.1051 ]
log(mpg) -48.8780 9.8048 -4.9851 0.0000 [ -48.878 , 0 ]

The residual bootstrap confidence intervals include 0 which means that these slopes are not statistically significant to the model and they do not yield the same results as the p-values do.

Combining All Inferential Statistics

Final Combined Inferential Statistics: p-values and Bootstrap CIs
Estimate Std. Error Pr(>|t|) btc.ci.95 btr.ci.95
(Intercept) 254.3380 42.1158 0.0000 [ 68.9915 , 322.0853 ] [ 0 , 254.338 ]
accelrate 9.1051 1.0478 0.0000 [ -3.1656 , 2.9992 ] [ 0 , 9.1051 ]
log(mpg) -48.8780 9.8048 0.0000 [ -30.5112 , 28.3308 ] [ -48.878 , 0 ]

The bootstrap confidence intervals do not agree with the p-value results which means that this model should not be used for prediction and estimation that the final model does have serious violations of the model assumptions.

width of the two bootstrap confidence intervals
btc.wd btr.wd
253.093827 254.338039
6.164756 9.105136
58.841922 48.877987

The widths of residual bootstrap and case-bootstrap confidence intervals are not similar to each other showing that model assumptions are sitll violated.

Discussion

The main finding was that this model is not a good model for prediction and estimation. Also, the inferential statistics did not agree with the bootstrap conclusion. On one hand, the p-values were less than alpha but the confidence intervals in the bootstrap method all included zero meaning that the variables did not statistically contribute to the model.

If the final model was statistically significant, because the response variable was transformed, simple algebra would have to be applied for interpretation. For example, if the model yield $2,000, this quantity would have to be squared because the final model takes the square root of the response.

Overall, the final model should not be used for interpretation because it failed the bootstrap method thus meaning that their were major model assumption violations. This shows that the bootstrap method is great to check your work instead of just basing your conclusion off of p-values.