Description of the Data Set

This is a data set that compares technological advancement of hybrid electric vehichles in different market segments. I got the data set from UFL. With technology improving, electic vehicles have become more popular but are expensive. The electric vehicles in the dataset are from different countries and years. The sample size in 154 HEVs including 11 plugin HEVs from 1997 to 2013. The EPA database was used to collect the required information from the vehicles.

The numeric variables are vehicle id (carid), model year(year), manufacturer’s suggested retail price in 2013 $ (msrp), acceleration rate in km/hour/seccond (accelrate), fuel economy in miles/gallon (mpg), max of mpg and mpge (mpgmpge), and molel class ID (carclass_id). The categorical varaibles are vehicle (vehicle) and model class (carclass): C = compact, M = midsize, TS = 2 seater, L = large, PT = pickup truck, MV = minivan, SUV = sport utility vehicle. Using these variables, is there a multiple linear model that can predict the price of an electric car that was manufactured between 1997 and 2013?

I am taking out some of the variables from the data set that are difficult to categorize or are redonedant. I am removing the variable vehicle id (carid) because it doesn’t have any effect on price and mpge (mpgmpge) because it measures the same exact value as mpg. I am also removing the categorical variable carclass because it is recoded as a numeric variable with 1 = compact, 3 = midsize, 7 = 2 seater, 2 = large, 5 = pickup truck, 4 = minivan, and 6 = suv.

Full Linear Model

Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 854011.2647 771125.6371 1.107487 0.2698803
year -419.6596 384.1168 -1.092531 0.2763749
accelrate 4311.3907 499.3628 8.633784 0.0000000
mpg -546.1041 135.5074 -4.030067 0.0000889
carclass_id -1066.7506 679.6868 -1.569474 0.1186731

According to the summary, only two of the variables have statistically significant slopes. The variable accelrate (t=8.63, p < 0.001) and mpg (t= -4.03, p < 0.001) have statistically significant slopes with the variables carclass_id and year having p-values > \(\alpha\) = 0.05. This means that carclass_id and year, are not significantly contributing to the linear model so we will remove them from the model.

From the residual plots, we can see that there are some violations. The variance of the residuals doesn’t appear to be consistently constant. The Q-Q plot shows that the distribution of the residuals is slightly off from the normal distribution. There appears to be weak curvature of the residuals but there are no concerning outliers or leverage values.

Box-Cox Transformation

Square Root Transformation

We perform a Box-Cox transformation with log-transformed mpg and the square root of the response.

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 254.338039 42.115772 6.039021 0.0e+00
accelrate 9.105136 1.047759 8.690108 0.0e+00
log(mpg) -48.877987 9.804840 -4.985088 1.7e-06

There is some improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.

Log Transformation

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.7276200 0.2023182 48.080789 0.0000000
accelrate 0.0928168 0.0108268 8.572907 0.0000000
mpg -0.0109725 0.0029012 -3.782102 0.0002241

There is even more improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.

Goodness of Fit Measures

Next we will test all three models and determine which one is the best.

Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 3.232045e+10 0.5366073 0.5240832 5 2942.7848 2957.9370 3.477941e+10
sqrt.price.log.mpg 1.619397e+05 0.5616349 0.5557900 3 1071.5748 1080.6661 1.692585e+05
log.price 1.718713e+01 0.5194495 0.5130422 3 -328.5004 -319.4091 1.797001e+01

We can see from the above table that the goodness-of-fit measures of the second model are unanimously better than the other two models. Considering the interpretability, goodness-of-fit, and simplicity, we choose the second model as the final model.

Final Model

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 254.338039 42.115772 6.039021 0.0e+00
accelrate 9.105136 1.047759 8.690108 0.0e+00
log(mpg) -48.877987 9.804840 -4.985088 1.7e-06

Summary of the Model

The final model is \(sqrt(price) = 254.34 + 9.1*accelrate - 48.88*log(mpg).\)

This is interpreted as the acceleration rate increases by 1, the price increases by 3% if the mpg are the same.

Discussions

We use various regression techniques such as Box-Cox transformation for response variables and other transformations of the explanatory variables to search for the final model in the case study. There are four variables in the data set but only two were significant so we did have to perform a variable selection procedure.

All candidate models have the same number of variables. We use commonly-used global goodness-of-fit measures as model selection criteria.

The interpretation of the regression coefficients is not straightforward since the response variable was transformed into a sqaure root and the variable mpg was transformed into a log. We used some algebra to derive the practical interpretation of the regression coefficients associated with the variables at their original scales.

The violation of the normal assumption of the residuals was corrected but the constant variance of the residuals remained uncorrected.