This is a data set that compares technological advancement of hybrid electric vehichles in different market segments. I got the data set from UFL. With technology improving, electic vehicles have become more popular but are expensive. The electric vehicles in the dataset are from different countries and years. The sample size in 154 HEVs including 11 plugin HEVs from 1997 to 2013. The EPA database was used to collect the required information from the vehicles.
The numeric variables are vehicle id (carid), model year(year), manufacturer’s suggested retail price in 2013 $ (msrp), acceleration rate in km/hour/seccond (accelrate), fuel economy in miles/gallon (mpg), max of mpg and mpge (mpgmpge), and molel class ID (carclass_id). The categorical varaibles are vehicle (vehicle) and model class (carclass): C = compact, M = midsize, TS = 2 seater, L = large, PT = pickup truck, MV = minivan, SUV = sport utility vehicle. Using these variables, is there a multiple linear model that can predict the price of an electric car that was manufactured between 1997 and 2013?
I am taking out some of the variables from the data set that are difficult to categorize or are redonedant. I am removing the variable vehicle id (carid) because it doesn’t have any effect on price and mpge (mpgmpge) because it measures the same exact value as mpg. I am also removing the categorical variable carclass because it is recoded as a numeric variable with 1 = compact, 3 = midsize, 7 = 2 seater, 2 = large, 5 = pickup truck, 4 = minivan, and 6 = suv.
#options(digits = 15, scipen=999) #this sets all digits to 15 and gets rid of scientific notation
hybrid = read.csv("https://raw.githubusercontent.com/anafrance/STA-321/main/www/hybrid_reg.csv", header = TRUE)
hybrid = hybrid[,-1] #taking out variable carid
hybrid = hybrid[,-6] #taking out variable mpgmpge
hybrid = hybrid[,-6] #taking out variable carclass
hybrid = hybrid[,-1] #taking out variable vechile
#hybrid[ ]
price <- hybrid$msrp #pulling out the price variable
hybrid = hybrid[,-2] #taking out response variable
full.model = lm(price ~ ., data=hybrid) #building the model
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients") #labeling the model
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 854011.2647 | 771125.6371 | 1.107487 | 0.2698803 |
| year | -419.6596 | 384.1168 | -1.092531 | 0.2763749 |
| accelrate | 4311.3907 | 499.3628 | 8.633784 | 0.0000000 |
| mpg | -546.1041 | 135.5074 | -4.030067 | 0.0000889 |
| carclass_id | -1066.7506 | 679.6868 | -1.569474 | 0.1186731 |
According to the summary, only two of the variables have statistically significant slopes. The variable accelrate (t=8.63, p < 0.001) and mpg (t= -4.03, p < 0.001) have statistically significant slopes with the variables carclass_id and year having p-values > \(\alpha\) = 0.05. This means that carclass_id and year, are not significantly contributing to the linear model so we will remove them from the model.
par(mfrow=c(2,2))
plot(full.model)
From the residual plots, we can see that there are some violations. The variance of the residuals doesn’t appear to be consistently constant. The Q-Q plot shows that the distribution of the residuals is slightly off from the normal distribution. There appears to be weak curvature of the residuals but there are no concerning outliers or leverage values.
library(MASS)
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
##
#boxcox(price ~ log(year) + accelrate + mpg + carclass_id, data = hybrid, lambda = seq(-.5, .5, length = 10),
# xlab=expression(paste(lambda, ": log year")))
##
#boxcox(price ~ year + accelrate + mpg + log(carclass_id), data = hybrid, lambda = seq(-.5, .5, length = 10),
# xlab=expression(paste(lambda, ": log carclass_id")))
##
boxcox(price ~ log(accelrate) + mpg, data = hybrid, lambda = seq(-.5, .5, length = 10),
xlab=expression(paste(lambda, ": log accelrate")))
##
boxcox(price ~ accelrate + log(mpg), data = hybrid, lambda = seq(-.5, .5, length = 10),
xlab=expression(paste(lambda, ": log mpg")))
We perform a Box-Cox transformation with log-transformed mpg and the square root of the response.
sqrt.price.log.mpg = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)
kable(summary(sqrt.price.log.mpg)$coef, caption = "log-transformed model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 254.338039 | 42.115772 | 6.039021 | 0.0e+00 |
| accelrate | 9.105136 | 1.047759 | 8.690108 | 0.0e+00 |
| log(mpg) | -48.877987 | 9.804840 | -4.985088 | 1.7e-06 |
par(mfrow = c(2,2))
plot(sqrt.price.log.mpg)
There is some improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.
log.price = lm(log(price) ~ accelrate + mpg, data = hybrid)
kable(summary(log.price)$coef, caption = "log-transformed model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9.7276200 | 0.2023182 | 48.080789 | 0.0000000 |
| accelrate | 0.0928168 | 0.0108268 | 8.572907 | 0.0000000 |
| mpg | -0.0109725 | 0.0029012 | -3.782102 | 0.0002241 |
par(mfrow = c(2,2))
plot(log.price)
There is even more improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.
Next we will test all three models and determine which one is the best.
select=function(m){ # m is an object: model
e = m$resid # residuals
n0 = length(e) # sample size
SSE=(m$df)*(summary(m)$sigma)^2 # sum of squared error
R.sq=summary(m)$r.squared
R.adj=summary(m)$adj.r
MSE=(summary(m)$sigma)^2 # square error
Cp=(SSE/MSE)-(n0-2*(n0-m$df)) # Mellow's p
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df) # Akaike information criterion
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df) # Schwarz Bayesian Information criterion
X=model.matrix(m) # design matrix of the model
H=X%*%solve(t(X)%*%X)%*%t(X) # hat matrix
d=e/(1-diag(H))
PRESS=t(d)%*%d # predicted residual error sum of squares (PRESS)- a cross-validation measure
tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
tbl
}
sqrt.price.log.mpg = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)
output.sum = rbind(select(full.model), select(sqrt.price.log.mpg), select(log.price))
row.names(output.sum) = c("full.model", "sqrt.price.log.mpg", "log.price")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 3.232045e+10 | 0.5366073 | 0.5240832 | 5 | 2942.7848 | 2957.9370 | 3.477941e+10 |
| sqrt.price.log.mpg | 1.619397e+05 | 0.5616349 | 0.5557900 | 3 | 1071.5748 | 1080.6661 | 1.692585e+05 |
| log.price | 1.718713e+01 | 0.5194495 | 0.5130422 | 3 | -328.5004 | -319.4091 | 1.797001e+01 |
We can see from the above table that the goodness-of-fit measures of the second model are unanimously better than the other two models. Considering the interpretability, goodness-of-fit, and simplicity, we choose the second model as the final model.
cmtrx = summary(sqrt.price.log.mpg)$coef
kable(summary(sqrt.price.log.mpg)$coef, caption = "Inferential Statistics of Final Model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 254.338039 | 42.115772 | 6.039021 | 0.0e+00 |
| accelrate | 9.105136 | 1.047759 | 8.690108 | 0.0e+00 |
| log(mpg) | -48.877987 | 9.804840 | -4.985088 | 1.7e-06 |
Because the assumpution of constant variance is still violated, a bootstrap method is used to find confidence intervals for the coefficients of the final regression model.
log.price = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)
B = 1000 #number of bootstrap replicates
num.p = dim(model.frame(log.price))[2] # number of parameters in the model
smpl.n = dim(model.frame(log.price))[1] # sample size
# a matrix to store bootstrap coefficients
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)
for (i in 1:B){ #for loop that goes from 1 to 1000
bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # put final model to the bootstrap sample
log.price.btc = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid[bootc.id,])
coef.mtrx[i,] = coef(log.price.btc) # extract coefficents from bootstrap regression model
}
boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## coefficient matrix of the final model
## Bootstrap sampling distribution of the estimated coefficients
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
# height of the histogram
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3")
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")
}
par(mfrow=c(2,2)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Accelrate" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="MPG" )
num.p = dim(coef.mtrx)[2] # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),8) #lower limit
uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),8) #upper limit
btc.wd[i] = uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#print out CI
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)),
caption = "Regression Coefficient Matrix")
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 254.3380 | 42.1158 | 6.0390 | 0.0000 | [ 72.4057 , 312.7024 ] |
| accelrate | 9.1051 | 1.0478 | 8.6901 | 0.0000 | [ -2.8153 , 3.1544 ] |
| log(mpg) | -48.8780 | 9.8048 | -4.9851 | 0.0000 | [ -28.7835 , 29.2044 ] |
The histograms appear to be approximately normal. The bootstrap confidence intervals of the variables accelrate and mpg include 0 in them which means that the slopes of these variables do not have statistical significance on the response variable (price of the car).
hist(sort(log.price$residuals),n=40,
xlab="Residuals",
col = "lightblue",
border="navy",
main = "Histogram of Residuals")
The histogram of the residuals appears to be approximately normal with a
slight right tailed skewness.
log.price <- lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)
model.resid = log.price$residuals
B=1000
num.p = dim(model.matrix(log.price))[2] # number of parameters
samp.n = dim(model.matrix(log.price))[1] # sample size
btr.mtrx = matrix(rep(0,6*B), ncol=num.p) # zero matrix to store boot coefs
for (i in 1:B){
## Bootstrap response values
bt.lg.price = log.price$fitted.values +
sample(log.price$residuals, samp.n, replace = TRUE) # bootstrap residuals
# replace price with bootstrap log price
hybrid$bt.lg.price = bt.lg.price
btr.model = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)
btr.mtrx[i,]=btr.model$coefficients
}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## Bootstrap sampling distribution of the estimated coefficients
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
# height of the histogram
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3") # normal density curve
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") # loess curve
}
par(mfrow=c(1,3)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Accelrate" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="MPG" )
The histograms of the residuals are not normal and the normal and LOESS curves are no where near each other which mean that the indicated inference of the significance of variables based on p-values and residual bootstrap do not yield the same results.
The 95% residual bootstrap confidence intervals are given below
num.p = dim(coef.mtrx)[2] # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8) #lower limit
uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8) #upper limit
btr.wd[i] = uci.975 - lci.025
btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btr.ci.95=btr.ci)),
caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 254.3380 | 42.1158 | 6.0390 | 0.0000 | [ 0 , 254.338 ] |
| accelrate | 9.1051 | 1.0478 | 8.6901 | 0.0000 | [ 0 , 9.1051 ] |
| log(mpg) | -48.8780 | 9.8048 | -4.9851 | 0.0000 | [ -48.878 , 0 ] |
The residual bootstrap confidence intervals include 0 which means that these slopes are not statistically significant to the model and they do not yield the same results as the p-values do.
kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), btc.ci.95=btc.ci,btr.ci.95=btr.ci)),
caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")
| Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 254.3380 | 42.1158 | 0.0000 | [ 72.4057 , 312.7024 ] | [ 0 , 254.338 ] |
| accelrate | 9.1051 | 1.0478 | 0.0000 | [ -2.8153 , 3.1544 ] | [ 0 , 9.1051 ] |
| log(mpg) | -48.8780 | 9.8048 | 0.0000 | [ -28.7835 , 29.2044 ] | [ -48.878 , 0 ] |
The bootstrap confidence intervals do not agree with the p-value results which means that this model should not be used for prediction and estimation that the final model does have serious violations of the model assumptions.
kable(cbind(btc.wd, btr.wd), caption="width of the two bootstrap confidence intervals")
| btc.wd | btr.wd |
|---|---|
| 240.296708 | 254.338039 |
| 5.969673 | 9.105136 |
| 57.987909 | 48.877987 |
The widths of residual bootstrap and case-bootstrap confidence intervals are not similar to each other showing that model assumptions are sitll violated.
The main finding was that this model is not a good model for prediction and estimation. Also, the inferential statistics did not agree with the bootstrap conclusion. On one hand, the p-values were less than alpha but the confidence intervals in the bootstrap method all included zero meaning that the variables did not statistically contribute to the model.
If the final model was statistically significant, because the response variable was transformed, simple algebra would have to be applied for interpretation. For example, if the model yield $2,000, this quantity would have to be squared because the final model takes the square root of the response.
Overall, the final model should not be used for interpretation because it failed the bootstrap method thus meaning that their were major model assumption violations. This shows that the bootstrap method is great to check your work instead of just basing your conclusion off of p-values.