Description of the Data Set

This is a data set that compares technological advancement of hybrid electric vehichles in different market segments. I got the data set from UFL. With technology improving, electic vehicles have become more popular but are expensive. The electric vehicles in the dataset are from different countries and years. The sample size in 154 HEVs including 11 plugin HEVs from 1997 to 2013. The EPA database was used to collect the required information from the vehicles.

The numeric variables are vehicle id (carid), model year(year), manufacturer’s suggested retail price in 2013 $ (msrp), acceleration rate in km/hour/seccond (accelrate), fuel economy in miles/gallon (mpg), max of mpg and mpge (mpgmpge), and molel class ID (carclass_id). The categorical varaibles are vehicle (vehicle) and model class (carclass): C = compact, M = midsize, TS = 2 seater, L = large, PT = pickup truck, MV = minivan, SUV = sport utility vehicle. Using these variables, is there a multiple linear model that can predict the price of an electric car that was manufactured between 1997 and 2013?

I am taking out some of the variables from the data set that are difficult to categorize or are redonedant. I am removing the variable vehicle id (carid) because it doesn’t have any effect on price and mpge (mpgmpge) because it measures the same exact value as mpg. I am also removing the categorical variable carclass because it is recoded as a numeric variable with 1 = compact, 3 = midsize, 7 = 2 seater, 2 = large, 5 = pickup truck, 4 = minivan, and 6 = suv.

Full Linear Model

#options(digits = 15, scipen=999) #this sets all digits to 15 and gets rid of scientific notation


hybrid = read.csv("https://raw.githubusercontent.com/anafrance/STA-321/main/www/hybrid_reg.csv", header = TRUE)
hybrid = hybrid[,-1]   #taking out variable carid
hybrid = hybrid[,-6]   #taking out variable mpgmpge
hybrid = hybrid[,-6]   #taking out variable carclass
hybrid = hybrid[,-1]   #taking out variable vechile
#hybrid[ ]

price <- hybrid$msrp   #pulling out the price variable

hybrid = hybrid[,-2]   #taking out response variable

full.model = lm(price ~ ., data=hybrid)    #building the model
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")  #labeling the model

Statistics of Regression Coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	854011.2647	771125.6371	1.107487	0.2698803
year	-419.6596	384.1168	-1.092531	0.2763749
accelrate	4311.3907	499.3628	8.633784	0.0000000
mpg	-546.1041	135.5074	-4.030067	0.0000889
carclass_id	-1066.7506	679.6868	-1.569474	0.1186731

According to the summary, only two of the variables have statistically significant slopes. The variable accelrate (t=8.63, p < 0.001) and mpg (t= -4.03, p < 0.001) have statistically significant slopes with the variables carclass_id and year having p-values > $\alpha$ = 0.05. This means that carclass_id and year, are not significantly contributing to the linear model so we will remove them from the model.

par(mfrow=c(2,2))
plot(full.model)

From the residual plots, we can see that there are some violations. The variance of the residuals doesn’t appear to be consistently constant. The Q-Q plot shows that the distribution of the residuals is slightly off from the normal distribution. There appears to be weak curvature of the residuals but there are no concerning outliers or leverage values.

Box-Cox Transformation

library(MASS)
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
##
#boxcox(price ~ log(year) + accelrate + mpg + carclass_id, data = hybrid, lambda = seq(-.5, .5, length = 10), 
   #    xlab=expression(paste(lambda, ": log year")))
##
#boxcox(price ~ year + accelrate + mpg + log(carclass_id), data = hybrid, lambda = seq(-.5, .5, length = 10), 
 #      xlab=expression(paste(lambda, ": log carclass_id")))

##
boxcox(price ~ log(accelrate) + mpg, data = hybrid, lambda = seq(-.5, .5, length = 10), 
       xlab=expression(paste(lambda, ": log accelrate")))

##
boxcox(price ~ accelrate + log(mpg), data = hybrid, lambda = seq(-.5, .5, length = 10), 
       xlab=expression(paste(lambda, ": log mpg")))

Square Root Transformation

We perform a Box-Cox transformation with log-transformed mpg and the square root of the response.

sqrt.price.log.mpg = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)
kable(summary(sqrt.price.log.mpg)$coef, caption = "log-transformed model")

log-transformed model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	254.338039	42.115772	6.039021	0.0e+00
accelrate	9.105136	1.047759	8.690108	0.0e+00
log(mpg)	-48.877987	9.804840	-4.985088	1.7e-06

par(mfrow = c(2,2))
plot(sqrt.price.log.mpg)

There is some improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.

Log Transformation

log.price = lm(log(price) ~ accelrate + mpg, data = hybrid)
kable(summary(log.price)$coef, caption = "log-transformed model")

log-transformed model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	9.7276200	0.2023182	48.080789	0.0000000
accelrate	0.0928168	0.0108268	8.572907	0.0000000
mpg	-0.0109725	0.0029012	-3.782102	0.0002241

par(mfrow = c(2,2))
plot(log.price)

There is even more improvement in the Q-Q plot but the assumption of constant variance still seems to be violated.

Goodness of Fit Measures

Next we will test all three models and determine which one is the best.

select=function(m){ # m is an object: model
 e = m$resid                           # residuals
 n0 = length(e)                        # sample size
 SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
 R.sq=summary(m)$r.squared             
 R.adj=summary(m)$adj.r               
 MSE=(summary(m)$sigma)^2              # square error
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion
 X=model.matrix(m)                     # design matrix of the model
 H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
}
sqrt.price.log.mpg = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)

output.sum = rbind(select(full.model), select(sqrt.price.log.mpg), select(log.price))
row.names(output.sum) = c("full.model", "sqrt.price.log.mpg", "log.price")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC	PRESS
full.model	3.232045e+10	0.5366073	0.5240832	5	2942.7848	2957.9370	3.477941e+10
sqrt.price.log.mpg	1.619397e+05	0.5616349	0.5557900	3	1071.5748	1080.6661	1.692585e+05
log.price	1.718713e+01	0.5194495	0.5130422	3	-328.5004	-319.4091	1.797001e+01

We can see from the above table that the goodness-of-fit measures of the second model are unanimously better than the other two models. Considering the interpretability, goodness-of-fit, and simplicity, we choose the second model as the final model.

Final Model

cmtrx = summary(sqrt.price.log.mpg)$coef
kable(summary(sqrt.price.log.mpg)$coef, caption = "Inferential Statistics of Final Model")

Inferential Statistics of Final Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	254.338039	42.115772	6.039021	0.0e+00
accelrate	9.105136	1.047759	8.690108	0.0e+00
log(mpg)	-48.877987	9.804840	-4.985088	1.7e-06

Bootstap Final Model

Because the assumpution of constant variance is still violated, a bootstrap method is used to find confidence intervals for the coefficients of the final regression model.

log.price = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)

B = 1000       #number of bootstrap replicates

num.p = dim(model.frame(log.price))[2]  # number of parameters in the model
smpl.n = dim(model.frame(log.price))[1] # sample size

# a matrix to store bootstrap coefficients 
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)       
 
for (i in 1:B){    #for loop that goes from 1 to 1000
  bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # put final model to the bootstrap sample
  log.price.btc = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid[bootc.id,])     
  coef.mtrx[i,] = coef(log.price.btc)    # extract coefficents from bootstrap regression model    
}

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## coefficient matrix of the final model
  ## Bootstrap sampling distribution of the estimated coefficients
  
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  
  # height of the histogram
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
  
}

par(mfrow=c(2,2))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Accelrate" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="MPG" )

num.p = dim(coef.mtrx)[2]  # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
  lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),8)   #lower limit
  uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),8)   #upper limit
  btc.wd[i] =  uci.975 - lci.025
  btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
 }
#print out CI
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)), 
      caption = "Regression Coefficient Matrix")

Regression Coefficient Matrix
	Estimate	Std. Error	t value	btc.ci.95
(Intercept)	254.3380	42.1158	6.0390	[ 72.4057 , 312.7024 ]
accelrate	9.1051	1.0478	8.6901	[ -2.8153 , 3.1544 ]
log(mpg)	-48.8780	9.8048	-4.9851	[ -28.7835 , 29.2044 ]

The histograms appear to be approximately normal. The bootstrap confidence intervals of the variables accelrate and mpg include 0 in them which means that the slopes of these variables do not have statistical significance on the response variable (price of the car).

hist(sort(log.price$residuals),n=40,
     xlab="Residuals",
     col = "lightblue",
     border="navy",
     main = "Histogram of Residuals")

The histogram of the residuals appears to be approximately normal with a slight right tailed skewness.

Residual Bootstrap Regression

log.price <- lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)
model.resid = log.price$residuals

B=1000
num.p = dim(model.matrix(log.price))[2]   # number of parameters
samp.n = dim(model.matrix(log.price))[1]  # sample size
btr.mtrx = matrix(rep(0,6*B), ncol=num.p) # zero matrix to store boot coefs
for (i in 1:B){
  ## Bootstrap response values
  bt.lg.price = log.price$fitted.values + 
        sample(log.price$residuals, samp.n, replace = TRUE)  # bootstrap residuals
  
  # replace price with bootstrap log price
  hybrid$bt.lg.price =  bt.lg.price   
  btr.model = lm((price)^0.5 ~ accelrate + log(mpg), data = hybrid)   
  btr.mtrx[i,]=btr.model$coefficients
}

boot.hist = function(bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  
  # height of the histogram
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")       # normal density curve         
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    # loess curve
} 

par(mfrow=c(1,3))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Accelrate" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="MPG" )

The histograms of the residuals are not normal and the normal and LOESS curves are no where near each other which mean that the indicated inference of the significance of variables based on p-values and residual bootstrap do not yield the same results.

The 95% residual bootstrap confidence intervals are given below

num.p = dim(coef.mtrx)[2]  # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p){
  lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)   #lower limit
  uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)   #upper limit
  btr.wd[i] = uci.975 - lci.025
  btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}

kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btr.ci.95=btr.ci)), 
      caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")

Regression Coefficient Matrix with 95% Residual Bootstrap CI
	Estimate	Std. Error	t value	btr.ci.95
(Intercept)	254.3380	42.1158	6.0390	[ 0 , 254.338 ]
accelrate	9.1051	1.0478	8.6901	[ 0 , 9.1051 ]
log(mpg)	-48.8780	9.8048	-4.9851	[ -48.878 , 0 ]

The residual bootstrap confidence intervals include 0 which means that these slopes are not statistically significant to the model and they do not yield the same results as the p-values do.

Combining All Inferential Statistics

kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), btc.ci.95=btc.ci,btr.ci.95=btr.ci)), 
      caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")

Final Combined Inferential Statistics: p-values and Bootstrap CIs
	Estimate	Std. Error	btc.ci.95	btr.ci.95
(Intercept)	254.3380	42.1158	[ 72.4057 , 312.7024 ]	[ 0 , 254.338 ]
accelrate	9.1051	1.0478	[ -2.8153 , 3.1544 ]	[ 0 , 9.1051 ]
log(mpg)	-48.8780	9.8048	[ -28.7835 , 29.2044 ]	[ -48.878 , 0 ]

The bootstrap confidence intervals do not agree with the p-value results which means that this model should not be used for prediction and estimation that the final model does have serious violations of the model assumptions.

kable(cbind(btc.wd, btr.wd), caption="width of the two bootstrap confidence intervals")

width of the two bootstrap confidence intervals
btc.wd	btr.wd
240.296708	254.338039
5.969673	9.105136
57.987909	48.877987

The widths of residual bootstrap and case-bootstrap confidence intervals are not similar to each other showing that model assumptions are sitll violated.

Discussion

The main finding was that this model is not a good model for prediction and estimation. Also, the inferential statistics did not agree with the bootstrap conclusion. On one hand, the p-values were less than alpha but the confidence intervals in the bootstrap method all included zero meaning that the variables did not statistically contribute to the model.

If the final model was statistically significant, because the response variable was transformed, simple algebra would have to be applied for interpretation. For example, if the model yield $2,000, this quantity would have to be squared because the final model takes the square root of the response.

Overall, the final model should not be used for interpretation because it failed the bootstrap method thus meaning that their were major model assumption violations. This shows that the bootstrap method is great to check your work instead of just basing your conclusion off of p-values.

Mini-Project

Gianna LaFrance

2-26-2023