The data collected was collected from the UFL website and is a data set on comparing the technological advancement of hybrid electric vehicles. The source did not say how the data was collected. The vehicle variable tells which car each other variable is for and the carid is a code for that variable, so vehicle is character and carid is an integer. The year is also an integer and is the model year of the car. The carclass is a categorical variable that tells us the type of car with C being compact, M being midsize, TS being 2 Seater, L being Large, PT being Pickup Truck, MV being Minivan, and SUV being Sport Utility Vehicle. The carclass_id variable codes the carclass from 1 to 7. The msrp is the manufacturer’s suggested retail price in 2013. Accelerate is the cars acceleration rate in km/hour/second. The mpg variable is the miles per gallon and the mpgmpge variable includes the maximum of the electric miles and the the gas miles. I’m wondering how each different variable has an effect on the price of the car but mainly, how does the mpg variable have an effect on the price of the car. The data set has enough information to answer all of these questions.
car = read.csv("https://raw.githubusercontent.com/emmalaughin/sta321/main/data/hybrid_reg%20(1).csv")
head(car)
## carid vehicle year msrp accelrate mpg mpgmpge carclass
## 1 1 Prius (1st Gen) 1997 24509.74 7.46 41.26 41.26 C
## 2 2 Tino 2000 35354.97 8.20 54.10 54.10 C
## 3 3 Prius (2nd Gen) 2000 26832.25 7.97 45.23 45.23 C
## 4 4 Insight 2000 18936.41 9.52 53.00 53.00 TS
## 5 5 Civic (1st Gen) 2001 25833.38 7.04 47.04 47.04 C
## 6 6 Insight 2001 19036.71 9.52 53.00 53.00 TS
## carclass_id
## 1 1
## 2 1
## 3 1
## 4 7
## 5 1
## 6 7
msrp<- car$msrp #defining variables to use
mpg <- car$mpg
plot(mpg, msrp, pch = 21, col ="navy",
main = "Relationship between MSRP and MPG ") #creaating scatterplot
The scatterplot shows a negative linear relationship between the two variables.
parametric.model <- lm(msrp ~ mpg) #creating the linear model
par(mfrow = c(2,2)) #arranging plots in a 2x2 layout
plot(parametric.model) #plotting the model
The model assumes the explanatory variable and response variable have a linear trend, and that the errors are normally distributed with mean 0 and a variance of $^2. The plot of the residuals along with the standardized residuals plot both show clusters. The normal Q-Q plot shows us the data is not normal because it is nonlinear and the Cook’s distance plot shows us that there are no extreme outliers. Although the model assumptions are violated, we will not to a linear transformation because the sample size is big enough to use the bootstrap method which is more reliable than the linear model beause it is non-parametric.
reg.table <- coef(summary(parametric.model))
kable(reg.table, caption = "Inferential statistics for the parametric linear
regression model: MSRP and mpg") #doing inferential statistics on the model
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 75448.200 | 4907.4791 | 15.374126 | 0 |
| mpg | -1038.259 | 134.5413 | -7.717028 | 0 |
Since the value is negative, as the market price increases as the mpg decreases, which makes sense because it is a stude of electric cars.
B <- 1000 # amount of repeated bootstrap sampling
boot.beta0 <- NULL #empty vector
boot.beta1 <- NULL #empty vector
## bootstrap regression models using for-loop
vec.id <- 1:length(msrp) # vector of observation ID
for(i in 1:B){ #creating the for loop to bootstrap cases
boot.id <- sample(vec.id, length(msrp), replace = TRUE) #sampling vector
boot.msrp <- msrp[boot.id] # bootstrap msrp
boot.mpg <- mpg[boot.id] # bootstrap mpg
boot.reg <-lm(msrp[boot.id] ~ mpg[boot.id]) #bootstrap regression
boot.beta0[i] <- coef(boot.reg)[1] # bootstrap intercept
boot.beta1[i] <- coef(boot.reg)[2] # bootstrap slope
}
boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2) #bootstrap CI for intercept
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2) #bootstrap CI for slope
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci)) #creating a data frame of bootstrap CI
names(boot.coef) <- c("2.5%", "97.5%")
kable(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.") #creating table of bootstrap coefficients
| 2.5% | 97.5% | |
|---|---|---|
| boot.beta0.ci | 64118.134 | 87642.5669 |
| boot.beta1.ci | -1360.439 | -752.3881 |
The confidence interval for the slope is (-1387.746, -742.5299), which doesn’t include 0 so miles per gallon and market price are statistically correllated. The bootstrap confidence interval should be used before the p-value in the linear model here because the assumptions in the linear model were violated, so a non-parametric test is better.