1 Data Analysis

The data collected was collected from the UFL website and is a data set on comparing the technological advancement of hybrid electric vehicles. The source did not say how the data was collected. The vehicle variable tells which car each other variable is for and the carid is a code for that variable, so vehicle is character and carid is an integer. The year is also an integer and is the model year of the car. The carclass is a categorical variable that tells us the type of car with C being compact, M being midsize, TS being 2 Seater, L being Large, PT being Pickup Truck, MV being Minivan, and SUV being Sport Utility Vehicle. The carclass_id variable codes the carclass from 1 to 7. The msrp is the manufacturer’s suggested retail price in 2013. Accelerate is the cars acceleration rate in km/hour/second. The mpg variable is the miles per gallon and the mpgmpge variable includes the maximum of the electric miles and the the gas miles. I’m wondering how each different variable has an effect on the price of the car but mainly, how does the mpg variable have an effect on the price of the car. The data set has enough information to answer all of these questions.

car = read.csv("https://raw.githubusercontent.com/emmalaughin/sta321/main/data/hybrid_reg%20(1).csv")
head(car) 
##   carid         vehicle year     msrp accelrate   mpg mpgmpge carclass
## 1     1 Prius (1st Gen) 1997 24509.74      7.46 41.26   41.26        C
## 2     2            Tino 2000 35354.97      8.20 54.10   54.10        C
## 3     3 Prius (2nd Gen) 2000 26832.25      7.97 45.23   45.23        C
## 4     4         Insight 2000 18936.41      9.52 53.00   53.00       TS
## 5     5 Civic (1st Gen) 2001 25833.38      7.04 47.04   47.04        C
## 6     6         Insight 2001 19036.71      9.52 53.00   53.00       TS
##   carclass_id
## 1           1
## 2           1
## 3           1
## 4           7
## 5           1
## 6           7

2 Simple Linear Regression

msrp<- car$msrp #defining variables to use
mpg <- car$mpg
plot(mpg, msrp, pch = 21, col ="navy",
     main = "Relationship between MSRP and MPG ") #creaating scatterplot

The scatterplot shows a negative linear relationship between the two variables.

2.1 Linear Model

parametric.model <- lm(msrp ~ mpg) #creating the linear model
par(mfrow = c(2,2)) #arranging plots in a 2x2 layout
plot(parametric.model) #plotting the model

The model assumes the explanatory variable and response variable have a linear trend, and that the errors are normally distributed with mean 0 and a variance of $^2. The plot of the residuals along with the standardized residuals plot both show clusters. The normal Q-Q plot shows us the data is not normal because it is nonlinear and the Cook’s distance plot shows us that there are no extreme outliers. Although the model assumptions are violated, we will not to a linear transformation because the sample size is big enough to use the bootstrap method which is more reliable than the linear model beause it is non-parametric.

reg.table <- coef(summary(parametric.model)) 
kable(reg.table, caption = "Inferential statistics for the parametric linear
      regression model: MSRP and mpg")  #doing inferential statistics on the model
Inferential statistics for the parametric linear regression model: MSRP and mpg
Estimate Std. Error t value Pr(>|t|)
(Intercept) 75448.200 4907.4791 15.374126 0
mpg -1038.259 134.5413 -7.717028 0

Since the value is negative, as the market price increases as the mpg decreases, which makes sense because it is a stude of electric cars.

3 Bootstraping Simple Linear Regression

B <- 1000    # amount of repeated bootstrap sampling
boot.beta0 <- NULL #empty vector
boot.beta1 <- NULL #empty vector
## bootstrap regression models using for-loop
vec.id <- 1:length(msrp)   # vector of observation ID
for(i in 1:B){            #creating the for loop to bootstrap cases
  boot.id <- sample(vec.id, length(msrp), replace = TRUE)   #sampling vector 
  boot.msrp <- msrp[boot.id]           # bootstrap msrp
  boot.mpg <- mpg[boot.id]     # bootstrap mpg

  boot.reg <-lm(msrp[boot.id] ~ mpg[boot.id]) #bootstrap regression
  boot.beta0[i] <- coef(boot.reg)[1]   # bootstrap intercept
  boot.beta1[i] <- coef(boot.reg)[2]   # bootstrap slope
}

boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2) #bootstrap CI for intercept
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2) #bootstrap CI for slope
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci))  #creating a data frame of bootstrap CI
names(boot.coef) <- c("2.5%", "97.5%") 
kable(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.") #creating table of bootstrap coefficients
Bootstrap confidence intervals of regression coefficients.
2.5% 97.5%
boot.beta0.ci 64118.134 87642.5669
boot.beta1.ci -1360.439 -752.3881

The confidence interval for the slope is (-1387.746, -742.5299), which doesn’t include 0 so miles per gallon and market price are statistically correllated. The bootstrap confidence interval should be used before the p-value in the linear model here because the assumptions in the linear model were violated, so a non-parametric test is better.