HW-Q 1: A study was done on a diesel-powered light-duty pickup truck to see if humidity, air temperature, and barometric pressure influence emission of nitrous oxide (in ppm). Emission measurements were taken at different times, with varying experimental conditions. The data are given in NitrousOxide.xlsx

#install.packages('knitr')
library(knitr)
library(readxl)
NitrousOxide1 <- read_excel("/Users/saroo/Downloads/HW 4 MTH-420 FOLDER/NitrousOxide1.xlsx")

library(ISLR2)

y = NitrousOxide1$`Nitrous Oxide`
x1 = NitrousOxide1$Humidity
x2 = NitrousOxide1$Temp.
x3 = NitrousOxide1$Pressure

mult.lin.model1 = lm(y ~ x1 + x2 + x3)
  1. Find the estimates of the coefficients.
coef(mult.lin.model1)
##  (Intercept)           x1           x2           x3 
## -3.507778141 -0.002624991  0.000798941  0.154155030

The coefficient values are:

\[\beta_0 = -3.507778141\] \[\beta_1 = -0.002624991\] \[\beta_2 = 0.000798941\] \[\beta_3 = 0.154155030\]

  1. Fit this multiple linear regression model to the given data with the estimated coefficients.
#install.packages('equatiomatic')
library(equatiomatic)


extract_eq(mult.lin.model1, use_coefs = TRUE, coef_digits = 4)

\[ \operatorname{\widehat{y}} = -3.5078 - 0.0026(\operatorname{x1}) + 8e-04(\operatorname{x2}) + 0.1542(\operatorname{x3}) \]

#what beta goes with what variable

0.0026 is the coefficient of Humidity, 8e-04 is the coefficient of Temperature, 0.1542 is the coefficient of Pressure.

  1. Estimate the amount of nitrous oxide emitted for the conditions where humidity is 50%, temperature is 76F, and barometric pressure is 29.30.
newdata1 <- data.frame(x1 = 50, x2 = 76, x3 = 29.30)
predict(mult.lin.model1, newdata1)
##         1 
## 0.9384342
#interpretation + unit

When humidity is at 50%, temperature is at 76 F, and barometric pressure is at 29.30, Nitrous Oxide is estimated to be at 0.938 ppm.

  1. Compute the 95% Confidence interval for the estimated amount of nitrous oxide emitted based on the information given in the part (c) of this question.
predict(mult.lin.model1, newdata1, interval = 'confidence')
##         fit       lwr       upr
## 1 0.9384342 0.9080547 0.9688137
#unit lower bound and upper bound, we are 95% confident that is going to be between, confidence interval template

We are 95% confident that the estimated amount of nitrous oxide emitted is between 0.908 ppm and 0.969 ppm.

  1. Compute the 95% Prediction interval for the estimated amount of nitrous oxide emitted based on the information given in the part (c) of this question.
predict(mult.lin.model1, newdata1, interval = 'predict')
##         fit       lwr      upr
## 1 0.9384342 0.8155473 1.061321
#same template

There is a 95% probability that an estimated amount of nitrous oxide emitted will be contained within 0.816 ppm and 1.06 ppm.

  1. test the following at the 0.05 level:
summary(mult.lin.model1)[[4]]
##                 Estimate   Std. Error    t value    Pr(>|t|)
## (Intercept) -3.507778141 3.0048640629 -1.1673667 0.260165545
## x1          -0.002624991 0.0006548626 -4.0084602 0.001013833
## x2           0.000798941 0.0020451430  0.3906529 0.701206155
## x3           0.154155030 0.1013674592  1.5207546 0.147836496

1). \(H_0 : \beta_1 = 0\) versus \(H_1 : \beta_1 \neq 0\)

Since the p-value = 0.001013833 is less than \(\alpha = 0.05 = 5 \%\), so the estimated \(β_1 = -0.002624991\) is statistically significant for this model.

2). \(H_0 : \beta_2 = 0\) versus \(H_1 : \beta_2 \neq 0\) Since the p-value = 0.701206155 is more than \(\alpha = 0.05 = 5 \%\), so the estimated \(β_2 = 0.000798941\) is statistically not significant for this model and the null hypothesis is true.

3). \(H_0 : \beta_3 = 0\) versus \(H_1 : \beta_3 \neq 0\)

Since the p-value = 0.147836496 is more than \(\alpha = 0.05 = 5 \%\), so the estimated \(β_3 = 0.154155030\) is statistically not significant for this model and the null hypothesis is true.

  1. Make a 3d plot of your estimated model, show your plot on this assignment, and comment/discuss if there is anything unusual you can identify
library(plot3D)
library(rgl)
library(car)


datafor3Dplot <- data.frame(y,x1,x2)

scatter3d(x1,x2,x3,y, data = datafor3Dplot, 
          theta = 40, phi = 40, pch = 20, col = c("red",'blue','pink'))
3d plot of the estimated model

3d plot of the estimated model

For the NitrousOxide1 data, a linear regression fit to nitrous oxide using humidity, temperature and pressure as predictors. We see that some observations lie above and some observations lie below the least squares regression plane. From the pattern of the residuals, we can see that there is a pronounced non-linear relationship in the data. The positive residuals (those visible above the surface), some tend to lie along the line, while most are far from the line. The negative residuals (some not visible), tend to lie away from this line.

HW-Q 2: A study was done to assess the cost effectiveness of driving a four-door sedan instead of a van or an SUV (sports utility vehicle). The continuous variables are odometer reading and octane of the gasoline used. The response variable is miles per gallon. The data are presented in the dataset CarMpg.xlsx.

  1. Find the estimates of the coefficients.
CarMpg <- read_excel("/Users/saroo/Downloads/HW 4 MTH-420 FOLDER/CarMpg.xlsx")


y_2 = CarMpg$MPG
x1_2 = CarMpg$Odometer
x2_2 = CarMpg$Octane

x3_2 = CarMpg$`Car Type`

mult.lin.model2 = lm(y_2 ~ x1_2 + x2_2  + x3_2)
summary(mult.lin.model2)
## 
## Call:
## lm(formula = y_2 ~ x1_2 + x2_2 + x3_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1936 -1.5244 -0.5187  1.4735  4.8702 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.097e+00  7.396e+00   0.419 0.679913    
## x1_2        -3.670e-05  1.670e-05  -2.198 0.039912 *  
## x2_2         3.704e-01  8.783e-02   4.218 0.000423 ***
## x3_2SUV     -1.298e+01  1.168e+00 -11.109 5.24e-10 ***
## x3_2van     -1.153e+01  1.124e+00 -10.257 2.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.282 on 20 degrees of freedom
## Multiple R-squared:  0.8903, Adjusted R-squared:  0.8683 
## F-statistic: 40.57 on 4 and 20 DF,  p-value: 2.506e-09
coef(mult.lin.model2)
##   (Intercept)          x1_2          x2_2       x3_2SUV       x3_2van 
##  3.096664e+00 -3.669540e-05  3.704375e-01 -1.297973e+01 -1.152520e+01

The coefficient values are:

\[\beta_0 = 3.096664e+00\] \[\beta_1 = -3.669540e-05\] \[\beta_2 = 3.704375e-01\] \[\beta_3 = -1.297973e+01\] \[\beta_4 = -1.152520e+01\] (b) Fit this multiple linear regression model to the given data with the estimated coefficients.

extract_eq(mult.lin.model2, use_coefs = TRUE, coef_digits = 6)

\[ \operatorname{\widehat{y\_2}} = 3.096664 - 3.7e-05(\operatorname{x1\_2}) + 0.370438(\operatorname{x2\_2}) - 12.979725(\operatorname{x3\_2}_{\operatorname{SUV}}) - 11.525202(\operatorname{x3\_2}_{\operatorname{van}}) \] 0.0026 is the coefficient of Humidity, 8e-04 is the coefficient of Temperature, 0.1542 is the coefficient of Pressure.

  1. Which type of vehicle appears to get the best gas mileage?
summary(mult.lin.model2)[4]
## $coefficients
##                  Estimate   Std. Error     t value     Pr(>|t|)
## (Intercept)  3.096664e+00 7.396288e+00   0.4186781 6.799133e-01
## x1_2        -3.669540e-05 1.669686e-05  -2.1977422 3.991175e-02
## x2_2         3.704375e-01 8.782700e-02   4.2178096 4.227447e-04
## x3_2SUV     -1.297973e+01 1.168346e+00 -11.1094863 5.243788e-10
## x3_2van     -1.152520e+01 1.123657e+00 -10.2568652 2.061827e-09

If our vehicle is a SEDAN, then both the coefficients x3_2SUV and x3_2VAN will be equal to 0, hence leading to no negative contributions to MPG.Therefore using a SEDAN leads to an improvement in the average MPG of the vehicle, with the result being significant

  1. Discuss the difference between a van and an SUV in terms of gas mileage. While you are doing this, make sure you present your reasoning for your answers with evidence.

We can see that the coefficient for SUV and VAN is negative.This means that if the x3_2 is either a VAN or an SUV then the average value of MPG reduces.The coefficients have a p-value of almost 0, which means that both the coefficients are also significant.

HW-Q 3: Leg strength is a necessary characteristic of a successful punter in American football. One measure of the quality of a good punt is the ”hang time.” This is the time that the ball hangs in the air before being caught by the punt returner. To determine what leg strength factors influence hang time and to develop an empirical model for predicting this response, a study on The Relationship Between Selected Physical Performance Variables and Football Punting Ability was conducted by the Department of Health, Physical Education, and Recreation at Virginia Tech. Thirteen punters were chosen for the experiment, and each punted a football 10 times. The average hang times, along with the strength measures used in the analysis, were recorded in the data set Punter

• Determine the most appropriate model for predicting hang time. (Note: while you doing it , make sure you follow all possible regression routine you know so far from the textbook, and present your evidence for each of the vital step/process you have gone through.)

Punter <- read_excel("/Users/saroo/Downloads/HW 4 MTH-420 FOLDER/Punter.xlsx")
## New names:
## * `` -> ...8
## * `` -> ...9
## * `` -> ...10
## * `` -> ...11
y_3 = Punter$`Hang Time`

# We will try to average the variable for leg strength and hamstring flexibility for left and right legsbfor each observation.

average_leg_strength = rowMeans(data.frame(Punter$RLS, Punter$LLS), na.rm = TRUE)

average_leg_strength
##  [1] 170 135 175 160 160 150 175 110 115 125 130 135 155
average_hamstring_flexibility = rowMeans(data.frame(Punter$RHF, Punter$LHF), na.rm = TRUE)

average_hamstring_flexibility
##  [1] 106  92  93 103 104 101 108  86  90  85  89  92  95
x1_3 = average_leg_strength
x2_3 = average_hamstring_flexibility
x3_3 = Punter$Power

multi.lin.model3 = lm(y_3 ~ x1_3 + x2_3 + x3_3)
summary(multi.lin.model3)
## 
## Call:
## lm(formula = y_3 ~ x1_3 + x2_3 + x3_3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31035 -0.06459 -0.01722  0.06253  0.32218 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.406207   0.898020   0.452   0.6617  
## x1_3        0.011528   0.004695   2.455   0.0364 *
## x2_3        0.012612   0.015094   0.836   0.4250  
## x3_3        0.003197   0.001746   1.831   0.1003  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.209 on 9 degrees of freedom
## Multiple R-squared:  0.8629, Adjusted R-squared:  0.8171 
## F-statistic: 18.87 on 3 and 9 DF,  p-value: 0.0003194
## don't include the variable x2_3 and power because statistically unsignificant because p > 0.05 and rsquared has to be close to 1

As it is clear from the model used above, the variables average_hamstring_flexibility and Punter$Power don’t hold statistical significance. That is because their p-value > 0.05. Even if R-squared is at an acceptable value of 0.8629, this model cannot be used because two of its predictors are statistically insignificant.

We will then build a model only including the variable of right and left leg strength as well as power predictor variables.

multi.lin.model4 = lm(y_3 ~ average_leg_strength + x3_3)
summary(multi.lin.model4)
## 
## Call:
## lm(formula = y_3 ~ average_leg_strength + x3_3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37445 -0.07033 -0.00606  0.05232  0.31113 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          1.077931   0.394084   2.735  0.02100 * 
## average_leg_strength 0.014295   0.003278   4.360  0.00142 **
## x3_3                 0.003869   0.001526   2.536  0.02959 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2059 on 10 degrees of freedom
## Multiple R-squared:  0.8522, Adjusted R-squared:  0.8227 
## F-statistic: 28.83 on 2 and 10 DF,  p-value: 7.05e-05

This model, on the other hand has an acceptable r-squared value, as well as statistically significant coefficients for the predictor variables.

par(mfrow = c(2,2))
plot(multi.lin.model4, pch = 20)

datafor3Dplot2 <- data.frame(y_3,average_leg_strength,x3_3)

scatter3d(average_leg_strength,x3_3,y_3, data = datafor3Dplot, 
          theta = 40, phi = 40, pch = 20, col = c("red",'blue','pink'))
knitr::include_graphics("/Users/saroo/Downloads/HW 4 MTH-420 FOLDER/Screen Shot 2022-04-05 at 21.04.41.png")