#install.packages('knitr')
library(knitr)
library(readxl)
NitrousOxide1 <- read_excel("/Users/saroo/Downloads/HW 4 MTH-420 FOLDER/NitrousOxide1.xlsx")
library(ISLR2)
y = NitrousOxide1$`Nitrous Oxide`
x1 = NitrousOxide1$Humidity
x2 = NitrousOxide1$Temp.
x3 = NitrousOxide1$Pressure
mult.lin.model1 = lm(y ~ x1 + x2 + x3)
coef(mult.lin.model1)
## (Intercept) x1 x2 x3
## -3.507778141 -0.002624991 0.000798941 0.154155030
The coefficient values are:
\[\beta_0 = -3.507778141\] \[\beta_1 = -0.002624991\] \[\beta_2 = 0.000798941\] \[\beta_3 = 0.154155030\]
#install.packages('equatiomatic')
library(equatiomatic)
extract_eq(mult.lin.model1, use_coefs = TRUE, coef_digits = 4)
\[ \operatorname{\widehat{y}} = -3.5078 - 0.0026(\operatorname{x1}) + 8e-04(\operatorname{x2}) + 0.1542(\operatorname{x3}) \]
#what beta goes with what variable
0.0026 is the coefficient of Humidity, 8e-04 is the coefficient of Temperature, 0.1542 is the coefficient of Pressure.
newdata1 <- data.frame(x1 = 50, x2 = 76, x3 = 29.30)
predict(mult.lin.model1, newdata1)
## 1
## 0.9384342
#interpretation + unit
When humidity is at 50%, temperature is at 76 F, and barometric pressure is at 29.30, Nitrous Oxide is estimated to be at 0.938 ppm.
predict(mult.lin.model1, newdata1, interval = 'confidence')
## fit lwr upr
## 1 0.9384342 0.9080547 0.9688137
#unit lower bound and upper bound, we are 95% confident that is going to be between, confidence interval template
We are 95% confident that the estimated amount of nitrous oxide emitted is between 0.908 ppm and 0.969 ppm.
predict(mult.lin.model1, newdata1, interval = 'predict')
## fit lwr upr
## 1 0.9384342 0.8155473 1.061321
#same template
There is a 95% probability that an estimated amount of nitrous oxide emitted will be contained within 0.816 ppm and 1.06 ppm.
summary(mult.lin.model1)[[4]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.507778141 3.0048640629 -1.1673667 0.260165545
## x1 -0.002624991 0.0006548626 -4.0084602 0.001013833
## x2 0.000798941 0.0020451430 0.3906529 0.701206155
## x3 0.154155030 0.1013674592 1.5207546 0.147836496
1). \(H_0 : \beta_1 = 0\) versus \(H_1 : \beta_1 \neq 0\)
Since the p-value = 0.001013833 is less than \(\alpha = 0.05 = 5 \%\), so the estimated \(β_1 = -0.002624991\) is statistically significant for this model.
2). \(H_0 : \beta_2 = 0\) versus \(H_1 : \beta_2 \neq 0\) Since the p-value = 0.701206155 is more than \(\alpha = 0.05 = 5 \%\), so the estimated \(β_2 = 0.000798941\) is statistically not significant for this model and the null hypothesis is true.
3). \(H_0 : \beta_3 = 0\) versus \(H_1 : \beta_3 \neq 0\)
Since the p-value = 0.147836496 is more than \(\alpha = 0.05 = 5 \%\), so the estimated \(β_3 = 0.154155030\) is statistically not significant for this model and the null hypothesis is true.
library(plot3D)
library(rgl)
library(car)
datafor3Dplot <- data.frame(y,x1,x2)
scatter3d(x1,x2,x3,y, data = datafor3Dplot,
theta = 40, phi = 40, pch = 20, col = c("red",'blue','pink'))
3d plot of the estimated model
For the NitrousOxide1 data, a linear regression fit to nitrous oxide using humidity, temperature and pressure as predictors. We see that some observations lie above and some observations lie below the least squares regression plane. From the pattern of the residuals, we can see that there is a pronounced non-linear relationship in the data. The positive residuals (those visible above the surface), some tend to lie along the line, while most are far from the line. The negative residuals (some not visible), tend to lie away from this line.
CarMpg <- read_excel("/Users/saroo/Downloads/HW 4 MTH-420 FOLDER/CarMpg.xlsx")
y_2 = CarMpg$MPG
x1_2 = CarMpg$Odometer
x2_2 = CarMpg$Octane
x3_2 = CarMpg$`Car Type`
mult.lin.model2 = lm(y_2 ~ x1_2 + x2_2 + x3_2)
summary(mult.lin.model2)
##
## Call:
## lm(formula = y_2 ~ x1_2 + x2_2 + x3_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1936 -1.5244 -0.5187 1.4735 4.8702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.097e+00 7.396e+00 0.419 0.679913
## x1_2 -3.670e-05 1.670e-05 -2.198 0.039912 *
## x2_2 3.704e-01 8.783e-02 4.218 0.000423 ***
## x3_2SUV -1.298e+01 1.168e+00 -11.109 5.24e-10 ***
## x3_2van -1.153e+01 1.124e+00 -10.257 2.06e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.282 on 20 degrees of freedom
## Multiple R-squared: 0.8903, Adjusted R-squared: 0.8683
## F-statistic: 40.57 on 4 and 20 DF, p-value: 2.506e-09
coef(mult.lin.model2)
## (Intercept) x1_2 x2_2 x3_2SUV x3_2van
## 3.096664e+00 -3.669540e-05 3.704375e-01 -1.297973e+01 -1.152520e+01
The coefficient values are:
\[\beta_0 = 3.096664e+00\] \[\beta_1 = -3.669540e-05\] \[\beta_2 = 3.704375e-01\] \[\beta_3 = -1.297973e+01\] \[\beta_4 = -1.152520e+01\] (b) Fit this multiple linear regression model to the given data with the estimated coefficients.
extract_eq(mult.lin.model2, use_coefs = TRUE, coef_digits = 6)
\[ \operatorname{\widehat{y\_2}} = 3.096664 - 3.7e-05(\operatorname{x1\_2}) + 0.370438(\operatorname{x2\_2}) - 12.979725(\operatorname{x3\_2}_{\operatorname{SUV}}) - 11.525202(\operatorname{x3\_2}_{\operatorname{van}}) \] 0.0026 is the coefficient of Humidity, 8e-04 is the coefficient of Temperature, 0.1542 is the coefficient of Pressure.
summary(mult.lin.model2)[4]
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.096664e+00 7.396288e+00 0.4186781 6.799133e-01
## x1_2 -3.669540e-05 1.669686e-05 -2.1977422 3.991175e-02
## x2_2 3.704375e-01 8.782700e-02 4.2178096 4.227447e-04
## x3_2SUV -1.297973e+01 1.168346e+00 -11.1094863 5.243788e-10
## x3_2van -1.152520e+01 1.123657e+00 -10.2568652 2.061827e-09
If our vehicle is a SEDAN, then both the coefficients x3_2SUV and x3_2VAN will be equal to 0, hence leading to no negative contributions to MPG.Therefore using a SEDAN leads to an improvement in the average MPG of the vehicle, with the result being significant
We can see that the coefficient for SUV and VAN is negative.This means that if the x3_2 is either a VAN or an SUV then the average value of MPG reduces.The coefficients have a p-value of almost 0, which means that both the coefficients are also significant.
• Determine the most appropriate model for predicting hang time. (Note: while you doing it , make sure you follow all possible regression routine you know so far from the textbook, and present your evidence for each of the vital step/process you have gone through.)
Punter <- read_excel("/Users/saroo/Downloads/HW 4 MTH-420 FOLDER/Punter.xlsx")
## New names:
## * `` -> ...8
## * `` -> ...9
## * `` -> ...10
## * `` -> ...11
y_3 = Punter$`Hang Time`
# We will try to average the variable for leg strength and hamstring flexibility for left and right legsbfor each observation.
average_leg_strength = rowMeans(data.frame(Punter$RLS, Punter$LLS), na.rm = TRUE)
average_leg_strength
## [1] 170 135 175 160 160 150 175 110 115 125 130 135 155
average_hamstring_flexibility = rowMeans(data.frame(Punter$RHF, Punter$LHF), na.rm = TRUE)
average_hamstring_flexibility
## [1] 106 92 93 103 104 101 108 86 90 85 89 92 95
x1_3 = average_leg_strength
x2_3 = average_hamstring_flexibility
x3_3 = Punter$Power
multi.lin.model3 = lm(y_3 ~ x1_3 + x2_3 + x3_3)
summary(multi.lin.model3)
##
## Call:
## lm(formula = y_3 ~ x1_3 + x2_3 + x3_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31035 -0.06459 -0.01722 0.06253 0.32218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.406207 0.898020 0.452 0.6617
## x1_3 0.011528 0.004695 2.455 0.0364 *
## x2_3 0.012612 0.015094 0.836 0.4250
## x3_3 0.003197 0.001746 1.831 0.1003
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.209 on 9 degrees of freedom
## Multiple R-squared: 0.8629, Adjusted R-squared: 0.8171
## F-statistic: 18.87 on 3 and 9 DF, p-value: 0.0003194
## don't include the variable x2_3 and power because statistically unsignificant because p > 0.05 and rsquared has to be close to 1
As it is clear from the model used above, the variables average_hamstring_flexibility and Punter$Power don’t hold statistical significance. That is because their p-value > 0.05. Even if R-squared is at an acceptable value of 0.8629, this model cannot be used because two of its predictors are statistically insignificant.
We will then build a model only including the variable of right and left leg strength as well as power predictor variables.
multi.lin.model4 = lm(y_3 ~ average_leg_strength + x3_3)
summary(multi.lin.model4)
##
## Call:
## lm(formula = y_3 ~ average_leg_strength + x3_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37445 -0.07033 -0.00606 0.05232 0.31113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.077931 0.394084 2.735 0.02100 *
## average_leg_strength 0.014295 0.003278 4.360 0.00142 **
## x3_3 0.003869 0.001526 2.536 0.02959 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2059 on 10 degrees of freedom
## Multiple R-squared: 0.8522, Adjusted R-squared: 0.8227
## F-statistic: 28.83 on 2 and 10 DF, p-value: 7.05e-05
This model, on the other hand has an acceptable r-squared value, as well as statistically significant coefficients for the predictor variables.
par(mfrow = c(2,2))
plot(multi.lin.model4, pch = 20)
datafor3Dplot2 <- data.frame(y_3,average_leg_strength,x3_3)
scatter3d(average_leg_strength,x3_3,y_3, data = datafor3Dplot,
theta = 40, phi = 40, pch = 20, col = c("red",'blue','pink'))
knitr::include_graphics("/Users/saroo/Downloads/HW 4 MTH-420 FOLDER/Screen Shot 2022-04-05 at 21.04.41.png")