Assignment 8 - Multiple Regression with “lm”

Introduction

The goal of this analysis is to investigate the factors that influence the pressure drop in a screen plate bubble column. Problem 3.13 from the textbook provides experimental data measuring the pressure drop response \(y\) along with four explanatory variables:

The dataset (data-table-B9.csv) contains 62 observations collected under various operating conditions. Using this dataset, the objective is to build and evaluate multiple linear regression models in R, beginning with a first-order model and extending to a full first-order interaction model. The analysis includes hypothesis testing, model comparison using partial F-tests, and model selection using stepwise regression based on AIC. All numerical results are rounded to two decimal places unless otherwise specified, and statistical conclusions are made at the significance level \(\alpha = 0.05\).

data_1<-read.csv(text="x1,x2,x3,x4,y
2.14,10,0.34,1,28.9
4.14,10,0.34,1,31
8.15,10,0.34,1,26.4
2.14,10,0.34,0.246,27.2
4.14,10,0.34,0.379,26.1
8.15,10,0.34,0.474,23.2
2.14,10,0.34,0.141,19.7
4.14,10,0.34,0.234,22.1
8.15,10,0.34,0.311,22.8
2.14,10,0.34,0.076,29.2
4.14,10,0.34,0.132,23.6
8.15,10,0.34,0.184,23.6
2.14,2.63,0.34,0.679,24.2
4.14,2.63,0.34,0.804,22.1
8.15,2.63,0.34,0.89,20.9
2.14,2.63,0.34,0.514,17.6
4.14,2.63,0.34,0.672,15.7
8.15,2.63,0.34,0.801,15.8
2.14,2.63,0.34,0.346,14
4.14,2.63,0.34,0.506,17.1
8.15,2.63,0.34,0.669,18.3
2.14,2.63,0.34,1,33.8
4.14,2.63,0.34,1,31.7
8.15,2.63,0.34,1,28.1
5.6,1.25,0.34,0.848,18.1
5.6,1.25,0.34,0.737,16.5
5.6,1.25,0.34,0.651,15.4
5.6,1.25,0.34,0.554,15
4.3,2.63,0.34,0.748,19.1
4.3,2.63,0.34,0.682,16.2
4.3,2.63,0.34,0.524,16.3
4.3,2.63,0.34,0.472,15.8
4.3,2.63,0.34,0.398,15.4
5.6,10.1,0.25,0.789,19.2
5.6,10.1,0.25,0.677,8.4
5.6,10.1,0.25,0.59,15
5.6,10.1,0.25,0.523,12
5.6,10.1,0.34,0.789,21.9
5.6,10.1,0.34,0.677,21.3
5.6,10.1,0.34,0.59,21.6
5.6,10.1,0.34,0.523,19.8
4.3,10.1,0.34,0.741,21.6
4.3,10.1,0.34,0.617,17.3
4.3,10.1,0.34,0.524,20
4.3,10.1,0.34,0.457,18.6
2.4,10.1,0.34,0.615,22.1
2.4,10.1,0.34,0.473,14.7
2.4,10.1,0.34,0.381,15.8
2.4,10.1,0.34,0.32,13.2
5.6,10.1,0.55,0.789,30.8
5.6,10.1,0.55,0.677,27.5
5.6,10.1,0.55,0.59,25.2
5.6,10.1,0.55,0.523,22.8
2.14,112,0.34,0.68,41.7
4.14,112,0.34,0.803,33.7
8.15,112,0.34,0.889,29.7
2.14,112,0.34,0.514,41.8
4.14,112,0.34,0.672,37.1
8.15,112,0.34,0.801,40.1
2.14,112,0.34,0.306,42.7
4.14,112,0.34,0.506,48.6
8.15,112,0.34,0.668,42.4")

Solve Problem

Problem 1

When fitting the first-order model with all two-factor interactions using lm(y ~ (x1 + x2 + x3 + x4)^2, data = dat), R reports that two interaction terms have coefficients listed as NA. Which two interactions are dropped due to singularities?

Answer: x1:x3 and x2:x3

Problem 2

For the full first-order interaction model (Q1), what is the R² value? (Round to 4 decimal places)

#create the first-order model
model<-lm(y~(x1+x2+x3+x4)^2, data = data_1)
result<-summary(model)
#pull R^2 out 
r_sqr<-round(result$r.squared, 4)
r_sqr
## [1] 0.7496
Problem 3

For the full first-order interaction model (Q1), what is the F-statistic for testing the significance of the overall regression? (Round to 2 decimal places)

#find the F-statistic
f_stats<-round(result$fstatistic[1], 2)
f_stats
## value 
## 19.83
Problem 4

Based on the overall F-test for the full interaction model (Q1 ), what do you conclude about the full regression model?

Answer: Reject H₀; at least one regression coefficient is significantly different from zero.

Problem 5

For the partial F-test comparing the reduced model (main effects only: x1 + x2 + x3 + x4) to the full interaction model, what is the partial F-statistic? (Round to 2 decimal places)

#comparing reduced vs full model
model_reduce<-lm(y ~ x1 + x2 + x3 + x4, data = data_1)
partial_f<-anova(model_reduce, model)
#pull the result out
f_test<-round(partial_f$F[2], 2)
f_test
## [1] 3.08
Problem 6

What is the p-value for the partial F-test of all two-factor interactions? (Round to 4 decimal places)

#pull p-value out 
p_value<-round(partial_f$`Pr(>F)`[2], 4)
p_value
## [1] 0.0235
Problem 7

Based on the partial F-test, what do you conclude about the two-factor interactions as a group at α = 0.05?

#based on a=0.05 make the conclusion 
if(p_value<0.05){
  conclusion<-print("Reject H0")
} else {
  conclusion<-print("Fail to reject")
}
## [1] "Reject H0"
Problem 8

Using stepwise regression (backward elimination starting with the full model using AIC), which terms are included in the best-fitting model justified by significance of regression parameters?

#backward stepwise selection based on AIC
backward<-step(model, direction="backward", trace= TRUE)
## Start:  AIC=199.73
## y ~ (x1 + x2 + x3 + x4)^2
## 
## 
## Step:  AIC=199.73
## y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x4 + x3:x4
## 
## 
## Step:  AIC=199.73
## y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4 + x3:x4
## 
##         Df Sum of Sq    RSS    AIC
## - x3:x4  1    10.942 1173.4 198.31
## - x1:x4  1    20.682 1183.1 198.82
## <none>               1162.4 199.73
## - x1:x2  1    38.737 1201.2 199.76
## - x2:x4  1   227.751 1390.2 208.82
## 
## Step:  AIC=198.31
## y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4
## 
##         Df Sum of Sq    RSS    AIC
## - x1:x4  1    19.837 1193.2 197.35
## <none>               1173.4 198.31
## - x1:x2  1    38.709 1212.1 198.32
## - x2:x4  1   228.394 1401.8 207.34
## - x3     1   249.320 1422.7 208.26
## 
## Step:  AIC=197.35
## y ~ x1 + x2 + x3 + x4 + x1:x2 + x2:x4
## 
##         Df Sum of Sq    RSS    AIC
## - x1:x2  1    32.307 1225.5 197.01
## <none>               1193.2 197.35
## - x2:x4  1   220.026 1413.2 205.84
## - x3     1   252.209 1445.4 207.24
## 
## Step:  AIC=197.01
## y ~ x1 + x2 + x3 + x4 + x2:x4
## 
##         Df Sum of Sq    RSS    AIC
## - x1     1    11.262 1236.8 195.57
## <none>               1225.5 197.01
## - x2:x4  1   207.286 1432.8 204.69
## - x3     1   248.430 1473.9 206.45
## 
## Step:  AIC=195.57
## y ~ x2 + x3 + x4 + x2:x4
## 
##         Df Sum of Sq    RSS    AIC
## <none>               1236.8 195.57
## - x3     1    243.60 1480.4 204.72
## - x2:x4  1    245.68 1482.4 204.81
#pull the regression parameters out
step_model<-names(coef(backward))
step_model
## [1] "(Intercept)" "x2"          "x3"          "x4"          "x2:x4"
Problem 9

For the best model (y ~ x2 + x3 + x4 + x2:x4), what is the adjusted R²? (Round to 4 decimal places) best

#fit the best model 
best_model<-lm(y~x2+x3+x4+x2:x4, data = data_1)
#calculate r.sqr from best model
r.sqr_2<-round(summary(best_model)$adj.r.squared, 4)
r.sqr_2
## [1] 0.7149
Problem 10

For the best model, what is the estimated coefficient for the x2:x4 interaction term? (Round to 4 decimal places)

#calculate estimated coefficient for the x2:x4 from best model
coef_2.4<-round(coef(best_model)["x2:x4"], 4)
coef_2.4
##   x2:x4 
## -0.3047
Problem 11

In the best model (y ~ x2 + x3 + x4 + x2:x4), the variable x1 is not included. When calculating confidence and prediction intervals at the given points of interest, what should you do with the x1 values?

Answer: Omit x1 since it is not in the model

Problem 12

Using the best model, what is the predicted (fitted) value of y at Point 1 (x2 = 10, x3 = 0.5, x4 = 0.75)? (Round to 2 decimal places)

#calculate point 1
p1<-data.frame(x2=10, x3=0.5, x4=0.75)
#find the predicted fitted value at the point 1
yh_point1<-round(predict(best_model, newdata = p1), 2) 
yh_point1
##     1 
## 27.44
Problem 13

What is the lower bound of the 95% confidence interval on the mean response at Point 1 (x2 = 10, x3 = 0.5, x4 = 0.75)? (Round to 2 decimal places)

#95% confidence interval for the mean response
CI_p1<-predict(best_model, newdata = p1, interval = "confidence")
lower_b<-round(CI_p1[, "lwr"], 2)
lower_b
## [1] 24.01
Problem 14

What is the upper bound of the 95% confidence interval on the mean response at Point 1? (Round to 2 decimal places)

#find upper bound 
upr_b<-round(CI_p1[, "upr"], 2)
upr_b
## [1] 30.88
Problem 15

What is the predicted (fitted) value of y at Point 2 (x2 = 3, x3 = 0.25, x4 = 0.85)? (Round to 2 decimal places)

#calculate point 2
p2<-data.frame(x2=3, x3=0.25, x4=0.85)
#find the predicted fitted value at the point 1
yh_point2<-round(predict(best_model, newdata = p2), 2) 
yh_point2
##     1 
## 18.61
Problem 16

What is the upper bound of the 95% prediction interval at Point 2 (x2 = 3, x3 = 0.25, x4 = 0.85)? (Round to 2 decimal places)

#95% prediction interval for the mean response
PI_p2<-predict(best_model, newdata = p2, interval = "prediction")
PI_upr<-round(PI_p2[, "upr"], 2)
PI_upr
## [1] 28.36
Problem 17

The prediction interval is always wider than the confidence interval at the same point because the prediction interval accounts for both the uncertainty in estimating the mean response and the variability of individual observations.

Answer: True

Problem 18

Why was x1 (superficial fluid velocity of the gas) ultimately excluded from the best model?

Answer: x1 was not statistically significant as a predictor of pressure drop and its removal improved the model based on AIC