Introduction
- Problem 3.13 in the book using the “lm” function in R.
Data
First-order model
Second model
Analysis of Variance (ANOVA)
Step-wise regression
Best Model
- Plots for best model
Confidence and prediction Intervals
- For point 1: 95% Confidence Interval
- For Point 2: 95% Prediction Interval

Introduction

Problem 3.13 in the book using the “lm” function in R.

Specifically, the problem studies the effect of four factors (x1 = superficial fluid velocity of the gas (cs/s), x2= kinematic velocity, x3= mesh opening (cm), and x4= dimensionless number relating the superficial velocity of the gas to the fluid) on the pressure drop in a screen plate bubble column (y).Round numeric answers to 2 decimal places unless otherwise stated. Use α = 0.05. The corresponding data file data-table-B9.csv is given and read into the analysis which is named “dat”.

dat <- read.csv("data-table-B9.csv")

Data

The data contains 62 rows and 5 columns. You can then view the top values of the data:

head(dat)

##     x1 x2   x3    x4    y
## 1 2.14 10 0.34 1.000 28.9
## 2 4.14 10 0.34 1.000 31.0
## 3 8.15 10 0.34 1.000 26.4
## 4 2.14 10 0.34 0.246 27.2
## 5 4.14 10 0.34 0.379 26.1
## 6 8.15 10 0.34 0.474 23.2

The data has four predictors (x1,x2,x3,x4) and one response (y).

First-order model

Fit the first-order model(Q1) in R with all two factor interactions using the form: lm(y~(x1+x2+x3+x4)^2,data=dat))

Q1 <- lm(y~(x1+x2+x3+x4)^2,data=dat)
summary(Q1)

## 
## Call:
## lm(formula = y ~ (x1 + x2 + x3 + x4)^2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4804 -3.0766 -0.6635  2.9625 12.2221 
## 
## Coefficients: (2 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.88376   23.17863   0.685  0.49616    
## x1            0.18696    0.78447   0.238  0.81255    
## x2            0.37921    0.06332   5.989 1.89e-07 ***
## x3          -11.99940   67.31148  -0.178  0.85919    
## x4           -8.86442   35.62553  -0.249  0.80446    
## x1:x2         0.01155    0.00869   1.329  0.18955    
## x1:x3              NA         NA      NA       NA    
## x1:x4        -1.11525    1.14847  -0.971  0.33592    
## x2:x3              NA         NA      NA       NA    
## x2:x4        -0.38547    0.11962  -3.222  0.00218 ** 
## x3:x4        72.85976  103.15353   0.706  0.48308    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.683 on 53 degrees of freedom
## Multiple R-squared:  0.7496, Adjusted R-squared:  0.7118 
## F-statistic: 19.83 on 8 and 53 DF,  p-value: 1.947e-13

From the summary table we notice that the interactions x1:x3 and x2:x3 have their coefficients listed as NA therefore they are dropped due to singularities. It also indicated the \(R^{2}\) value as 0.7496 and F-statistic as 19.93.This shows that the model is statistically significant with a \(p_{value}\) that is very small and way less that 0.05. Therefore this leads to us rejecting the null hypothesis H0.

Second model

Fit second model into the model(FIT2) which has the main effects (x1+x2+x3+x4) in the form: lm(y~x1+x2+x3+x4,data=dat)

FIT2 <- lm(y~x1+x2+x3+x4,data=dat)
summary(FIT2)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9958 -3.3092 -0.2419  3.3924 10.5668 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.89453    4.32508   1.363  0.17828    
## x1          -0.47790    0.34002  -1.406  0.16530    
## x2           0.18271    0.01718  10.633 3.78e-15 ***
## x3          35.40284   11.09960   3.190  0.00232 ** 
## x4           5.84391    2.90978   2.008  0.04935 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.014 on 57 degrees of freedom
## Multiple R-squared:  0.6914, Adjusted R-squared:  0.6697 
## F-statistic: 31.92 on 4 and 57 DF,  p-value: 5.818e-14

The summary table indicates that the model has an \(R^{2}\) value of 0.6914 and the F-statistic of 31.92. With a \(p_{value}\) that is way less than 0.05 the model is therefore significant.

Analysis of Variance (ANOVA)

Using the two models we then test for the significance of FIT2 over Q1 using the ANOVA function in R. This will provide a \(p_{value}\) that compares the models. A \(p_{value}\) < 0.05 shows that there is significance and the null hypothesis \(H_{0}\) can be rejected. Whereas if the \(p_{value}\) > 0.05 the null hypothesis \(H_{0}\) is accepted.

anova(FIT2,Q1)

## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2 + x3 + x4
## Model 2: y ~ (x1 + x2 + x3 + x4)^2
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     57 1432.8                              
## 2     53 1162.4  4    270.37 3.0819 0.02352 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA provides a \(p_{value}\) of 0.02352 which is less than 0.05. So we reject the null hypothesis and find the best fitting model Using Step-wise regression.

Step-wise regression

In order to determine the best fitting model, we use step-wise regression also known as backward elimination starting with the full model using AIC.

backward <- step(Q1,direction = "backward", trace = TRUE)

## Start:  AIC=199.73
## y ~ (x1 + x2 + x3 + x4)^2
## 
## 
## Step:  AIC=199.73
## y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x4 + x3:x4
## 
## 
## Step:  AIC=199.73
## y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4 + x3:x4
## 
##         Df Sum of Sq    RSS    AIC
## - x3:x4  1    10.942 1173.4 198.31
## - x1:x4  1    20.682 1183.1 198.82
## <none>               1162.4 199.73
## - x1:x2  1    38.737 1201.2 199.76
## - x2:x4  1   227.751 1390.2 208.82
## 
## Step:  AIC=198.31
## y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4
## 
##         Df Sum of Sq    RSS    AIC
## - x1:x4  1    19.837 1193.2 197.35
## <none>               1173.4 198.31
## - x1:x2  1    38.709 1212.1 198.32
## - x2:x4  1   228.394 1401.8 207.34
## - x3     1   249.320 1422.7 208.26
## 
## Step:  AIC=197.35
## y ~ x1 + x2 + x3 + x4 + x1:x2 + x2:x4
## 
##         Df Sum of Sq    RSS    AIC
## - x1:x2  1    32.307 1225.5 197.01
## <none>               1193.2 197.35
## - x2:x4  1   220.026 1413.2 205.84
## - x3     1   252.209 1445.4 207.24
## 
## Step:  AIC=197.01
## y ~ x1 + x2 + x3 + x4 + x2:x4
## 
##         Df Sum of Sq    RSS    AIC
## - x1     1    11.262 1236.8 195.57
## <none>               1225.5 197.01
## - x2:x4  1   207.286 1432.8 204.69
## - x3     1   248.430 1473.9 206.45
## 
## Step:  AIC=195.57
## y ~ x2 + x3 + x4 + x2:x4
## 
##         Df Sum of Sq    RSS    AIC
## <none>               1236.8 195.57
## - x3     1    243.60 1480.4 204.72
## - x2:x4  1    245.68 1482.4 204.81

Then we acquire the best model:

BEST <- names(coef(backward))
BEST

## [1] "(Intercept)" "x2"          "x3"          "x4"          "x2:x4"

Best Model

For the summary of the best model

bestmodel <- lm(y~x2+x3+x4+x2:x4,data=dat)
summary(bestmodel)

## 
## Call:
## lm(formula = y ~ x2 + x3 + x4 + x2:x4, data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.959 -3.358 -1.131  3.040 11.646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.52261    4.03964   0.377  0.70763    
## x2           0.38056    0.06084   6.255 5.47e-08 ***
## x3          34.51062   10.29961   3.351  0.00144 ** 
## x4           9.52471    2.96093   3.217  0.00214 ** 
## x2:x4       -0.30472    0.09056  -3.365  0.00137 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.658 on 57 degrees of freedom
## Multiple R-squared:  0.7336, Adjusted R-squared:  0.7149 
## F-statistic: 39.24 on 4 and 57 DF,  p-value: 9.297e-16

Plots for best model

We can then plot the best model and check the residual plots to ensure model is normally distributed and has a constant variance.

plot(bestmodel, which = 1)

plot(bestmodel, which = 2)

Confidence and prediction Intervals

Using the best model,we then determine the confidence and prediction intervals using the points shown in the table below.

Points of interest
	x2	x3	x4
Point 1	10	0.5	0.75
Point 2	3	0.25	0.85

For point 1: 95% Confidence Interval

point1 <- c(x2=10,x3=0.5,x4=0.75)
predict(bestmodel,data.frame(x2=10,x3=0.5,x4=0.75))

##        1 
## 27.44168

predict(bestmodel,data.frame(x2=10,x3=0.5,x4=0.75),interval="confidence")

##        fit      lwr      upr
## 1 27.44168 24.00618 30.87717

For Point 2: 95% Prediction Interval

point2 <- c(x2=3,x3=0.25,x4=0.85)
predict(bestmodel,data.frame(x2=3,x3=0.25,x4=0.85))

##        1 
## 18.61092

predict(bestmodel,data.frame(x2=3,x3=0.25,x4=0.85),interval="prediction")

##        fit      lwr      upr
## 1 18.61092 8.860183 28.36165

Assignment 8-Inference in Multiple Regression with lm

Christine Lotigo

2026-02-21