Multiple Regression with “lm”

Problem Statement

An Engineer collects data regarding the study of the effects of four variables on a dimensionless factor used to describe pressure drops in a screen-plate bubble column. The results are provided in a .csv data set.

ds <- read.csv("data-table-B9.csv")
print(ds)
##      x1     x2   x3    x4    y
## 1  2.14  10.00 0.34 1.000 28.9
## 2  4.14  10.00 0.34 1.000 31.0
## 3  8.15  10.00 0.34 1.000 26.4
## 4  2.14  10.00 0.34 0.246 27.2
## 5  4.14  10.00 0.34 0.379 26.1
## 6  8.15  10.00 0.34 0.474 23.2
## 7  2.14  10.00 0.34 0.141 19.7
## 8  4.14  10.00 0.34 0.234 22.1
## 9  8.15  10.00 0.34 0.311 22.8
## 10 2.14  10.00 0.34 0.076 29.2
## 11 4.14  10.00 0.34 0.132 23.6
## 12 8.15  10.00 0.34 0.184 23.6
## 13 2.14   2.63 0.34 0.679 24.2
## 14 4.14   2.63 0.34 0.804 22.1
## 15 8.15   2.63 0.34 0.890 20.9
## 16 2.14   2.63 0.34 0.514 17.6
## 17 4.14   2.63 0.34 0.672 15.7
## 18 8.15   2.63 0.34 0.801 15.8
## 19 2.14   2.63 0.34 0.346 14.0
## 20 4.14   2.63 0.34 0.506 17.1
## 21 8.15   2.63 0.34 0.669 18.3
## 22 2.14   2.63 0.34 1.000 33.8
## 23 4.14   2.63 0.34 1.000 31.7
## 24 8.15   2.63 0.34 1.000 28.1
## 25 5.60   1.25 0.34 0.848 18.1
## 26 5.60   1.25 0.34 0.737 16.5
## 27 5.60   1.25 0.34 0.651 15.4
## 28 5.60   1.25 0.34 0.554 15.0
## 29 4.30   2.63 0.34 0.748 19.1
## 30 4.30   2.63 0.34 0.682 16.2
## 31 4.30   2.63 0.34 0.524 16.3
## 32 4.30   2.63 0.34 0.472 15.8
## 33 4.30   2.63 0.34 0.398 15.4
## 34 5.60  10.10 0.25 0.789 19.2
## 35 5.60  10.10 0.25 0.677  8.4
## 36 5.60  10.10 0.25 0.590 15.0
## 37 5.60  10.10 0.25 0.523 12.0
## 38 5.60  10.10 0.34 0.789 21.9
## 39 5.60  10.10 0.34 0.677 21.3
## 40 5.60  10.10 0.34 0.590 21.6
## 41 5.60  10.10 0.34 0.523 19.8
## 42 4.30  10.10 0.34 0.741 21.6
## 43 4.30  10.10 0.34 0.617 17.3
## 44 4.30  10.10 0.34 0.524 20.0
## 45 4.30  10.10 0.34 0.457 18.6
## 46 2.40  10.10 0.34 0.615 22.1
## 47 2.40  10.10 0.34 0.473 14.7
## 48 2.40  10.10 0.34 0.381 15.8
## 49 2.40  10.10 0.34 0.320 13.2
## 50 5.60  10.10 0.55 0.789 30.8
## 51 5.60  10.10 0.55 0.677 27.5
## 52 5.60  10.10 0.55 0.590 25.2
## 53 5.60  10.10 0.55 0.523 22.8
## 54 2.14 112.00 0.34 0.680 41.7
## 55 4.14 112.00 0.34 0.803 33.7
## 56 8.15 112.00 0.34 0.889 29.7
## 57 2.14 112.00 0.34 0.514 41.8
## 58 4.14 112.00 0.34 0.672 37.1
## 59 8.15 112.00 0.34 0.801 40.1
## 60 2.14 112.00 0.34 0.306 42.7
## 61 4.14 112.00 0.34 0.506 48.6
## 62 8.15 112.00 0.34 0.668 42.4

x1 = superficial fluid velocity of the gas (cs/s)

x2 = kinematic velocity

x3 = mesh opening (cm)

x4 = dimensionless number relating the superficial velocity of the gas to the fluid

a = 0.05 (Given for statistical data analysis)

Round all answers to two decimal places.

Run first order model with two factor interaction. Identify interactions that are dropped due to singularities.

fit1 <- lm(y~(x1+x2+x3+x4)^2, data=ds)
summary(fit1)
## 
## Call:
## lm(formula = y ~ (x1 + x2 + x3 + x4)^2, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4804 -3.0766 -0.6635  2.9625 12.2221 
## 
## Coefficients: (2 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.88376   23.17863   0.685  0.49616    
## x1            0.18696    0.78447   0.238  0.81255    
## x2            0.37921    0.06332   5.989 1.89e-07 ***
## x3          -11.99940   67.31148  -0.178  0.85919    
## x4           -8.86442   35.62553  -0.249  0.80446    
## x1:x2         0.01155    0.00869   1.329  0.18955    
## x1:x3              NA         NA      NA       NA    
## x1:x4        -1.11525    1.14847  -0.971  0.33592    
## x2:x3              NA         NA      NA       NA    
## x2:x4        -0.38547    0.11962  -3.222  0.00218 ** 
## x3:x4        72.85976  103.15353   0.706  0.48308    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.683 on 53 degrees of freedom
## Multiple R-squared:  0.7496, Adjusted R-squared:  0.7118 
## F-statistic: 19.83 on 8 and 53 DF,  p-value: 1.947e-13

The model results suggests that x1:x3 and x2:x3 interactions coefficients are NA due to singularities. This results shows collinarity, therefore the regression models of these interactions are approximately the same.

Summary statistics on the first order model.

In the summary output above, the the r-squared value is 0.7496.

The F-statistic is 19.83. No rounding is necessary on this output.

Since the P-Value<<0.05, we reject the null hypothesis. As least one of the regression coefficients is significantly different from zero.

Compare to a reduced model.

fit1_red <- lm(y~(x1+x2+x3+x4),data=ds)
anova(fit1_red,fit1)
## Analysis of Variance Table
## 
## Model 1: y ~ (x1 + x2 + x3 + x4)
## Model 2: y ~ (x1 + x2 + x3 + x4)^2
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     57 1432.8                              
## 2     53 1162.4  4    270.37 3.0819 0.02352 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The partial F-Statistic is 3.08.

The P-Value is 0.0235.

Since the P-Value<0.05, we reject the null hypothesis. At least one interaction term is significant and should be retained.

Model justification using step-wise regression.

fit1_step <- step(fit1,dorectopm="backward",trace=0)
summary(fit1_step)
## 
## Call:
## lm(formula = y ~ x2 + x3 + x4 + x2:x4, data = ds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.959 -3.358 -1.131  3.040 11.646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.52261    4.03964   0.377  0.70763    
## x2           0.38056    0.06084   6.255 5.47e-08 ***
## x3          34.51062   10.29961   3.351  0.00144 ** 
## x4           9.52471    2.96093   3.217  0.00214 ** 
## x2:x4       -0.30472    0.09056  -3.365  0.00137 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.658 on 57 degrees of freedom
## Multiple R-squared:  0.7336, Adjusted R-squared:  0.7149 
## F-statistic: 39.24 on 4 and 57 DF,  p-value: 9.297e-16

The following terms are best for the best-fitting model: x2, x3, x4, x2:x4.

Best model summary statistics.

fit1_best <- lm(y~(x2+x3+x4+x2:x4),data=ds)
summary(fit1_best)
## 
## Call:
## lm(formula = y ~ (x2 + x3 + x4 + x2:x4), data = ds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.959 -3.358 -1.131  3.040 11.646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.52261    4.03964   0.377  0.70763    
## x2           0.38056    0.06084   6.255 5.47e-08 ***
## x3          34.51062   10.29961   3.351  0.00144 ** 
## x4           9.52471    2.96093   3.217  0.00214 ** 
## x2:x4       -0.30472    0.09056  -3.365  0.00137 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.658 on 57 degrees of freedom
## Multiple R-squared:  0.7336, Adjusted R-squared:  0.7149 
## F-statistic: 39.24 on 4 and 57 DF,  p-value: 9.297e-16

The adjusted square value for the best model is 0.7149.

The coefficient for x2:x4 is -0.3047.

When calculating the confidence and prediction intervals for the model, we omit x1 because it is not included in the model.

Prediction based on best model and confidence/prediction interval bounds (95% confidence).

point1 <- data.frame(x2=10,x3=0.5,x4=0.75)
predict(fit1_best,newdata=point1)
##        1 
## 27.44168

Predicting y at the above points yields y=27.44.

predict(fit1_best,newdata=point1,interval="confidence",level = 0.95)
##        fit      lwr      upr
## 1 27.44168 24.00618 30.87717

The lower bound at the predicted response from point 1 is 24.01.

The upper bound at the predicted response from point 1 is 30.88.

point2 <- data.frame(x2=3,x3=0.25,x4=0.85)
predict(fit1_best,newdata=point2)
##        1 
## 18.61092

Predicting y at the above points yields y=18.61.

predict(fit1_best,newdata=point2,interval="prediction",level = 0.95)
##        fit      lwr      upr
## 1 18.61092 8.860183 28.36165

The upper bound at the predicted response from point 2 is 28.36.

The prediction interval is always wider than the confidence interval at the same point because the prediction interval accounts for both the uncertainty in estimating the mean response and the variability of individual observations. True.

x1 was excluded from the best model because it was not statistically significant as a predictor of pressure drop and its removal improved the model based on AIC.