7.2 A solid - fuel rocket propellant loses weight after it is produced. The data on Table 1 are available: a. Fit a second - order polynomial that expresses weight loss as a function of the number of months since production.
library(readxl) # Load the library of readxl
Table1 <- read_excel("Table1.xlsx")
library(ggplot2) # Load the library of ggplot2
model1 <- lm(y ~ x+I(x^2), data=Table1) #alternate: model1 <- lm(y ~ poly(x,2), data=Table1)
ggplot(Table1, aes(x,y)) + geom_point(alpha=0.5) + stat_smooth(method = "lm", formula = y ~ x+I(x^2), size = 1) + labs(title="Second order polynomial fitted plot of loses weight after it is produced", x="Months since Production", y="Weight Loss")+theme_bw() # title check
Result1 <- summary(model1)
Result1$coefficients[,1]
(Intercept) x I(x^2)
1.633000 -1.232182 1.494545
library(broom)
glance(model1)[c(4,5)]
anova(model1)[2,c(4,5)]
F value Pr(>F)
I(x^2) 361974 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
No, not at all. Because the model is well fitted (i.e. R2=1 and p-value ~ 0). It is recommanded to use lower order term model to avoid multicollinearity problem for extrapolated prediction if the lower order term model is well fitted.
vif(model1)
x I(x^2)
19.90625 19.90625
Variance inflation factors (VIF) test
7.3 Refer to Problem 7.2. Compute the residuals for the second - order model. Analyze the residuals and comment on the adequacy of the model.
library(ggResidpanel) # Load the library of ggResidpanel
resid_panel(model1, plots = "all")
7.4 Consider the data shown on the Table 2: a. Fit a second - order polynomial model to these data.
Table2 <- read_excel("Table2.xlsx")
model2 <- lm(y ~ x+I(x^2), data=Table2)
ggplot(Table2, aes(x,y)) + geom_point(alpha=0.5) + stat_smooth(method = "lm", formula = y ~ x+I(x^2), size = 1) + labs(title="Second order polynomial fitted plot", x="Variable", y="Response")+theme_bw() # title check
Result2 <- summary(model2)
Result2$coefficients[,1]
(Intercept) x I(x^2)
-4.459494 1.383712 1.467047
glance(model2)[c(4,5)]
fullmodel <- lm(y ~ x+I(x^2), data=Table2) #fit full model (model2)
reducedmodel <- lm(y ~ x, data=Table2) #fit reduced model
anova(reducedmodel, fullmodel) #lack of fit test
Analysis of Variance Table
Model 1: y ~ x
Model 2: y ~ x + I(x^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 10 48.982
2 9 24.720 1 24.262 8.8333 0.01565 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Partial F-test in needed to see The regression model The F test-statistic turns out to be 8.83 and the corresponding p-value is 0.016. Since this p-value is less than .05, we can reject the null hypothesis of the test and conclude that the full model offers a statistically significantly better fit than the reduced model.
anova(model2)[2,c(4,5)]
F value Pr(>F)
I(x^2) 8.8333 0.01565 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
7.5 Refer to Problem 7.4. Compute the residuals from the second - order model. Analyze the residuals and draw conclusions about the adequacy of the model.
resid_panel(model2, plots = "all")
7.6 The carbonation level of a soft drink beverage is affected by the temperature of the product and the filler operating pressure. Twelve observations were obtained and the resulting data are shown on the Table 3. a. Fit a second - order polynomial.
library(car) # load car package
library(rgl) # load rgl package #install.packages("rgl")
library("plot3D")
Table3 <- read_excel("Table3.xlsx")
x <- Table3$x1
y <- Table3$x2
z <- Table3$y
fit <- lm(z ~ I(x^2)+I(y^2)+x*y) # Compute the linear regression
model3 <- fit
# create a grid from the x and y values (min to max) and predict values for every point
# this will become the regression plane
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy),
nrow = grid.lines, ncol = grid.lines)
# create the fitted points for droplines to the surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 1,colvar = NULL, col="red", theta = 50, phi = 20,# change the plot orientation to better view the bottom
bty="b", xlab = "Temperature of the product (x1)", ylab = "Filler operating pressure (x2)", zlab = "Carbonation level (y)", surf = list(x = x.pred, y = y.pred, z = z.pred, facets = TRUE, fit = fitpoints, col=ramp.col (col = c("dodgerblue3","seagreen2"), n = 300, alpha=0.9), border="black"), main = "3D Scatter plot with polynoimial regression")
avPlots(model3)#produce added variable plots
Result3 <- summary(model3)
Result3$coefficients[,1]
(Intercept) I(x^2) I(y^2) x y x:y
3025.318635 3.625875 1.154250 -194.272894 -6.050668 -1.331710
glance(model3)[c(4,5)]
fullmodel <- lm(y ~ x1+x2+I(x1^2)+I(x2^2)+x1*x2, data=Table3) #fit full model (model3)
reducedmodel <- lm(y ~ x1+x2, data=Table3) #fit reduced model
anova(reducedmodel, fullmodel) #lack of fit test
Analysis of Variance Table
Model 1: y ~ x1 + x2
Model 2: y ~ x1 + x2 + I(x1^2) + I(x2^2) + x1 * x2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 9 8.1242
2 6 2.3022 3 5.822 5.0578 0.04415 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(model3)[5,c(4,5)]
F value Pr(>F)
x:y 2.2081 0.1878
anova(model3)[c(3,4),c(4,5)]
F value Pr(>F)
x 0.5210 0.49757
y 9.6306 0.02103 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
7.7 Refer to Problem 7.6. Compute the residuals from the second - order model. Analyze the residuals and comment on the adequacy of the model.
resid_panel(model3, plots = "all")
7.18 An article in the Journal of Pharmaceutical Sciences (80, 971 – 977, 1991) presents data on the observed mole fraction solubility of a solute at a constant temperature, along with x1 = dispersion partial solubility, x2 = dipolar partial solubility, and x3 = hydrogen bonding Hansen partial solubility as shown on the Table 4. The response y is the negative logarithm of the mole fraction solubility. a. Fit a complete quadratic model to the data.
Table4 <- read_excel("Table4.xlsx")
model4 <- lm(y ~ I(x1^2)+I(x2^2)+I(x3^2)+x1*x2*x3, data=Table4) # Compute the linear regression
avPlots(model4) #produce added variable plots
Result4 <- summary(model4)
Result4$coefficients[,1]
(Intercept) I(x1^2) I(x2^2) I(x3^2) x1 x2 x3 x1:x2
-1.6094487123 -0.0166935497 -0.0083074959 0.0006522905 0.3798276877 0.2629408525 -0.0952478284 -0.0246470466
x1:x3 x2:x3 x1:x2:x3
0.0046098065 -0.0163606359 0.0025975064
glance(model4)[c(4,5)]
summary(model4)$coefficient
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.6094487123 1.380968118 -1.1654496 0.2620359
I(x1^2) -0.0166935497 0.018461073 -0.9042567 0.3801533
I(x2^2) -0.0083074959 0.012560752 -0.6613852 0.5183963
I(x3^2) 0.0006522905 0.001541127 0.4232555 0.6781149
x1 0.3798276877 0.319013975 1.1906302 0.2523006
x2 0.2629408525 0.168039080 1.5647601 0.1384881
x3 -0.0952478284 0.108988619 -0.8739245 0.3959358
x1:x2 -0.0246470466 0.017160974 -1.4362266 0.1714655
x1:x3 0.0046098065 0.013769870 0.3347749 0.7424304
x2:x3 -0.0163606359 0.047781024 -0.3424087 0.7367941
x1:x2:x3 0.0025975064 0.006478482 0.4009437 0.6941148
The Pr(>|t|) column represents the p-value associated with the value in the t value column.
t value = Estimate / Std. Error p-value = 2 * pt(abs(t value), residual df, lower.tail = FALSE) If the p-value is less than a certain significance level (e.g. α = .05) then the predictor variable is said to have a statistically significant relationship with the response variable in the model.
The p-value for the predictor variable x1 is .0325. Since this value is less than .05, it has a statistically significant relationship with the response variable in the model. The p-value for the predictor variable x2 is .3732. Since this value is not less than .05, it does not have a statistically significant relationship with the response variable in the model.
resid_panel(model4, plots = "all")
fullmodel4 <- lm(y ~ x1+x2+x3+I(x1^2)+I(x2^2)+I(x3^2)+x1*x2*x3, data=Table4) #fit full model (model4)
reducedmodel4 <- lm(y ~ x1+x2+x3+x1*x2*x3, data=Table4) #fit reduced model
anova(reducedmodel, fullmodel) #lack of fit test
Analysis of Variance Table
Model 1: y ~ x1 + x2
Model 2: y ~ x1 + x2 + I(x1^2) + I(x2^2) + x1 * x2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 9 8.1242
2 6 2.3022 3 5.822 5.0578 0.04415 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The formula for Extra Sums of Squares is: ESS = Residual sum of squares (reduced) – Residual Sum of Squares (full). Let’s say your model contains one predictor variable, X1. If you add a second predictor, X2 to the model, ESS explains the additional variation explained by X2. We can write that as: SSR (X2 | X1).
In terms of SSE, let’s say you have a model with one predictor variable, X1. You add a variable X2 to a model. Extra Sums of Squares explains the part of SSE not explained by the original variable (X1). We can write that as: SSR (X2 | X1) = X1 – (X1, X2)
7.19 Consider the quadratic regression model from Problem 7.18. Find the variance inflation factors and comment on multicollinearity in this model.
vif(model4)
I(x1^2) I(x2^2) I(x3^2) x1 x2 x3 x1:x2 x1:x3 x2:x3 x1:x2:x3
574.0342 178.8020 108.0153 580.5829 628.5960 1569.3943 393.6975 1410.5121 15278.7388 15981.5670
VIF=1/(1-R2) VIF exceeding 5 indicates high multicollinearity between this independent variable and the others. A value greater than 5 indicates potentially severe correlation between a given predictor variable and other predictor variables in the model. In this case, the coefficient estimates and p-values in the regression output are likely unreliable.
8.18 Consider the fuel consumption data in Table 5 (B.18 in the textbook). Regressor x1 is an indicator variable. Perform a thorough analysis of these data. What conclusions do you draw from this analysis?
Quadratic model, linear model and reduced model to the data.
Table5 <- read_excel("Table5.xlsx")
model5 <- lm(y ~ x1+x2+x3+x4+x5+x6+x7+x8, data=Table5)
summary(model5)$coefficients[,1]
(Intercept) x1 x2 x3 x4 x5 x6 x7 x8
-312.8950782 -4.2500000 0.1587115 1.0271892 -8.6450025 -0.4319767 -0.1368195 -0.3183932 -0.5241563
anova(model5)[,c(4,5)]
F value Pr(>F)
x1 0.7736 0.40828
x2 0.0690 0.80034
x3 0.3047 0.59815
x4 1.5985 0.24659
x5 4.3389 0.07575 .
x6 0.0005 0.98321
x7 0.0129 0.91261
x8 0.0545 0.82207
Residuals
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model6 <- lm(y ~ x1+x1*x4+x1*x5, data=Table5)
anova(model6)[,c(4,5)]
F value Pr(>F)
x1 5.5223 0.0406433 *
x4 0.1941 0.6688918
x5 38.2521 0.0001035 ***
x1:x4 0.0060 0.9397868
x1:x5 47.0514 4.407e-05 ***
Residuals
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(model6)$coefficients[,1]
(Intercept) x1 x4 x5 x1:x4 x1:x5
459.7923515 -105.8938268 11.4138242 -0.6193277 -13.0107922 0.6513557
model7 <- lm(y ~ x1*x5, data=Table5)
anova(model7)[,c(4,5)]
F value Pr(>F)
x1 3.5412 0.0843337 .
x5 22.6268 0.0004667 ***
x1:x5 26.6152 0.0002374 ***
Residuals
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(model7)$coefficients[,1]
(Intercept) x1 x5 x1:x5
468.8094790 -116.1725879 -0.5498248 0.5721282
model8 <- lm(y ~ x5, data=Table5)
anova(model8)[,c(4,5)]
F value Pr(>F)
x5 7.5143 0.01592 *
Residuals
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(model8)$coefficients[,1]
(Intercept) x5
410.7231850 -0.2637607
Here is how to interpret each plot: The x-axis displays a single predictor variable and the y-axis displays the response variable. The blue line shows the association between the predictor variable and the response variable, while holding the value of all other predictor variables constant. The points that are labelled in each plot represent the two observations with the largest residuals and the two observations with the largest partial leverage. Note that the angle of the line in each plot matches the sign of the coefficient from the estimated regression equation.
Test for significance of regression, and construct t statistics for each model parameter. Interpret these results.
glance(model5)[c(4,5)]
glance(model6)[c(4,5)]
glance(model7)[c(4,5)]
glance(model8)[c(4,5)]
The Pr(>|t|) column represents the p-value associated with the value in the t value column.
t value = Estimate / Std. Error p-value = 2 * pt(abs(t value), residual df, lower.tail = FALSE) If the p-value is less than a certain significance level (e.g. α = .05) then the predictor variable is said to have a statistically significant relationship with the response variable in the model.
The p-value for the predictor variable x1 is .0325. Since this value is less than .05, it has a statistically significant relationship with the response variable in the model. The p-value for the predictor variable x2 is .3732. Since this value is not less than .05, it does not have a statistically significant relationship with the response variable in the model.
Residuals and comment on model adequacy.
resid_panel(model7, plots = "all")
resid_panel(model8, plots = "all")
the extra - sum - of - squares method to test the contribution of all second - order terms to the model.
anova(model5, model7) #lack of fit test
Analysis of Variance Table
Model 1: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
Model 2: y ~ x1 * x5
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 653.75
2 12 244.83 -5 408.92
anova(model7, model8) #lack of fit test
Analysis of Variance Table
Model 1: y ~ x1 * x5
Model 2: y ~ x5
Res.Df RSS Df Sum of Sq F Pr(>F)
1 12 244.83
2 14 860.10 -2 -615.27 15.078 0.000532 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
variance inflation factors and comment on multicollinearity in this model.
vif(model5)
x1 x2 x3 x4 x5 x6 x7 x8
1.000000 1.900541 168.467420 43.103776 60.791320 275.472571 185.707184 44.363364