dat<-read.csv("C:\\Users\\abdal\\OneDrive\\Desktop\\TTU\\Spring 2023\\IE 5344\\FA\\FA7\\data-table-B9(1).csv")

Problem A

Consider a first order multiple regression model with two-factor interactions. Check for model adequacy and make any corrective actions if deemed necessary.  Test for the signifcance of the full regression model, what do you conclude?  

head(dat)
##     x1 x2   x3    x4    y
## 1 2.14 10 0.34 1.000 28.9
## 2 4.14 10 0.34 1.000 31.0
## 3 8.15 10 0.34 1.000 26.4
## 4 2.14 10 0.34 0.246 27.2
## 5 4.14 10 0.34 0.379 26.1
## 6 8.15 10 0.34 0.474 23.2
model1 <- lm(y~x1+x2+x3+x4,data=dat)#reduced model 
model2 <- lm(y~x1+x2+x3+x4+x1:x2+x1:x3+x1:x4+x2:x3+x2:x4+x3:x4,data=dat) #full model
summary(model2)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + 
##     x2:x3 + x2:x4 + x3:x4, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4804 -3.0766 -0.6635  2.9625 12.2221 
## 
## Coefficients: (2 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.88376   23.17863   0.685  0.49616    
## x1            0.18696    0.78447   0.238  0.81255    
## x2            0.37921    0.06332   5.989 1.89e-07 ***
## x3          -11.99940   67.31148  -0.178  0.85919    
## x4           -8.86442   35.62553  -0.249  0.80446    
## x1:x2         0.01155    0.00869   1.329  0.18955    
## x1:x3              NA         NA      NA       NA    
## x1:x4        -1.11525    1.14847  -0.971  0.33592    
## x2:x3              NA         NA      NA       NA    
## x2:x4        -0.38547    0.11962  -3.222  0.00218 ** 
## x3:x4        72.85976  103.15353   0.706  0.48308    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.683 on 53 degrees of freedom
## Multiple R-squared:  0.7496, Adjusted R-squared:  0.7118 
## F-statistic: 19.83 on 8 and 53 DF,  p-value: 1.947e-13

We are going to take x2:x3 and x1:x3 interaction parameter out because of the Multicollinearity problem.

model2_updated <- lm(y~x1+x2+x3+x4+x1:x2+x1:x4+x2:x4+x3:x4,data=dat)
summary(model2_updated)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4 + 
##     x3:x4, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4804 -3.0766 -0.6635  2.9625 12.2221 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.88376   23.17863   0.685  0.49616    
## x1            0.18696    0.78447   0.238  0.81255    
## x2            0.37921    0.06332   5.989 1.89e-07 ***
## x3          -11.99940   67.31148  -0.178  0.85919    
## x4           -8.86442   35.62553  -0.249  0.80446    
## x1:x2         0.01155    0.00869   1.329  0.18955    
## x1:x4        -1.11525    1.14847  -0.971  0.33592    
## x2:x4        -0.38547    0.11962  -3.222  0.00218 ** 
## x3:x4        72.85976  103.15353   0.706  0.48308    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.683 on 53 degrees of freedom
## Multiple R-squared:  0.7496, Adjusted R-squared:  0.7118 
## F-statistic: 19.83 on 8 and 53 DF,  p-value: 1.947e-13

From the p value in the Full model is less than 0.05 so we reject the null hypothesis, means the full model is significant.

We conclude that the full model is still significant, however it does not say if any individual predictor is insignificant, so we need to do partial f test or individual t test.

Problem B

Test for the signifiance of all 2 factor interactions using a partial F-test.  What are your findings?  

anova(model1,model2_updated)
## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2 + x3 + x4
## Model 2: y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4 + x3:x4
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     57 1432.8                              
## 2     53 1162.4  4    270.37 3.0819 0.02352 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#model1 is the reduced model
#model2_updated is the full model with the intercaction variables 

the Pr(>F) is less than 0.05 so we reject H0 means that model2_updated ( the full model) is significante.

Problem C

Determine the best fitting model using partial F and/or t-tests.  What is the final model? 

t- test on (x1:x2) interaction variable :

model2_updated <- lm(y~x1+x2+x3+x4+x1:x2+x1:x4+x2:x4+x3:x4,data=dat) #this is the full model

model21 <- lm(y~x1+x2+x3+x4+x1:x4+x2:x4+x3:x4,data=dat) #this is the full model
anova(model21,model2_updated)
## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2 + x3 + x4 + x1:x4 + x2:x4 + x3:x4
## Model 2: y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4 + x3:x4
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     54 1201.2                           
## 2     53 1162.4  1    38.737 1.7662 0.1895

p value is more than 0.05 means that we fail reject the null hypothesis, means the interaction variable (x1:x2) is not significant.

t- test on (x1:x4) interaction variable :

model2_updated <- lm(y~x1+x2+x3+x4+x1:x2+x1:x4+x2:x4+x3:x4,data=dat) #this is the full model
model22 <- lm(y~x1+x2+x3+x4+x1:x2+x2:x4+x3:x4,data=dat) #this is the reduced model
anova(model22,model2_updated)
## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2 + x3 + x4 + x1:x2 + x2:x4 + x3:x4
## Model 2: y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4 + x3:x4
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1     54 1183.1                          
## 2     53 1162.4  1    20.682 0.943 0.3359

p value is more than 0.05 means that we fail to reject the null hypothesis, means the interaction variable (x1:x4) is not significant.

t- test on (x2:x4) interaction variable :

model2_updated <- lm(y~x1+x2+x3+x4+x1:x2+x1:x4+x2:x4+x3:x4,data=dat) #this is the full model
model23 <- lm(y~x1+x2+x3+x4+x1:x2+x1:x4+x3:x4,data=dat) #this is the full model
anova(model23,model2_updated)
## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x3:x4
## Model 2: y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4 + x3:x4
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     54 1390.2                                
## 2     53 1162.4  1    227.75 10.384 0.002176 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p value is less than 0.05 means that we reject the null hypothesis, means the interaction variable (x2:x4) is significant.

t- test on (x3:x4) interaction variable :

model2_updated <- lm(y~x1+x2+x3+x4+x1:x2+x1:x4+x2:x4+x3:x4,data=dat) #this is the full model
model24 <- lm(y~x1+x2+x3+x4+x1:x2+x1:x4+x2:x4,data=dat) #this is the full model
anova(model24,model2_updated)
## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4
## Model 2: y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x4 + x2:x4 + x3:x4
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     54 1173.4                           
## 2     53 1162.4  1    10.942 0.4989 0.4831

p value is more than 0.05 means that we fail to reject the null hypothesis, means the interaction variable (x3:x4) is not significant.

Using partial f test it does say that the full model is significant but it does not show the individual significance. We used the t test for the individual interaction variables, we conclude that (x2:x4) is the only significant interaction variable.

The best model would be:

model_best <- lm(y~x1+x2+x3+x4+x2:x4,data=dat) #this is the full model
summary(model_best)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x2:x4, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7718 -3.5211 -0.7941  3.5334 11.3012 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.30440    4.20077   0.549  0.58548    
## x1          -0.23456    0.32696  -0.717  0.47612    
## x2           0.36987    0.06289   5.881 2.37e-07 ***
## x3          34.89878   10.35789   3.369  0.00137 ** 
## x4           9.88611    3.01599   3.278  0.00180 ** 
## x2:x4       -0.28846    0.09373  -3.078  0.00323 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.678 on 56 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7125 
## F-statistic: 31.23 on 5 and 56 DF,  p-value: 4.871e-15

Problem D and E

point <- data.frame(c(5.0,10.0),c(10.0,3.0),c(0.5,0.25),c(0.75,0.85))
xx1<- c(5.0,10.0)
xx2<- c(10.0,3.0)
xx3<- c(0.5,0.25)
xx4<- c(0.75,0.85)
predict(model_best,data.frame(x1=xx1,x2=xx2,x3=xx3,x4=xx4))
##        1        2 
## 27.53084 17.46075
predict(model_best,data.frame(x1=xx1,x2=xx2,x3=xx3,x4=xx4),interval="confidence")
##        fit      lwr      upr
## 1 27.53084 24.07032 30.99136
## 2 17.46075 13.16386 21.75764
predict(model_best,data.frame(x1=xx1,x2=xx2,x3=xx3,x4=xx4),interval="prediction")
##        fit       lwr      upr
## 1 27.53084 17.541098 37.52059
## 2 17.46075  7.151384 27.77012

We have confidence interval as shown above in the second output
fit lwr upr

1 27.53084 24.07032 30.99136

2 17.46075 13.16386 21.75764

We have prediction interval as shown above in the third output

fit lwr upr

1 27.53084 17.541098 37.52059

2 17.46075 7.151384 27.77012