Based on the scatterplots and correlation plot, we may have sources of multicollinearity between X3 and X4.
It seems that both have a correlation with the response.
ch9 <-read.table("CH09PR10.txt")
names(ch9) <- c("Y", "X1", "X2", "X3", "X4")
pairs(Y~.,data=ch9,
main="Simple Scatterplot Matrix")
corr_matrix <- cor(ch9)
corr_matrix
## Y X1 X2 X3 X4
## Y 1.0000000 0.5144107 0.4970057 0.8970645 0.8693865
## X1 0.5144107 1.0000000 0.1022689 0.1807692 0.3266632
## X2 0.4970057 0.1022689 1.0000000 0.5190448 0.3967101
## X3 0.8970645 0.1807692 0.5190448 1.0000000 0.7820385
## X4 0.8693865 0.3266632 0.3967101 0.7820385 1.0000000
\(\hat{Y}\) = -124.3818 + 0.29573X1 + 0.04829X2 + 1.30601X3 + 0.51982X4
Yes, it does appear that all predictor variables should be kept in the model. I want to argue this because the F value is overall significant. It says that X2 has a p-value that is not significant, but we must conclude further tests to verify this.
n <-nrow(ch9);n
## [1] 25
p <- 5
fitJP <- lm(Y~., ch9)
fitJP
##
## Call:
## lm(formula = Y ~ ., data = ch9)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4
## -124.38182 0.29573 0.04829 1.30601 0.51982
attach(ch9)
summary(fitJP)
##
## Call:
## lm(formula = Y ~ ., data = ch9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9779 -3.4506 0.0941 2.4749 5.9959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -124.38182 9.94106 -12.512 6.48e-11 ***
## X1 0.29573 0.04397 6.725 1.52e-06 ***
## X2 0.04829 0.05662 0.853 0.40383
## X3 1.30601 0.16409 7.959 1.26e-07 ***
## X4 0.51982 0.13194 3.940 0.00081 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.099 on 20 degrees of freedom
## Multiple R-squared: 0.9629, Adjusted R-squared: 0.9555
## F-statistic: 129.7 on 4 and 20 DF, p-value: 5.262e-14
Based on adj R^2, Mallows Cp, AIC, BIC, and PRESS, the best model is clearly the model with X1, X3, and X4.
Though, adj R^2 is close for the full model with X1, X3, and X4.
library(MASS)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
library(leaps)
library(DAAG)
## Loading required package: lattice
##
## Attaching package: 'DAAG'
## The following object is masked from 'package:MASS':
##
## hills
# Here fit_p is the full model
ols_step_all_possible(fitJP) # Rsquare, Adjust Rsquare and Mallow' Cp
## # A tibble: 15 x 6
## Index N Predictors `R-Square` `Adj. R-Square` `Mallow's Cp`
## * <int> <int> <chr> <dbl> <dbl> <dbl>
## 1 1 1 X3 0.805 0.796 84.2
## 2 2 1 X4 0.756 0.745 111.
## 3 3 1 X1 0.265 0.233 375.
## 4 4 1 X2 0.247 0.214 385.
## 5 5 2 X1 X3 0.933 0.927 17.1
## 6 6 2 X3 X4 0.877 0.866 47.2
## 7 7 2 X1 X4 0.815 0.798 80.6
## 8 8 2 X2 X3 0.806 0.788 85.5
## 9 9 2 X2 X4 0.783 0.764 97.8
## 10 10 2 X1 X2 0.464 0.415 270.
## 11 11 3 X1 X3 X4 0.962 0.956 3.73
## 12 12 3 X1 X2 X3 0.934 0.925 18.5
## 13 13 3 X2 X3 X4 0.879 0.862 48.2
## 14 14 3 X1 X2 X4 0.845 0.823 66.3
## 15 15 4 X1 X2 X3 X4 0.963 0.955 5
k <- ols_step_all_possible(fitJP)
plot(k)
k <- as.data.frame(k)
sse1=(1-k$rsquare)*var(Y)*(n-1)
aic1 <- k$aic-n*log(2*pi)-n-2
bic1 <- aic1 + (log(n)-2)*(k[,2]+1)
press1 <- (1-k$predrsq)*var(Y)*(n-1)
k1 <- cbind(k,aic1,bic1,press1,sse1)
# same as Table9.2 at page 353.
table2=k1[c(4,3,1,2,10,7,9,5,8,6,11,14,13,12,15),c(3,2,18,4,5,7,15,16,17)]
table2
## predictors n sse1 rsquare adjr cp aic1
## 2 X2 1 6817.5291 0.2470147 0.2142762 384.832454 144.20941
## 1 X1 1 6658.1453 0.2646184 0.2326452 375.344689 143.61801
## 3 X3 1 1768.0228 0.8047247 0.7962344 84.246496 110.46853
## 4 X4 1 2210.6887 0.7558329 0.7452170 110.597414 116.05459
## 5 X1 X2 2 4851.1799 0.4641948 0.4154853 269.780029 137.70254
## 7 X1 X4 2 1672.5853 0.8152656 0.7984716 80.565307 111.08125
## 9 X2 X4 2 1962.0716 0.7832923 0.7635916 97.797790 115.07201
## 6 X1 X3 2 606.6574 0.9329956 0.9269043 17.112978 85.72721
## 8 X2 X3 2 1755.8127 0.8060733 0.7884436 85.519650 112.29528
## 10 X3 X4 2 1111.3126 0.8772573 0.8660988 47.153985 100.86053
## 13 X1 X3 X4 3 348.1970 0.9615422 0.9560482 3.727399 73.84732
## 12 X1 X2 X4 3 1400.1275 0.8453581 0.8232664 66.346500 108.63607
## 14 X2 X3 X4 3 1095.8078 0.8789698 0.8616797 48.231020 102.50928
## 11 X1 X2 X3 3 596.7207 0.9340931 0.9246779 18.521465 87.31433
## 15 X1 X2 X3 X4 4 335.9775 0.9628918 0.9554702 5.000000 74.95421
## bic1 press1
## 2 146.64717 7991.0964
## 1 146.05576 7791.5994
## 3 112.90629 2064.5976
## 4 118.49234 2548.6349
## 5 141.35916 6444.0411
## 7 114.73788 2109.8967
## 9 118.72864 2491.7979
## 6 89.38384 760.9744
## 8 115.95191 2206.6460
## 10 104.51716 1449.6001
## 13 78.72282 471.4520
## 12 113.51157 1885.8454
## 14 107.38479 1570.5610
## 11 92.18984 831.1521
## 15 81.04859 518.9885
#Another Method that does not include AIC or PRESS
best <- function(model, ...)
{
subsets <- regsubsets(formula(model), model.frame(model), ...)
subsets <- with(summary(subsets),
cbind(p = as.numeric(rownames(which)), which, adjr2, cp, bic))
return(subsets)
}
round(best(fitJP, nbest = 6), 4)
## p (Intercept) X1 X2 X3 X4 adjr2 cp bic
## 1 1 1 0 0 1 0 0.7962 84.2465 -34.3959
## 1 1 1 0 0 0 1 0.7452 110.5974 -28.8098
## 1 1 1 1 0 0 0 0.2326 375.3447 -1.2464
## 1 1 1 0 1 0 0 0.2143 384.8325 -0.6550
## 2 2 1 1 0 1 0 0.9269 17.1130 -57.9183
## 2 2 1 0 0 1 1 0.8661 47.1540 -42.7850
## 2 2 1 1 0 0 1 0.7985 80.5653 -32.5643
## 2 2 1 0 1 1 0 0.7884 85.5196 -31.3502
## 2 2 1 0 1 0 1 0.7636 97.7978 -28.5735
## 2 2 1 1 1 0 0 0.4155 269.7800 -5.9430
## 3 3 1 1 0 1 1 0.9560 3.7274 -68.5793
## 3 3 1 1 1 1 0 0.9247 18.5215 -55.1123
## 3 3 1 0 1 1 1 0.8617 48.2310 -39.9174
## 3 3 1 1 1 0 1 0.8233 66.3465 -33.7906
## 4 4 1 1 1 1 1 0.9555 5.0000 -66.2536
Model contains X1, X3, and X4 based on forward selection.
The models match. They both use X1, X3, and X4.
step.model <- stepAIC(fitJP, direction = "forward",
trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4, data = ch9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9779 -3.4506 0.0941 2.4749 5.9959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -124.38182 9.94106 -12.512 6.48e-11 ***
## X1 0.29573 0.04397 6.725 1.52e-06 ***
## X2 0.04829 0.05662 0.853 0.40383
## X3 1.30601 0.16409 7.959 1.26e-07 ***
## X4 0.51982 0.13194 3.940 0.00081 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.099 on 20 degrees of freedom
## Multiple R-squared: 0.9629, Adjusted R-squared: 0.9555
## F-statistic: 129.7 on 4 and 20 DF, p-value: 5.262e-14
models <- regsubsets(Y~., data = ch9, nvmax = 5,
method = "forward")
summary(models)
## Subset selection object
## Call: regsubsets.formula(Y ~ ., data = ch9, nvmax = 5, method = "forward")
## 4 Variables (and intercept)
## Forced in Forced out
## X1 FALSE FALSE
## X2 FALSE FALSE
## X3 FALSE FALSE
## X4 FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: forward
## X1 X2 X3 X4
## 1 ( 1 ) " " " " "*" " "
## 2 ( 1 ) "*" " " "*" " "
## 3 ( 1 ) "*" " " "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
ols_step_forward_p(fitJP,prem=.1)
## Forward Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. X1
## 2. X2
## 3. X3
## 4. X4
##
## We are selecting variables based on p value...
##
## Variables Entered:
##
## - X3
## - X1
## - X4
##
## No more variables to be added.
##
## Final Model Output
## ------------------
##
## Model Summary
## --------------------------------------------------------------
## R 0.981 RMSE 4.072
## R-Squared 0.962 Coef. Var 4.416
## Adj. R-Squared 0.956 MSE 16.581
## Pred R-Squared 0.948 MAE 3.096
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ---------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ---------------------------------------------------------------------
## Regression 8705.803 3 2901.934 175.018 0.0000
## Residual 348.197 21 16.581
## Total 9054.000 24
## ---------------------------------------------------------------------
##
## Parameter Estimates
## -----------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -----------------------------------------------------------------------------------------------
## (Intercept) -124.200 9.874 -12.578 0.000 -144.734 -103.666
## X3 1.357 0.152 0.619 8.937 0.000 1.041 1.673
## X1 0.296 0.044 0.310 6.784 0.000 0.205 0.387
## X4 0.517 0.131 0.284 3.948 0.001 0.245 0.790
## -----------------------------------------------------------------------------------------------
##
## Selection Summary
## -------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## -------------------------------------------------------------------------
## 1 X3 0.8047 0.7962 84.2465 183.4155 8.7676
## 2 X1 0.9330 0.9269 17.1130 158.6741 5.2512
## 3 X4 0.9615 0.9560 3.7274 146.7942 4.0720
## -------------------------------------------------------------------------
Backward stepwise AIC produces the model with all four variables which has the higher and the second best model is the other model with X1,X3, and X4 with a slightly smaller AIC. I personally would choose the model with fewer predictors because adding an extra predictor does not give as much of a gain as I would like for it to be considered justifiable.
stepAIC(fitJP, direction="backward")
## Start: AIC=74.95
## Y ~ X1 + X2 + X3 + X4
##
## Df Sum of Sq RSS AIC
## - X2 1 12.22 348.20 73.847
## <none> 335.98 74.954
## - X4 1 260.74 596.72 87.314
## - X1 1 759.83 1095.81 102.509
## - X3 1 1064.15 1400.13 108.636
##
## Step: AIC=73.85
## Y ~ X1 + X3 + X4
##
## Df Sum of Sq RSS AIC
## <none> 348.20 73.847
## - X4 1 258.46 606.66 85.727
## - X1 1 763.12 1111.31 100.861
## - X3 1 1324.39 1672.59 111.081
##
## Call:
## lm(formula = Y ~ X1 + X3 + X4, data = ch9)
##
## Coefficients:
## (Intercept) X1 X3 X4
## -124.2000 0.2963 1.3570 0.5174
SSE for our new model is 348.197. The SSE for our original model is 335.9775. The SSE is higher for our new model which implies it is worse!
PRESS for our new model is 471.452. The PRESS for our original model is 518.9885.
This comparison suggets that MSE may not always be as valid of an indicator of the predictive ability of the fitted model!
fitJP_new <- lm(Y~X1+X3+X4, data=ch9)
### Detail for p=5.
SSE_new <- sigma(fitJP_new)^2*(n-4)
SSE_new #New Model SSE
## [1] 348.197
sigma(fitJP)^2*(n-5) #Original Model SSE
## [1] 335.9775
press_new <- press(fitJP_new)
press_new #New Model PRESS
## [1] 471.452
press(fitJP) #Original Model PRESS
## [1] 518.9885
Yes, based on the correlation matrix, the validation set
ch922 <-read.table("CH09PR22.txt")
attach(ch922)
names(ch922) <- c("Y", "X1", "X2", "X3", "X4")
cor(ch922)
## Y X1 X2 X3 X4
## Y 1.0000000 0.53707787 0.34477442 0.8880519 0.8879388
## X1 0.5370779 1.00000000 0.01057088 0.1772891 0.3196395
## X2 0.3447744 0.01057088 1.00000000 0.3437441 0.2207638
## X3 0.8880519 0.17728907 0.34374413 1.0000000 0.8714466
## X4 0.8879388 0.31963945 0.22076377 0.8714466 1.0000000
cor(ch9)
## Y X1 X2 X3 X4
## Y 1.0000000 0.5144107 0.4970057 0.8970645 0.8693865
## X1 0.5144107 1.0000000 0.1022689 0.1807692 0.3266632
## X2 0.4970057 0.1022689 1.0000000 0.5190448 0.3967101
## X3 0.8970645 0.1807692 0.5190448 1.0000000 0.7820385
## X4 0.8693865 0.3266632 0.3967101 0.7820385 1.0000000
The model is as follows \(\hat{Y}\) = -124.2002 + 0.29633X1 + 1.35697X3 + 0.51742X4
s{b0} = 9.87406, s{b1} = 0.04368, s{b3} = 0.15183, s{b4} = 0.13105
The MSE = 13.92788, The adj r-squared is 0.956
The validation data set fitted to the model is as follows
\(\hat{Y}\) = -122.76705 + 0.31238X1 + 1.40676X3 + 0.42838X4
s{b0} = 11.84783, s{b1} = 0.04729, s{b3} = 0.23262, s{b4} = 0.19749
The MSE = 15.41814, The adj r-squared is 0.9416
It seems that the estimates for the validation data set appear to be reasonably similar to those obtained from the model building data set.
summary(fitJP_new)
##
## Call:
## lm(formula = Y ~ X1 + X3 + X4, data = ch9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4579 -3.1563 -0.2057 1.8070 6.6083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -124.20002 9.87406 -12.578 3.04e-11 ***
## X1 0.29633 0.04368 6.784 1.04e-06 ***
## X3 1.35697 0.15183 8.937 1.33e-08 ***
## X4 0.51742 0.13105 3.948 0.000735 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.072 on 21 degrees of freedom
## Multiple R-squared: 0.9615, Adjusted R-squared: 0.956
## F-statistic: 175 on 3 and 21 DF, p-value: 5.16e-15
mean(summary(fitJP_new)$residuals^2)
## [1] 13.92788
fit_valid <- lm(Y~X1+X3+X4, data=ch922)
summary(fit_valid)
##
## Call:
## lm(formula = Y ~ X1 + X3 + X4, data = ch922)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4619 -2.3836 0.6834 2.1123 7.2394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -122.76705 11.84783 -10.362 1.04e-09 ***
## X1 0.31238 0.04729 6.605 1.54e-06 ***
## X3 1.40676 0.23262 6.048 5.31e-06 ***
## X4 0.42838 0.19749 2.169 0.0417 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.284 on 21 degrees of freedom
## Multiple R-squared: 0.9489, Adjusted R-squared: 0.9416
## F-statistic: 130 on 3 and 21 DF, p-value: 1.017e-13
mean(summary(fit_valid)$residuals^2)
## [1] 15.41814
MSPR = 15.70972 and MSE = 13.92788 from the model building data set.
There may be a bias problem here with MSE similar to the problem in 9.21. MSE may not be the best predictor to use in this case to determine the validity of the predictability of a model.
fitJP_pred <- predict(fitJP_new,new<-data.frame(X1=ch922$X1,X3=ch922$X3,X4=ch922$X4))
MSPR <-mean((ch922$Y-fitJP_pred)^2)
MSPR #MSPR
## [1] 15.70972
Combined model
\(\hat{Y}\) = -123.44104 + 0.30364X1 + 1.369X3 + 0.48735X4
s{b0} = 7.16508, s{b1} = 0.03072, s{b3} = 0.12280, s{b4} = 0.10475
The estimated standard deviations of the estimated regression coefficients are appreciably reduced now from those obtained for the model building data set.
total <- merge(ch9,ch922, by=c("Y", "X1", "X2", "X3", "X4"), all=TRUE)
total
## Y X1 X2 X3 X4
## 1 57 111 102 83 72
## 2 58 65 109 88 84
## 3 58 120 77 80 74
## 4 61 81 129 88 76
## 5 64 87 81 90 88
## 6 66 63 101 93 84
## 7 66 67 98 98 84
## 8 67 74 121 91 85
## 9 67 105 102 88 83
## 10 71 84 113 98 78
## 11 71 93 73 91 82
## 12 73 78 85 95 84
## 13 74 110 114 91 78
## 14 75 91 111 96 84
## 15 76 101 117 93 95
## 16 77 95 57 95 85
## 17 78 104 73 93 80
## 18 80 62 97 99 100
## 19 80 100 101 95 88
## 20 81 129 70 94 95
## 21 82 120 94 95 90
## 22 83 91 129 97 83
## 23 88 86 110 100 87
## 24 92 85 90 104 98
## 25 92 99 110 96 97
## 26 92 102 139 101 92
## 27 94 140 121 96 89
## 28 95 111 113 101 91
## 29 95 115 119 102 94
## 30 96 110 107 103 103
## 31 97 93 95 106 98
## 32 98 128 99 98 89
## 33 99 98 125 108 95
## 34 99 99 113 104 95
## 35 99 120 89 105 97
## 36 100 78 125 115 102
## 37 100 104 83 100 102
## 38 100 116 103 103 103
## 39 104 109 120 104 106
## 40 104 112 119 106 105
## 41 109 96 114 114 103
## 42 109 109 129 102 108
## 43 109 136 104 106 104
## 44 111 99 132 109 105
## 45 111 106 102 109 109
## 46 115 94 121 115 104
## 47 116 105 122 116 102
## 48 117 128 134 108 98
## 49 126 133 120 113 108
## 50 127 150 118 107 110
fit_full <- lm(Y~X1+X3+X4, data=total)
summary(fit_full)
##
## Call:
## lm(formula = Y ~ X1 + X3 + X4, data = total)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7192 -2.7369 0.1278 2.0971 7.0657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -123.44104 7.16508 -17.228 < 2e-16 ***
## X1 0.30364 0.03072 9.886 5.86e-13 ***
## X3 1.36906 0.12280 11.148 1.15e-14 ***
## X4 0.48735 0.10475 4.652 2.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.006 on 46 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9539
## F-statistic: 338.9 on 3 and 46 DF, p-value: < 2.2e-16