Part A - 9.10

Part 9.10 A.

Based on the scatterplots and correlation plot, we may have sources of multicollinearity between X3 and X4.

It seems that both have a correlation with the response.

ch9 <-read.table("CH09PR10.txt")
names(ch9) <- c("Y", "X1", "X2", "X3", "X4")
pairs(Y~.,data=ch9, 
   main="Simple Scatterplot Matrix")

corr_matrix <- cor(ch9)
corr_matrix

##            Y        X1        X2        X3        X4
## Y  1.0000000 0.5144107 0.4970057 0.8970645 0.8693865
## X1 0.5144107 1.0000000 0.1022689 0.1807692 0.3266632
## X2 0.4970057 0.1022689 1.0000000 0.5190448 0.3967101
## X3 0.8970645 0.1807692 0.5190448 1.0000000 0.7820385
## X4 0.8693865 0.3266632 0.3967101 0.7820385 1.0000000

Part 9.10 B.

\(\hat{Y}\) = -124.3818 + 0.29573X1 + 0.04829X2 + 1.30601X3 + 0.51982X4

Yes, it does appear that all predictor variables should be kept in the model. I want to argue this because the F value is overall significant. It says that X2 has a p-value that is not significant, but we must conclude further tests to verify this.

n <-nrow(ch9);n

## [1] 25

p <- 5
fitJP <- lm(Y~., ch9)
fitJP

## 
## Call:
## lm(formula = Y ~ ., data = ch9)
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4  
##  -124.38182      0.29573      0.04829      1.30601      0.51982

attach(ch9)
summary(fitJP)

## 
## Call:
## lm(formula = Y ~ ., data = ch9)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9779 -3.4506  0.0941  2.4749  5.9959 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -124.38182    9.94106 -12.512 6.48e-11 ***
## X1             0.29573    0.04397   6.725 1.52e-06 ***
## X2             0.04829    0.05662   0.853  0.40383    
## X3             1.30601    0.16409   7.959 1.26e-07 ***
## X4             0.51982    0.13194   3.940  0.00081 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.099 on 20 degrees of freedom
## Multiple R-squared:  0.9629, Adjusted R-squared:  0.9555 
## F-statistic: 129.7 on 4 and 20 DF,  p-value: 5.262e-14

Part B. - Find R2, Mallows Cp, AIC, BIC, and PRESS and which is the best regression models

Based on adj R^2, Mallows Cp, AIC, BIC, and PRESS, the best model is clearly the model with X1, X3, and X4.

Though, adj R^2 is close for the full model with X1, X3, and X4.

library(MASS)
library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:MASS':
## 
##     cement

## The following object is masked from 'package:datasets':
## 
##     rivers

library(leaps)
library(DAAG)

## Loading required package: lattice

## 
## Attaching package: 'DAAG'

## The following object is masked from 'package:MASS':
## 
##     hills

# Here fit_p is the full model
ols_step_all_possible(fitJP)   # Rsquare, Adjust Rsquare and Mallow' Cp

## # A tibble: 15 x 6
##    Index     N Predictors  `R-Square` `Adj. R-Square` `Mallow's Cp`
##  * <int> <int> <chr>            <dbl>           <dbl>         <dbl>
##  1     1     1 X3               0.805           0.796         84.2 
##  2     2     1 X4               0.756           0.745        111.  
##  3     3     1 X1               0.265           0.233        375.  
##  4     4     1 X2               0.247           0.214        385.  
##  5     5     2 X1 X3            0.933           0.927         17.1 
##  6     6     2 X3 X4            0.877           0.866         47.2 
##  7     7     2 X1 X4            0.815           0.798         80.6 
##  8     8     2 X2 X3            0.806           0.788         85.5 
##  9     9     2 X2 X4            0.783           0.764         97.8 
## 10    10     2 X1 X2            0.464           0.415        270.  
## 11    11     3 X1 X3 X4         0.962           0.956          3.73
## 12    12     3 X1 X2 X3         0.934           0.925         18.5 
## 13    13     3 X2 X3 X4         0.879           0.862         48.2 
## 14    14     3 X1 X2 X4         0.845           0.823         66.3 
## 15    15     4 X1 X2 X3 X4      0.963           0.955          5

k <- ols_step_all_possible(fitJP)
plot(k)

k <- as.data.frame(k)
sse1=(1-k$rsquare)*var(Y)*(n-1)
aic1 <- k$aic-n*log(2*pi)-n-2 
bic1 <- aic1 + (log(n)-2)*(k[,2]+1) 
press1 <- (1-k$predrsq)*var(Y)*(n-1)
k1 <- cbind(k,aic1,bic1,press1,sse1)
# same as Table9.2 at page 353.
table2=k1[c(4,3,1,2,10,7,9,5,8,6,11,14,13,12,15),c(3,2,18,4,5,7,15,16,17)]
table2

##     predictors n      sse1   rsquare      adjr         cp      aic1
## 2           X2 1 6817.5291 0.2470147 0.2142762 384.832454 144.20941
## 1           X1 1 6658.1453 0.2646184 0.2326452 375.344689 143.61801
## 3           X3 1 1768.0228 0.8047247 0.7962344  84.246496 110.46853
## 4           X4 1 2210.6887 0.7558329 0.7452170 110.597414 116.05459
## 5        X1 X2 2 4851.1799 0.4641948 0.4154853 269.780029 137.70254
## 7        X1 X4 2 1672.5853 0.8152656 0.7984716  80.565307 111.08125
## 9        X2 X4 2 1962.0716 0.7832923 0.7635916  97.797790 115.07201
## 6        X1 X3 2  606.6574 0.9329956 0.9269043  17.112978  85.72721
## 8        X2 X3 2 1755.8127 0.8060733 0.7884436  85.519650 112.29528
## 10       X3 X4 2 1111.3126 0.8772573 0.8660988  47.153985 100.86053
## 13    X1 X3 X4 3  348.1970 0.9615422 0.9560482   3.727399  73.84732
## 12    X1 X2 X4 3 1400.1275 0.8453581 0.8232664  66.346500 108.63607
## 14    X2 X3 X4 3 1095.8078 0.8789698 0.8616797  48.231020 102.50928
## 11    X1 X2 X3 3  596.7207 0.9340931 0.9246779  18.521465  87.31433
## 15 X1 X2 X3 X4 4  335.9775 0.9628918 0.9554702   5.000000  74.95421
##         bic1    press1
## 2  146.64717 7991.0964
## 1  146.05576 7791.5994
## 3  112.90629 2064.5976
## 4  118.49234 2548.6349
## 5  141.35916 6444.0411
## 7  114.73788 2109.8967
## 9  118.72864 2491.7979
## 6   89.38384  760.9744
## 8  115.95191 2206.6460
## 10 104.51716 1449.6001
## 13  78.72282  471.4520
## 12 113.51157 1885.8454
## 14 107.38479 1570.5610
## 11  92.18984  831.1521
## 15  81.04859  518.9885

#Another Method that does not include AIC or PRESS 
best <- function(model, ...) 
{
  subsets <- regsubsets(formula(model), model.frame(model), ...)
  subsets <- with(summary(subsets),
                  cbind(p = as.numeric(rownames(which)), which, adjr2, cp, bic))  

  return(subsets)
}  

round(best(fitJP, nbest = 6), 4)

##   p (Intercept) X1 X2 X3 X4  adjr2       cp      bic
## 1 1           1  0  0  1  0 0.7962  84.2465 -34.3959
## 1 1           1  0  0  0  1 0.7452 110.5974 -28.8098
## 1 1           1  1  0  0  0 0.2326 375.3447  -1.2464
## 1 1           1  0  1  0  0 0.2143 384.8325  -0.6550
## 2 2           1  1  0  1  0 0.9269  17.1130 -57.9183
## 2 2           1  0  0  1  1 0.8661  47.1540 -42.7850
## 2 2           1  1  0  0  1 0.7985  80.5653 -32.5643
## 2 2           1  0  1  1  0 0.7884  85.5196 -31.3502
## 2 2           1  0  1  0  1 0.7636  97.7978 -28.5735
## 2 2           1  1  1  0  0 0.4155 269.7800  -5.9430
## 3 3           1  1  0  1  1 0.9560   3.7274 -68.5793
## 3 3           1  1  1  1  0 0.9247  18.5215 -55.1123
## 3 3           1  0  1  1  1 0.8617  48.2310 -39.9174
## 3 3           1  1  1  0  1 0.8233  66.3465 -33.7906
## 4 4           1  1  1  1  1 0.9555   5.0000 -66.2536

Part C - 9.18

Part 9.18 A.

Model contains X1, X3, and X4 based on forward selection.

Part 9.18 B.

The models match. They both use X1, X3, and X4.

step.model <- stepAIC(fitJP, direction = "forward", 
                      trace = FALSE)
summary(step.model)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4, data = ch9)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9779 -3.4506  0.0941  2.4749  5.9959 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -124.38182    9.94106 -12.512 6.48e-11 ***
## X1             0.29573    0.04397   6.725 1.52e-06 ***
## X2             0.04829    0.05662   0.853  0.40383    
## X3             1.30601    0.16409   7.959 1.26e-07 ***
## X4             0.51982    0.13194   3.940  0.00081 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.099 on 20 degrees of freedom
## Multiple R-squared:  0.9629, Adjusted R-squared:  0.9555 
## F-statistic: 129.7 on 4 and 20 DF,  p-value: 5.262e-14

models <- regsubsets(Y~., data = ch9, nvmax = 5,
                     method = "forward")
summary(models)

## Subset selection object
## Call: regsubsets.formula(Y ~ ., data = ch9, nvmax = 5, method = "forward")
## 4 Variables  (and intercept)
##    Forced in Forced out
## X1     FALSE      FALSE
## X2     FALSE      FALSE
## X3     FALSE      FALSE
## X4     FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: forward
##          X1  X2  X3  X4 
## 1  ( 1 ) " " " " "*" " "
## 2  ( 1 ) "*" " " "*" " "
## 3  ( 1 ) "*" " " "*" "*"
## 4  ( 1 ) "*" "*" "*" "*"

ols_step_forward_p(fitJP,prem=.1)

## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. X1 
## 2. X2 
## 3. X3 
## 4. X4 
## 
## We are selecting variables based on p value...
## 
## Variables Entered: 
## 
## - X3 
## - X1 
## - X4 
## 
## No more variables to be added.
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.981       RMSE                4.072 
## R-Squared               0.962       Coef. Var           4.416 
## Adj. R-Squared          0.956       MSE                16.581 
## Pred R-Squared          0.948       MAE                 3.096 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                 ANOVA                                 
## ---------------------------------------------------------------------
##                 Sum of                                               
##                Squares        DF    Mean Square       F         Sig. 
## ---------------------------------------------------------------------
## Regression    8705.803         3       2901.934    175.018    0.0000 
## Residual       348.197        21         16.581                      
## Total         9054.000        24                                     
## ---------------------------------------------------------------------
## 
##                                       Parameter Estimates                                       
## -----------------------------------------------------------------------------------------------
##       model        Beta    Std. Error    Std. Beta       t        Sig        lower       upper 
## -----------------------------------------------------------------------------------------------
## (Intercept)    -124.200         9.874                 -12.578    0.000    -144.734    -103.666 
##          X3       1.357         0.152        0.619      8.937    0.000       1.041       1.673 
##          X1       0.296         0.044        0.310      6.784    0.000       0.205       0.387 
##          X4       0.517         0.131        0.284      3.948    0.001       0.245       0.790 
## -----------------------------------------------------------------------------------------------

## 
##                             Selection Summary                             
## -------------------------------------------------------------------------
##         Variable                  Adj.                                       
## Step    Entered     R-Square    R-Square     C(p)        AIC        RMSE     
## -------------------------------------------------------------------------
##    1    X3            0.8047      0.7962    84.2465    183.4155    8.7676    
##    2    X1            0.9330      0.9269    17.1130    158.6741    5.2512    
##    3    X4            0.9615      0.9560     3.7274    146.7942    4.0720    
## -------------------------------------------------------------------------

Part D.

Backward stepwise AIC produces the model with all four variables which has the higher and the second best model is the other model with X1,X3, and X4 with a slightly smaller AIC. I personally would choose the model with fewer predictors because adding an extra predictor does not give as much of a gain as I would like for it to be considered justifiable.

stepAIC(fitJP, direction="backward")

## Start:  AIC=74.95
## Y ~ X1 + X2 + X3 + X4
## 
##        Df Sum of Sq     RSS     AIC
## - X2    1     12.22  348.20  73.847
## <none>               335.98  74.954
## - X4    1    260.74  596.72  87.314
## - X1    1    759.83 1095.81 102.509
## - X3    1   1064.15 1400.13 108.636
## 
## Step:  AIC=73.85
## Y ~ X1 + X3 + X4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               348.20  73.847
## - X4    1    258.46  606.66  85.727
## - X1    1    763.12 1111.31 100.861
## - X3    1   1324.39 1672.59 111.081

## 
## Call:
## lm(formula = Y ~ X1 + X3 + X4, data = ch9)
## 
## Coefficients:
## (Intercept)           X1           X3           X4  
##   -124.2000       0.2963       1.3570       0.5174

Part E- Problem 9.21

SSE for our new model is 348.197. The SSE for our original model is 335.9775. The SSE is higher for our new model which implies it is worse!

PRESS for our new model is 471.452. The PRESS for our original model is 518.9885.

This comparison suggets that MSE may not always be as valid of an indicator of the predictive ability of the fitted model!

fitJP_new <- lm(Y~X1+X3+X4, data=ch9)

### Detail for p=5. 
SSE_new <- sigma(fitJP_new)^2*(n-4)
SSE_new #New Model SSE

## [1] 348.197

sigma(fitJP)^2*(n-5) #Original Model SSE

## [1] 335.9775

press_new <- press(fitJP_new) 
press_new #New Model PRESS

## [1] 471.452

press(fitJP) #Original Model PRESS

## [1] 518.9885

Part F - Problem 9.22

Part 9.22 A

Yes, based on the correlation matrix, the validation set

ch922 <-read.table("CH09PR22.txt")
attach(ch922)
names(ch922) <- c("Y", "X1", "X2", "X3", "X4")
cor(ch922)

##            Y         X1         X2        X3        X4
## Y  1.0000000 0.53707787 0.34477442 0.8880519 0.8879388
## X1 0.5370779 1.00000000 0.01057088 0.1772891 0.3196395
## X2 0.3447744 0.01057088 1.00000000 0.3437441 0.2207638
## X3 0.8880519 0.17728907 0.34374413 1.0000000 0.8714466
## X4 0.8879388 0.31963945 0.22076377 0.8714466 1.0000000

cor(ch9)

##            Y        X1        X2        X3        X4
## Y  1.0000000 0.5144107 0.4970057 0.8970645 0.8693865
## X1 0.5144107 1.0000000 0.1022689 0.1807692 0.3266632
## X2 0.4970057 0.1022689 1.0000000 0.5190448 0.3967101
## X3 0.8970645 0.1807692 0.5190448 1.0000000 0.7820385
## X4 0.8693865 0.3266632 0.3967101 0.7820385 1.0000000

Part 9.22 B

The model is as follows \(\hat{Y}\) = -124.2002 + 0.29633X1 + 1.35697X3 + 0.51742X4

s{b0} = 9.87406, s{b1} = 0.04368, s{b3} = 0.15183, s{b4} = 0.13105

The MSE = 13.92788, The adj r-squared is 0.956

The validation data set fitted to the model is as follows

\(\hat{Y}\) = -122.76705 + 0.31238X1 + 1.40676X3 + 0.42838X4

s{b0} = 11.84783, s{b1} = 0.04729, s{b3} = 0.23262, s{b4} = 0.19749

The MSE = 15.41814, The adj r-squared is 0.9416

It seems that the estimates for the validation data set appear to be reasonably similar to those obtained from the model building data set.

summary(fitJP_new)

## 
## Call:
## lm(formula = Y ~ X1 + X3 + X4, data = ch9)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4579 -3.1563 -0.2057  1.8070  6.6083 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -124.20002    9.87406 -12.578 3.04e-11 ***
## X1             0.29633    0.04368   6.784 1.04e-06 ***
## X3             1.35697    0.15183   8.937 1.33e-08 ***
## X4             0.51742    0.13105   3.948 0.000735 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.072 on 21 degrees of freedom
## Multiple R-squared:  0.9615, Adjusted R-squared:  0.956 
## F-statistic:   175 on 3 and 21 DF,  p-value: 5.16e-15

mean(summary(fitJP_new)$residuals^2)

## [1] 13.92788

fit_valid <- lm(Y~X1+X3+X4, data=ch922)
summary(fit_valid)

## 
## Call:
## lm(formula = Y ~ X1 + X3 + X4, data = ch922)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4619 -2.3836  0.6834  2.1123  7.2394 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -122.76705   11.84783 -10.362 1.04e-09 ***
## X1             0.31238    0.04729   6.605 1.54e-06 ***
## X3             1.40676    0.23262   6.048 5.31e-06 ***
## X4             0.42838    0.19749   2.169   0.0417 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.284 on 21 degrees of freedom
## Multiple R-squared:  0.9489, Adjusted R-squared:  0.9416 
## F-statistic:   130 on 3 and 21 DF,  p-value: 1.017e-13

mean(summary(fit_valid)$residuals^2)

## [1] 15.41814

Part 9.22 C.

MSPR = 15.70972 and MSE = 13.92788 from the model building data set.

There may be a bias problem here with MSE similar to the problem in 9.21. MSE may not be the best predictor to use in this case to determine the validity of the predictability of a model.

fitJP_pred <- predict(fitJP_new,new<-data.frame(X1=ch922$X1,X3=ch922$X3,X4=ch922$X4))
MSPR <-mean((ch922$Y-fitJP_pred)^2)
MSPR  #MSPR

## [1] 15.70972

Part 9.22 D.

Combined model

\(\hat{Y}\) = -123.44104 + 0.30364X1 + 1.369X3 + 0.48735X4

s{b0} = 7.16508, s{b1} = 0.03072, s{b3} = 0.12280, s{b4} = 0.10475

The estimated standard deviations of the estimated regression coefficients are appreciably reduced now from those obtained for the model building data set.

total <- merge(ch9,ch922, by=c("Y", "X1", "X2", "X3", "X4"), all=TRUE)
total

##      Y  X1  X2  X3  X4
## 1   57 111 102  83  72
## 2   58  65 109  88  84
## 3   58 120  77  80  74
## 4   61  81 129  88  76
## 5   64  87  81  90  88
## 6   66  63 101  93  84
## 7   66  67  98  98  84
## 8   67  74 121  91  85
## 9   67 105 102  88  83
## 10  71  84 113  98  78
## 11  71  93  73  91  82
## 12  73  78  85  95  84
## 13  74 110 114  91  78
## 14  75  91 111  96  84
## 15  76 101 117  93  95
## 16  77  95  57  95  85
## 17  78 104  73  93  80
## 18  80  62  97  99 100
## 19  80 100 101  95  88
## 20  81 129  70  94  95
## 21  82 120  94  95  90
## 22  83  91 129  97  83
## 23  88  86 110 100  87
## 24  92  85  90 104  98
## 25  92  99 110  96  97
## 26  92 102 139 101  92
## 27  94 140 121  96  89
## 28  95 111 113 101  91
## 29  95 115 119 102  94
## 30  96 110 107 103 103
## 31  97  93  95 106  98
## 32  98 128  99  98  89
## 33  99  98 125 108  95
## 34  99  99 113 104  95
## 35  99 120  89 105  97
## 36 100  78 125 115 102
## 37 100 104  83 100 102
## 38 100 116 103 103 103
## 39 104 109 120 104 106
## 40 104 112 119 106 105
## 41 109  96 114 114 103
## 42 109 109 129 102 108
## 43 109 136 104 106 104
## 44 111  99 132 109 105
## 45 111 106 102 109 109
## 46 115  94 121 115 104
## 47 116 105 122 116 102
## 48 117 128 134 108  98
## 49 126 133 120 113 108
## 50 127 150 118 107 110

fit_full <- lm(Y~X1+X3+X4, data=total)
summary(fit_full)

## 
## Call:
## lm(formula = Y ~ X1 + X3 + X4, data = total)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7192 -2.7369  0.1278  2.0971  7.0657 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -123.44104    7.16508 -17.228  < 2e-16 ***
## X1             0.30364    0.03072   9.886 5.86e-13 ***
## X3             1.36906    0.12280  11.148 1.15e-14 ***
## X4             0.48735    0.10475   4.652 2.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.006 on 46 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9539 
## F-statistic: 338.9 on 3 and 46 DF,  p-value: < 2.2e-16

STAT 408 HW 10

Kajal Chokshi

11/27/2018