LL10: Partial F-Test

This learning log will discuss using a partial F-Test for a multiple linear regression model. We will be using the Water dataset to show some examples.

Recall from 3.8 that an F-test is very similar to a t-test in that you are testing the significance of a predictor variable in the model with the following hypotheses: H_o: B₁=0 H_a: B₁<>0

Again, for SLR, if our B₁=0, that would imply that no matter what predictor value we chose, the response variable would not change, thus making the predictor insignificant. If we extend this thinking to MLR, we can think of the partial F-Test testing to set if a set of predictor variables are significant in predicting our response variable.

For example, say we have B₀, B₁, B₂,…B_g, B_g+1, B_g+2,…B_k as our predictors. The partial F-Test gives the following hypotheses:

H_o: B_g+1 = B_g+2 = … = B_k = 0 Says that a set of predictors are insignificant. H_a: At least 1 of B_g+1,B_g+2,…,B_k <> 0 Says that at least 1 predictor is significant.

After these hypotheses, we have two different linear models: A Complete model with k+1 predictors and a Reduced model with only g+1 predictors. If we want to test the significance of the k-g group of predictors, we can apply the same formula to find our F statistic.

Before, we said when our explained variation was the variation in Y that could be predicted in the model.

\[ F = Explained Variation/(Unexplained Variation /(n-2) )\]

Instead of calculating explained variation in the numerator again, the partial F-test calculates the drop in unexplained variation from removing the k-g predictors via SSE_r - SSE_c. If SSE_r is greater than SSE_c, that would mean that the k-g predictors had a larger explained variation and thus decreased the unexplained variation in the complete model (SSEc). This provides the new F-stat equation, adjusting for degrees of freedom:

\[ F = (SSE_r-SSE_c)((k-g)/(SSE_c/(n-k-1) )\]

When SSE_r-SSE_c is large, that implies that the k-g predictors hold a large proportion of explained variation and at least 1 is significant. Conversely, when the value is small, the amount of explained variation in the k-g predictors is small and none are significant.

Now let’s show an example with the Water data.

Water Data

library("alr3")

## Warning: package 'alr3' was built under R version 3.3.3

## Loading required package: car

## Warning: package 'car' was built under R version 3.3.3

data(water)
head(water)

##   Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
## 1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
## 2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
## 3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
## 4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
## 5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
## 6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

summary(water)

##       Year          APMAM            APSAB           APSLAKE     
##  Min.   :1948   Min.   : 2.700   Min.   : 1.450   Min.   : 1.77  
##  1st Qu.:1958   1st Qu.: 4.975   1st Qu.: 3.390   1st Qu.: 3.36  
##  Median :1969   Median : 7.080   Median : 4.460   Median : 4.62  
##  Mean   :1969   Mean   : 7.323   Mean   : 4.652   Mean   : 4.93  
##  3rd Qu.:1980   3rd Qu.: 9.115   3rd Qu.: 5.685   3rd Qu.: 5.83  
##  Max.   :1990   Max.   :18.080   Max.   :11.960   Max.   :13.02  
##      OPBPC             OPRC           OPSLAKE           BSAAM       
##  Min.   : 4.050   Min.   : 4.350   Min.   : 4.600   Min.   : 41785  
##  1st Qu.: 7.975   1st Qu.: 7.875   1st Qu.: 8.705   1st Qu.: 59857  
##  Median : 9.550   Median :11.110   Median :12.140   Median : 69177  
##  Mean   :12.836   Mean   :12.002   Mean   :13.522   Mean   : 77756  
##  3rd Qu.:16.545   3rd Qu.:14.975   3rd Qu.:16.920   3rd Qu.: 92206  
##  Max.   :43.370   Max.   :24.850   Max.   :33.070   Max.   :146345

This data shows the snowfall at different locations (APMAM-OPSLAKE) and the corresponding water level at a location BSAAM. We’ll start with our complete model of the 6 predictors.

Complete Model

attach(water)
Water.Modc <- lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
summary(Water.Modc)

## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
##     OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## OPBPC          69.70     461.69   0.151 0.880839    
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

Now we see that some are relatively insignificant with very large P-values. Let create our reduced model with only the intercept, APSLAKE, OPRC, and OPSLAKE.

Reduced Model

Water.Modr <- lm(BSAAM~APSLAKE+OPRC+OPSLAKE)
summary(Water.Modr)

## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16

Nice, all of those predictors seem legit. Now we’ll run our partial F-test using the anova command to see if APMAM, APSAB, OPBPC are significant. Order of the models doesn’t matter here.

Partial F-Test

anova(Water.Modc,Water.Modr)

## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     36 2055830733                           
## 2     39 2068947585 -3 -13116852 0.0766 0.9722

So, as we look at our test output, we got an F-stat of 0.0766 and a P-value of 0.9722. Even looking at the RSS values (or the SSE values), we see that they are very close in value. That would lead us to fail to reject H_o, meaning that APMAM, APSAB, and OPBPC are worthless predictors. The person who collected that data should feel bad.

Learning Log 10

Jimmy Kroll

March 5, 2018

LL10: Partial F-Test

Water Data

Complete Model

Reduced Model

Partial F-Test