In class today, we covered section 4.10 of the book. This section was on partial F-tests. I have always been confused on what an F-test is. Through this class session and working on my R Guide, I now feel like I have a better understanding of the F-test. The basic idea of a partial F-test is that you want to clean up your multiple linear regression equation by removing the predictor variables that are not significant. Before I knew of the partial F-test, I would have just manually removed the predictor variables with high T-test p-values and been done with it, but now I have a cleaner more mechanical way of doing this. First I would run a regression using the full cast of predictors.

library(alr3)
## Warning: package 'alr3' was built under R version 3.3.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.3.3
attach(lakes)
summary(lakes)
##     Species         MaxDepth        MeanDepth           Cond        
##  Min.   : 1.00   Min.   :  0.10   Min.   :  0.04   Min.   :   8.00  
##  1st Qu.: 6.00   1st Qu.:  1.50   1st Qu.:  1.00   1st Qu.:  39.75  
##  Median : 9.00   Median :  7.50   Median :  3.40   Median : 167.50  
##  Mean   :10.55   Mean   : 45.45   Mean   : 18.39   Mean   : 258.09  
##  3rd Qu.:14.00   3rd Qu.: 20.40   3rd Qu.: 10.00   3rd Qu.: 314.50  
##  Max.   :32.00   Max.   :613.00   Max.   :325.00   Max.   :1600.00  
##                                                    NA's   :19       
##       Elev             Lat             Long             Dist       
##  Min.   :  -1.0   Min.   :28.00   Min.   : -5.70   Min.   : 0.120  
##  1st Qu.: 187.0   1st Qu.:40.00   1st Qu.: 83.60   1st Qu.: 0.250  
##  Median : 295.0   Median :43.00   Median : 89.60   Median : 0.750  
##  Mean   : 642.2   Mean   :45.17   Mean   : 96.15   Mean   : 1.341  
##  3rd Qu.: 506.0   3rd Qu.:46.20   3rd Qu.:109.60   3rd Qu.: 1.750  
##  Max.   :3433.0   Max.   :74.70   Max.   :156.70   Max.   :14.000  
##                                                                    
##      NLakes           Photo             Area        
##  Min.   :   2.0   Min.   :   1.0   Min.   :      0  
##  1st Qu.:  19.0   1st Qu.:  22.5   1st Qu.:      0  
##  Median :  44.0   Median : 263.0   Median :      8  
##  Mean   : 287.6   Mean   : 370.7   Mean   : 318925  
##  3rd Qu.: 169.0   3rd Qu.: 617.5   3rd Qu.:    149  
##  Max.   :8805.0   Max.   :1500.0   Max.   :8240000  
##                   NA's   :22

This data is on the number of plankton species in a lake and a bunch of statistics for each lake. My complete model is the following:

Complete <- lm(Species ~ MaxDepth + MeanDepth + Elev + Lat + Long + Dist + NLakes + Area)
summary(Complete)
## 
## Call:
## lm(formula = Species ~ MaxDepth + MeanDepth + Elev + Lat + Long + 
##     Dist + NLakes + Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9343 -1.9385 -0.0553  2.2754 11.3651 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.561e+01  3.159e+00   4.941 6.57e-06 ***
## MaxDepth     3.754e-02  2.693e-02   1.394 0.168524    
## MeanDepth   -3.331e-02  4.153e-02  -0.802 0.425700    
## Elev        -1.727e-03  6.245e-04  -2.765 0.007555 ** 
## Lat         -3.086e-02  6.998e-02  -0.441 0.660816    
## Long        -2.111e-02  2.254e-02  -0.937 0.352616    
## Dist        -1.284e+00  3.547e-01  -3.620 0.000607 ***
## NLakes      -2.169e-03  1.533e-03  -1.414 0.162472    
## Area         2.281e-06  6.226e-07   3.664 0.000527 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.361 on 60 degrees of freedom
## Multiple R-squared:  0.6157, Adjusted R-squared:  0.5644 
## F-statistic: 12.01 on 8 and 60 DF,  p-value: 4.709e-10

As you can see, there are some predictors that are not so great, based on the high p-values. Let’s run a model with only the predictors that seem to be the best. We removed the two predictors with p-values greater than 0.4, Lat and MeanDepth. We will then compare using an ANOVA test.

Reduced <- lm(Species ~ MaxDepth + Elev + Long + Dist + NLakes + Area)
summary(Reduced)
## 
## Call:
## lm(formula = Species ~ MaxDepth + Elev + Long + Dist + NLakes + 
##     Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8134 -1.9495 -0.3427  2.2582 11.5189 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.486e+01  1.985e+00   7.486 3.09e-10 ***
## MaxDepth     1.673e-02  7.537e-03   2.220 0.030079 *  
## Elev        -1.660e-03  5.893e-04  -2.816 0.006510 ** 
## Long        -3.035e-02  1.935e-02  -1.568 0.121964    
## Dist        -1.166e+00  3.289e-01  -3.544 0.000756 ***
## NLakes      -1.155e-03  6.987e-04  -1.653 0.103430    
## Area         2.545e-06  5.379e-07   4.732 1.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.322 on 62 degrees of freedom
## Multiple R-squared:  0.6099, Adjusted R-squared:  0.5722 
## F-statistic: 16.16 on 6 and 62 DF,  p-value: 4.326e-11
anova(Complete, Reduced)
## Analysis of Variance Table
## 
## Model 1: Species ~ MaxDepth + MeanDepth + Elev + Lat + Long + Dist + NLakes + 
##     Area
## Model 2: Species ~ MaxDepth + Elev + Long + Dist + NLakes + Area
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     60 1141.1                           
## 2     62 1158.2 -2   -17.013 0.4473 0.6415

We recieved a large p-value, which means we are able to drop them. This is reaffirmed by the r-squared values for the two regressions. The complete model had a r-squared of 0.6157 and the reduced model had a r-squared value of 0.6099. We lost less than 1% of explained variability when we dropped two whole variables. Notice how the Adjusted r-squared actually rose, because we are no longer as heavily penalized for the dead weight predictors.