In class today, we covered section 4.10 of the book. This section was on partial F-tests. I have always been confused on what an F-test is. Through this class session and working on my R Guide, I now feel like I have a better understanding of the F-test. The basic idea of a partial F-test is that you want to clean up your multiple linear regression equation by removing the predictor variables that are not significant. Before I knew of the partial F-test, I would have just manually removed the predictor variables with high T-test p-values and been done with it, but now I have a cleaner more mechanical way of doing this. First I would run a regression using the full cast of predictors.
library(alr3)
## Warning: package 'alr3' was built under R version 3.3.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.3.3
attach(lakes)
summary(lakes)
## Species MaxDepth MeanDepth Cond
## Min. : 1.00 Min. : 0.10 Min. : 0.04 Min. : 8.00
## 1st Qu.: 6.00 1st Qu.: 1.50 1st Qu.: 1.00 1st Qu.: 39.75
## Median : 9.00 Median : 7.50 Median : 3.40 Median : 167.50
## Mean :10.55 Mean : 45.45 Mean : 18.39 Mean : 258.09
## 3rd Qu.:14.00 3rd Qu.: 20.40 3rd Qu.: 10.00 3rd Qu.: 314.50
## Max. :32.00 Max. :613.00 Max. :325.00 Max. :1600.00
## NA's :19
## Elev Lat Long Dist
## Min. : -1.0 Min. :28.00 Min. : -5.70 Min. : 0.120
## 1st Qu.: 187.0 1st Qu.:40.00 1st Qu.: 83.60 1st Qu.: 0.250
## Median : 295.0 Median :43.00 Median : 89.60 Median : 0.750
## Mean : 642.2 Mean :45.17 Mean : 96.15 Mean : 1.341
## 3rd Qu.: 506.0 3rd Qu.:46.20 3rd Qu.:109.60 3rd Qu.: 1.750
## Max. :3433.0 Max. :74.70 Max. :156.70 Max. :14.000
##
## NLakes Photo Area
## Min. : 2.0 Min. : 1.0 Min. : 0
## 1st Qu.: 19.0 1st Qu.: 22.5 1st Qu.: 0
## Median : 44.0 Median : 263.0 Median : 8
## Mean : 287.6 Mean : 370.7 Mean : 318925
## 3rd Qu.: 169.0 3rd Qu.: 617.5 3rd Qu.: 149
## Max. :8805.0 Max. :1500.0 Max. :8240000
## NA's :22
This data is on the number of plankton species in a lake and a bunch of statistics for each lake. My complete model is the following:
Complete <- lm(Species ~ MaxDepth + MeanDepth + Elev + Lat + Long + Dist + NLakes + Area)
summary(Complete)
##
## Call:
## lm(formula = Species ~ MaxDepth + MeanDepth + Elev + Lat + Long +
## Dist + NLakes + Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9343 -1.9385 -0.0553 2.2754 11.3651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.561e+01 3.159e+00 4.941 6.57e-06 ***
## MaxDepth 3.754e-02 2.693e-02 1.394 0.168524
## MeanDepth -3.331e-02 4.153e-02 -0.802 0.425700
## Elev -1.727e-03 6.245e-04 -2.765 0.007555 **
## Lat -3.086e-02 6.998e-02 -0.441 0.660816
## Long -2.111e-02 2.254e-02 -0.937 0.352616
## Dist -1.284e+00 3.547e-01 -3.620 0.000607 ***
## NLakes -2.169e-03 1.533e-03 -1.414 0.162472
## Area 2.281e-06 6.226e-07 3.664 0.000527 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.361 on 60 degrees of freedom
## Multiple R-squared: 0.6157, Adjusted R-squared: 0.5644
## F-statistic: 12.01 on 8 and 60 DF, p-value: 4.709e-10
As you can see, there are some predictors that are not so great, based on the high p-values. Let’s run a model with only the predictors that seem to be the best. We removed the two predictors with p-values greater than 0.4, Lat and MeanDepth. We will then compare using an ANOVA test.
Reduced <- lm(Species ~ MaxDepth + Elev + Long + Dist + NLakes + Area)
summary(Reduced)
##
## Call:
## lm(formula = Species ~ MaxDepth + Elev + Long + Dist + NLakes +
## Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8134 -1.9495 -0.3427 2.2582 11.5189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.486e+01 1.985e+00 7.486 3.09e-10 ***
## MaxDepth 1.673e-02 7.537e-03 2.220 0.030079 *
## Elev -1.660e-03 5.893e-04 -2.816 0.006510 **
## Long -3.035e-02 1.935e-02 -1.568 0.121964
## Dist -1.166e+00 3.289e-01 -3.544 0.000756 ***
## NLakes -1.155e-03 6.987e-04 -1.653 0.103430
## Area 2.545e-06 5.379e-07 4.732 1.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.322 on 62 degrees of freedom
## Multiple R-squared: 0.6099, Adjusted R-squared: 0.5722
## F-statistic: 16.16 on 6 and 62 DF, p-value: 4.326e-11
anova(Complete, Reduced)
## Analysis of Variance Table
##
## Model 1: Species ~ MaxDepth + MeanDepth + Elev + Lat + Long + Dist + NLakes +
## Area
## Model 2: Species ~ MaxDepth + Elev + Long + Dist + NLakes + Area
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 60 1141.1
## 2 62 1158.2 -2 -17.013 0.4473 0.6415
We recieved a large p-value, which means we are able to drop them. This is reaffirmed by the r-squared values for the two regressions. The complete model had a r-squared of 0.6157 and the reduced model had a r-squared value of 0.6099. We lost less than 1% of explained variability when we dropped two whole variables. Notice how the Adjusted r-squared actually rose, because we are no longer as heavily penalized for the dead weight predictors.