Lecture 19: (More) Hypothesis Testing II

Two ways of testing hypotheses

In this example, we are going to explore two ways of performing hypothesis tests on whole models. The first is a permutation test, where the connection between the explanatory variables and the response is severed. This is a more ‘modern’ approach which takes advantage of modern computing power to generate a large number of permutations to simulate our null hypothesis. The second approach is based on ANOVA, where the sampling distribution can be estimated without extensive calculations (simulations) from ‘first principals’. This type of test is used most often, and will likely be something you refer to often in your final projects. With that in mind, we will start by exploring the permutation test, which tends to provide a more intuitive perspective on hypothesis testing. Additionally, the permutation test has the benefit of requiring fewer assumptions, which makes it a more suitable test in some cases.

The permutation test

The idea of the permutation test is to enforce the null hypothesis that there is no connection between the response variable and the explanatory variables. This is a very common ‘generic’ null hypothesis, and one that proves useful in many cases. You can imagine many situations where the thing you are interested in testing is whether Y is a function of X, so a natural null hypothesis is that Y is not a function of X (i.e., there is no relationship between Y and X).

An effective way to do this is to randomize the response variable in a way that is consistent with sampling variability. When construction confidence intervals, the resample() function was used. Re-sampling will typically repeat some cases and omit others. Here, the shuffle() function will be used instead, to scramble the order of one or more variables, while leaving the others in their original state. This also means that no cases are repeated or omitted, which is useful when testing our null hypothesis.

To illustrate, consider a model for exploring whether sex and mother’s height are related to the height of a child:

data("Galton")
mod = lm(height ~ sex + mother, data=Galton)
coef(mod)

## (Intercept)        sexM      mother 
##  41.4495235   5.1766949   0.3531371

The coefficients indicate that typical males are taller than typical females by 5.177 inches, and that for each inch taller the mother is, a child will typically be taller by 0.353 inches. A reasonable test statistic to summarize the whole model is \(R^2\). The summary report show that the \(R^2\) value is 0.5618019:

summary(mod)

## 
## Call:
## lm(formula = height ~ sex + mother, data = Galton)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4036 -1.6024  0.1528  1.5890  9.4199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.44952    2.20949   18.76   <2e-16 ***
## sexM         5.17669    0.15867   32.62   <2e-16 ***
## mother       0.35314    0.03439   10.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.374 on 895 degrees of freedom
## Multiple R-squared:  0.5618, Adjusted R-squared:  0.5608 
## F-statistic: 573.7 on 2 and 895 DF,  p-value: < 2.2e-16

For confidence intervals, re-sampling was applied to the entire data.frame. This select random cases, but each select case is an authentic one that matches exactly the original values for that case (i.e., the rows are randomized, but the entire row is used). The whole point of re-sampling is to get an idea of the variability introduced by random sampling of authentic cases.

resamp = do(5) * lm(height ~ sex + mother, data=resample(Galton))
resamp

##   Intercept     sexM    mother    sigma r.squared        F numdf dendf
## 1  36.13204 5.012872 0.4374663 2.313298 0.5725514 599.4096     2   895
## 2  38.27235 5.345589 0.4011061 2.380750 0.5789896 615.4191     2   895
## 3  41.95792 4.925837 0.3466026 2.422901 0.5331324 511.0159     2   895
## 4  40.01667 5.080601 0.3773315 2.335610 0.5663339 584.3998     2   895
## 5  40.46442 5.172233 0.3681553 2.426540 0.5516976 550.7101     2   895

The sexM coefficients are tightly grouped near 5.1, the mother coefficients are around 0.3 to 0.4.

In order to carry out a permutation test, we do not want to randomize the whole data frame; instead, we just want to shuffle the response variable, to artificially reduce the relationship between the response and the explanatory variables:

shuff = do(5) * lm(shuffle(height) ~ sex + mother, data=Galton)
shuff

##   Intercept         sexM      mother    sigma    r.squared         F numdf
## 1  63.73278  0.107433105  0.04638063 3.584974 0.0010847202 0.4859394     2
## 2  61.63798 -0.338846464  0.08267482 3.577494 0.0052488067 2.3612347     2
## 3  68.25893  0.003640861 -0.02340854 3.586511 0.0002279886 0.1020481     2
## 4  68.38039 -0.378831271 -0.02221342 3.581634 0.0029451273 1.3218374     2
## 5  60.93850 -0.130830492  0.09190901 3.579896 0.0039124961 1.7577191     2
##   dendf
## 1   895
## 2   895
## 3   895
## 4   895
## 5   895

Now the sexM and mother coefficients are close to 0, as would be expected when there is no relationship between the response variable and the explanatory variables.

In constructing the sampling distribution under the null hypothesis, you should really do hundreds of trials of fitting the model to the scrambled data, calculating the test statistic (\(R^2\) here) for each trial. Note that each trial itself involves all of the cases in our sample, but those cases have been changed so that the shuffled variable almost certainly takes on a different value from the original data in every case.

nulltrials = do(500) * lm(shuffle(height) ~ sex + mother, data=Galton)
head(nulltrials)

##   Intercept       sexM       mother    sigma    r.squared         F numdf
## 1  65.98977 -0.3027791  0.014476277 3.583510 0.0018999869 0.8518627     2
## 2  64.22053 -0.2929027  0.042004450 3.582468 0.0024807520 1.1128973     2
## 3  64.19063  0.4679327  0.036323287 3.578477 0.0047017176 2.1139579     2
## 4  66.70581 -0.1950424  0.002432311 3.585581 0.0007462129 0.3341797     2
## 5  68.27336 -0.1923761 -0.022049840 3.585314 0.0008950381 0.4008883     2
## 6  70.24563  0.1148042 -0.055308083 3.584111 0.0015655332 0.7016746     2
##   dendf
## 1   895
## 2   895
## 3   895
## 4   895
## 5   895
## 6   895

Notice that do() calculates \(R^2\) from the model, and the output from the above is a data.frame. Naturally, all of the \(R^2\) values for the trials are close to NA; after all, there is no relation between the response (after randomization with shuffle()) and the explanatory variables, just as our null hypothesis would suggest.

The p-value can be calculated directly from the trials, by comparison to the observed test statistic for the actual data: \(R^2\) was 0.5618019.

counts = tally(~ r.squared > summary(mod)$r.squared, data=nulltrials)
counts

## 
##  TRUE FALSE 
##     0   500

As we can see, 0 out of 500 trials were greater than the values of the test statistic, 0.5618019. It wouldn’t be fair to claim that \(p = 0\), since we only did 500 trials, but it is reasonable to say that the permutation test shows the p-value is \(p <\) 1/500.

First-principals test

On modern computers, the permutation test is entirely practical. But a few decades ago, it was not! Great creativity was applied to finding test statistics where the sampling distribution would be estimated without extensive calculations. One of these is the F statistic. This is still very useful today and is a standard part of the regression report in many (most?) statistical packages.

Here is the regression report from the height ~ sex + mother model:

mod = lm(height ~ sex + mother, data=Galton)
s = summary(mod)
s

## 
## Call:
## lm(formula = height ~ sex + mother, data = Galton)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4036 -1.6024  0.1528  1.5890  9.4199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.44952    2.20949   18.76   <2e-16 ***
## sexM         5.17669    0.15867   32.62   <2e-16 ***
## mother       0.35314    0.03439   10.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.374 on 895 degrees of freedom
## Multiple R-squared:  0.5618, Adjusted R-squared:  0.5608 
## F-statistic: 573.7 on 2 and 895 DF,  p-value: < 2.2e-16

The last line of the report shows an F statistic of 573.7275799, 2, 895 based on an \(R^2\) of 0.5618019 and translates this into a p-value that is practically zero: \(p\) < 2.22e-16

By way of showing that the regression report is rooted in the same approach we have covered in class, we can confirm the calculations. There are \(m=\) 3 coefficients and \(n=\) 898 cases, producing \(n-m=\) 895 degrees of freedom in the denominator and \(m-1=\) 2 degrees of freedom in the numerator. The calculation of the F statistic from \(R^2\) and the degrees of freedom follows the formula given in class:

\[F = \left.\frac{R^2}{m-1} \middle/ \frac{1-R^2}{n-m}\right.\]

Plugging the values into the formula produces:

(0.562/2) / ((1-0.562)/895)

## [1] 574.1895

F is the test statistic. To convert it to a p-value, we need to calculate how extreme the value of F= 573.7275799 is with reference to the F distribution with 895 and 2 degrees of freedom:

1-pf(574.2, 2, 895)

## [1] 0

The calculation of p-values from F allows follows this form. In the context of the F distribution, “extreme” always means “bigger than”. So, calculate the area under the F distribution to the right of the observed value (i.e., subtract from 1). We can also compute this directly using the following command:

pf(574.2, 2, 895, lower.tail = FALSE)

## [1] 3.611326e-161

or for a cleaner output:

format.pval(pf(574.2, 2, 895, lower.tail = FALSE))

## [1] "< 2.22e-16"

Regression reports and analysis of variance

How does the F statistic (like \(R^2\)) compare explained versus unexplained variation? By taking apart the variation and partitioning it into a part associated with the model (explained) plus a part associated with the residuals (unexplained). With linear models, this is called Analysis of variance, and the ANOVA report is now we present this ‘analysis’.

Here is a very simple model with one quantitative explanatory variable:

data("KidsFeet")
mod = lm(width ~ length, data=KidsFeet)
summary(mod)

## 
## Call:
## lm(formula = width ~ length, data = KidsFeet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83864 -0.31056 -0.00892  0.27622  0.76300 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8623     1.2081   2.369   0.0232 *  
## length        0.2480     0.0488   5.081  1.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3963 on 37 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.3951 
## F-statistic: 25.82 on 1 and 37 DF,  p-value: 1.097e-05

anova(mod)

## Analysis of Variance Table
## 
## Response: width
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## length     1 4.0557  4.0557  25.819 1.097e-05 ***
## Residuals 37 5.8120  0.1571                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note the similarities in the summary and ANOVA reports. Let’s work through the ANOVA calculation by comparing the simple calculation involving \(R^2\), m and n to get F from above, with the the \(R^2\) at the end of the regression report, and the the p-value in the ANOVA report.

First, recall that F is equal to \(t^2\). Second, recall that the parameter (for the ‘t’ distribution) is the number of degrees of freedom in the residual, which is simply n (number of cases) minus m (the number of coefficients,or, equivalently, the number of model vectors, including the intercept). From here, the translation from t-values to p-values is based on the t-distribution.

Turns out that for large df of residuals, the t distribution is effectively the same as the normal distribution:

foo = rt(10000, df=100)
histogram(~foo, breaks=50, type="density", fit="normal")

For small df, it has fatter tails than the normal

foo = rt(10000, df=4)
histogram(~foo, breaks=50, type="density", fit="normal") # not a match

The normal doesn’t put enough density at the center and puts way too little at the tails.

The reason for the long tails is that, when there are few degrees of freedom in the residual, we don’t have a very good idea of the length of the residual vector.

Analysis of Variance (ANOVA)

We sometimes need to deal with complicated models. We had one strategy for interpreting such models — partial derivatives and partial change. But we also need a strategy for deciding when evidence for including a variable or a term warrants doing so. Looking at coefficients is very difficult. Looking at \(R^2\) hides how the different terms contribute. Looking at the change in \(R^2\) involves constructing lots and lots of models, which isn’t itself a problem but is a hassle (later, we’ll see that it does hide something, what we’ll come to call “eating the variance”).

To illustrate, here’s a complicated model of wages. Do we really need all these terms:

data("CPS85")
mod = lm( wage ~ sector*sex + exper*sex*educ + married*union*sector, data=CPS85)

The ANOVA report gives us a p-value for each model term, making it easy to spot what’s helping and what’s not.

anova(mod)

## Analysis of Variance Table
## 
## Response: wage
##                       Df Sum Sq Mean Sq F value    Pr(>F)    
## sector                 7 2571.6  367.37 20.3402 < 2.2e-16 ***
## sex                    1  436.0  435.98 24.1388 1.222e-06 ***
## exper                  1  264.5  264.48 14.6434 0.0001466 ***
## educ                   1  982.1  982.06 54.3741 7.080e-13 ***
## married                1   18.2   18.19  1.0071 0.3160994    
## union                  1  171.6  171.61  9.5016 0.0021681 ** 
## sector:sex             6  115.5   19.24  1.0654 0.3822797    
## sex:exper              1   77.1   77.09  4.2682 0.0393548 *  
## exper:educ             1   43.8   43.84  2.4275 0.1198621    
## sex:educ               1    1.2    1.23  0.0679 0.7945023    
## married:union          1    4.1    4.11  0.2277 0.6334094    
## sector:married         7  168.1   24.01  1.3296 0.2340399    
## sector:union           7  173.8   24.83  1.3749 0.2136679    
## sex:exper:educ         1    0.2    0.15  0.0083 0.9273512    
## sector:married:union   4  162.9   40.72  2.2548 0.0622147 .  
## Residuals            492 8886.1   18.06                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Only one of the interaction terms seems to be contributing, and that’s pretty marginal.

Let’s try for housing prices:

houses = read.csv("http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv")
hmod = lm(Price ~ Living.Area*Baths*Bedrooms*Fireplace*Age, data=houses)
anova(hmod)

## Analysis of Variance Table
## 
## Response: Price
##                                            Df     Sum Sq    Mean Sq
## Living.Area                                 1 4.5060e+12 4.5060e+12
## Baths                                       1 1.5644e+11 1.5644e+11
## Bedrooms                                    1 3.3020e+10 3.3020e+10
## Fireplace                                   1 5.1717e+09 5.1717e+09
## Age                                         1 3.9644e+09 3.9644e+09
## Living.Area:Baths                           1 3.3073e+11 3.3073e+11
## Living.Area:Bedrooms                        1 1.3294e+09 1.3294e+09
## Baths:Bedrooms                              1 5.1315e+09 5.1315e+09
## Living.Area:Fireplace                       1 8.7434e+09 8.7434e+09
## Baths:Fireplace                             1 5.6659e+09 5.6659e+09
## Bedrooms:Fireplace                          1 3.2093e+09 3.2093e+09
## Living.Area:Age                             1 1.0048e+10 1.0048e+10
## Baths:Age                                   1 3.8552e+10 3.8552e+10
## Bedrooms:Age                                1 7.1130e+08 7.1130e+08
## Fireplace:Age                               1 2.1777e+09 2.1777e+09
## Living.Area:Baths:Bedrooms                  1 2.6660e+10 2.6660e+10
## Living.Area:Baths:Fireplace                 1 2.6662e+09 2.6662e+09
## Living.Area:Bedrooms:Fireplace              1 1.0438e+10 1.0438e+10
## Baths:Bedrooms:Fireplace                    1 3.2456e+09 3.2456e+09
## Living.Area:Baths:Age                       1 2.5256e+09 2.5256e+09
## Living.Area:Bedrooms:Age                    1 9.3272e+09 9.3272e+09
## Baths:Bedrooms:Age                          1 1.0598e+09 1.0598e+09
## Living.Area:Fireplace:Age                   1 1.8997e+08 1.8997e+08
## Baths:Fireplace:Age                         1 9.0038e+09 9.0038e+09
## Bedrooms:Fireplace:Age                      1 8.6749e+09 8.6749e+09
## Living.Area:Baths:Bedrooms:Fireplace        1 6.3867e+09 6.3867e+09
## Living.Area:Baths:Bedrooms:Age              1 1.3267e+09 1.3267e+09
## Living.Area:Baths:Fireplace:Age             1 3.4116e+09 3.4116e+09
## Living.Area:Bedrooms:Fireplace:Age          1 2.6334e+07 2.6334e+07
## Baths:Bedrooms:Fireplace:Age                1 1.0440e+10 1.0440e+10
## Living.Area:Baths:Bedrooms:Fireplace:Age    1 2.9448e+09 2.9448e+09
## Residuals                                1031 2.3695e+12 2.2983e+09
##                                            F value    Pr(>F)    
## Living.Area                              1960.6253 < 2.2e-16 ***
## Baths                                      68.0701 4.779e-16 ***
## Bedrooms                                   14.3672 0.0001591 ***
## Fireplace                                   2.2503 0.1338984    
## Age                                         1.7249 0.1893496    
## Living.Area:Baths                         143.9052 < 2.2e-16 ***
## Living.Area:Bedrooms                        0.5784 0.4470997    
## Baths:Bedrooms                              2.2328 0.1354168    
## Living.Area:Fireplace                       3.8043 0.0513904 .  
## Baths:Fireplace                             2.4653 0.1166911    
## Bedrooms:Fireplace                          1.3964 0.2375987    
## Living.Area:Age                             4.3718 0.0367817 *  
## Baths:Age                                  16.7747 4.539e-05 ***
## Bedrooms:Age                                0.3095 0.5781107    
## Fireplace:Age                               0.9475 0.3305731    
## Living.Area:Baths:Bedrooms                 11.6002 0.0006849 ***
## Living.Area:Baths:Fireplace                 1.1601 0.2816953    
## Living.Area:Bedrooms:Fireplace              4.5419 0.0333118 *  
## Baths:Bedrooms:Fireplace                    1.4122 0.2349657    
## Living.Area:Baths:Age                       1.0989 0.2947509    
## Living.Area:Bedrooms:Age                    4.0584 0.0442118 *  
## Baths:Bedrooms:Age                          0.4611 0.4972513    
## Living.Area:Fireplace:Age                   0.0827 0.7737848    
## Baths:Fireplace:Age                         3.9177 0.0480469 *  
## Bedrooms:Fireplace:Age                      3.7746 0.0523093 .  
## Living.Area:Baths:Bedrooms:Fireplace        2.7789 0.0958167 .  
## Living.Area:Baths:Bedrooms:Age              0.5773 0.4475597    
## Living.Area:Baths:Fireplace:Age             1.4844 0.2233599    
## Living.Area:Bedrooms:Fireplace:Age          0.0115 0.9147751    
## Baths:Bedrooms:Fireplace:Age                4.5427 0.0332961 *  
## Living.Area:Baths:Bedrooms:Fireplace:Age    1.2813 0.2579179    
## Residuals                                                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our job is going to be to start developing a clear understanding of ANOVA and how it works, so that we can interpret the results of ANOVA in useful ways.

Reference

This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.