Hypothesis Testing: The Absolute Basics

Bailey, Chapter 4, spends a lot of time talking about the theory and statistics behind hypothesis testing. This handout in no way replaces that chapter! Instead, it does one thing: teaching you how to interpret OLS regression output in R so you can perform hypothesis tests.

We’ll begin with the swiss dataset:

data(swiss)

To find the variables available to us, we use

names(swiss)
## [1] "Fertility"        "Agriculture"      "Examination"     
## [4] "Education"        "Catholic"         "Infant.Mortality"

Let’s choose two variables. For our X variable, I suggest using Education; for our Y variable, let’s use Fertility. The basic hypothesis here would be that Education levels affect Fertility; in general, Education lowers Fertility. (That could be a modern notion, though: much of the ways that relationship is hypothesized to hold are premised on the idea that women who are more educated choose fewer children, and our Education variable here measures male education levels.)

Let’s plot the data:

plot(swiss$Fertility~swiss$Education,
     main="Fertility Rates and Education Levels in Switzerland (1888)",
     xlab="Education Levels",
     ylab="Fertility Rates (Log)",
     pch=16)

(Note that the fertility rates come to us in a log value.)

Nothing too surprising here: It looks like there’s a generally negative relationship!

Let’s estimate what that relationship will be using OLS. We’ve run OLS tests before:

swiss.test <- lm(Fertility~Education,data=swiss)
summary(swiss.test)
## 
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.036  -6.711  -1.011   9.526  19.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  79.6101     2.1041  37.836  < 2e-16 ***
## Education    -0.8624     0.1448  -5.954 3.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared:  0.4406, Adjusted R-squared:  0.4282 
## F-statistic: 35.45 on 1 and 45 DF,  p-value: 3.659e-07
swiss.test ## If you're wondering why I always use the summary() command on the output of lm() functions.
## 
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
## 
## Coefficients:
## (Intercept)    Education  
##     79.6101      -0.8624

What’s important in the summary(swiss.test) output? First, it gives us our estimate for the intercept: 79.6. That’s nice, and the computer will need that if we want to plot the relationship, but we don’t care right now.

What we ARE intereted in is the output labeled “Education”. This gives us the Estimate of \(\beta_1\), which is -0.8624. It also reports the Std. Error of \(\beta_1\) (which Bailey denotes as \(se(\hat{\beta_1})\)), which is 0.1448.

If we divide \(\frac{\hat{\beta_1}}{se(\hat{\beta_1}}\), we get \(\frac{-0.8624}{0.1448}\), which my calculator says is -5.9558. That’s awfully close to the t value that R reports (the differences are being driven by rounding). This t-value is the t-statistic derived by William Sealy Gossett of the Guinness brewery, thereby becoming the Guinness corporation’s second-greatest contribution to mankind.

A t-value of -6 is really big, given that the t-value for the critical value of 0.05 significance we usually want is about 1.96 (or “2”). (Note: a negative t-value is really big because we care about the absolute value of t.)

Consequently, the estimated p-value (Pr(>|t|)) is really small—so small,in fact, that R won’t tell us what it is directly. It has to use scientific notation: 3.66e-07 is \(3.66 \times 10^-7\), or 0.000000366, which is to say “essentially zero”. (NOTE: This would not be zero in a physics class, but it’s only social science.)

Bailey’s chapter should tell you how to interpret a p-value that small.

A quick bonus:

plot(swiss$Fertility~swiss$Education,
     main="Fertility Rates and Education Levels in Switzerland (1888)",
     xlab="Education Levels",
     ylab="Fertility Rates (Log)",
     pch=16)
abline(swiss.test,
       col="red")