Handout 008: In-Class Beta bias

Hypothesis Testing and Multivariate OLS

Last week, we worked with the swiss dataset:

names(swiss)

## [1] "Fertility"        "Agriculture"      "Examination"     
## [4] "Education"        "Catholic"         "Infant.Mortality"

Let’s try that again. We want to know if farmers have more children than non-farmers. We start with a straigthforward bivariate regression:

swiss.1 <- lm(Fertility~Agriculture,data=swiss)
summary(swiss.1)

## 
## Call:
## lm(formula = Fertility ~ Agriculture, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5374  -7.8685  -0.6362   9.0464  24.4858 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.30438    4.25126  14.185   <2e-16 ***
## Agriculture  0.19420    0.07671   2.532   0.0149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.82 on 45 degrees of freedom
## Multiple R-squared:  0.1247, Adjusted R-squared:  0.1052 
## F-statistic: 6.409 on 1 and 45 DF,  p-value: 0.01492

And plotted it:

plot(swiss$Fertility~swiss$Agriculture,
     main="Fertility Rates and Agricultural Employment in Switzerland (1888)",
     xlab="Pct Males in Agriculture",
     ylab="Fertility Rates (Log)",
     pch=16)
abline(swiss.1,
       col="red")

This looks like a pretty straightforward diagnosis, right? Having more farmers (or at least districts high in agricultural employment) leads to higher fertility! Get all the Swiss men off the farm (if we want to lower fertility) or put them to work in the fields (if we want to raise it)!

But wait.

What if there’s something else lurking in that error term? What if there’s …. endogeneity?

Quick Exercise 1

Write down some ways that endogeneity could bias our estimates of \(\beta_{Agriculture}\). Indicate the direction in which these mechanisms would bias our estimates.

Bailey, Chapter 5, lays out the ways that we will address these problems. Let’s work through this running example.

We might hypothesize that Agriculture affects Fertility, but what are some other factors that affect Fertility without affecting agricultural employment propensity?

For instance, do people rationally choose to have more children if they expect fewer of them to survive? (That might seem cruel, but it’s definitely a fair hypothesis.) We can test that by estimating the equation \(\Fertility = \beta_{0} + \beta_{Agriculture} Agriculture + \beta_{Infant Mort.} InfantMortality + \epsilon\), which will allow us to hold constant the relationship of Agriculture with regards to Infant Mortality:

swiss.2 <- lm(Fertility~Agriculture + Infant.Mortality,data=swiss)
summary(swiss.2)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Infant.Mortality, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.334  -7.603  -1.920   7.070  24.162 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      21.95462   11.52852   1.904  0.06341 . 
## Agriculture       0.20892    0.06864   3.044  0.00394 **
## Infant.Mortality  1.88563    0.53522   3.523  0.00101 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.55 on 44 degrees of freedom
## Multiple R-squared:  0.3173, Adjusted R-squared:  0.2862 
## F-statistic: 10.22 on 2 and 44 DF,  p-value: 0.0002257

(Yes, running multivariate OLS is exactly the same as bivariate OLS — except that we add “+ covariatename” to add a new X term.)

Quick Exercise 2

Write out the regression equation we have estimated. Interpret Agriculture and Infant Mortality’s relationships with Fertility.

We might also think that religion might influence Fertility through channels that don’t directly affect (male) occupational choice (especially in Switzerland, where religious affiliation reflects centuries-old political settlements):

swiss.3 <- lm(Fertility~Agriculture + Catholic,data=swiss)
summary(swiss.3)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Catholic, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.863  -6.350   1.136   7.476  18.056 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 59.86392    3.98754  15.013   <2e-16 ***
## Agriculture  0.10953    0.07848   1.396   0.1698    
## Catholic     0.11496    0.04274   2.690   0.0101 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.07 on 44 degrees of freedom
## Multiple R-squared:  0.2483, Adjusted R-squared:  0.2141 
## F-statistic: 7.266 on 2 and 44 DF,  p-value: 0.001876

Quick Exercise 3

Write out the regression equation we have estimated. Interpret the relationships you see.

Let’s also estimate:

swiss.4 <- lm(Fertility~Agriculture + Education,data=swiss)
summary(swiss.4)

Quick Exercise 4

Write out the regression equations we have estimated. Interpret the relationships you see.

Finally, let’s throw everything together:

swiss.5 <- lm(Fertility~Agriculture + Infant.Mortality + Catholic + Education,data=swiss)
summary(swiss.5)

And let’s see the results of this (and everything) in tabular format:


	Dependent variable:

	Swiss Fertility Rates
	(1)	(2)	(3)	(4)	(5)

Agriculture	0.194^**	0.209^***	0.110	-0.066	-0.155^**
	(0.077)	(0.069)	(0.078)	(0.080)	(0.068)

Infant.Mortality		1.886^***			1.078^***
		(0.535)			(0.382)

Catholic			0.115^**		0.125^***
			(0.043)		(0.029)

Education				-0.963^***	-0.980^***
				(0.189)	(0.148)

Constant	60.304^***	21.955^*	59.864^***	84.080^***	62.101^***
	(4.251)	(11.529)	(3.988)	(5.782)	(9.605)


Observations	47	47	47	47	47
R²	0.125	0.317	0.248	0.449	0.699
Adjusted R²	0.105	0.286	0.214	0.424	0.671
Residual Std. Error	11.816 (df = 45)	10.554 (df = 44)	11.074 (df = 44)	9.479 (df = 44)	7.168 (df = 42)
F Statistic	6.409^** (df = 1; 45)	10.223^*** (df = 2; 44)	7.266^*** (df = 2; 44)	17.945^*** (df = 2; 44)	24.424^*** (df = 4; 42)

Note:	p<0.1; p<0.05; p<0.01

Quick Exercise 6

How does the table present the information from the regression equations we wrote out above? Why present the information as a table? How does the relationship we’re interested in change as we add more variables? Why? Which model should we prefer, and why?