Last week, we worked with the swiss dataset:
names(swiss)
## [1] "Fertility" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
Let’s try that again. We want to know if farmers have more children than non-farmers. We start with a straigthforward bivariate regression:
swiss.1 <- lm(Fertility~Agriculture,data=swiss)
summary(swiss.1)
##
## Call:
## lm(formula = Fertility ~ Agriculture, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5374 -7.8685 -0.6362 9.0464 24.4858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.30438 4.25126 14.185 <2e-16 ***
## Agriculture 0.19420 0.07671 2.532 0.0149 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.82 on 45 degrees of freedom
## Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052
## F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
And plotted it:
plot(swiss$Fertility~swiss$Agriculture,
main="Fertility Rates and Agricultural Employment in Switzerland (1888)",
xlab="Pct Males in Agriculture",
ylab="Fertility Rates (Log)",
pch=16)
abline(swiss.1,
col="red")
This looks like a pretty straightforward diagnosis, right? Having more farmers (or at least districts high in agricultural employment) leads to higher fertility! Get all the Swiss men off the farm (if we want to lower fertility) or put them to work in the fields (if we want to raise it)!
But wait.
What if there’s something else lurking in that error term? What if there’s …. endogeneity?
Write down some ways that endogeneity could bias our estimates of \(\beta_{Agriculture}\). Indicate the direction in which these mechanisms would bias our estimates.
Bailey, Chapter 5, lays out the ways that we will address these problems. Let’s work through this running example.
We might hypothesize that Agriculture affects Fertility, but what are some other factors that affect Fertility without affecting agricultural employment propensity?
For instance, do people rationally choose to have more children if they expect fewer of them to survive? (That might seem cruel, but it’s definitely a fair hypothesis.) We can test that by estimating the equation \(\Fertility = \beta_{0} + \beta_{Agriculture} Agriculture + \beta_{Infant Mort.} InfantMortality + \epsilon\), which will allow us to hold constant the relationship of Agriculture with regards to Infant Mortality:
swiss.2 <- lm(Fertility~Agriculture + Infant.Mortality,data=swiss)
summary(swiss.2)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.334 -7.603 -1.920 7.070 24.162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.95462 11.52852 1.904 0.06341 .
## Agriculture 0.20892 0.06864 3.044 0.00394 **
## Infant.Mortality 1.88563 0.53522 3.523 0.00101 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.55 on 44 degrees of freedom
## Multiple R-squared: 0.3173, Adjusted R-squared: 0.2862
## F-statistic: 10.22 on 2 and 44 DF, p-value: 0.0002257
(Yes, running multivariate OLS is exactly the same as bivariate OLS — except that we add “+ covariatename” to add a new X term.)
Write out the regression equation we have estimated. Interpret Agriculture and Infant Mortality’s relationships with Fertility.
We might also think that religion might influence Fertility through channels that don’t directly affect (male) occupational choice (especially in Switzerland, where religious affiliation reflects centuries-old political settlements):
swiss.3 <- lm(Fertility~Agriculture + Catholic,data=swiss)
summary(swiss.3)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.863 -6.350 1.136 7.476 18.056
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.86392 3.98754 15.013 <2e-16 ***
## Agriculture 0.10953 0.07848 1.396 0.1698
## Catholic 0.11496 0.04274 2.690 0.0101 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.07 on 44 degrees of freedom
## Multiple R-squared: 0.2483, Adjusted R-squared: 0.2141
## F-statistic: 7.266 on 2 and 44 DF, p-value: 0.001876
Write out the regression equation we have estimated. Interpret the relationships you see.
Let’s also estimate:
swiss.4 <- lm(Fertility~Agriculture + Education,data=swiss)
summary(swiss.4)
Write out the regression equations we have estimated. Interpret the relationships you see.
Finally, let’s throw everything together:
swiss.5 <- lm(Fertility~Agriculture + Infant.Mortality + Catholic + Education,data=swiss)
summary(swiss.5)
And let’s see the results of this (and everything) in tabular format:
| Dependent variable: | |||||
| Swiss Fertility Rates | |||||
| (1) | (2) | (3) | (4) | (5) | |
| Agriculture | 0.194** | 0.209*** | 0.110 | -0.066 | -0.155** |
| (0.077) | (0.069) | (0.078) | (0.080) | (0.068) | |
| Infant.Mortality | 1.886*** | 1.078*** | |||
| (0.535) | (0.382) | ||||
| Catholic | 0.115** | 0.125*** | |||
| (0.043) | (0.029) | ||||
| Education | -0.963*** | -0.980*** | |||
| (0.189) | (0.148) | ||||
| Constant | 60.304*** | 21.955* | 59.864*** | 84.080*** | 62.101*** |
| (4.251) | (11.529) | (3.988) | (5.782) | (9.605) | |
| Observations | 47 | 47 | 47 | 47 | 47 |
| R2 | 0.125 | 0.317 | 0.248 | 0.449 | 0.699 |
| Adjusted R2 | 0.105 | 0.286 | 0.214 | 0.424 | 0.671 |
| Residual Std. Error | 11.816 (df = 45) | 10.554 (df = 44) | 11.074 (df = 44) | 9.479 (df = 44) | 7.168 (df = 42) |
| F Statistic | 6.409** (df = 1; 45) | 10.223*** (df = 2; 44) | 7.266*** (df = 2; 44) | 17.945*** (df = 2; 44) | 24.424*** (df = 4; 42) |
| Note: | p<0.1; p<0.05; p<0.01 | ||||
How does the table present the information from the regression equations we wrote out above? Why present the information as a table? How does the relationship we’re interested in change as we add more variables? Why? Which model should we prefer, and why?