In this handout, I’m going to work through the very basics of computing a bivariate OLS regression and showing its output in tables and graphically. Let’s begin by working with the swiss dataset pre-loaded in R:
data(swiss)
head(swiss) # head() returns the first five records in a dataset
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
Using ?swiss gives you more background on this dataset—it’s a record of various demographic indicators from Switzerland in 1888. Who cares? Well, nobody (maybe the Swiss) but it’s a useful dataset with lots of different variable types.
Let’s choose two variables that seem likely to be related to each other: Agriculture, the percentage of males involved in agriculture by province, and Education, the percentage of draftees receiving education beyond primary school. A simple scatterplot shows these variables’ relationship:
plot(swiss$Education,swiss$Agriculture,
main="Agricultural Occupational Percentage by Educational Attainment",
xlab="Percentage of Draftees with More than Primary Education (Province-Level)",
ylab="Percentage of Males Who Farm (Province-Level)")
These variables look nicely correlated, right? So, let’s fit the world’s simplest OLS regression to see if there is a relationship:
swiss.lm <- lm(Agriculture~Education, data=swiss)
summary(swiss.lm)
##
## Call:
## lm(formula = Agriculture ~ Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.927 -9.633 2.188 12.762 26.720
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.2432 3.9321 17.10 < 2e-16 ***
## Education -1.5105 0.2707 -5.58 1.3e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.65 on 45 degrees of freedom
## Multiple R-squared: 0.409, Adjusted R-squared: 0.3959
## F-statistic: 31.14 on 1 and 45 DF, p-value: 1.305e-06
Write out the estimated regression line relating Agriculture and Education.
So, wonderful! We have a regression! Is this causal? Well, by now you should be thinking about all the reasons it’s not—but on the other hand, there’s definitely lots of reasons to think that provinces with more-educated populaces have fewer farmers as a share of their labor force. But how can we show this relationship? It turns out to be almost trivially easy:
plot(swiss$Education,swiss$Agriculture,
main="Agricultural Occupational Percentage by Educational Attainment",
xlab="Percentage of Draftees with More than Primary Education (Province-Level)",
ylab="Percentage of Males Who Farm (Province-Level)",
pch=16 ) # pch is "plot character" and frankly I hate the default R choice; google R pch table for more options.
abline(swiss.lm,
col="red")
Wow! That’s a lot easier to work with!