Using R to Compute, Analyze, and Present Data

In this handout, I’m going to work through the very basics of computing a bivariate OLS regression and showing its output in tables and graphically. Let’s begin by working with the swiss dataset pre-loaded in R:

data(swiss)
head(swiss) # head() returns the first five records in a dataset
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Using ?swiss gives you more background on this dataset—it’s a record of various demographic indicators from Switzerland in 1888. Who cares? Well, nobody (maybe the Swiss) but it’s a useful dataset with lots of different variable types.

Let’s choose two variables that seem likely to be related to each other: Agriculture, the percentage of males involved in agriculture by province, and Education, the percentage of draftees receiving education beyond primary school. A simple scatterplot shows these variables’ relationship:

plot(swiss$Education,swiss$Agriculture,
     main="Agricultural Occupational Percentage by Educational Attainment",
     xlab="Percentage of Draftees with More than Primary Education (Province-Level)",
     ylab="Percentage of Males Who Farm (Province-Level)")

These variables look nicely correlated, right? So, let’s fit the world’s simplest OLS regression to see if there is a relationship:

swiss.lm <- lm(Agriculture~Education, data=swiss)
summary(swiss.lm)
## 
## Call:
## lm(formula = Agriculture ~ Education, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.927  -9.633   2.188  12.762  26.720 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  67.2432     3.9321   17.10  < 2e-16 ***
## Education    -1.5105     0.2707   -5.58  1.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.65 on 45 degrees of freedom
## Multiple R-squared:  0.409,  Adjusted R-squared:  0.3959 
## F-statistic: 31.14 on 1 and 45 DF,  p-value: 1.305e-06

Quick Exercise 1

Write out the estimated regression line relating Agriculture and Education.


So, wonderful! We have a regression! Is this causal? Well, by now you should be thinking about all the reasons it’s not—but on the other hand, there’s definitely lots of reasons to think that provinces with more-educated populaces have fewer farmers as a share of their labor force. But how can we show this relationship? It turns out to be almost trivially easy:

plot(swiss$Education,swiss$Agriculture,
     main="Agricultural Occupational Percentage by Educational Attainment",
     xlab="Percentage of Draftees with More than Primary Education (Province-Level)",
     ylab="Percentage of Males Who Farm (Province-Level)",
     pch=16 ) # pch is "plot character" and frankly I hate the default R choice; google R pch table for more options.
abline(swiss.lm,
       col="red")

Wow! That’s a lot easier to work with!