Linear Models
Linear regressions are used to model the linear relationship between one dependent variable and one or more independent variables. Remember that a regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable.
In this laboratory, we will: (1) load a dataset from the
datasets package, (2) graph the relationship between two
and three variables, (3) perform simple and multiple linear regression
analyses using lm().
Relevant functions: lm(),
summary(), ggplot().
1. Loading Data
We will use the swiss dataset from the
datasets package. It encompasses a standardized fertility
measure and socio-economic indicators for each of 47 French-speaking
provinces of Switzerland from the year 1888. First, install the
datasets package using install.packages().
Then, load it into your library using library(). You can
then refer to the swiss data frame directly using any
function.
library(datasets) # Loading the car package using library()
dim(swiss) # Checking the dimensions of the swiss dataset using dim()
## [1] 47 6
colnames(swiss) # Viewing the colmn names usging head()
## [1] "Fertility" "Agriculture" "Examination" "Education"
## [5] "Catholic" "Infant.Mortality"
head(swiss,10) # Viewing the first 10 observations of the dataset using head()
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Broye 83.8 70.2 16 7 92.85
## Glane 92.4 67.8 14 8 97.16
## Gruyere 82.4 53.3 12 7 97.67
## Sarine 82.9 45.2 16 13 91.38
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
## Broye 23.6
## Glane 24.9
## Gruyere 21.0
## Sarine 24.4
Here is a description of the different variables within this dataset:
| Variable name | Description |
|---|---|
| Fertility | common standardized fertility measure |
| Agriculture | % of males involved in agriculture as occupation |
| Examination | % draftees receiving highest mark on army examination |
| Education | % education beyond primary school for draftees |
| Catholic | % catholic |
| Infant.Mortality | % live births who live less than 1 year |
Here, the unit of observation is not individuals, but rather provinces of Switzerland—i.e. we are dealing with aggregate data. For each observation, this information is encompassed within the row name.
rownames(swiss) # Checking the row names using rownames()
## [1] "Courtelary" "Delemont" "Franches-Mnt" "Moutier" "Neuveville"
## [6] "Porrentruy" "Broye" "Glane" "Gruyere" "Sarine"
## [11] "Veveyse" "Aigle" "Aubonne" "Avenches" "Cossonay"
## [16] "Echallens" "Grandson" "Lausanne" "La Vallee" "Lavaux"
## [21] "Morges" "Moudon" "Nyone" "Orbe" "Oron"
## [26] "Payerne" "Paysd'enhaut" "Rolle" "Vevey" "Yverdon"
## [31] "Conthey" "Entremont" "Herens" "Martigwy" "Monthey"
## [36] "St Maurice" "Sierre" "Sion" "Boudry" "La Chauxdfnd"
## [41] "Le Locle" "Neuchatel" "Val de Ruz" "ValdeTravers" "V. De Geneve"
## [46] "Rive Droite" "Rive Gauche"
2. Bivariate Relationship
Let’s start using a simple linear regression model, looking at the relationship between one independent variable (i.e. infant mortality) and one dependent variable (i.e. fertility rates). Note that defining the direction of this relationship might lead to endogeneity concerns.
2.1 Graphing our Data
Our first step is to visually assess whether there seems to be a linear relationship between infant mortality and fertility rates by looking at a scatterplot.
library(ggplot2)
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
theme_minimal(base_size = 20) +
xlim(0,30)
At first glance, does there seem to be a correlation between both variables? If so, does the relationship seem positive or negative?
2.2 Running an OLS model
2.1.1 Using the lm() function
Using lm(), we specify a simple linear regression
formula: dependent variable ~ independent variable.
Here, we are regressing one dependent variable on one independent
variable. The lm() function will generate an
intercept and a regression
coefficient. The intercept represents the value
of the dependent variable when the independent variable is equal to
zero. The regression
coefficient represents the mean change in the dependent
variable for one unit of change in the independent variable (in other
words, the slope).
| Information | Description |
|---|---|
| Formula | Our linear model (dependent variable ~ independent variable) |
| Residuals | Difference between the observed and the estimated value of the dependent variable |
| Intercept | Average value of the dependent variable when the independent variable is equal to zero |
| Coefficient | Mean change in the dependent variable for one unit of change in the independent variable |
Note that lm() automatically performs listwise
deletion of rows with a missing value within the vectors you are using.
In other words, it only uses pairwise complete observations.
# Creating our model using lm()
model1 <- lm(Fertility ~ Infant.Mortality, data=swiss)
# Viewing our regression results using summary()
summary(model1)
##
## Call:
## lm(formula = Fertility ~ Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.672 -5.687 -0.381 7.239 28.565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5155 11.7113 2.947 0.00507 **
## Infant.Mortality 1.7865 0.5812 3.074 0.00359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.48 on 45 degrees of freedom
## Multiple R-squared: 0.1735, Adjusted R-squared: 0.1552
## F-statistic: 9.448 on 1 and 45 DF, p-value: 0.003585
How do you interpret the coefficient and the intercept?
Let’s now visualize the results of our regression model model1 in a graph.
# Plotting the relationship between the IV and the DV using a scatterplot
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
theme_minimal(base_size = 20) +
geom_smooth(method='lm', formula= y~x) + # This is the code that adds the lm line
xlim(0,30)
2.1.2 Polynomial models (optional)
ÂExtra Material on Polynomial Models
Sometimes, the relationship between x and y is best described by a nth degree polynomial function.
Therefore, instead of:
\[Y_i = \beta_0 + \beta_1x + \epsilon \]
you might have…
\[Y_i = \beta_0 + \beta_1x^2 + \epsilon \]
Where x has a polynomial form (\(x^2\) would be a second degree polynomial, \(x^3\) a third-degree polynomial, and so on…).
A LOESS (locally estimated scatterplot smoothing) visualizes a multitude of linear regression lines across your data. It emphasizes trends within your Y ~ X relationship. I find it useful to represent the nuanced patterns within the data I’m analyzing.
You might use polynomial models other than LOESS when your model doesn’t respect the linearity assumption. You’ll talk about this more in Prof. Turgeon’s class! For now, you just need to know that there are alternative to classical linear models—although polynomial models are still technically considered linear models, but let’s skip over this for now…
### Loess ###
# Plotting the relationship between the IV and the DV using a scatterplot
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
theme_minimal(base_size = 20) +
geom_smooth(method='loess') + # This is the code that adds the loess line
xlim(0,30)
## `geom_smooth()` using formula = 'y ~ x'
### Polynomial Model ###
# Plotting the relationship between the IV and the DV using a scatterplot
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
theme_minimal(base_size = 20) +
geom_smooth(method='lm', formula=y~poly(x,2)) + # This is the code that adds the polynomial line
xlim(0,30)
Let’s test drive all of this!
Optional Exercise
Using ggplot(), visualize an OLS model predicting
% agriculture as a determinant of
fertility, and then a loess model with the same
relationship.
# Answer
ggplot(data=swiss, aes(Agriculture,Fertility)) +
geom_point() + xlab("Agriculture (%)") + ylab("Fertility Index") +
theme_minimal(base_size = 20) +
geom_smooth(method='lm', formula= y~x) # This is the code that adds the lm line
ggplot(data=swiss, aes(Agriculture,Fertility)) +
geom_point() + xlab("Agriculture (%)") + ylab("Fertility Index") +
theme_minimal(base_size = 20) +
geom_smooth(method='loess') # This is the code that adds the loess line
## `geom_smooth()` using formula = 'y ~ x'
3. Multivariate Relationships
As you move forward with data analysis, you might start using multiple linear regression models: dependent variable ~ independent variable #1 + independent variable #2. The main difference from simple linear models is that you will have more than one predictors (or independent variables).
# Creating our model using lm()
model2 <- lm(Fertility ~ Infant.Mortality + Education, data=swiss)
# Viewing our regression results using summary()
summary(model2)
##
## Call:
## lm(formula = Fertility ~ Infant.Mortality + Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3906 -6.0088 -0.9624 5.8808 21.0736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.8213 8.8904 5.491 1.88e-06 ***
## Infant.Mortality 1.5187 0.4287 3.543 0.000951 ***
## Education -0.8167 0.1298 -6.289 1.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.426 on 44 degrees of freedom
## Multiple R-squared: 0.5648, Adjusted R-squared: 0.545
## F-statistic: 28.55 on 2 and 44 DF, p-value: 1.126e-08
How do you interpret the results from this model?
Visually, your regression analysis would look like a plane rather than a line.
Exercise
Using lm(), run an OLS model predicting %
agriculture and % education as a determinant
of fertility and interpret your regression
coefficients.
# Answers available on OWL