POLISCI 3325G - Data Science for Political Science

Evelyne Brie

Fall 2022

Linear Models

Linear regressions are used to model the linear relationship between one dependent variable and one or more independent variables. Remember that a regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable.

In this laboratory, we will: (1) load a dataset from the datasets package, (2) graph the relationship between two and three variables, (3) perform simple and multiple linear regression analyses using lm().

Relevant functions: lm(), summary(), ggplot().

1. Loading Data

We will use the swiss dataset from the datasets package. It encompasses a standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland from the year 1888. First, install the datasets package using install.packages(). Then, load it into your library using library(). You can then refer to the swiss data frame directly using any function.

library(datasets) # Loading the car package using library()

dim(swiss) # Checking the dimensions of the swiss dataset using dim()
## [1] 47  6
colnames(swiss) # Viewing the colmn names usging head()
## [1] "Fertility"        "Agriculture"      "Examination"      "Education"       
## [5] "Catholic"         "Infant.Mortality"
head(swiss,10) # Viewing the first 10 observations of the dataset using head()
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
## Broye             83.8        70.2          16         7    92.85
## Glane             92.4        67.8          14         8    97.16
## Gruyere           82.4        53.3          12         7    97.67
## Sarine            82.9        45.2          16        13    91.38
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
## Broye                    23.6
## Glane                    24.9
## Gruyere                  21.0
## Sarine                   24.4

Here is a description of the different variables within this dataset:

Variable name Description
Fertility common standardized fertility measure
Agriculture % of males involved in agriculture as occupation
Examination % draftees receiving highest mark on army examination
Education % education beyond primary school for draftees
Catholic % catholic
Infant.Mortality % live births who live less than 1 year

Here, the unit of observation is not individuals, but rather provinces of Switzerland—i.e. we are dealing with aggregate data. For each observation, this information is encompassed within the row name.

rownames(swiss) # Checking the row names using rownames()
##  [1] "Courtelary"   "Delemont"     "Franches-Mnt" "Moutier"      "Neuveville"  
##  [6] "Porrentruy"   "Broye"        "Glane"        "Gruyere"      "Sarine"      
## [11] "Veveyse"      "Aigle"        "Aubonne"      "Avenches"     "Cossonay"    
## [16] "Echallens"    "Grandson"     "Lausanne"     "La Vallee"    "Lavaux"      
## [21] "Morges"       "Moudon"       "Nyone"        "Orbe"         "Oron"        
## [26] "Payerne"      "Paysd'enhaut" "Rolle"        "Vevey"        "Yverdon"     
## [31] "Conthey"      "Entremont"    "Herens"       "Martigwy"     "Monthey"     
## [36] "St Maurice"   "Sierre"       "Sion"         "Boudry"       "La Chauxdfnd"
## [41] "Le Locle"     "Neuchatel"    "Val de Ruz"   "ValdeTravers" "V. De Geneve"
## [46] "Rive Droite"  "Rive Gauche"

2. Bivariate Relationship

Let’s start using a simple linear regression model, looking at the relationship between one independent variable (i.e. infant mortality) and one dependent variable (i.e. fertility rates). Note that defining the direction of this relationship might lead to endogeneity concerns.

2.1 Graphing our Data

Our first step is to visually assess whether there seems to be a linear relationship between infant mortality and fertility rates by looking at a scatterplot.

library(ggplot2)

ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
  geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) +
  xlim(0,30)

At first glance, does there seem to be a correlation between both variables? If so, does the relationship seem positive or negative?

2.2 Running an OLS model

2.1.1 Using the lm() function

Using lm(), we specify a simple linear regression formula: dependent variable ~ independent variable. Here, we are regressing one dependent variable on one independent variable. The lm() function will generate an intercept and a regression coefficient. The intercept represents the value of the dependent variable when the independent variable is equal to zero. The regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable (in other words, the slope).

Information Description
Formula Our linear model (dependent variable ~ independent variable)
Residuals Difference between the observed and the estimated value of the dependent variable
Intercept Average value of the dependent variable when the independent variable is equal to zero
Coefficient Mean change in the dependent variable for one unit of change in the independent variable

Note that lm() automatically performs listwise deletion of rows with a missing value within the vectors you are using. In other words, it only uses pairwise complete observations.

 # Creating our model using lm()
model1 <- lm(Fertility ~ Infant.Mortality, data=swiss)

# Viewing our regression results using summary()
summary(model1)
## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.672  -5.687  -0.381   7.239  28.565 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       34.5155    11.7113   2.947  0.00507 **
## Infant.Mortality   1.7865     0.5812   3.074  0.00359 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.48 on 45 degrees of freedom
## Multiple R-squared:  0.1735, Adjusted R-squared:  0.1552 
## F-statistic: 9.448 on 1 and 45 DF,  p-value: 0.003585

How do you interpret the coefficient and the intercept?

Let’s now visualize the results of our regression model model1 in a graph.

# Plotting the relationship between the IV and the DV using a scatterplot
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
  geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) + 
  geom_smooth(method='lm', formula= y~x) + # This is the code that adds the lm line
  xlim(0,30)

2.1.2 Polynomial models (optional)

 

Extra Material on Polynomial Models

Sometimes, the relationship between x and y is best described by a nth degree polynomial function.

Therefore, instead of:

\[Y_i = \beta_0 + \beta_1x + \epsilon \]

you might have…

\[Y_i = \beta_0 + \beta_1x^2 + \epsilon \]

Where x has a polynomial form (\(x^2\) would be a second degree polynomial, \(x^3\) a third-degree polynomial, and so on…).

A LOESS (locally estimated scatterplot smoothing) visualizes a multitude of linear regression lines across your data. It emphasizes trends within your Y ~ X relationship. I find it useful to represent the nuanced patterns within the data I’m analyzing.

You might use polynomial models other than LOESS when your model doesn’t respect the linearity assumption. You’ll talk about this more in Prof. Turgeon’s class! For now, you just need to know that there are alternative to classical linear models—although polynomial models are still technically considered linear models, but let’s skip over this for now…

### Loess ###

# Plotting the relationship between the IV and the DV using a scatterplot
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
  geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) + 
  geom_smooth(method='loess') +  # This is the code that adds the loess line
  xlim(0,30)
## `geom_smooth()` using formula = 'y ~ x'

### Polynomial Model ###

# Plotting the relationship between the IV and the DV using a scatterplot
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
  geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) + 
  geom_smooth(method='lm', formula=y~poly(x,2)) +  # This is the code that adds the polynomial line
  xlim(0,30)

Let’s test drive all of this!

Optional Exercise

Using ggplot(), visualize an OLS model predicting % agriculture as a determinant of fertility, and then a loess model with the same relationship.

# Answer
ggplot(data=swiss, aes(Agriculture,Fertility)) +
  geom_point() + xlab("Agriculture (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) + 
  geom_smooth(method='lm', formula= y~x)  # This is the code that adds the lm line

ggplot(data=swiss, aes(Agriculture,Fertility)) +
  geom_point() + xlab("Agriculture (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) + 
  geom_smooth(method='loess')  # This is the code that adds the loess line
## `geom_smooth()` using formula = 'y ~ x'

3. Multivariate Relationships

As you move forward with data analysis, you might start using multiple linear regression models: dependent variable ~ independent variable #1 + independent variable #2. The main difference from simple linear models is that you will have more than one predictors (or independent variables).

 # Creating our model using lm()
model2 <- lm(Fertility ~ Infant.Mortality + Education, data=swiss)

# Viewing our regression results using summary()
summary(model2)
## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality + Education, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3906  -6.0088  -0.9624   5.8808  21.0736 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       48.8213     8.8904   5.491 1.88e-06 ***
## Infant.Mortality   1.5187     0.4287   3.543 0.000951 ***
## Education         -0.8167     0.1298  -6.289 1.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.426 on 44 degrees of freedom
## Multiple R-squared:  0.5648, Adjusted R-squared:  0.545 
## F-statistic: 28.55 on 2 and 44 DF,  p-value: 1.126e-08

How do you interpret the results from this model?

Visually, your regression analysis would look like a plane rather than a line.

Exercise

Using lm(), run an OLS model predicting % agriculture and % education as a determinant of fertility and interpret your regression coefficients.

# Answers available on OWL