PSCI 9590A - Introduction to Quantitative Methods

Evelyne Brie

Fall 2022

Linear Models II

Linear regressions are used to model the linear relationship between one dependent variable and one or more independent variables. Remember that a regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable.

In this laboratory, we will: (1) load a dataset from the datasets package, (2) graph the relationship between two and three variables, (3) perform simple and multiple linear regression analyses using lm(), (4) introduce the concept of polynomial models and (5) run different OLS diagnostic tests.

Relevant functions: lm(), summary(), ggplot(), density().

1. Loading Data

We will use the swiss dataset from the datasets package. It encompasses a standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland from the year 1888. First, install the datasets package using install.packages(). Then, load it into your library using library(). You can then refer to the swiss data frame directly using any function.

library(datasets) # Loading the car package using library()

dim(swiss) # Checking the dimensions of the swiss dataset using dim()

## [1] 47  6

colnames(swiss) # Viewing the colmn names usging head()

## [1] "Fertility"        "Agriculture"      "Examination"      "Education"       
## [5] "Catholic"         "Infant.Mortality"

head(swiss,10) # Viewing the first 10 observations of the dataset using head()

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
## Broye             83.8        70.2          16         7    92.85
## Glane             92.4        67.8          14         8    97.16
## Gruyere           82.4        53.3          12         7    97.67
## Sarine            82.9        45.2          16        13    91.38
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
## Broye                    23.6
## Glane                    24.9
## Gruyere                  21.0
## Sarine                   24.4

Here is a description of the different variables within this dataset:

Variable name	Description
Fertility	common standardized fertility measure
Agriculture	% of males involved in agriculture as occupation
Examination	% draftees receiving highest mark on army examination
Education	% education beyond primary school for draftees
Catholic	% catholic
Infant.Mortality	% live births who live less than 1 year

Here, the unit of observation is not individuals, but rather provinces of Switzerland—i.e. we are dealing with aggregate data. For each observation, this information is encompassed within the row name.

rownames(swiss) # Checking the row names using rownames()

##  [1] "Courtelary"   "Delemont"     "Franches-Mnt" "Moutier"      "Neuveville"  
##  [6] "Porrentruy"   "Broye"        "Glane"        "Gruyere"      "Sarine"      
## [11] "Veveyse"      "Aigle"        "Aubonne"      "Avenches"     "Cossonay"    
## [16] "Echallens"    "Grandson"     "Lausanne"     "La Vallee"    "Lavaux"      
## [21] "Morges"       "Moudon"       "Nyone"        "Orbe"         "Oron"        
## [26] "Payerne"      "Paysd'enhaut" "Rolle"        "Vevey"        "Yverdon"     
## [31] "Conthey"      "Entremont"    "Herens"       "Martigwy"     "Monthey"     
## [36] "St Maurice"   "Sierre"       "Sion"         "Boudry"       "La Chauxdfnd"
## [41] "Le Locle"     "Neuchatel"    "Val de Ruz"   "ValdeTravers" "V. De Geneve"
## [46] "Rive Droite"  "Rive Gauche"

2. Bivariate Relationship

Let’s start using a simple linear regression model, looking at the relationship between one independent variable (i.e. infant mortality) and one dependent variable (i.e. fertility rates). Note that defining the direction of this relationship might lead to endogeneity concerns.

2.1 Graphing our Data

Our first step is to visually assess whether there seems to be a linear relationship between infant mortality and fertility rates by looking at a scatterplot.

ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
  geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) +
  xlim(0,30)

At first glance, does there seem to be a correlation between both variables? If so, does the relationship seem positive or negative?

2.2 Running an OLS model

2.1.1 Using the `lm()` function

Using lm(), we specify a simple linear regression formula: dependent variable ~ independent variable. Here, we are regressing one dependent variable on one independent variable. The lm() function will generate an intercept and a regression coefficient. The intercept represents the value of the dependent variable when the independent variable is equal to zero. The regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable (in other words, the slope).

Information	Description
Formula	Our linear model (dependent variable ~ independent variable)
Residuals	Difference between the observed and the estimated value of the dependent variable
Intercept	Value of the dependent variable when the independent variable is equal to zero
Coefficient	Mean change in the dependent variable for one unit of change in the independent variable
p-value	Probability of getting results similar to yours (or more extreme) given that the null hypothesis (= no relationship between the DV and the IV) is true
Degrees of freedom	Number of observations - number of parameters - 1
R-squared	Proportion of explained variance in the dependent variable

Note that lm() automatically performs listwise deletion of rows with a missing value within the vectors you are using. In other words, it only uses pairwise complete observations.

 # Creating our model using lm()
model1 <- lm(Fertility ~ Infant.Mortality, data=swiss)

# Viewing our regression results using summary()
summary(model1)

## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.672  -5.687  -0.381   7.239  28.565 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       34.5155    11.7113   2.947  0.00507 **
## Infant.Mortality   1.7865     0.5812   3.074  0.00359 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.48 on 45 degrees of freedom
## Multiple R-squared:  0.1735, Adjusted R-squared:  0.1552 
## F-statistic: 9.448 on 1 and 45 DF,  p-value: 0.003585

How do you interpret the coefficients, the p-values and the r-squared here? Note that the adjusted r-squared is adjusted to the number of predictors in the model, and increases when adding a new variable increases the percentage of explained variance more than would be expected by chance alone.

Let’s now visualize the results of our regression model model1 in a graph.

# Plotting the relationship between the IV and the DV using a scatterplot
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
  geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) + 
  geom_smooth(method='lm', formula= y~x) + # This is the code that adds the lm line
  xlim(0,30)

2.1.2 Polynomial models

Note on Polynomial Models

Sometimes, the relationship between x and y is best described by a nth degree polynomial function.

Therefore, instead of:

\[Y_i = \beta_0 + \beta_1x + \epsilon \]

you might have…

\[Y_i = \beta_0 + \beta_1x^2 + \epsilon \]

Where x has a polynomial form (\(x^2\) would be a second degree polynomial, \(x^3\) a third-degree polynomial, and so on…).

A LOESS (locally estimated scatterplot smoothing) visualizes a multitude of linear regression lines across your data. It emphasizes trends within your Y ~ X relationship. I find it useful to represent the nuanced patterns within the data I’m analyzing.

You might use polynomial models other than LOESS when your model doesn’t respect the linearity assumption. You’ll talk about this more in Prof. Turgeon’s class! For now, you just need to know that there are alternative to classical linear models—although polynomial models are still technically considered linear models, but let’s skip over this for now…

### Loess ###

# Plotting the relationship between the IV and the DV using a scatterplot
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
  geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) + 
  geom_smooth(method='loess') +  # This is the code that adds the loess line
  xlim(0,30)

## `geom_smooth()` using formula 'y ~ x'

### Polynomial Model ###

# Plotting the relationship between the IV and the DV using a scatterplot
ggplot(data=swiss, aes(Infant.Mortality,Fertility)) +
  geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
  theme_minimal(base_size = 20) + 
  geom_smooth(method='lm', formula=y~poly(x,2)) +  # This is the code that adds the polynomial line
  xlim(0,30)

Let’s test drive all of this!

Exercise 1

Using ggplot(), visualize an OLS model predicting % agriculture as a determinant of fertility, and then a loess model with the same relationship.

3. Residuals

The residuals represent the observed value minus the predicted value.

We can visualize the residuals of our model using the following code. Each observation (i.e. each point in the scatterplot) is associated with a residual (the difference between itself and the prediction from our OLS model).

# Adding the regression model results to our original df
# We create a new data frame called swiss_withLmInfo
swiss_withLmInfo <- swiss
swiss_withLmInfo <- augment(model1)
# Here is more info on the data we are adding (we won't discuss all of these variables today):
# https://search.r-project.org/CRAN/refmans/broom/html/augment.lm.html


# Printing out the first 5 observations of that data frame
head(swiss_withLmInfo,5)

## # A tibble: 5 × 9
##   .rownames    Fertility Infant.M…¹ .fitted .resid   .hat .sigma .cooksd .std.…²
##   <chr>            <dbl>      <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
## 1 Courtelary        80.2       22.2    74.2   6.02 0.0343   11.6 0.00507   0.534
## 2 Delemont          83.1       22.2    74.2   8.92 0.0343   11.5 0.0111    0.791
## 3 Franches-Mnt      92.5       20.2    70.6  21.9  0.0214   11.1 0.0407    1.93 
## 4 Moutier           85.8       20.3    70.8  15.0  0.0216   11.4 0.0193    1.32 
## 5 Neuveville        76.9       20.6    71.3   5.58 0.0224   11.6 0.00277   0.492
## # … with abbreviated variable names ¹Infant.Mortality, ².std.resid

# Let's visualize our residuals
ggplot(swiss_withLmInfo, aes(Infant.Mortality,Fertility)) +
  geom_point() + xlab("Infant Mortality (%)") + ylab("Fertility Index") +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend =Infant.Mortality, yend = .fitted), color = "red", size = 0.3) +
  theme_minimal(base_size = 20) + 
  xlim(0,30)

## `geom_smooth()` using formula 'y ~ x'

4. Diagnostic Tests

4.1 Linearity of the residuals

We want to make sure that our bivariate relationship is linear, which implies that our residuals are also linear. The red line indicates the relationship between the residuals and the fitted values (i.e. your predicted values). Ideally, the line should be horizontal at zero. Why? Because you should have on average 0 residual if all errors are randomly distributed (just like 100 people who sing badly might have a perfect pitch on average when singing together).

# The plot() function will give you up to 6 diagnostics tests on your model
plot(model1,1) # By indicating "1" as an argument, we select this specific test

Here, the line is indeed horizontal at zero, so we can confirm the linearity of the relationship between infant mortality and fertility rates. If the relationship had not been linear, we could have used a non-linear model or a polynomial function—but we won’t cover this in this class.

4.2 Homoscedasticity

We now want to make sure that the variance of our residuals is similar across the range of predicted values. This would mean that our model always misses the prediction by roughly the same range, across our observations.

The red line on the graph below indicates the relationship between the square root of standardized residuals, i.e. the square root of \[ \frac{\text{observed value - predicted value}}{\sqrt{\text{predicted value}}} \] and your fitted values (i.e. your prediction). Why do we use a square root on the standardized residuals? Because high values get compressed and low values become more spread out. This is not critical—in case you were curious about this, just know that we take the square root because we want the data to be less skewed.

Note that the average of the standardized residuals is 0 and their standard deviation is 1 (like any standard normal distribution, which is a specific type of normal distribution with mean = 0 and sd = 1). Ideally, the red line should be horizontal with equally (randomly) spread points, since this would represent the same amount of standardized residuals regardless of our fitted value.

# The plot() function will give you up to 6 diagnostics tests on your model
plot(model1,3) # By indicating "3" as an argument, we select this specific test

Here, our line is roughly horizontal, but we observe that our variance seems to be slightly higher for provinces with fitted values between 63 and 70 for the fertility rate (although they remain random, just within a broader range, hence the straight red line). What do you think?

4.3 Normality of residuals

We want to make sure that our residuals are distributed normally, which is one of the assumptions of OLS. The y axis is here again the standardized residuals (but not squared, this time). The x axis is the theoretical quantile associated with any standard normal distribution. The one thing that is critical to understand here is this: the diagonal dotted line represents what the standardized residuals should look like if they were truly normally distributed.

# The plot() function will give you up to 6 diagnostics tests on your model
plot(model1,2) # By indicating "2" as an argument, we select this specific test

# We can also visualize this like we did last time... 
# See how the small jump on the line in the previous graph fits
# with the patterns in the graph below? 
plot(density(model1$residuals))

Here, our observations are roughly normally distributed. If they weren’t, we might have wanted to include missing covariates that could explain that our predictor misses out on certain predictions (ex.: if we had disproportionally high residuals for all provinces with high fertility rates, and all these provinces with high fertility rates were located at high altitudes, we might have wanted to control for altitude to remove this bias in case this is what is causing the distortion).

4.4 Outliers and high leverage points (extra material)

Some of your data points might be outliers (i.e. extreme values on the dependent variable) or have high leverage (i.e. extreme values on the independent variable). In the test below, we want to make sure that our outliers are within 3 standard deviations from the mean standardized residuals value (which, by definition, is 0). We also want to see whether standardized residuals have a leverage smaller than \(2(p + 1)/n\) (where p is the number of predictors, in this case 1, and n is the number of observations, in this case 47, so \(2(p + 1)/n\) = \(2(1 + 1)/47\) = \(4/47\) = \(0.08510638\)).

# The plot() function will give you up to 6 diagnostics tests on your model
plot(model1,5) # By indicating "5" as an argument, we select this specific test

Here, none of the observations are more distant than 3 standard deviations from the mean, so we don’t need to remove outliers. For a few of our observations, however, leverage is above 0.08.

# Printing out the rows with a leverage above 0.08
swiss_withLmInfo[swiss_withLmInfo$.hat>0.08,]

## # A tibble: 4 × 9
##   .rownames  Fertility Infant.Mor…¹ .fitted .resid   .hat .sigma .cooksd .std.…²
##   <chr>          <dbl>        <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
## 1 Porrentruy      76.1         26.6    82.0 -5.94  0.135    11.6 2.41e-2 -0.556 
## 2 Glane           92.4         24.9    79.0 13.4   0.0843   11.4 6.84e-2  1.22  
## 3 La Vallee       54.3         10.8    53.8  0.490 0.235    11.6 3.68e-4  0.0489
## 4 Conthey         75.5         15.1    61.5 14.0   0.0814   11.4 7.18e-2  1.27  
## # … with abbreviated variable names ¹Infant.Mortality, ².std.resid

4.5 Influential values (extra material)

This graph gives us the Cook’s distance for each observation, which indicates the influence that each data point has in the model. How is it calculated? Essentially by removing one given data point from the model and recalculating the regression. It then tells you how much the values in the model have changed when this one particular data point was excluded. If the model changes a lot from removing simply one observation, this one data point had a lot of influence!

There isn’t a clear cutoff here. Some people say that a Cook’s distance of 0.5 or even 1 is too much, other say that you should simply remove observations that are 4/n (where n is the total number of observations, in this case 47, so 4/47 = 0.08), or simply those that over the very distant from the rest.

# The plot() function will give you up to 6 diagnostics tests on your model
plot(model1,4) # By indicating "4" as an argument, we select this specific test

# Checking the province for observation #8
rownames(swiss)[8]

## [1] "Glane"

Here, we might want to remove provinces Glane, Conthey, Sierre and Ville de Genève because they are above the 0.08 threshold.

5. Multivariate Relationships

As you move forward with data analysis, you might start using multiple linear regression models: dependent variable ~ independent variable #1 + independent variable #2. The main difference from simple linear models is that you will have more than one predictors (or independent variables).

 # Creating our model using lm()
model2 <- lm(Fertility ~ Infant.Mortality + Education, data=swiss)

# Viewing our regression results using summary()
summary(model2)

## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality + Education, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3906  -6.0088  -0.9624   5.8808  21.0736 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       48.8213     8.8904   5.491 1.88e-06 ***
## Infant.Mortality   1.5187     0.4287   3.543 0.000951 ***
## Education         -0.8167     0.1298  -6.289 1.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.426 on 44 degrees of freedom
## Multiple R-squared:  0.5648, Adjusted R-squared:  0.545 
## F-statistic: 28.55 on 2 and 44 DF,  p-value: 1.126e-08

How do you interpret the results from this model?

Visually, your regression analysis would look like a plane rather than a line.

Exercise 2

Using lm(), run an OLS model predicting % agriculture and % education as a determinant of fertility and interpret your regression coefficients (+ associated p-values).