Correlation and Regression in R

Course Description

Ultimately, data analysis is about understanding relationships among variables. Exploring data with multiple variables requires new, more complex tools, but enables a richer set of comparisons. In this course, you will learn how to describe relationships between two numerical quantities. You will characterize these relationships graphically, in the form of summary statistics, and through simple linear regression models.

1. Visualizing Two Variables

Scatterplots

Scatterplots are the most common and effective tools for visualizing the relationship between two numeric variables.

The ncbirths dataset is a random sample of 1,000 cases taken from a larger dataset collected in 2004. Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age). You can view the help file for these data by running ?ncbirths in the console.

Using the ncbirths dataset, make a scatterplot using ggplot() to illustrate how the birth weight of these babies varies according to the number of weeks of gestation.

# Getting things set up
if(!require("openintro")){
  install.packages("openintro")
  library(openintro)
  }

if(!require("HistData")){
  install.packages("HistData")
  library(HistData)
  }

data(ncbirths)

library(ggplot2)
library(tidyverse)

# Scatterplot of weight vs. weeks
ggplot(data = ncbirths, aes(x = weeks, y = weight)) + 
  geom_point()

Boxplots as discretized/conditioned scatterplots

If it is helpful, you can think of boxplots as scatterplots for which the variable on the x-axis has been discretized.

Discretization is the process through which we can transform continuous variables, models or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

Continuous data is Measured, while Discrete data is Counted.

Often, it is easier to understand continuous data (such as weight) when divided and stored into meaningful categories or groups. For example, we can divide a continuous variable, weight, and store it in the following groups:

Under 100 lbs (light)
Between 140–160 lbs (mid)
Over 200 lbs (heavy)

The cut() function takes two arguments: the continuous variable you want to discretize and the number of breaks that you want to make in that continuous variable in order to discretize it.

Using the ncbirths dataset again, make a boxplot illustrating how the birth weight of these babies varies according to the number of weeks of gestation. This time, use the cut() function to discretize the x-variable into six intervals (i.e. five breaks).

# Boxplot of weight vs. weeks
ggplot(data = ncbirths, 
       aes(x = cut(weeks, breaks = 5), y = weight)) + 
  geom_boxplot()

Creating Scatterplots

Creating scatterplots is simple and they are so useful that it is worthwhile to expose yourself to many examples. Over time, you will gain familiarity with the types of patterns that you see. You will begin to recognize how scatterplots can reveal the nature of the relationship between two variables.

Scatterplots with a linear pattern have points that seem to generally fall along a line while nonlinear patterns seem to follow along some curve.

In this exercise, and throughout this chapter, we will be using several datasets listed below. These data are available through the openintro package. Briefly:

The mammals dataset contains information about 39 different species of mammals, including their body weight, brain weight, gestation time, and a few other variables.
The mlbBat10 dataset contains batting statistics for 1,199 Major League Baseball players during the 2010 season.

To see more thorough documentation, use the ? or help() functions.

Using the mammals dataset, create a scatterplot illustrating how the brain weight of a mammal varies as a function of its body weight.

ls(mammals)

##  [1] "body_wt"      "brain_wt"     "danger"       "dreaming"     "exposure"    
##  [6] "gestation"    "life_span"    "non_dreaming" "predation"    "species"     
## [11] "total_sleep"

# Mammals scatterplot
ggplot(data = mammals, aes(x = body_wt, y = brain_wt)) +
  geom_point()

Using the mlbBat10 dataset, create a scatterplot illustrating how the slugging percentage (SLG) of a player varies as a function of his on-base percentage (OBP).

ls(mlbbat10)

##  [1] "at_bat"          "bat_avg"         "caught_stealing" "double"         
##  [5] "game"            "hit"             "home_run"        "name"           
##  [9] "obp"             "position"        "rbi"             "run"            
## [13] "slg"             "stolen_base"     "strike_out"      "team"           
## [17] "total_base"      "triple"          "walk"

# Baseball player scatterplot
ggplot(data = mlbbat10, aes(x = obp, y = slg)) +
  geom_point()

Using the bdims dataset, create a scatterplot illustrating how a person’s weight varies as a function of their height. Use color to separate by sex, which you’ll need to coerce to a factor with factor().

# Body dimensions scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
  geom_point()

Using the smoking dataset, create a scatterplot illustrating how the amount that a person smokes on weekdays varies as a function of their age.

ls(smoking)

##  [1] "age"                   "amt_weekdays"          "amt_weekends"         
##  [4] "ethnicity"             "gender"                "gross_income"         
##  [7] "highest_qualification" "marital_status"        "nationality"          
## [10] "region"                "smoke"                 "type"

# Smoking scatterplot
ggplot(data = smoking, aes(x = age, y = amt_weekdays)) +
  geom_point()

Transformations

The relationship between two variables may not be linear. In these cases we can sometimes see strange and even inscrutable patterns in a scatterplot of the data. Sometimes there really is no meaningful relationship between the two variables. Other times, a careful transformation of one or both of the variables can reveal a clear relationship.

Recall the bizarre pattern that you saw in the scatterplot between brain weight and body weight among mammals in a previous exercise. Can we use transformations to clarify this relationship?

ggplot2 provides several different mechanisms for viewing transformed relationships. The coord_trans() function transforms the coordinates of the plot. Alternatively, the scale_x_log10() and scale_y_log10() functions perform a base-10 log transformation of each axis. Note the differences in the appearance of the axes.

The mammals dataset is available in your workspace.

Use coord_trans() to create a scatterplot showing how a mammal’s brain weight varies as a function of its body weight, where both the x and y axes are on a “log10” scale.

# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = body_wt, y = brain_wt)) +
  geom_point() + 
  coord_trans(x = "log10", y = "log10")

Use scale_x_log10() and scale_y_log10() to achieve the same effect but with different axis labels and grid lines.

# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = body_wt, y = brain_wt)) +
  geom_point() +
  scale_x_log10() + 
  scale_y_log10()

Identifying Outliers

In Chapter 5, we will discuss how outliers can affect the results of a linear regression model and how we can deal with them. For now, it is enough to simply identify them and note how the relationship between two variables may change as a result of removing outliers.

Recall that in the baseball example earlier in the chapter, most of the points were clustered in the lower left corner of the plot, making it difficult to see the general pattern of the majority of the data. This difficulty was caused by a few outlying players whose on-base percentages (OBPs) were exceptionally high. These values are present in our dataset only because these players had very few batting opportunities.

Both OBP and SLG are known as rate statistics, since they measure the frequency of certain events (as opposed to their count). In order to compare these rates sensibly, it makes sense to include only players with a reasonable number of opportunities, so that these observed rates have the chance to approach their long-run frequencies.

In Major League Baseball, batters qualify for the batting title only if they have 3.1 plate appearances per game. This translates into roughly 502 plate appearances in a 162-game season. The mlbBat10 dataset does not include plate appearances as a variable, but we can use at-bats (AB) – which constitute a subset of plate appearances – as a proxy.

Use filter() to keep only players who had at least 200 at-bats, assigning to ab_gt_200.
Using ab_gt_200, create a scatterplot for SLG as a function of OBP.
Find the row of ab_gt_200 corresponding to the one player (with at least 200 at-bats) whose OBP was below 0.200.

ls(mlbbat10)

##  [1] "at_bat"          "bat_avg"         "caught_stealing" "double"         
##  [5] "game"            "hit"             "home_run"        "name"           
##  [9] "obp"             "position"        "rbi"             "run"            
## [13] "slg"             "stolen_base"     "strike_out"      "team"           
## [17] "total_base"      "triple"          "walk"

# Filter for AB greater than or equal to 200
ab_gt_200 <- mlbbat10 %>% 
  filter(at_bat >= 200)

# Scatterplot of SLG vs. OBP
ggplot(ab_gt_200, aes(x = obp, y = slg)) +
  geom_point()

# Identify the outlying player
ab_gt_200 %>%
  filter(obp < 0.2)

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 1 x 19
##   name  team  position  game at_bat   run   hit double triple home_run   rbi
##   <fct> <fct> <fct>    <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl> <dbl>
## 1 B Wo~ LAA   3B          81    226    20    33      2      0        4    14
## # ... with 8 more variables: total_base <dbl>, walk <dbl>, strike_out <dbl>,
## #   stolen_base <dbl>, caught_stealing <dbl>, obp <dbl>, slg <dbl>,
## #   bat_avg <dbl>

2. Correlation

Computing Correlation

The cor(x, y) function will compute the Pearson product-moment correlation between variables, x and y. Since this quantity is symmetric with respect to x and y, it doesn’t matter in which order you put the variables.

The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between foggy days and attacks of wheeziness. However, in statistical terms we use correlation to denote association between two quantitative variables. We also assume that the association is linear, that one variable increases or decreases a fixed amount for a unit increase or decrease in the other. The other technique that is often used in these circumstances is regression, which involves estimating the best straight line to summarise the association.

The correlation coefficient is measured on a scale that varies from + 1 through 0 to - 1. Complete correlation between two variables is expressed by either + 1 or -1. When one variable increases as the other increases the correlation is positive; when one decreases as the other increases it is negative. Complete absence of correlation is represented by 0.

At the same time, the cor() function is very conservative when it encounters missing data (e.g. NAs). The use argument allows you to override the default behavior of returning NA whenever any of the values encountered is NA. Setting the use argument to “pairwise.complete.obs” allows cor() to compute the correlation coefficient for those observations where the values of x and y are both not missing.

Use cor() to compute the correlation between the birthweight of babies in the ncbirths dataset and their mother’s age. There is no missing data in either variable.
Compute the correlation between the birthweight and the number of weeks of gestation for all non-missing pairs.

# Compute correlation
ncbirths %>%
  summarize(N = n(), r = cor(weight, mage))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 1 x 2
##       N      r
##   <int>  <dbl>
## 1  1000 0.0551

# Compute correlation for all non-missing pairs
ncbirths %>%
  summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 1 x 2
##       N     r
##   <int> <dbl>
## 1  1000 0.670

Exploring Anscombe

In 1973, Francis Anscombe famously created four datasets with remarkably similar numerical properties, but obviously different graphic relationships. The Anscombe dataset contains the x and y coordinates for these four datasets, along with a grouping variable, set, that distinguishes the quartet.

Anscombe’s quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that “numerical calculations are exact, but graphs are rough.”

It may be helpful to remind yourself of the graphic relationship by viewing the four scatterplots:

ggplot(data = Anscombe, aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~ set)

For each of the four sets of data points in the Anscombe dataset, compute the following in the order specified. Names are provided in your call to summarize().

Number of observations, N
Mean of x
Standard deviation of x
Mean of y
Standard deviation of y
Correlation coefficient between x and y

# Compute properties of Anscombe*
Anscombe %>%
  group_by(set) %>%
  summarize(
    N = n(), 
    mean_of_x = mean(x), 
    std_dev_of_x = sd(x), 
    mean_of_y = mean(y), 
    std_dev_of_y = sd(y), 
    correlation_between_x_and_y = cor(x, y)
  )

Perception of correlation

Estimating the value of the correlation coefficient between two quantities from their scatterplot can be tricky. Statisticians have shown that people’s perception of the strength of these relationships can be influenced by design choices like the x and y scales.

Nevertheless, with some practice your perception of correlation will improve. Toggle through the four scatterplots in the plotting window, each of which you’ve seen in a previous exercise. Jot down your best estimate of the value of the correlation coefficient between each pair of variables. Then, compare these values to the actual values you compute in this exercise.

If you’re having trouble recalling variable names, it may help to preview a dataset in the console with str() or glimpse().

Draw the plot then calculate the correlation between OBP and SLG for all players in the mlbBat10 dataset.

# Run this and look at the plot
ggplot(data = mlbbat10, aes(x = obp, y = slg)) +
  geom_point()

# Correlation for all baseball players
mlbbat10 %>%
  summarize(N = n(), r = cor(obp, slg))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 1 x 2
##       N     r
##   <int> <dbl>
## 1  1199 0.815

Draw the plot then calculate the correlation between OBP and SLG for all players in the mlbbat10 dataset with at least 200 at-bats.

# Run this and look at the plot
mlbbat10 %>% 
    filter(at_bat > 200) %>%
    ggplot(aes(x = obp, y = slg)) + 
    geom_point()

# Correlation for all players with at least 200 ABs
mlbbat10 %>%
  filter(at_bat >= 200) %>%
  summarize(N = n(), r = cor(obp, slg))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 1 x 2
##       N     r
##   <int> <dbl>
## 1   329 0.686

Draw the plot then calculate the correlation between height and weight for each sex in the bdims dataset.

# Run this and look at the plot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
  geom_point()

# Correlation of body dimensions
bdims %>%
  group_by(sex) %>%
  summarize(N = n(), r = cor(hgt, wgt))

## `summarise()` ungrouping output (override with `.groups` argument)

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 2 x 3
##     sex     N     r
##   <int> <int> <dbl>
## 1     0   260 0.431
## 2     1   247 0.535

Draw the plot then calculate the correlation between body weight and brain weight for all species of mammals. Alongside this computation, compute the correlation between the same two quantities after taking their natural logarithms.

# Run this and look at the plot
ggplot(data = mammals, aes(x = body_wt, y = brain_wt)) +
  geom_point() + scale_x_log10() + scale_y_log10()

# Correlation among mammals, with and without log
mammals %>%
  summarize(N = n(), 
            r = cor(body_wt, brain_wt), 
            r_log = cor(log(body_wt), log(brain_wt)))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 1 x 3
##       N     r r_log
##   <int> <dbl> <dbl>
## 1    62 0.934 0.960

Spurious correlation in random data

In statistics, a spurious correlation, or spuriousness, refers to a connection between two variables that appears causal but is not. Spurious relationships often have the appearance of one variable affecting another. This spurious correlation is often caused by a third factor that is not apparent at the time of examination, sometimes called a confounding factor.

Examples of Spurious Correlations

Drownings rise when ice cream sales rise. It may seem that increased ice cream sales cause more drowning, but in reality, rising heat may cause more people to swim, as well as buy more ice cream.
The U.S. murder rate from 2006-2011 dropped at the same rate as Microsoft Internet Explorer usage.
Executives who say please and thank you more often enjoy better share performance.

Statisticians must always be skeptical of potentially spurious correlations. Human beings are very good at seeing patterns in data, sometimes when the patterns themselves are actually just random noise. To illustrate how easy it can be to fall into this trap, we will look for patterns in truly random data.

The noise dataset contains 20 sets of x and y variables drawn at random from a standard normal distribution. Each set, denoted as z, has 50 observations of x, y pairs. Do you see any pairs of variables that might be meaningfully correlated? Are all of the correlation coefficients close to zero?

Create a faceted scatterplot that shows the relationship between each of the 20 sets of pairs of random variables x and y. You will need the facet_wrap() function for this.
Compute the actual correlation between each of the 20 sets of pairs of x and y.
Identify the datasets that show non-trivial correlation of greater than 0.2 in absolute value.

# Create faceted scatterplot
ggplot(data = noise, aes(x = x, y = y)) +
  geom_point() + 
  facet_wrap(~ z)

# Compute correlations for each dataset
noise_summary <- noise %>%
  group_by(z) %>%
  summarize(N = n(), spurious_cor = cor(x, y))

# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
  filter(abs(spurious_cor) > 0.2)

3. Simple linear regression

The “best fit” line

The simple linear regression model for a numeric response as a function of a numeric explanatory variable can be visualized on the corresponding scatterplot by a straight line. This is a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

Least-Squares Regression

The most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed, there are no cancellations between positive and negative values.

We might consider linear regression to be a specific example of a larger class of smooth models. The geom_smooth() function allows you to draw such models over a scatterplot of the data itself. This technique is known as visualizing the model in the data space. The method argument to geom_smooth() allows you to specify what class of smooth model you want to see. Since we are exploring linear models, we’ll set this argument to the value “lm”.

Note that geom_smooth() also takes an se argument that controls the standard error, which we will ignore for now.

Create a scatterplot of body weight as a function of height for all individuals in the bdims dataset with a simple linear model plotted over the data.

# Scatterplot with regression line
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Uniqueness of least squares regression line

The least squares criterion implies that the slope of the regression line is unique. In practice, the slope is computed by R. In this exercise, you will experiment with trying to find the optimal value for the regression slope for weight as a function of height in the bdims dataset via trial-and-error.

To help, we’ve built a custom function for you called add_line(), which takes a single argument: the proposed slope coefficient.

The bdims dataset is available in your workspace. Experiment with different values (to the nearest integer) of the my_slope parameter until you find one that you think fits best.

add_line <- function (my_slope) {

  bdims_summary <- bdims %>%
    summarize(N = n(), r = cor(hgt, wgt),
              mean_hgt = mean(hgt), mean_wgt = mean(wgt),
              sd_hgt = sd(hgt), sd_wgt = sd(wgt)) %>%
    mutate(true_slope = r * sd_wgt / sd_hgt, 
           true_intercept = mean_wgt - true_slope * mean_hgt)
  p <- ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
    geom_point() + 
    geom_point(data = bdims_summary, 
               aes(x = mean_hgt, y = mean_wgt), 
               color = "red", size = 3)
  
  my_data <- bdims_summary %>%
    mutate(my_slope = my_slope, 
           my_intercept = mean_wgt - my_slope * mean_hgt)
  p + geom_abline(data = my_data, 
                  aes(intercept = my_intercept, slope = my_slope), color = "dodgerblue")
}

# Estimate optimal value of my_slope
add_line(my_slope = 1)

What is a Residual in Regression?

When you perform simple linear regression (or any other type of regression analysis), you get a line of best fit. The data points usually don’t fall exactly on this regression equation line; they are scattered around. A residual is the vertical distance between a data point and the regression line. Each data point has one residual. They are positive if they are above the regression line and negative if they are below the regression line. If the regression line actually passes through the point, the residual at that point is zero.

As residuals are the difference between any data point and the regression line, they are sometimes called “errors.” Error in this context doesn’t mean that there’s something wrong with the analysis; it just means that there is some unexplained difference. In other words, the residual is the error that isn’t explained by the regression line. The sum of the residuals always equals zero (assuming that your line is actually the line of “best fit.” The mean of residuals is also equal to zero, as the mean = the sum of the residuals / the number of items.

Fitting a linear model “by hand”

Recall the simple linear regression model:

\(Y = b_0 + b_1 \cdot X\)

Two facts enable you to compute the slope \(b_1\) and intercept \(b_0\) of a simple linear regression model from some basic summary statistics.

First, the slope can be defined as:

\(b_1 = r_{X,Y} \cdot \frac{s_Y}{s_X}\)

where \(r_{X,Y}\) represents the correlation (cor()) of \(X\) and \(Y\) and \(s_X\) and \(s_Y\) represent the standard deviation (sd()) of \(X\) and \(Y\), respectively.

Second, the point \((\bar{x}, \bar{y})\) is always on the least squares regression line, where \(\bar{x}\) and \(\bar{y}\) denote the average of \(x\) and \(y\), respectively.

The bdims_summary data frame contains all of the information you need to compute the slope and intercept of the least squares regression line for body weight \((Y)\) as a function of height \((X)\). You might need to do some algebra to solve for \(b_0\)!

Print the bdims_summary data frame.
Use mutate() to add the slope and intercept to the bdims_summary data frame.

# Print bdims_summary
bdims_summary

# Add slope and intercept
bdims_summary %>%
  mutate(slope = r * sd_wgt / sd_hgt, 
         intercept = mean_wgt - slope * mean_hgt)

Regression to the mean

Regression to the mean (RTM) is a concept attributed to Sir Francis Galton. The basic idea is that extreme random observations will tend to be less extreme upon a second trial. This is simply due to chance alone. While “regression to the mean” and “linear regression” are not the same thing, we will examine them together in this exercise.

RTM is a particular concern in two situations:

Pre-post intervention study designs that target “high risk” groups (e.g., individuals with high blood pressure)
Two-phase sampling designs where a subset of the first sample based on initial value is chosen for further study (e.g., when selecting the highest or lowest risk group for sub-analyses or follow-up).

Luckily, there are corrections you can make at the design phase and/or analysis phase to minimize the risk of RTM. The best solution, say most authors, is an appropriately designated control group.

One way to see the effects of regression to the mean is to compare the heights of parents to their children’s heights. While it is true that tall mothers and fathers tend to have tall children, those children tend to be less tall than their parents, relative to average. That is, fathers who are 3 inches taller than the average father tend to have children who may be taller than average, but by less than 3 inches.

The Galton_men and Galton_women datasets contain data originally collected by Galton himself in the 1880s on the heights of men and women, respectively, along with their parents’ heights.

Compare the slope of the regression line to the slope of the diagonal line. What does this tell you?

Create a scatterplot of the height of men as a function of their father’s height. Add the simple linear regression line and a diagonal line (with slope equal to 1 and intercept equal to 0) to the plot.
Create a scatterplot of the height of women as a function of their mother’s height. Add the simple linear regression line and a diagonal line to the plot.

# Height of children vs. height of father
ggplot(data = Galton_men, aes(x = father, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)

# Height of children vs. height of mother
ggplot(data = Galton_women, aes(x = mother, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)

4. Interpreting regression models

Fitting simple linear models

While the geom_smooth(method = “lm”) function is useful for drawing linear models on a scatterplot, it doesn’t actually return the characteristics of the model. As suggested by that syntax, however, the function that creates linear models is lm(). This function generally takes two arguments:

A formula that specifies the model A data argument for the data frame that contains the data you want to use to fit the model The lm() function return a model object having class “lm”. This object contains lots of information about your regression model, including the data used to fit the model, the specification of the model, the fitted values and residuals, etc.

Using the bdims dataset, create a linear model for the weight of people as a function of their height.
Using the mlbBat10 dataset, create a linear model for SLG as a function of OBP.
Using the mammals dataset, create a linear model for the body weight of mammals as a function of their brain weight, after taking the natural log of both variables.

# Linear model for weight as a function of height
lm(wgt ~ hgt, data = bdims)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Coefficients:
## (Intercept)          hgt  
##    -105.011        1.018

# Linear model for SLG as a function of OBP
lm(slg ~ obp, data = mlbbat10)

## 
## Call:
## lm(formula = slg ~ obp, data = mlbbat10)
## 
## Coefficients:
## (Intercept)          obp  
##    0.009407     1.110323

# Log-linear model for body weight as a function of brain weight
lm(log(body_wt) ~ log(brain_wt), data = mammals)

## 
## Call:
## lm(formula = log(body_wt) ~ log(brain_wt), data = mammals)
## 
## Coefficients:
##   (Intercept)  log(brain_wt)  
##        -2.509          1.225

The lm summary output

An “lm” object contains a host of information about the regression model that you fit. There are various ways of extracting different pieces of information.

The coef() function displays only the values of the coefficients. Conversely, the summary() function displays not only that information, but a bunch of other information, including the associated standard error and p-value for each coefficient, the R², adjusted R², and the residual standard error. The summary of an “lm” object in R is very similar to the output you would see in other statistical computing environments (e.g. Stata, SPSS, etc.)

We have already created the mod object, a linear model for the weight of individuals as a function of their height, using the bdims dataset and the code

mod <- lm(wgt ~ hgt, data = bdims)

Now, you will:

Use coef() to display the coefficients of mod.
Use summary() to display the full regression output of mod.

# Show the coefficients
coef(mod)

## (Intercept)         hgt 
## -105.011254    1.017617

# Show the full output
summary(mod)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

Fitted values and residuals

The least squares fitting procedure guarantees that the mean of the residuals is zero (n.b., numerical instability may result in the computed values not being exactly zero). At the same time, the mean of the fitted values must equal the mean of the response variable.

In this exercise, we will confirm these two mathematical facts by accessing the fitted values and residuals with the fitted.values() and residuals() functions, respectively, for the following model:

mod <- lm(wgt ~ hgt, data = bdims)

mod (defined above) is available in your workspace.

Confirm that the mean of the body weights equals the mean of the fitted values of mod.
Compute the mean of the residuals of mod.

# Mean of weights equal to mean of fitted values?
mean(bdims$wgt) == mean(fitted.values(mod))

## [1] TRUE

# Mean of the residuals
mean(residuals(mod))

## [1] -1.266971e-15

Tidying your linear model

As you fit a regression model, there are some quantities (e.g. R²) that apply to the model as a whole, while others apply to each observation. If there are several of these per-observation quantities, it is sometimes convenient to attach them to the original data as new variables.

The augment() function from the broom package does exactly this. It takes a model object as an argument and returns a data frame that contains the data on which the model was fit, along with several quantities specific to the regression model, including the fitted values, residuals, leverage scores, and standardized residuals.

The same linear model from the last exercise, mod, is available in your workspace.

Load the broom package.
Create a new data frame called bdims_tidy that is the augmentation of the mod linear model.
View the bdims_tidy data frame using glimpse().

# Load broom
library(broom)

## Warning: package 'broom' was built under R version 4.0.2

# Create bdims_tidy
bdims_tidy <- augment(mod)

# Glimpse the resulting data frame
glimpse(bdims_tidy)

## Rows: 507
## Columns: 8
## $ wgt        <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81...
## $ hgt        <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, ...
## $ .fitted    <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 79.686...
## $ .resid     <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -6.686...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063, -0.7...
## $ .hat       <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576, 0.00...
## $ .sigma     <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9.3147...
## $ .cooksd    <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e-03, ...

Making predictions

The fitted.values() function or the augment()-ed data frame provides us with the fitted values for the observations that were in the original data. However, once we have fit the model, we may want to compute expected values for observations that were not present in the data on which the model was fit. These types of predictions are called out-of-sample.

The ben data frame contains a height and weight observation for one person. The mod object contains the fitted model for weight as a function of height for the observations in the bdims dataset. We can use the predict() function to generate expected values for the weight of new individuals. We must pass the data frame of new observations through the newdata argument.

The same linear model, mod, is defined in your workspace.

Print ben to the console.
Use predict() with the newdata argument to compute the expected weight of the individual in the ben data frame.

# Print ben
ben

# Predict the weight of ben
predict(mod, newdata = ben)

Adding a regression line to a plot manually

The geom_smooth() function makes it easy to add a simple linear regression line to a scatterplot of the corresponding variables. And in fact, there are more complicated regression models that can be visualized in the data space with geom_smooth(). However, there may still be times when we will want to add regression lines to our scatterplot manually. To do this, we will use the geom_abline() function, which takes slope and intercept arguments. Naturally, we have to compute those values ahead of time, but we already saw how to do this (e.g. using coef()).

The coefs data frame contains the model estimates retrieved from coef(). Passing this to geom_abline() as the data argument will enable you to draw a straight line on your scatterplot.

Use geom_abline() to add a line defined in the coefs data frame to a scatterplot of weight vs. height for individuals in the bdims dataset.

# Add the line to the scatterplot*
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_abline(data = coefs, 
              aes(intercept = `(Intercept)`, slope = hgt),  
              color = "dodgerblue")

5. Model Fit

Standard error of residuals

One way to assess strength of fit is to consider how far off the model is for a typical case. That is, for some observations, the fitted value will be very close to the actual value, while for others it will not. The magnitude of a typical residual can give us a sense of generally how close our estimates are.

However, recall that some of the residuals are positive, while others are negative. In fact, it is guaranteed by the least squares fitting procedure that the mean of the residuals is zero. Thus, it makes more sense to compute the square root of the mean squared residual, or root mean squared error (RMSE). R calls this quantity the residual standard error.

Question: The residual standard error reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 4.67. What does this mean?

Answer: The typical difference between the observed poverty rate and the poverty rate predicted by the model is about 4.67 percentage points.

To make this estimate unbiased, you have to divide the sum of the squared residuals by the degrees of freedom in the model.

You can recover the residuals from mod with residuals(), and the degrees of freedom with df.residual().

View a summary() of mod.
Compute the mean of the residuals() and verify that it is approximately zero.
Use residuals() and df.residual() to compute the root mean squared error (RMSE), a.k.a. residual standard error.

# View summary of model
summary(mod)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Compute the mean of the residuals
mean(residuals(mod))

## [1] -1.266971e-15

# Compute RMSE
sqrt(sum(residuals(mod)^2) / df.residual(mod))

## [1] 9.30804

Assessing simple linear model fit

A model fits the data well if the differences between the observed values and the model’s predicted values are small and unbiased. Before you look at the statistical measures for goodness-of-fit, you should check the residual plots. Residual plots can reveal unwanted residual patterns that indicate biased results more effectively than numbers. When your residual plots pass muster, you can trust your numerical results and check the goodness-of-fit statistics.

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. However, R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

Recall that the coefficient of determination \((R^2)\), can be computed as

\(R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{Var(e)}{Var(y)} \,,\)

where \(e\) is the vector of residuals and \(y\) is the response variable. This gives us the interpretation of \(R^2\) as the percentage of the variability in the response that is explained by the model, since the residuals are the part of that variability that remains unexplained by the model.

The bdims_tidy data frame is the result of augment()-ing the bdims data frame with the mod for wgt as a function of hgt.

Use the summary() function to view the full results of mod.
Use the bdims_tidy data frame to compute the R² of mod manually using the formula above, by computing the ratio of the variance of the residuals to the variance of the response variable.

# View model summary
summary(mod)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Compute R-squared
bdims_tidy %>%
  summarize(var_y = var(wgt), var_e = var(.resid)) %>%
  mutate(R_squared = 1 - var_e / var_y)

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 1 x 3
##   var_y var_e R_squared
##   <dbl> <dbl>     <dbl>
## 1  178.  86.5     0.515

The result means that 51.4% of the variability in weight is explained by height.

Linear vs. Average

The \(R^2\) gives us a numerical measurement of the strength of fit relative to a null model based on the average of the response variable:

\(\hat{y}_{null} = \bar{y}\)

This model has an \(R^2\) of zero because \(SSE = SST\). That is, since the fitted values \(\hat{y}_{null}\) are all equal to the average \(\bar{y}\), the residual for each observation is the distance between that observation and the mean of the response. Since we can always fit the null model, it serves as a baseline against which all other models will be compared.

In the graphic, we visualize the residuals for the null model (mod_null at left) vs. the simple linear regression model (mod_hgt at right) with height as a single explanatory variable. Try to convince yourself that, if you squared the lengths of the grey arrows on the left and summed them up, you would get a larger value than if you performed the same operation on the grey arrows on the right.

It may be useful to preview these augment()-ed data frames with glimpse():

glimpse(mod_null)
glimpse(mod_hgt)

Compute the sum of the squared residuals (SSE) for the null model mod_null.
Compute the sum of the squared residuals (SSE) for the regression model mod_hgt.

# Compute SSE for null model
mod_null %>%
  summarize(SSE = sum(.resid^2))

# Compute SSE for regression model
mod_hgt %>%
  summarize(SSE = sum(.resid^2))

Leverage

The leverage of an observation in a regression model is defined entirely in terms of the distance of that observation from the mean of the explanatory variable. That is, observations close to the mean of the explanatory variable have low leverage, while observations far from the mean of the explanatory variable have high leverage. Points of high leverage may or may not be influential.

The augment() function from the broom package will add the leverage scores (.hat) to a model data frame.

Use augment() to list the top 6 observations by their leverage scores, in descending order.

# Rank points of high leverage
mod %>%
  augment() %>%
  arrange(desc(.hat)) %>%
  head()

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 6 x 8
##     wgt   hgt .fitted .resid .std.resid   .hat .sigma  .cooksd
##   <dbl> <dbl>   <dbl>  <dbl>      <dbl>  <dbl>  <dbl>    <dbl>
## 1  85.5  198.    96.6 -11.1      -1.20  0.0182   9.30 0.0134  
## 2  90.9  197.    95.6  -4.66     -0.505 0.0170   9.31 0.00221 
## 3  49.8  147.    44.8   5.02      0.543 0.0148   9.31 0.00221 
## 4  80.7  194.    91.9 -11.2      -1.21  0.0131   9.30 0.00976 
## 5  95.9  193     91.4   4.51      0.488 0.0126   9.32 0.00152 
## 6  44.8  150.    47.1  -2.32     -0.251 0.0124   9.32 0.000397

Influence

Cook’s Distance

Cook’s distance is the scaled change in fitted values, which is useful for identifying outliers in the X values (observations for predictor variables). Cook’s distance shows the influence of each observation on the fitted response values. An observation with Cook’s distance larger than three times the mean Cook’s distance might be an outlier. The measurement is a combination of each observation’s leverage and residual values; the higher the leverage and residuals, the higher the Cook’s distance.

As noted previously, observations of high leverage may or may not be influential. The influence of an observation depends not only on its leverage, but also on the magnitude of its residual. Recall that while leverage only takes into account the explanatory variable \((x)\), the residual depends on the response variable \((y)\) and the fitted value \((\hat{y})\).

Influential points are likely to have high leverage and deviate from the general relationship between the two variables. We measure influence using Cook’s distance, which incorporates both the leverage and residual of each observation.

Use augment() to list the top 6 observations by their Cook’s distance (.cooksd), in descending order.

# Rank influential points
mod %>%
  augment() %>%
  arrange(desc(.cooksd)) %>%
  head()

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 6 x 8
##     wgt   hgt .fitted .resid .std.resid    .hat .sigma .cooksd
##   <dbl> <dbl>   <dbl>  <dbl>      <dbl>   <dbl>  <dbl>   <dbl>
## 1  73.2  151.    48.8   24.4       2.64 0.0109    9.25  0.0386
## 2 116.   178.    75.9   40.5       4.36 0.00296   9.14  0.0282
## 3 104.   165.    63.0   41.1       4.42 0.00279   9.14  0.0273
## 4 109.   190.    88.8   19.8       2.13 0.0103    9.28  0.0238
## 5  67.3  152.    50.1   17.2       1.86 0.00982   9.29  0.0171
## 6  76.8  158.    55.3   21.5       2.32 0.00613   9.27  0.0166

Removing Outliers

Observations can be outliers for a number of different reasons. Statisticians must always be careful—and more importantly, transparent—when dealing with outliers. Sometimes, a better model fit can be achieved by simply removing outliers and re-fitting the model. However, one must have strong justification for doing this. A desire to have a higher R² is not a good enough reason!

In the mlbBat10 data, the outlier with an OBP of 0.550 is Bobby Scales, an infielder who had four hits in 13 at-bats for the Chicago Cubs. Scales also walked seven times, resulting in his unusually high OBP. The justification for removing Scales here is weak. While his performance was unusual, there is nothing to suggest that it is not a valid data point, nor is there a good reason to think that somehow we will learn more about Major League Baseball players by excluding him.

Nevertheless, we can demonstrate how removing him will affect our model.

Use filter() to create a subset of mlbBat10 called nontrivial_players consisting of only those players with at least 10 at-bats and OBP of below 0.500.
Fit the linear model for SLG as a function of OBP for the nontrivial_players. Save the result as mod_cleaner.
View the summary() of the new model and compare the slope and R² to those of mod, the original model fit to the data on all players.
Visualize the new model with ggplot() and the appropriate geom_*() functions.

# Create nontrivial_players
nontrivial_players <- mlbbat10 %>%
  filter(at_bat >= 10, obp < 0.5)

# Fit model to new data
mod_cleaner <- lm(slg ~ obp, data = nontrivial_players)

# View model summary
summary(mod_cleaner)

## 
## Call:
## lm(formula = slg ~ obp, data = nontrivial_players)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31383 -0.04165 -0.00261  0.03992  0.35819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.043326   0.009823  -4.411 1.18e-05 ***
## obp          1.345816   0.033012  40.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared:  0.6937, Adjusted R-squared:  0.6932 
## F-statistic:  1662 on 1 and 734 DF,  p-value: < 2.2e-16

# Visualize new model
ggplot(data = nontrivial_players, aes(x = obp, y = slg)) +
  geom_point() + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

High leverage points

Not all points of high leverage are influential. While the high leverage observation corresponding to Bobby Scales in the previous exercise is influential, the three observations for players with OBP and SLG values of 0 are not influential.

This is because they happen to lie right near the regression anyway. Thus, while their extremely low OBP gives them the power to exert influence over the slope of the regression line, their low SLG prevents them from using it.

The linear model, mod, is available in your workspace. Use a combination of augment(), arrange() with two arguments, and head() to find the top 6 observations with the highest leverage but the lowest Cook’s distance.

# Rank high leverage points
mod %>%
  augment() %>%
  arrange(desc(.hat), .cooksd) %>%
  head()

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 6 x 8
##     wgt   hgt .fitted .resid .std.resid   .hat .sigma  .cooksd
##   <dbl> <dbl>   <dbl>  <dbl>      <dbl>  <dbl>  <dbl>    <dbl>
## 1  85.5  198.    96.6 -11.1      -1.20  0.0182   9.30 0.0134  
## 2  90.9  197.    95.6  -4.66     -0.505 0.0170   9.31 0.00221 
## 3  49.8  147.    44.8   5.02      0.543 0.0148   9.31 0.00221 
## 4  80.7  194.    91.9 -11.2      -1.21  0.0131   9.30 0.00976 
## 5  95.9  193     91.4   4.51      0.488 0.0126   9.32 0.00152 
## 6  44.8  150.    47.1  -2.32     -0.251 0.0124   9.32 0.000397