Summary

Equipped with your understanding of the general modeling framework, in this chapter we’ll cover basic linear regression where you’ll keep things simple and model the outcome variable \(y\) as a function of a single explanatory/predictor variable \(x\). We’ll use both categorical and numerical \(x\) variables. The outcome variable of interest in this chapter will be teaching evaluation scores of instructors at the University of Texas, Austin.

Explaining Teaching Score with Age

Earlier, you explored the relationship between teaching score and age via a scatter plot.

# Load packages
library(tidyverse)
library(moderndive)
# Create a scatter plot of teaching score over age
ggplot(evals, aes(x = age, y = score)) +
  geom_point() +
  labs(x = "age", y = "score",
       title = "Teaching score over age")

You found that the relationship showed a correlation coefficient of -0.107, indicating a weakly negative relationship:

evals %>% 
  summarise(correlation = cor(age, score))
## # A tibble: 1 x 1
##   correlation
##         <dbl>
## 1      -0.107

You also saw that the scatter plot suffers from overplotting. Let’s keep this overplotting in mind as we move forward. Now, can you visually summarise the above relationship with a best-fitting line? That is, a line that cuts through the cloud of points, separating the signal from the noise? We can do this using a regression line.

We can add a regression line to the ggplot2 code by adding a geom_smooth() layer with lm for linear model and se equal FALSE to omit standard error bars (which are a concept for a more advance course).

# Create a scatter plot of teaching score over age
ggplot(evals, aes(x = age, y = score)) +
  geom_point() +
  labs(x = "age", y = "score",
       title = "Teaching score over age") +
  # Add a "best-fitting" line
  geom_smooth(method = "lm", se = FALSE)

Observe. The overall relationship is negative; as ages increase, scores decrease. This is consistent with our computed correlation coefficient of -0.107. Now, does that mean aging directly causes decreases in score? Not necessarily, as there may be other factors we are not accounting for. Correlation does not equal causation.

This best-fitting line is the linear regression line and it is a fitted linear model \(\hat{f}()\). Let’s draw connections with our earlier modeling theory.

Refresher: Modeling in General

  • Truth: Assumed model is \(y=f(\vec{x})+\epsilon\)
  • Goal: Given \(y\) and \(\vec{x}\), fit a model \(\hat{f}(\vec{x})\) that approximates \(f(\vec{x})\), where \(\hat{y}=\hat{f}(\vec{x})\) is the fitted/predicted value for the observed value \(y\).

Modeling with Basic Linear Regression

  • Truth:
    • Assume \(f()\) is a linear function (i.e. a line), necessitating an intercept (\(\beta_0\)) and a slope for \(x\) (\(\beta_1\)): \(f(x)=\beta_0+\beta_1\cdot{x}\)
    • Observed value \(y=f(x)+\epsilon=\beta_0+\beta_1\cdot{x}+\epsilon\)
  • Fitted:
    • Assume \(\hat{f}(x)=\hat{\beta}_0+\hat{\beta}_1\cdot{x}\). These values are computed using our observed data.
    • Fitted/predicted value \(\hat{y}=\hat{f}(x)=\hat{\beta}_0+\hat{\beta}_1\cdot{x}\). Note that there is no \(\epsilon\) term here as our fitted model \(\hat{f}(x)\) should only capture signal and not noise.

The best-fitting line is thus: \(\hat{y}=\hat{f}(\vec{x})=\hat{\beta}_0+\hat{\beta}_1\cdot{x}\)

But what are the numerical values of the fitted intercept and slope? We’ll let R compute these for us.

Computing the Slope and Intercept of Regression Line

You first fit an lm linear model using as arguments the data and a model formula of form y ~ x, where y is the outcome and x is the explanatory variable.

# Fit regression model using formula of form: y ~ x
model_score_1 <- lm(score ~ age, data=evals)
# Output contents
model_score_1
## 
## Call:
## lm(formula = score ~ age, data = evals)
## 
## Coefficients:
## (Intercept)          age  
##    4.461932    -0.005938

While the intercept of 4.46 has a mathematical interpretation, being the value of \(y\) when \(x=0\), here it doesn’t have a practical interpretation (this would be the teaching score when age is 0). The slope of -0.005938 quantifies the relationship between score and age. Its interpretation is rise over run; for every increase of 1 in age there is an associated decrease of on average 0.0059 units in score. The negative slope emphasizes the negative relationship.

However, the latter output is a bit sparse, and not in dataframe format. Let’s improve thisby applying the get_regression_table function from the moderndive package to model_score_1.

# Output regression table using wrapper function:
kable(get_regression_table(model_score_1))
term estimate std_error statistic p_value lower_ci upper_ci
intercept 4.462 0.127 35.195 0.000 4.213 4.711
age -0.006 0.003 -2.311 0.021 -0.011 -0.001

This produces what is known as a regression table. This function is an example of a wrapper function. It takes other existing functions and hides its internal workings so that all you need to worry about are the input and output formats.

The fitted intercept and slope are now in the second column estimate. The additional columns, such as std_error and p_value all speak to the statistical significance of our results. These concepts are covered in more advanced courses on statistical inference.

Plotting a Best-fitting Regression Line

Previously you visualised the relationship of teaching score and “beauty score” via a scatterplot. Now let’s add a “best-fitting” regression line to provide a sense of any overall trends. Even though you know this plot suffers from overplotting, you’ll stick to the non-jitter version.

# Plot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point() +
  labs(x = "beauty score", y = "score") +
  geom_smooth(method = "lm", se = FALSE)

The overall trend seems to be positive. As instructors have higher “beauty” scores, so also do they tend to have higher teaching scores.

Fitting a Regression with a Numerical \(x\)

Let’s now explicitly quantify the linear relationship between score and bty_avg using linear regression. You will do this by first “fitting” the model. Then you will get the regression table, a standard output in many statistical software packages. Finally, based on the output of get_regression_table(), which interpretation of the slope coefficient is correct?

# Fit model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Output content
model_score_2
## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Coefficients:
## (Intercept)      bty_avg  
##     3.88034      0.06664

Given the sparsity of the output, let’s get the regression table using the get_regression_table() wrapper function.

kable(get_regression_table(model_score_2))
term estimate std_error statistic p_value lower_ci upper_ci
intercept 3.880 0.076 50.961 0 3.731 4.030
bty_avg 0.067 0.016 4.090 0 0.035 0.099

From this we can conclude that for every increase of one in beauty score, we can observe an associated increase of on average 0.067 units in teaching score.

Predicting Teaching Score Using Age

Let’s take our basic linear regression model of score as a function of age and now use it to predict events. For example, say we have demographic information about a professor at UT Austin. Can you make a good guess of their score? Or, more generally, based on a single predictor variable \(x\) can you make good predictions \(\hat{y}\)?

Recall, our best-fitting regression line from the last section:

Now, say all you know about an instructor is that they are aged 40. What is a good guess of their score? Can you use the above visualisation? A good guess is the fitted value on the regression line for age = 40. This is approximately 4.25. To compute this precisely, you need to use the fitted intercept and fitted slope for age from the regression table.

Refresher: Regression Table

Previously you learned how to fit a linear regression model with one numerical explanatory variable and applied the get_regression_table() function from the moderndive package to obtain the fitted intercept and fitted slope values:

term estimate std_error statistic p_value lower_ci upper_ci
intercept 4.462 0.127 35.195 0.000 4.213 4.711
age -0.006 0.003 -2.311 0.021 -0.011 -0.001

Recall that these values are in the estimate column and are 4.46 and -0.006.

Predicted Value

More generally, you can use a fitted regression model \(\hat{f}()\) for predictive as well as explanatory purposes:

  • Predictive regression models in general:
    \(\hat{y}=\hat{f}(x)=\hat{\beta}_0+\hat{\beta}_1\cdot{x}\)
  • Our predictive model: \(\hat{\text{score}}=4.46-0.006\cdot\text{age}\)
  • Our prediction: \(4.46-0.006\cdot40=4.22\)

This is very close to our earlier visual prediction of 4.25.

Prediction Error

Now say we found out that the instructor actually got a score of 3.5. Our prediction of 4.22 over-predicted by approximately 0.72 units in the negative direction. What this demonstrates is the modeling concept of a residual.

Residuals as Model Errors

A residual is the observed value \(y\) minus the fitted or predicted value \(\hat{y}\). This discrepancy between the two values corresponds to the \(\epsilon\) from the general modeling framework. Here the negative residual of -0.72 corresponds to our over-prediction. With linear regression, sometimes you’ll obtain positive residuals and sometimes negative. In linear regression these residuals always average out to 0.

  • Residual = \(y - \hat{y}\).
  • Corresponds to the \(\epsilon\) from the general modeling framework \(y=f(\vec{x})+\epsilon\).
  • For our example instructor: \(y-\hat{y}=3.5-4.22=-0.72\).
  • In linear regression, they are on average 0.

Now, say you want predicted \(\hat{y}\) and residuals for all 463 instructors. You could repeat the procedure we just followed 463 times but this would be tedious. So let’s automate this procedure using another wrapper function from the moderndive package.

Computing All Predicted Values

Recall, our earlier fitted linear model, saved in the variable model_score_1. Instead of using get_regression_table(), let’s now use get_regression_points() to get information on all 463 points in our datset.

# Fit regression model using formula of form: y ~ x
model_score_1 <- lm(score ~ age, data=evals)
# Get information on each point
model_score_1 %>% 
  get_regression_points() %>% 
  head(10) %>% 
  kable()
ID score age score_hat residual
1 4.7 36 4.248 0.452
2 4.1 36 4.248 -0.148
3 3.9 36 4.248 -0.348
4 4.8 36 4.248 0.552
5 4.6 59 4.112 0.488
6 4.3 59 4.112 0.188
7 2.8 59 4.112 -1.312
8 4.1 51 4.159 -0.059
9 3.4 51 4.159 -0.759
10 4.5 40 4.224 0.276

The first column ID identifies the rows. The second and third columns are the original outcome and explantory variables score and age. The fourth column score_hat is the predicted \(\hat{y}\) as computed using the equation of the regression line. The fifth column is the residual, score - score_hat.

Let’s go back to our original scatter plot and illustrate a few more residuals. Here’s we’ll plot six arbitrarily chosen residuals.

Image 1: Six residuals

Our earlier statement that the regression line is best-fitting means that of all possible lines, the blue regression line minimizes the residuals. What do I mean by minimize? Imagine you drew all 463 residuals, then squared their lengths so that positive and negative residuals were treated equally, then summed them. The regression line is the line that minimizes this quantity. You’ll see later that this quantity is called the sum of squared residuals and it measures the lack of fit of a model to a set of points. The blue regression line is best in that it minimizes this lack of fit.

Making Predictions Using Beauty Score

Say there is an instructor at UT Austin and you know nothing about them except that their beauty score is 5. What is your prediction \(\hat{y}\) of their teaching score \(y\)?

get_regression_table(model_score_2)
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099

Using the values of the intercept and slope from above, predict this instructor’s score.

# Use fitted intercept and slope to get a prediction
y_hat <- 3.88 + 0.067 * 5
y_hat
## [1] 4.215

Say it’s revealed that the instructor’s score is 4.7. Compute the residual for this prediction, i.e., the residual \(y-\hat{y}\).

# Use fitted intercept and slope to get a prediction
y_hat <- 3.88 + 5 * 0.0670
y_hat
## [1] 4.215
# Compute residual y - y_hat
4.7 - 4.215
## [1] 0.485

Computing Fitted/Predicted Values & Residuals

Now say you want to repeat this for all 463 instructors in evals. Doing this manually as you just did would be long and tedious, so as seen in the video, let’s automate this using the get_regression_points() function.

Furthemore, let’s unpack its output.

  • Let’s once again get the regression table for model_score_2.
  • Apply get_regression_points() from the moderndive package to automate making predictions and computing residuals.
# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Get regression table
get_regression_table(model_score_2)
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099
# Get all fitted/predicted values and residuals
get_regression_points(model_score_2)
## # A tibble: 463 x 5
##       ID score bty_avg score_hat residual
##    <int> <dbl>   <dbl>     <dbl>    <dbl>
##  1     1   4.7    5         4.21    0.486
##  2     2   4.1    5         4.21   -0.114
##  3     3   3.9    5         4.21   -0.314
##  4     4   4.8    5         4.21    0.586
##  5     5   4.6    3         4.08    0.52 
##  6     6   4.3    3         4.08    0.22 
##  7     7   2.8    3         4.08   -1.28 
##  8     8   4.1    3.33      4.10   -0.002
##  9     9   3.4    3.33      4.10   -0.702
## 10    10   4.5    3.17      4.09    0.409
## # … with 453 more rows
  • Let’s unpack the contents of the score_hat column. First, run the code that fits the model and outputs the regression table.
  • Add a new column score_hat_2 which replicates how score_hat is computed using the table’s values.
# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Get regression table
get_regression_table(model_score_2)
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099
# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>% 
  mutate(score_hat_2 = 3.88 + 0.067 * bty_avg)
## # A tibble: 463 x 6
##       ID score bty_avg score_hat residual score_hat_2
##    <int> <dbl>   <dbl>     <dbl>    <dbl>       <dbl>
##  1     1   4.7    5         4.21    0.486        4.22
##  2     2   4.1    5         4.21   -0.114        4.22
##  3     3   3.9    5         4.21   -0.314        4.22
##  4     4   4.8    5         4.21    0.586        4.22
##  5     5   4.6    3         4.08    0.52         4.08
##  6     6   4.3    3         4.08    0.22         4.08
##  7     7   2.8    3         4.08   -1.28         4.08
##  8     8   4.1    3.33      4.10   -0.002        4.10
##  9     9   3.4    3.33      4.10   -0.702        4.10
## 10    10   4.5    3.17      4.09    0.409        4.09
## # … with 453 more rows
  • Now let’s unpack the contents of the residual column. First, run the code that fits the model and outputs the regression table.
  • Add a new column residual_2 which replicates how residual is computed using the table’s values.
# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Get regression table
get_regression_table(model_score_2)
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099
# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>% 
  mutate(residual_2 = score - score_hat)
## # A tibble: 463 x 6
##       ID score bty_avg score_hat residual residual_2
##    <int> <dbl>   <dbl>     <dbl>    <dbl>      <dbl>
##  1     1   4.7    5         4.21    0.486      0.486
##  2     2   4.1    5         4.21   -0.114     -0.114
##  3     3   3.9    5         4.21   -0.314     -0.314
##  4     4   4.8    5         4.21    0.586      0.586
##  5     5   4.6    3         4.08    0.52       0.520
##  6     6   4.3    3         4.08    0.22       0.220
##  7     7   2.8    3         4.08   -1.28      -1.28 
##  8     8   4.1    3.33      4.10   -0.002     -0.002
##  9     9   3.4    3.33      4.10   -0.702     -0.702
## 10    10   4.5    3.17      4.09    0.409      0.409
## # … with 453 more rows

Explaining Teaching Score with Gender

Let’s now expand your modeling toolbox with basic regression models where the explanatory/predictor variable is not numerical, but rather categorical. Much of the world’s data is categorical in nature and its important to be equipped to handle it. You’ll continue constructing explanatory and predictive models of teaching score, now using the variable gender, which at the time of this study was recorded as a binary categorical variable.

Exploratory Data Visualization

As we similarly did in Chapter 1 for house price and house condition, let’s construct an exploratory boxplot of the relationship between score and gender.

ggplot(evals, aes(x=gender, y=score)) +
  geom_boxplot() +
  labs(x="gender", y="score")

You can easily compare the distribution of scores for men and women using single horizontal lines. For example, it seems male instructors tended to get higher scores as evidenced by the higher median. Remember, the solid line in boxplots are the median, not the mean. So before I perform any formal regression modeling, I expect men will tend to be rated higher than women by students. Let’s make a mental note: treating the median for women of about 4.1 as a “baseline for comparison”, I observe a difference of about +0.2 for men. These aren’t exact values, just a rough eyeballing.

Facetted Histogram

An alternative exploratory visualization is a facetted histogram. You use geom_histogram(), where x maps to the numerical variable score, and where we now have facets split by gender.

ggplot(evals, aes(x=score)) +
  geom_histogram(binwidth=0.25) +
  facet_wrap(~gender) +
  labs(x="gender", y="score")

Unlike the boxplots, you now get a sense for the shape of the distributions. They both exhibit a slight left-skew. Nothing drastic like the right-skew of Seattle house prices, but still a slight skew nonetheless. However, it’s now harder to say which distribution is centered at a higher point. This is because the median isn’t clearly marked like in boxplots. Furthermore, comparisons between groups can’t be made using single lines. So which plot is better? Boxplots or facetted histograms? There is no universal right answer; it all depends on what you are trying to emphasize to the consumers of these visualizations.

Fitting a regression model

You fit the regression as before where the model formula y ~ x in this case has x set to gender.

# Fit regression model
model_score_3 <- lm(score~gender, data=evals)

# Get regression table
model_score_3 %>% 
  get_regression_table() %>% 
  kable()
term estimate std_error statistic p_value lower_ci upper_ci
intercept 4.093 0.039 105.852 0.000 4.017 4.169
gender: male 0.142 0.051 2.784 0.006 0.042 0.241

Using get_regression_table(), you see again the regression table yields a fitted intercept and a fitted slope, but what do these mean when the explanatory variable is categorical? The intercept 4.09 is the average score for the women, the “baseline for comparison” group. Why in this case is the baseline group female and not male? For no other reason than “female” is alphabetically ahead of “male”. The slope of 0.142 is the difference in average score for men relative to women. This is known as a dummy or indicator variable. It is not that men had an average score of 0.142, rather they differed on average from the women by +0.142, so their average score is the sum of 4.09 + 0.142 = 4.23.

Computing the group means

Let’s convince ourselves of this by computing group means using the group_by() and summarize() verbs.

# Compute group means based on gender
evals %>% 
  group_by(gender) %>% 
  summarise(avg_score = mean(score)) %>% 
  kable()
gender avg_score
female 4.092821
male 4.234328

So, the latter table shows the means for men and women separately, whereas the regression table shows the average teaching score for the baseline group of women, and the relative difference to this baseline for the men.

A Different Categorical Explanatory Variable: rank

Let’s now consider a different categorical variable: rank. Let’s group the evals data by rank and obtain counts using the n() function in the summarize call.

# Get a count of records by rank
evals %>% 
  group_by(rank) %>% 
  summarise(n = n()) %>% 
  kable()
rank n
teaching 102
tenure track 108
tenured 253

Observe three levels: First teaching instructors responsibilities’ lie primarily only in teaching courses. Second, tenure track faculty also have research expectations and, generally speaking, are on the path to getting promoted to the third rank and highest, tenured.

EDA of Relationship of Score and Rank

Let’s perform an EDA of the relationship between an instructor’s score and their rank in the evals dataset. You’ll both visualize this relationship and compute summary statistics for each level of rank: teaching, tenure track, and tenured.

Write the code to create a boxplot of the relationship between teaching score and rank.

ggplot(evals, aes(rank, score)) +
  geom_boxplot() +
  labs(x="rank", y="score")

For each unique value in rank:

  • Count the number of observations in each group
  • Find the mean and standard deviation of score
evals %>% 
  group_by(rank) %>% 
  summarise(n = n(),
            mean_score = mean(score, na.rm=TRUE),
            sd_score = sd(score, na.rm=TRUE))
## # A tibble: 3 x 4
##   rank             n mean_score sd_score
##   <fct>        <int>      <dbl>    <dbl>
## 1 teaching       102       4.28    0.498
## 2 tenure track   108       4.15    0.561
## 3 tenured        253       4.14    0.550

Fitting a Regression With a Categorical x

You’ll now fit a regression model with the categorical variable rank as the explanatory variable and interpret the values in the resulting regression table. Note here the rank “teaching” is treated as the baseline for comparison group for the “tenure track” and “tenured” groups.

Fit a linear regression model between score and rank, and then apply the wrapper function to model_score_4 that returns the regression table.

# fit regression model
model_score_4 <- lm(score ~ rank,
                    data = evals)

# Get regressions table
kable(get_regression_table(model_score_4))
term estimate std_error statistic p_value lower_ci upper_ci
intercept 4.284 0.054 79.853 0.000 4.179 4.390
rank: tenure track -0.130 0.075 -1.733 0.084 -0.277 0.017
rank: tenured -0.145 0.064 -2.284 0.023 -0.270 -0.020

Based on the regression table, compute the 3 possible fitted values \(\hat{y}\), which are the group means. Since “teaching” is the baseline for comparison group, the intercept is the mean score for the “teaching” group and ranktenure track/ranktenured are relative offsets to this baseline for the “tenure track”/“tenured” groups.

# teaching mean
teaching_mean <- 4.28

# tenure track mean
tenure_track_mean <- 4.28 - 0.13

# tenured mean
tenured_mean <- 4.28 - 0.145

Predicting Teaching Score Using Gender

You’ll finish your exploration of basic regression with modeling for prediction using one categorical predictor variable. The idea remains the same as predicting with one numerical variable: based on information contained in the predictor variables, in this case gender, can you accurately guess teaching scores?

You previously calculated gender specific means using group_by() and summarise(). This time however, let’s also compute the standard deviation, which is a measure of spread.

# Compute group means based on gender
evals %>% 
  group_by(gender) %>% 
  summarise(mean_score = mean(score),
            sd_score = sd(score)) %>% 
  kable()
gender mean_score sd_score
female 4.092821 0.5638141
male 4.234328 0.5218958

On average, the male instructors got a score of 4.23 and the female instructors got a score of 4.09. Furthermore, there is variation around the mean scores as evidenced by the standard deviations. The women had slightly more variation with a standard deviation of 0.564.

So say you had an instructor at the UT Austin and you knew nothing about them other than that they were male. A reasonable prediction of their teaching score would be the group mean for the men of 4.23.

However, surely there are more factors associated with teaching scores than just gender? How good can this prediction be? There must be a fair amount of error involved.

What was our method for quantifying error? It was the residuals. Let’s get the predicted values and residuals for all 463 instructors.

Computing all Predicted Values and Residuals

We use the get_regression_points() function that we introduced earlier.

# Fit regression model
model_score_3 <- lm(score~gender, data=evals)

# Get information on each point
model_score_3 %>% 
  get_regression_points() %>%
  head(10) %>% 
  kable()
ID score gender score_hat residual
1 4.7 female 4.093 0.607
2 4.1 female 4.093 0.007
3 3.9 female 4.093 -0.193
4 4.8 female 4.093 0.707
5 4.6 male 4.234 0.366
6 4.3 male 4.234 0.066
7 2.8 male 4.234 -1.434
8 4.1 male 4.234 -0.134
9 3.4 male 4.234 -0.834
10 4.5 female 4.093 0.407

Recall that whereas get_regression_table() returns the regression table, get_regression_points() returns information on each of the 463 rows in the evals dataframe, each representing one instructor.

In the second column is the observed \(y\) outcome score. In the fourth column score_hat, observe that there are only two possible predicted values: either 4.09 or 4.23, corresponding to the group means for the women and men respectively. The final column contains the residuals, which are the observed values \(y\) minus the predicted values \(\hat{y}\), or here score - score_hat.

In the first row, since the prediction 4.09 was lower than the observed value of 4.7, there is a positive residual, whereas in the third row, since the prediction of 4.09 was greater than the observed value of 3.9, there is a negative residual.

Say you predicted a value that was exactly equal to the observed value. Then in this case the residual would be 0.

Histogram of Residuals

Now let’s take the output of the get_regression_points() function and save it into a new dataframe called model_score_3_points. This was you can use this dataframe in a ggplot function call to plot a histogram of the residuals

# Save the regression points to a variable
model_score_3_points <-
  get_regression_points(model_score_3)

# Plot residuals
ggplot(model_score_3_points, aes(x = residual)) +
  geom_histogram() +
  labs(x = "Residuals",
       title = "Residuals from score ~ gender model")

This histogram appears to be roughly centred at 0, indicating that on average the residuals (or errors) were 0. Sometimes we make errors as large as +1 or -1 out of a 5-unit scale. Those are fairly large. Also, note that we tend to make larger negative errors than positive errors. But these shortcomings are not necessarily bad. Remember, this is a very simplistic model with only one predictor: gender. The analysis of the residuals above suggests that we probably need more predictors than just gender to make good predictions of teaching scores. Wouldn’t it be great if you could fit regression models where you could use more than one explanatory or predictor variable? You’ll see in the upcoming Chapter 3 on multiple regression that you can.

Making Predictions Using Rank

Run get_regression_table(model_score_4) in the console to regenerate the regression table where you modeled score as a function of rank. Now say using this table you want to predict the teaching score of an instructor about whom you know nothing except that they are a tenured professor. Which of these statements is true?

# Get the regression table
get_regression_table(model_score_4) %>% 
  kable()
term estimate std_error statistic p_value lower_ci upper_ci
intercept 4.284 0.054 79.853 0.000 4.179 4.390
rank: tenure track -0.130 0.075 -1.733 0.084 -0.277 0.017
rank: tenured -0.145 0.064 -2.284 0.023 -0.270 -0.020

A good prediction of their score would be 4.28 - 0.145 = 4.135.

Visualising the Distribution of Residuals

Let’s now compute both the predicted score \(\hat{y}\) and the residual \(y-\hat{y}\) for all \(n=463\) instructors in the evals dataset. Furthermore, you’ll plot a histogram of the residuals and see if there are any patterns to the residuals, i.e. your predictive errors.

Apply the function that automates making predictions and computing residuals, and save these values to the dataframe model_score_4_points.

# Get the regression points
model_score_4_points <-
  get_regression_points(model_score_4)

# Print the first 10 rows
model_score_4_points %>% 
  head(10) %>% 
  kable()
ID score rank score_hat residual
1 4.7 tenure track 4.155 0.545
2 4.1 tenure track 4.155 -0.055
3 3.9 tenure track 4.155 -0.255
4 4.8 tenure track 4.155 0.645
5 4.6 tenured 4.139 0.461
6 4.3 tenured 4.139 0.161
7 2.8 tenured 4.139 -1.339
8 4.1 tenured 4.139 -0.039
9 3.4 tenured 4.139 -0.739
10 4.5 tenured 4.139 0.361

Now take the model_score_4_points dataframe to plot a histogram of the residual column so you can see the distribution of the residuals, i.e., the prediction errors.

# Plot residuals
ggplot(model_score_4_points, aes(x = residual)) +
  geom_histogram() +
  labs(x = "residuals", titlex = "Residuals from score ~ rank model")