Collaboration

Please indicate who you collaborated with on this problem set:

Question 1: Residual analysis

Recall from PS05 you used the evals dataset with

Outcome variable y = teaching score
Explanatory variable x = age

In particular you did an “exploratory data analysis (EDA)” to get a qualitative sense of the relationship of teaching score and age. You did this using both a visualization and summary statistics. You then “fit” a regression model and saved it in score_model_age:

# 1. Fit regression model:
score_model_age <- lm(score ~ age, data = evals)

You then output the regression table, which is an explicit numerical quantification of the relationship between score and age. It contains, in particular, the values of the fitted intercept and slope in the estimate column:

# 2. Output regression table:
get_regression_table(score_model_age)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	4.462	0.127	35.195	0.000	4.213	4.711
age	-0.006	0.003	-2.311	0.021	-0.011	-0.001

Now you will do a residual analysis. Recall the following. In our case

A residual is the discrepancy between an observed teaching score \(y\) and a predicted/fitted teaching score \(\hat{y}\) on the regression line.
Residuals are “left over variation that the regression model does not explain” and can be thought of as “errors”.
Residual analysis allows us to study any patterns in these “errors” and let’s us evaluate how “good” a regression model fits a set of points.
Furthermore, we’ll see in ModernDive Chapter 11 that an assumption for “statistical inference for regression” to be valid is that the residuals are “normally distributed and centered at 0.”
There are two methods of doing a visual residual analysis:
1. Plotting the relationship of the residuals vs all explanatory variables. In this case, there is only one explanatory variable: age.
2. Plotting a histogram of the residuals to see their overall distribution.

a) Look at all fitted values and residuals

First, let’s compute all 463 fitted values \(\hat{y} = \hat{score}\) and residuals. In the following code block:

Apply the appropriate function to the saved model score_model_age to obtain a table that contains the predicted scores (score_hat) and the residuals and save this table as regression_points.
Then apply the head() function to regression_points to print the first 6 rows.

Solution:

regression_points <- get_regression_points(score_model_age)
head(regression_points)

ID	score	age	score_hat	residual
1	4.7	36	4.248	0.452
2	4.1	36	4.248	-0.148
3	3.9	36	4.248	-0.348
4	4.8	36	4.248	0.552
5	4.6	59	4.112	0.488
6	4.3	59	4.112	0.188

b) How are the fitted/predicted values and residuals computed?

For the first row of values, how are score_hat and residuals computed? Meaning, what is the mathematical operation? (i.e. write out a general equation for calculating each) Note there might be some differences due to rounding error.

Solution:

score_hat = score predicted by regression table = intercept - 0.006 * age
residual = difference between actual and predicted score = score - score_hat

c) Method 1: Scatterplot of residuals over age

The first method for residual analysis is creating a scatterplot with the residuals on the y axis and the explanatory variable age on the x axis. Use the regression_points data frame we created earlier to create a scatterplot:

# Scatterplot:
ggplot(regression_points, aes(x = age, y = residual)) +
  geom_point() +
  labs(x = "Age", y = "Residual") +
  geom_hline(yintercept = 0, col = "blue", size = 1)

d) Qualitative description of any patterns

Qualitatively describe any salient features of the scatterplot above. For example, are the residuals evenly spread? Are they spread out generally the same at low and high values of age? Are there any extreme values?

Solution:

There appear to be more underestimates than there are overestimates for score predictions, but the range of overestimates of scores (up to -2) is greater than underestimates of scores (at most +1), i.e., the overestimates have more extreme values than the underestimates.

e) Method 2: Histogram of residuals

The second method for residual analysis is creating a histogram of the residuals to study their distribution. Use the regression_points data frame we created earlier to create a histogram. Use a binwidth of ~ 0.25:

ggplot(regression_points, aes(x = residual)) +
  geom_histogram(binwidth = 0.25, color = "white") +
  labs(x = "Residual")

f) Qualitative description of any patterns

Qualitatively describe any salient features of the distribution in histogram above. For example, what is the center, shape, spread, symmetry?

Solution:

This histogram seems to indicate that we have more positive residuals than negative. This means that there are more underestimates than there are overestimates, though because of the long left tail it appears that there is more variance in the overestimates than that in the underestimates. This corresponds to the scatterplot we saw earlier.

It does however, seem to be centred about 0, following an almost bell-shaped distribution, with a negative skew, because of the long tail on the left.

Problem Set 06

CAITLIN ONG

17 OCTOBER 2018

Collaboration

Question 1: Residual analysis

a) Look at all fitted values and residuals

b) How are the fitted/predicted values and residuals computed?

c) Method 1: Scatterplot of residuals over age

d) Qualitative description of any patterns

e) Method 2: Histogram of residuals

f) Qualitative description of any patterns