Please indicate who you collaborated with on this problem set:
Recall from PS05 you used the evals dataset with
In particular you did an “exploratory data analysis (EDA)” to get a qualitative sense of the relationship of teaching score and age. You did this using both a visualization and summary statistics. You then “fit” a regression model and saved it in score_model_age:
# 1. Fit regression model:
score_model_age <- lm(score ~ age, data = evals)You then output the regression table, which is an explicit numerical quantification of the relationship between score and age. It contains, in particular, the values of the fitted intercept and slope in the estimate column:
# 2. Output regression table:
get_regression_table(score_model_age)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 4.462 | 0.127 | 35.195 | 0.000 | 4.213 | 4.711 |
| age | -0.006 | 0.003 | -2.311 | 0.021 | -0.011 | -0.001 |
Now you will do a residual analysis. Recall the following. In our case
age.First, let’s compute all 463 fitted values \(\hat{y} = \hat{score}\) and residuals. In the following code block:
score_model_age to obtain a table that contains the predicted scores (score_hat) and the residuals and save this table as regression_points.head() function to regression_points to print the first 6 rows.Solution:
regression_points <- get_regression_points(score_model_age)
head(regression_points)| ID | score | age | score_hat | residual |
|---|---|---|---|---|
| 1 | 4.7 | 36 | 4.248 | 0.452 |
| 2 | 4.1 | 36 | 4.248 | -0.148 |
| 3 | 3.9 | 36 | 4.248 | -0.348 |
| 4 | 4.8 | 36 | 4.248 | 0.552 |
| 5 | 4.6 | 59 | 4.112 | 0.488 |
| 6 | 4.3 | 59 | 4.112 | 0.188 |
For the first row of values, how are score_hat and residuals computed? Meaning, what is the mathematical operation? (i.e. write out a general equation for calculating each) Note there might be some differences due to rounding error.
Solution:
score_hat = score predicted by regression table = intercept - 0.006 * ageresidual = difference between actual and predicted score = score - score_hatThe first method for residual analysis is creating a scatterplot with the residuals on the y axis and the explanatory variable age on the x axis. Use the regression_points data frame we created earlier to create a scatterplot:
# Scatterplot:
ggplot(regression_points, aes(x = age, y = residual)) +
geom_point() +
labs(x = "Age", y = "Residual") +
geom_hline(yintercept = 0, col = "blue", size = 1)Qualitatively describe any salient features of the scatterplot above. For example, are the residuals evenly spread? Are they spread out generally the same at low and high values of age? Are there any extreme values?
Solution:
There appear to be more underestimates than there are overestimates for score predictions, but the range of overestimates of scores (up to -2) is greater than underestimates of scores (at most +1), i.e., the overestimates have more extreme values than the underestimates.
The second method for residual analysis is creating a histogram of the residuals to study their distribution. Use the regression_points data frame we created earlier to create a histogram. Use a binwidth of ~ 0.25:
ggplot(regression_points, aes(x = residual)) +
geom_histogram(binwidth = 0.25, color = "white") +
labs(x = "Residual")Qualitatively describe any salient features of the distribution in histogram above. For example, what is the center, shape, spread, symmetry?
Solution:
This histogram seems to indicate that we have more positive residuals than negative. This means that there are more underestimates than there are overestimates, though because of the long left tail it appears that there is more variance in the overestimates than that in the underestimates. This corresponds to the scatterplot we saw earlier.
It does however, seem to be centred about 0, following an almost bell-shaped distribution, with a negative skew, because of the long tail on the left.