## Warning: package 'tibble' was built under R version 4.0.4
Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
This discussion uses the hsb2(link provided here) dataset from the openintro library which is sourced from the UCLA Institute for Digital Research & Education - Statistical Consulting. The variables taken into account for the linear regression model below are reading and math scores where the reading score has been used as the independent variable and math scores are used as the dependent variable (reading comprehension skills are important for word problems in math).
“https://www.openintro.org/data/index.php?data=hsb2︎”
## # A tibble: 200 x 11
## id gender race ses schtyp prog read write math science socst
## <int> <chr> <chr> <fct> <fct> <fct> <int> <int> <int> <int> <int>
## 1 70 male white low public general 57 52 41 47 57
## 2 121 female white midd~ public vocati~ 68 59 53 63 61
## 3 86 male white high public general 44 33 54 58 31
## 4 141 male white high public vocati~ 63 44 47 53 56
## 5 172 male white midd~ public academ~ 47 52 57 53 61
## 6 113 male white midd~ public academ~ 44 52 51 63 61
## 7 50 male african am~ midd~ public general 50 59 42 53 61
## 8 11 male hispanic midd~ public academ~ 34 46 45 39 36
## 9 84 male white midd~ public general 63 57 54 58 51
## 10 48 male african am~ midd~ public academ~ 57 55 52 50 51
## # ... with 190 more rows
##
## Call:
## lm(formula = hsb2$math ~ hsb2$read)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.1624 -5.1624 -0.4135 4.7775 16.4684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.03816 2.58945 8.125 4.76e-14 ***
## hsb2$read 0.60515 0.04865 12.438 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.037 on 198 degrees of freedom
## Multiple R-squared: 0.4386, Adjusted R-squared: 0.4358
## F-statistic: 154.7 on 1 and 198 DF, p-value: < 2.2e-16
The residual of an observation is the difference between the predicted outcome value minus the true observation value.
Residuals tell us how good a model fits the data i.e. is error large or small?
The below plot shows the size of the residual value using color coding (red signifies a higher residual, while the green shows a smaller residual) .
hsb2$predicted <- predict(my_lm) # Save the predicted values
hsb2$residuals <- residuals(my_lm) # Save the residual values
ggplot(hsb2, aes(x = read, y = math)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # regression line
geom_segment(aes(xend = read, yend = predicted), alpha = .2) + # draw line from point to line
geom_point(aes(color = abs(residuals), size = abs(residuals))) + # size of the points
scale_color_continuous(low = "green", high = "red") + # colour of the points mapped to residual size - green smaller, red larger
guides(color = FALSE, size = FALSE) + # Size legend removed
geom_point(aes(y = predicted), shape = 1) +
theme_bw()## `geom_smooth()` using formula 'y ~ x'
Below in the Residuals vs. Fitted plot, we can see it is mostly homoscedastic meaning the residuals are equally distributed along the regression line with exception to the first half of the graph. The red line in the Residuals vs. Fitted plot is also fairly flat, meaning linearity has been met.
The Scale_location plot shows some increasing variance at the left, but then seems to flatten out after 55.
In order to determine appropriateness of the linear model, we must take into consideration the four criteria listed and explained below.
There are four assumptions associated with a linear regression model:
Linearity: The relationship between X and the mean of Y is linear. Based on the Residuals vs. Fitted plot, the the red line is approximately horizontal at zero, suggesting linearity. The first plot above also shows there is somewhat of a linear relationship between the two variables.
Homoscedasticity: The variance of residual is the same for any value of X. The Scale-Location plot shows if residuals are spread equally along the ranges of predictor. We have an approximately horizontal line with most points equally spread with the exception of the far left of the graph.
Independence: Observations are independent of each other. Upon examining the Residuals vs. Fitted plot, we can see the correlation is approximately zero and looks like no relationship exists.
Normality: For any fixed value of X, Y is normally distributed. The QQ plot of residuals can be used to visually check the normality assumption. Based on the plot, it looks like the points show little deviance away from the line meaning both sets are approximately normally distributed.
It looks like the model fits the above criteria and can be considered an appropriate linear model.