library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'readr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Warning: package 'airports' was built under R version 4.2.2
## Loading required package: cherryblossom
## Warning: package 'cherryblossom' was built under R version 4.2.2
## Loading required package: usdata
## Warning: package 'usdata' was built under R version 4.2.2
library(airports)
library(cherryblossom)
library(usdata)
FORUM DESCRIPTION Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The hsb2 dataset from the openintro library, which is sourced from the UCLA Institute for Digital Research & Education - Statistical Consulting, is used in this discussion. A link to the dataset is available here. Reading and math scores are the variables included in the linear regression model shown below, with reading score acting as the independent variable and math score as the dependent variable (reading comprehension skills are important for word problems in math).
“https://www.openintro.org/data/index.php?data=hsb2︎”
hsb2
## # A tibble: 200 × 11
## id gender race ses schtyp prog read write math science socst
## <int> <chr> <chr> <fct> <fct> <fct> <int> <int> <int> <int> <int>
## 1 70 male white low public gene… 57 52 41 47 57
## 2 121 female white midd… public voca… 68 59 53 63 61
## 3 86 male white high public gene… 44 33 54 58 31
## 4 141 male white high public voca… 63 44 47 53 56
## 5 172 male white midd… public acad… 47 52 57 53 61
## 6 113 male white midd… public acad… 44 52 51 63 61
## 7 50 male african amer… midd… public gene… 50 59 42 53 61
## 8 11 male hispanic midd… public acad… 34 46 45 39 36
## 9 84 male white midd… public gene… 63 57 54 58 51
## 10 48 male african amer… midd… public acad… 57 55 52 50 51
## # … with 190 more rows
plot(hsb2$read, hsb2$math)
my_lm <- lm(hsb2$math ~ hsb2$read)
summary(my_lm)
##
## Call:
## lm(formula = hsb2$math ~ hsb2$read)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.1624 -5.1624 -0.4135 4.7775 16.4684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.03816 2.58945 8.125 4.76e-14 ***
## hsb2$read 0.60515 0.04865 12.438 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.037 on 198 degrees of freedom
## Multiple R-squared: 0.4386, Adjusted R-squared: 0.4358
## F-statistic: 154.7 on 1 and 198 DF, p-value: < 2.2e-16
Residual Analysis The discrepancy between the expected outcome value and the actual observation value is the residual of an observation. The residuals demonstrate how well a model fits the data, i.e., how big or small the error is. The color coding in the plot below illustrates the size of the residual value (red denotes a bigger residual, while green denotes a smaller residual).
hsb2$predicted <- predict(my_lm) # Save the predicted values
hsb2$residuals <- residuals(my_lm) # Save the residual values
ggplot(hsb2, aes(x = read, y = math)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # regression line
geom_segment(aes(xend = read, yend = predicted), alpha = .2) + # draw line from point to line
geom_point(aes(color = abs(residuals), size = abs(residuals))) + # size of the points
scale_color_continuous(low = "green", high = "red") + # colour of the points mapped to residual size - green smaller, red larger
guides(color = FALSE, size = FALSE) + # Size legend removed
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## `geom_smooth()` using formula 'y ~ x'
The Residuals vs. Fitted plot is shown below. As can be seen, it is generally homoscedastic, which means that, with the exception of the first half of the graph, the residuals are equally distributed along the regression line. In the Residuals vs. Fitted plot, the red line is likewise comparatively flat, indicating that linearity has been fulfilled. After 55, the Scale location plot appears to level out but still displays some growing variance at the left.
#par(mfrow=c(2,2)) #prints out two rows, two columns of plots
plot(my_lm)
Is the linear model the best choice? The four factors stated and
outlined below must be taken into account when determining if the linear
model is acceptable. A linear regression model is predicated on the
following four premises:
Linearity: X and the mean of Y have a straight line connection. The red line’s approximate horizontal position at zero on the Residuals vs. Fitted figure suggests linearity. Additionally, the first graphic above demonstrates a largely linear relationship between the two variables. Homoscedasticity: For every value of X, the variance of the residual is the same. The Scale-Location plot demonstrates whether residuals are distributed similarly across predictor ranges. With the exception of the extreme left of the graph, the most of the points are evenly spaced along an approximately horizontal line. Independent of one another: Observations are separate from one another. The Residuals vs. Fitted plot reveals that the correlation is nearly zero and that there appears to be no link. Normality: Y is normally distributed for any fixed value of X. The normality assumption can be visually verified using the QQ plot of residuals. Based on the figure, it appears that the points deviate from the line just slightly, indicating that both sets are roughly regularly distributed. The model appears to meet the aforementioned requirements and can be regarded as a suitable linear model.