library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2
## Warning: package 'readr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Warning: package 'airports' was built under R version 4.2.2
## Loading required package: cherryblossom
## Warning: package 'cherryblossom' was built under R version 4.2.2
## Loading required package: usdata
## Warning: package 'usdata' was built under R version 4.2.2
library(airports)
library(cherryblossom)
library(usdata)

FORUM DESCRIPTION Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

The hsb2 dataset from the openintro library, which is sourced from the UCLA Institute for Digital Research & Education - Statistical Consulting, is used in this discussion. A link to the dataset is available here. Reading and math scores are the variables included in the linear regression model shown below, with reading score acting as the independent variable and math score as the dependent variable (reading comprehension skills are important for word problems in math).

https://www.openintro.org/data/index.php?data=hsb2︎”

hsb2
## # A tibble: 200 × 11
##       id gender race          ses   schtyp prog   read write  math science socst
##    <int> <chr>  <chr>         <fct> <fct>  <fct> <int> <int> <int>   <int> <int>
##  1    70 male   white         low   public gene…    57    52    41      47    57
##  2   121 female white         midd… public voca…    68    59    53      63    61
##  3    86 male   white         high  public gene…    44    33    54      58    31
##  4   141 male   white         high  public voca…    63    44    47      53    56
##  5   172 male   white         midd… public acad…    47    52    57      53    61
##  6   113 male   white         midd… public acad…    44    52    51      63    61
##  7    50 male   african amer… midd… public gene…    50    59    42      53    61
##  8    11 male   hispanic      midd… public acad…    34    46    45      39    36
##  9    84 male   white         midd… public gene…    63    57    54      58    51
## 10    48 male   african amer… midd… public acad…    57    55    52      50    51
## # … with 190 more rows
plot(hsb2$read, hsb2$math)

my_lm <- lm(hsb2$math ~ hsb2$read)
summary(my_lm)
## 
## Call:
## lm(formula = hsb2$math ~ hsb2$read)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.1624  -5.1624  -0.4135   4.7775  16.4684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.03816    2.58945   8.125 4.76e-14 ***
## hsb2$read    0.60515    0.04865  12.438  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.037 on 198 degrees of freedom
## Multiple R-squared:  0.4386, Adjusted R-squared:  0.4358 
## F-statistic: 154.7 on 1 and 198 DF,  p-value: < 2.2e-16

Residual Analysis The discrepancy between the expected outcome value and the actual observation value is the residual of an observation. The residuals demonstrate how well a model fits the data, i.e., how big or small the error is. The color coding in the plot below illustrates the size of the residual value (red denotes a bigger residual, while green denotes a smaller residual).

hsb2$predicted <- predict(my_lm)   # Save the predicted values
hsb2$residuals <- residuals(my_lm) # Save the residual values
ggplot(hsb2, aes(x = read, y = math)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +     # regression line  
  geom_segment(aes(xend = read, yend = predicted), alpha = .2) +      # draw line from point to line
  geom_point(aes(color = abs(residuals), size = abs(residuals))) +  # size of the points
  scale_color_continuous(low = "green", high = "red") +             # colour of the points mapped to residual size - green smaller, red larger
  guides(color = FALSE, size = FALSE) +                             # Size legend removed
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## `geom_smooth()` using formula 'y ~ x'

The Residuals vs. Fitted plot is shown below. As can be seen, it is generally homoscedastic, which means that, with the exception of the first half of the graph, the residuals are equally distributed along the regression line. In the Residuals vs. Fitted plot, the red line is likewise comparatively flat, indicating that linearity has been fulfilled. After 55, the Scale location plot appears to level out but still displays some growing variance at the left.

#par(mfrow=c(2,2)) #prints out two rows, two columns of plots
plot(my_lm)

Is the linear model the best choice? The four factors stated and outlined below must be taken into account when determining if the linear model is acceptable. A linear regression model is predicated on the following four premises:

Linearity: X and the mean of Y have a straight line connection. The red line’s approximate horizontal position at zero on the Residuals vs. Fitted figure suggests linearity. Additionally, the first graphic above demonstrates a largely linear relationship between the two variables. Homoscedasticity: For every value of X, the variance of the residual is the same. The Scale-Location plot demonstrates whether residuals are distributed similarly across predictor ranges. With the exception of the extreme left of the graph, the most of the points are evenly spaced along an approximately horizontal line. Independent of one another: Observations are separate from one another. The Residuals vs. Fitted plot reveals that the correlation is nearly zero and that there appears to be no link. Normality: Y is normally distributed for any fixed value of X. The normality assumption can be visually verified using the QQ plot of residuals. Based on the figure, it appears that the points deviate from the line just slightly, indicating that both sets are roughly regularly distributed. The model appears to meet the aforementioned requirements and can be regarded as a suitable linear model.