Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
library(tidyverse)
data("USArrests")
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
# there is a positive correlation between urban population and assult in USA.
USArrests %>%
ggplot(aes(UrbanPop, Assault)) +
geom_point() +
geom_smooth(method = lm, se = F)
# the min-max, 1Q-3Q maginutdes are close but the median is far from the zero. Not a predictor of good model.
USArrests_lm <- lm(UrbanPop ~ Assault, data = USArrests)
summary(USArrests_lm)
##
## Call:
## lm(formula = UrbanPop ~ Assault, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.020 -9.637 2.023 10.567 23.989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.86213 4.59228 12.600 <2e-16 ***
## Assault 0.04496 0.02422 1.857 0.0695 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.13 on 48 degrees of freedom
## Multiple R-squared: 0.06701, Adjusted R-squared: 0.04758
## F-statistic: 3.448 on 1 and 48 DF, p-value: 0.06948
# the variance does not seem to be uniform.
USArrests_lm %>%
ggplot(aes(fitted(USArrests_lm), resid(USArrests_lm))) +
geom_point() +
geom_smooth(method = lm, se =F) +
labs(title = "Residual Analysis",
x = "Fitted Line", y = "Residuals") +
theme_minimal()
# the residuals seems to drift off on the right side of the tail. there seems to be some outliers
USArrests_lm %>%
ggplot(aes(sample = resid(USArrests_lm))) +
stat_qq() +
stat_qq_line() +
labs(title = "Q-Q Plot") +
theme_minimal()
Overall, I would see the linear model does not fully explain the data because of the aforementioned reasons.