Discussion 11

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

library(tidyverse)
data("USArrests")
head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

# there is a positive correlation between urban population and assult in USA.
USArrests %>% 
  ggplot(aes(UrbanPop, Assault)) +
  geom_point() +
  geom_smooth(method = lm, se = F)

# the min-max, 1Q-3Q maginutdes are close but the median is far from the zero. Not a predictor of good model.
USArrests_lm <- lm(UrbanPop ~ Assault, data = USArrests)
summary(USArrests_lm)

## 
## Call:
## lm(formula = UrbanPop ~ Assault, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.020  -9.637   2.023  10.567  23.989 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.86213    4.59228  12.600   <2e-16 ***
## Assault      0.04496    0.02422   1.857   0.0695 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.13 on 48 degrees of freedom
## Multiple R-squared:  0.06701,    Adjusted R-squared:  0.04758 
## F-statistic: 3.448 on 1 and 48 DF,  p-value: 0.06948

# the variance does not seem to be uniform.
USArrests_lm %>% 
  ggplot(aes(fitted(USArrests_lm), resid(USArrests_lm))) +
  geom_point() +
  geom_smooth(method = lm, se =F) +
  labs(title = "Residual Analysis",
       x = "Fitted Line", y = "Residuals") +
  theme_minimal()

# the residuals seems to drift off on the right side of the tail. there seems to be some outliers
USArrests_lm %>% 
  ggplot(aes(sample = resid(USArrests_lm))) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot") +
  theme_minimal()

Overall, I would see the linear model does not fully explain the data because of the aforementioned reasons.

Discussion 11

Saayed Alam

April 14, 2019