DATA607_Discussion

Discussion 11

Using R, build a regression model for data that interests you. Conduct residual analysis.

data("USArrests")

# Fit the linear regression model with Murder as the response variable
model <- lm(Murder ~ Assault + UrbanPop + Rape, data=USArrests)

summary(model)

## 
## Call:
## lm(formula = Murder ~ Assault + UrbanPop + Rape, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3990 -1.9127 -0.3444  1.2557  7.4279 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.276639   1.737997   1.885   0.0657 .  
## Assault      0.039777   0.005912   6.729 2.33e-08 ***
## UrbanPop    -0.054694   0.027880  -1.962   0.0559 .  
## Rape         0.061399   0.055740   1.102   0.2764    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.574 on 46 degrees of freedom
## Multiple R-squared:  0.6721, Adjusted R-squared:  0.6507 
## F-statistic: 31.42 on 3 and 46 DF,  p-value: 3.322e-11

par(mfrow=c(2, 2))
plot(model)

Was the linear model appropriate? Why or why not?

Residuals vs Fitted: Slight curve indicates potential non-linearity.
Normal Q-Q: Mostly straight line, but with some deviation at the ends, indicating possible outliers.
Scale-Location: Spread increases in the middle, suggesting non-constant variance (heteroscedasticity).
Residuals vs Leverage: A few data points outside the Cook’s distance lines may be influential outliers.

In conclusion, the linear model may not be the best option due to signs of non-linearity, potential outliers, and heteroscedasticity.

DATA607_Discussion_11

HAig Bedros

2024-04-03

Discussion 11