library(ggplot2)
library(tidyverse)
data(USArrests)
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
To visualise the distribution of the 4 variables, qqplots were plotted.
par(mfrow = c(2, 2))
qqnorm(USArrests$UrbanPop,datax=TRUE,bty="l",pch=19,col="RED",
main="Urban Pop")
qqline(USArrests$UrbanPop,datax=TRUE,lty=2)
qqnorm(USArrests$Assault,datax=TRUE,bty="l",pch=19,col="RED",
main="Assault")
qqline(USArrests$Assault,datax=TRUE,lty=2)
qqnorm(USArrests$Murder,datax=TRUE,bty="l",pch=19,col="RED",
main="Murder")
qqline(USArrests$Murder,datax=TRUE,lty=2)
qqnorm(USArrests$Rape,datax=TRUE,bty="l",pch=19,col="RED",
main="Rape")
qqline(USArrests$Rape,datax=TRUE,lty=2)
par(mfrow = c(1, 1))
Based on the qqplots above, the data of the 4 variables have relatively normal distributions. Therefore, a Pearson’s correlation test, cor.test() is appropriate for this study.
A linear regression was used to predict the assault rate based on the urban population.
linearmodel <- lm(Assault ~ UrbanPop, data = USArrests)
summary(linearmodel)
##
## Call:
## lm(formula = Assault ~ UrbanPop, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150.78 -61.85 -18.68 58.05 196.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.0766 53.8508 1.357 0.1811
## UrbanPop 1.4904 0.8027 1.857 0.0695 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared: 0.06701, Adjusted R-squared: 0.04758
## F-statistic: 3.448 on 1 and 48 DF, p-value: 0.06948
The results show a low multiple R-squared value (0.067) and adjusted R-squared value (0.048). This suggests that urban population is not effective in explaining the difference in assault rates across the states. The p-value is 0.069, suggesting the values are not statistically significant.
# Plot the linear model
ggplot(USArrests, aes(x = UrbanPop, y = Assault)) +
geom_point(color = "black") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(title = "Linear Model: Assault ~ UrbanPop",
x = "Urban Population (%)",
y = "Assault rate")
The points are widely scattered and there are many outliers. Many points are also found outside the confidence bands. Therefore, urban population is not a strong predictor.
pred_pop <- predict(linearmodel, newdata = data.frame(UrbanPop = 70))
The predicted assault rate when urban population is at 70% is 177.41 per 100,000.