Group 4 US Arrests Dataset

library(ggplot2)
library(tidyverse)
data(USArrests)
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Question 1: Which test is appropriate for checking correlation?

To visualise the distribution of the 4 variables, qqplots were plotted.

par(mfrow = c(2, 2))
qqnorm(USArrests$UrbanPop,datax=TRUE,bty="l",pch=19,col="RED",
       main="Urban Pop")
qqline(USArrests$UrbanPop,datax=TRUE,lty=2)

qqnorm(USArrests$Assault,datax=TRUE,bty="l",pch=19,col="RED",
       main="Assault")
qqline(USArrests$Assault,datax=TRUE,lty=2)

qqnorm(USArrests$Murder,datax=TRUE,bty="l",pch=19,col="RED",
       main="Murder")
qqline(USArrests$Murder,datax=TRUE,lty=2)

qqnorm(USArrests$Rape,datax=TRUE,bty="l",pch=19,col="RED",
       main="Rape")
qqline(USArrests$Rape,datax=TRUE,lty=2)

par(mfrow = c(1, 1))

Based on the qqplots above, the data of the 4 variables have relatively normal distributions. Therefore, a Pearson’s correlation test, cor.test() is appropriate for this study.

Question 2: Is murder rate correlated with assault rate?

cor.test(USArrests$Murder, USArrests$Assault)
## 
##  Pearson's product-moment correlation
## 
## data:  USArrests$Murder and USArrests$Assault
## t = 9.2981, df = 48, p-value = 2.596e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6739512 0.8831110
## sample estimates:
##       cor 
## 0.8018733

The correlation of murder and assault rate is 0.802, suggesting it has a strong positive correlation. It has a p-value lower than 0.05, showing that the results are statistically significant. Overall, states with high murder rates also have high assault rates.

Question 3: How to predict assault rate using urban population?

A linear regression was used to predict the assault rate based on the urban population.

linearmodel <- lm(Assault ~ UrbanPop, data = USArrests) 
summary(linearmodel)  
## 
## Call:
## lm(formula = Assault ~ UrbanPop, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150.78  -61.85  -18.68   58.05  196.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  73.0766    53.8508   1.357   0.1811  
## UrbanPop      1.4904     0.8027   1.857   0.0695 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared:  0.06701,    Adjusted R-squared:  0.04758 
## F-statistic: 3.448 on 1 and 48 DF,  p-value: 0.06948

The results show a low multiple R-squared value (0.067) and adjusted R-squared value (0.048). This suggests that urban population is not effective in explaining the difference in assault rates across the states. The p-value is 0.069, suggesting the values are not statistically significant.

# Plot the linear model 
ggplot(USArrests, aes(x = UrbanPop, y = Assault)) +
  geom_point(color = "black") +  
  geom_smooth(method = "lm", se = TRUE, color = "red") + 
  labs(title = "Linear Model: Assault ~ UrbanPop",
       x = "Urban Population (%)",
       y = "Assault rate")

The points are widely scattered and there are many outliers. Many points are also found outside the confidence bands. Therefore, urban population is not a strong predictor.

Example

pred_pop <- predict(linearmodel, newdata = data.frame(UrbanPop = 70))

The predicted assault rate when urban population is at 70% is 177.41 per 100,000.