I conducted a statewide statistical analysis of socioeconomic factors associated with county-level crime rates in Florida. I used regression modeling to examine the predictive effects of income, high school graduation rate, and urban population percentage to predict crime rates across Florida counties.
The best model I discovered for predicting crime rates was a simple linear regression using urban population percentage as the sole predictor, since urban population percentage emerged as the most influential factor. The model explains about 46% of the variance in crime rates across Florida counties. This suggests that the police department should increase its focus of resources and attention to urban centers.
A limitation of these analyses is that there are many other variables that may relate to crime rates that are not captured in my dataset. For example, higher population densities, unemployment rates, and ethnicity splits may contribute to crime rates.
Florida_County_Crime_Rates <- read_excel("Florida County Crime Rates.xlsx")
View(Florida_County_Crime_Rates)
florida_data <- Florida_County_Crime_Rates %>%
rename(
Crime = C,
Income = I,
HighSchoolGrad = HS,
UrbanPop = U
)
florida_data <- florida_data %>%
mutate(County = str_to_title(County))
# Descriptives
favstats(florida_data$Income)
## min Q1 median Q3 max mean sd n missing
## 15.4 21.05 24.6 28.15 35.6 24.51045 4.682758 67 0
Income_Range = max(florida_data$Income) - min(florida_data$Income)
favstats(florida_data$Crime)
## min Q1 median Q3 max mean sd n missing
## 0 35.5 52 69 128 52.40299 28.19363 67 0
Crime_Range = max(florida_data$Crime) - min(florida_data$Crime)
favstats(florida_data$HighSchoolGrad)
## min Q1 median Q3 max mean sd n missing
## 54.5 62.45 69 76.9 84.9 69.48955 8.858776 67 0
Grad_Range = max(florida_data$HighSchoolGrad) - min(florida_data$HighSchoolGrad)
favstats(florida_data$UrbanPop)
## min Q1 median Q3 max mean sd n missing
## 0 21.6 44.6 83.55 99.6 49.55821 33.96901 67 0
UrbanPop_Range = max(florida_data$UrbanPop) - min(florida_data$UrbanPop)
ggplot(florida_data, aes(x = HighSchoolGrad, y = Crime)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "black", linetype="dashed") +
labs(
title = "Crime Rate by High School Graduation Rate",
x = "High School Graduation Rate (%)",
y = "Crime Rate (per 1,000 residents)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Strangely, high school graduation rate appears to be positively correlated with crime rate. The higher the graduation rate, the higher the crime rate in that county.
ggplot(florida_data, aes(x = Income, y = Crime)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "black", linetype="dashed") +
labs(
title = "Crime Rate by Income",
x = "Median Income (in thousands)",
y = "Crime Rate (per 1,000 residents)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Similarly, income appears to be positively correlated with crime rate. More income, more crime in that county. A bit counterintuitive. Let’s look at the correlations between each of these variables to see if there’s more to the picture.
florida_variables <- florida_data %>%
select(Crime, Income, HighSchoolGrad, UrbanPop)
cor_matrix <- rcorr(as.matrix(florida_variables))
cor_matrix
## Crime Income HighSchoolGrad UrbanPop
## Crime 1.00 0.43 0.47 0.68
## Income 0.43 1.00 0.79 0.73
## HighSchoolGrad 0.47 0.79 1.00 0.79
## UrbanPop 0.68 0.73 0.79 1.00
##
## n= 67
##
##
## P
## Crime Income HighSchoolGrad UrbanPop
## Crime 2e-04 0e+00 0e+00
## Income 2e-04 0e+00 0e+00
## HighSchoolGrad 0e+00 0e+00 0e+00
## UrbanPop 0e+00 0e+00 0e+00
ggcorrplot(
cor_matrix$r,
type="lower",
lab = TRUE
)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggcorrplot package.
## Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The variable that shows the strongest relationship with Crime is Urban Population. All relationships in this matrix are positive and significant. The weakest relationships are still moderate, such as Crime-Income and Crime- HighSchoolGrad. Most correlations are strong, including Crime-UrbanPop, Income-HighSchoolGrad, Income-UrbanPop, HighSchoolGrad-UrbanPop.
model1 <- lm(Crime ~ UrbanPop, data = florida_data)
model2 <- lm(Crime ~ UrbanPop + HighSchoolGrad, data = florida_data)
model3 <- lm(Crime ~ UrbanPop + Income, data = florida_data)
model4 <- lm(Crime ~ UrbanPop + Income + HighSchoolGrad, data = florida_data)
model5 <- lm(Crime ~ Income + HighSchoolGrad, data = florida_data)
summary(model1)
##
## Call:
## lm(formula = Crime ~ UrbanPop, data = florida_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.766 -16.541 -4.741 16.521 49.632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.54125 4.53930 5.406 9.85e-07 ***
## UrbanPop 0.56220 0.07573 7.424 3.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 65 degrees of freedom
## Multiple R-squared: 0.4588, Adjusted R-squared: 0.4505
## F-statistic: 55.11 on 1 and 65 DF, p-value: 3.084e-10
summary(model2)
##
## Call:
## lm(formula = Crime ~ UrbanPop + HighSchoolGrad, data = florida_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.693 -15.742 -6.226 15.812 50.678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.1181 28.3653 2.084 0.0411 *
## UrbanPop 0.6825 0.1232 5.539 6.11e-07 ***
## HighSchoolGrad -0.5834 0.4725 -1.235 0.2214
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.82 on 64 degrees of freedom
## Multiple R-squared: 0.4714, Adjusted R-squared: 0.4549
## F-statistic: 28.54 on 2 and 64 DF, p-value: 1.379e-09
summary(model3)
##
## Call:
## lm(formula = Crime ~ UrbanPop + Income, data = florida_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.130 -15.590 -6.484 16.595 48.921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.9723 16.3536 2.444 0.0173 *
## UrbanPop 0.6418 0.1110 5.784 2.36e-07 ***
## Income -0.7906 0.8049 -0.982 0.3297
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.91 on 64 degrees of freedom
## Multiple R-squared: 0.4669, Adjusted R-squared: 0.4502
## F-statistic: 28.02 on 2 and 64 DF, p-value: 1.815e-09
summary(model4)
##
## Call:
## lm(formula = Crime ~ UrbanPop + Income + HighSchoolGrad, data = florida_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.407 -15.080 -6.588 16.178 50.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.7147 28.5895 2.089 0.0408 *
## UrbanPop 0.6972 0.1291 5.399 1.08e-06 ***
## Income -0.3831 0.9405 -0.407 0.6852
## HighSchoolGrad -0.4673 0.5544 -0.843 0.4025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.95 on 63 degrees of freedom
## Multiple R-squared: 0.4728, Adjusted R-squared: 0.4477
## F-statistic: 18.83 on 3 and 63 DF, p-value: 7.823e-09
summary(model5)
##
## Call:
## lm(formula = Crime ~ Income + HighSchoolGrad, data = florida_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.75 -19.61 -4.57 18.52 77.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -46.1094 24.9723 -1.846 0.0695 .
## Income 1.0311 1.0839 0.951 0.3450
## HighSchoolGrad 1.0540 0.5729 1.840 0.0705 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.14 on 64 degrees of freedom
## Multiple R-squared: 0.2289, Adjusted R-squared: 0.2048
## F-statistic: 9.5 on 2 and 64 DF, p-value: 0.000244
model_comparison <- tibble(
Model = c("Model 1: UrbanPop",
"Model 2: UrbanPop + HighSchoolGrad",
"Model 3: UrbanPop + Income",
"Model 4: UrbanPop + Income + HighSchoolGrad",
"Model 5: Income + HighSchoolGrad"),
R2 = c(summary(model1)$r.squared,
summary(model2)$r.squared,
summary(model3)$r.squared,
summary(model4)$r.squared,
summary(model5)$r.squared),
Adj_R2 = c(summary(model1)$adj.r.squared,
summary(model2)$adj.r.squared,
summary(model3)$adj.r.squared,
summary(model4)$adj.r.squared,
summary(model5)$adj.r.squared),
AIC = c(AIC(model1), AIC(model2), AIC(model3), AIC(model4), AIC(model5))
)
model_comparison
## # A tibble: 5 × 4
## Model R2 Adj_R2 AIC
## <chr> <dbl> <dbl> <dbl>
## 1 Model 1: UrbanPop 0.459 0.451 601.
## 2 Model 2: UrbanPop + HighSchoolGrad 0.471 0.455 602.
## 3 Model 3: UrbanPop + Income 0.467 0.450 602.
## 4 Model 4: UrbanPop + Income + HighSchoolGrad 0.473 0.448 604.
## 5 Model 5: Income + HighSchoolGrad 0.229 0.205 627.
Given that UrbanPop has the strongest relationship with Crime, it makes sense that the models that include UrbanPop are the best at predicting crime. Model5 does not include UrbanPop, thus it has the lowest adjusted R^2. The model with the highest adjuisted R^2 is model2 (.455), which uses UrbanPop and HighSchoolGrad, however it is hardly better than model1 (.451), which only includes UrbanPop. Therefore, with both simplicity and accuracy in mind, model1 is likely the best.