I conducted a statewide statistical analysis of socioeconomic factors associated with county-level crime rates in Florida. I used regression modeling to examine the predictive effects of income, high school graduation rate, and urban population percentage to predict crime rates across Florida counties.

The best model I discovered for predicting crime rates was a simple linear regression using urban population percentage as the sole predictor, since urban population percentage emerged as the most influential factor. The model explains about 46% of the variance in crime rates across Florida counties. This suggests that the police department should increase its focus of resources and attention to urban centers.

A limitation of these analyses is that there are many other variables that may relate to crime rates that are not captured in my dataset. For example, higher population densities, unemployment rates, and ethnicity splits may contribute to crime rates.

Florida_County_Crime_Rates <- read_excel("Florida County Crime Rates.xlsx")
View(Florida_County_Crime_Rates)

florida_data <- Florida_County_Crime_Rates %>%
  rename(
    Crime = C,
    Income = I,
    HighSchoolGrad = HS, 
    UrbanPop = U
  )

florida_data <- florida_data %>%
  mutate(County = str_to_title(County))
# Descriptives
favstats(florida_data$Income)
##   min    Q1 median    Q3  max     mean       sd  n missing
##  15.4 21.05   24.6 28.15 35.6 24.51045 4.682758 67       0
Income_Range = max(florida_data$Income) - min(florida_data$Income)

favstats(florida_data$Crime)
##  min   Q1 median Q3 max     mean       sd  n missing
##    0 35.5     52 69 128 52.40299 28.19363 67       0
Crime_Range = max(florida_data$Crime) - min(florida_data$Crime)

favstats(florida_data$HighSchoolGrad)
##   min    Q1 median   Q3  max     mean       sd  n missing
##  54.5 62.45     69 76.9 84.9 69.48955 8.858776 67       0
Grad_Range = max(florida_data$HighSchoolGrad) - min(florida_data$HighSchoolGrad)

favstats(florida_data$UrbanPop)
##  min   Q1 median    Q3  max     mean       sd  n missing
##    0 21.6   44.6 83.55 99.6 49.55821 33.96901 67       0
UrbanPop_Range = max(florida_data$UrbanPop) - min(florida_data$UrbanPop)
ggplot(florida_data, aes(x = HighSchoolGrad, y = Crime)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype="dashed") +
  labs(
    title = "Crime Rate by High School Graduation Rate",
    x = "High School Graduation Rate (%)",
    y = "Crime Rate (per 1,000 residents)"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Strangely, high school graduation rate appears to be positively correlated with crime rate. The higher the graduation rate, the higher the crime rate in that county.

ggplot(florida_data, aes(x = Income, y = Crime)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype="dashed") +
  labs(
    title = "Crime Rate by Income",
    x = "Median Income (in thousands)",
    y = "Crime Rate (per 1,000 residents)"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Similarly, income appears to be positively correlated with crime rate. More income, more crime in that county. A bit counterintuitive. Let’s look at the correlations between each of these variables to see if there’s more to the picture.

florida_variables <- florida_data %>%
  select(Crime, Income, HighSchoolGrad, UrbanPop)

cor_matrix <- rcorr(as.matrix(florida_variables))
cor_matrix
##                Crime Income HighSchoolGrad UrbanPop
## Crime           1.00   0.43           0.47     0.68
## Income          0.43   1.00           0.79     0.73
## HighSchoolGrad  0.47   0.79           1.00     0.79
## UrbanPop        0.68   0.73           0.79     1.00
## 
## n= 67 
## 
## 
## P
##                Crime Income HighSchoolGrad UrbanPop
## Crime                2e-04  0e+00          0e+00   
## Income         2e-04        0e+00          0e+00   
## HighSchoolGrad 0e+00 0e+00                 0e+00   
## UrbanPop       0e+00 0e+00  0e+00
ggcorrplot(
  cor_matrix$r,
  type="lower",
  lab = TRUE
)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggcorrplot package.
##   Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The variable that shows the strongest relationship with Crime is Urban Population. All relationships in this matrix are positive and significant. The weakest relationships are still moderate, such as Crime-Income and Crime- HighSchoolGrad. Most correlations are strong, including Crime-UrbanPop, Income-HighSchoolGrad, Income-UrbanPop, HighSchoolGrad-UrbanPop.

model1 <- lm(Crime ~ UrbanPop, data = florida_data)
model2 <- lm(Crime ~ UrbanPop + HighSchoolGrad, data = florida_data)
model3 <- lm(Crime ~ UrbanPop + Income, data = florida_data)
model4 <- lm(Crime ~ UrbanPop + Income + HighSchoolGrad, data = florida_data)
model5 <- lm(Crime ~ Income + HighSchoolGrad, data = florida_data)
summary(model1)
## 
## Call:
## lm(formula = Crime ~ UrbanPop, data = florida_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.766 -16.541  -4.741  16.521  49.632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.54125    4.53930   5.406 9.85e-07 ***
## UrbanPop     0.56220    0.07573   7.424 3.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 65 degrees of freedom
## Multiple R-squared:  0.4588, Adjusted R-squared:  0.4505 
## F-statistic: 55.11 on 1 and 65 DF,  p-value: 3.084e-10
summary(model2)
## 
## Call:
## lm(formula = Crime ~ UrbanPop + HighSchoolGrad, data = florida_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.693 -15.742  -6.226  15.812  50.678 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     59.1181    28.3653   2.084   0.0411 *  
## UrbanPop         0.6825     0.1232   5.539 6.11e-07 ***
## HighSchoolGrad  -0.5834     0.4725  -1.235   0.2214    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.82 on 64 degrees of freedom
## Multiple R-squared:  0.4714, Adjusted R-squared:  0.4549 
## F-statistic: 28.54 on 2 and 64 DF,  p-value: 1.379e-09
summary(model3)
## 
## Call:
## lm(formula = Crime ~ UrbanPop + Income, data = florida_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.130 -15.590  -6.484  16.595  48.921 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.9723    16.3536   2.444   0.0173 *  
## UrbanPop      0.6418     0.1110   5.784 2.36e-07 ***
## Income       -0.7906     0.8049  -0.982   0.3297    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.91 on 64 degrees of freedom
## Multiple R-squared:  0.4669, Adjusted R-squared:  0.4502 
## F-statistic: 28.02 on 2 and 64 DF,  p-value: 1.815e-09
summary(model4)
## 
## Call:
## lm(formula = Crime ~ UrbanPop + Income + HighSchoolGrad, data = florida_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.407 -15.080  -6.588  16.178  50.125 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     59.7147    28.5895   2.089   0.0408 *  
## UrbanPop         0.6972     0.1291   5.399 1.08e-06 ***
## Income          -0.3831     0.9405  -0.407   0.6852    
## HighSchoolGrad  -0.4673     0.5544  -0.843   0.4025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.95 on 63 degrees of freedom
## Multiple R-squared:  0.4728, Adjusted R-squared:  0.4477 
## F-statistic: 18.83 on 3 and 63 DF,  p-value: 7.823e-09
summary(model5)
## 
## Call:
## lm(formula = Crime ~ Income + HighSchoolGrad, data = florida_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.75 -19.61  -4.57  18.52  77.86 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -46.1094    24.9723  -1.846   0.0695 .
## Income           1.0311     1.0839   0.951   0.3450  
## HighSchoolGrad   1.0540     0.5729   1.840   0.0705 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.14 on 64 degrees of freedom
## Multiple R-squared:  0.2289, Adjusted R-squared:  0.2048 
## F-statistic:   9.5 on 2 and 64 DF,  p-value: 0.000244
model_comparison <- tibble(
  Model = c("Model 1: UrbanPop",
            "Model 2: UrbanPop + HighSchoolGrad",
            "Model 3: UrbanPop + Income",
            "Model 4: UrbanPop + Income + HighSchoolGrad",
            "Model 5: Income + HighSchoolGrad"),
  R2 = c(summary(model1)$r.squared,
         summary(model2)$r.squared,
         summary(model3)$r.squared,
         summary(model4)$r.squared,
         summary(model5)$r.squared),
  Adj_R2 = c(summary(model1)$adj.r.squared,
             summary(model2)$adj.r.squared,
             summary(model3)$adj.r.squared,
             summary(model4)$adj.r.squared,
             summary(model5)$adj.r.squared),
  AIC = c(AIC(model1), AIC(model2), AIC(model3), AIC(model4), AIC(model5))
)

model_comparison
## # A tibble: 5 × 4
##   Model                                          R2 Adj_R2   AIC
##   <chr>                                       <dbl>  <dbl> <dbl>
## 1 Model 1: UrbanPop                           0.459  0.451  601.
## 2 Model 2: UrbanPop + HighSchoolGrad          0.471  0.455  602.
## 3 Model 3: UrbanPop + Income                  0.467  0.450  602.
## 4 Model 4: UrbanPop + Income + HighSchoolGrad 0.473  0.448  604.
## 5 Model 5: Income + HighSchoolGrad            0.229  0.205  627.

Given that UrbanPop has the strongest relationship with Crime, it makes sense that the models that include UrbanPop are the best at predicting crime. Model5 does not include UrbanPop, thus it has the lowest adjusted R^2. The model with the highest adjuisted R^2 is model2 (.455), which uses UrbanPop and HighSchoolGrad, however it is hardly better than model1 (.451), which only includes UrbanPop. Therefore, with both simplicity and accuracy in mind, model1 is likely the best.