As the Data Analyst that I have started to become, my company has chosen me to investigate and uncover what socioeconomic factors are most strongly associated with rising crime rates across Florida counties. I’m super excited!

Step 1 - Loading and Preparing the Data

Now I’m going to load and clean the data

florida_crime <- read_excel("Florida County Crime Rates.xlsx")
view(florida_crime)
head(florida_crime)
## # A tibble: 6 × 5
##   County       C     I    HS     U
##   <chr>    <dbl> <dbl> <dbl> <dbl>
## 1 ALACHUA    104  22.1  82.7  73.2
## 2 BAKER       20  25.8  64.1  21.5
## 3 BAY         64  24.7  74.7  85  
## 4 BRADFORD    50  24.6  65    23.2
## 5 BREVARD     64  30.5  82.3  91.9
## 6 BROWARD     94  30.6  76.8  98.9

Renaming columns and formattin county names

florida_crime <- florida_crime %>% 
  rename (
    Crime = C,
    Income = I,
    HighSchoolGrad = HS,
    UrbanPop = U
  )
florida_crime
## # A tibble: 67 × 5
##    County    Crime Income HighSchoolGrad UrbanPop
##    <chr>     <dbl>  <dbl>          <dbl>    <dbl>
##  1 ALACHUA     104   22.1           82.7     73.2
##  2 BAKER        20   25.8           64.1     21.5
##  3 BAY          64   24.7           74.7     85  
##  4 BRADFORD     50   24.6           65       23.2
##  5 BREVARD      64   30.5           82.3     91.9
##  6 BROWARD      94   30.6           76.8     98.9
##  7 CALHOUN       8   18.6           55.9      0  
##  8 CHARLOTTE    35   25.7           75.7     80.2
##  9 CITRUS       27   21.3           68.6     31  
## 10 CLAY         41   34.9           81.2     65.8
## # ℹ 57 more rows
florida_crime <- florida_crime %>%
  mutate (County = str_to_title(County))
florida_crime
## # A tibble: 67 × 5
##    County    Crime Income HighSchoolGrad UrbanPop
##    <chr>     <dbl>  <dbl>          <dbl>    <dbl>
##  1 Alachua     104   22.1           82.7     73.2
##  2 Baker        20   25.8           64.1     21.5
##  3 Bay          64   24.7           74.7     85  
##  4 Bradford     50   24.6           65       23.2
##  5 Brevard      64   30.5           82.3     91.9
##  6 Broward      94   30.6           76.8     98.9
##  7 Calhoun       8   18.6           55.9      0  
##  8 Charlotte    35   25.7           75.7     80.2
##  9 Citrus       27   21.3           68.6     31  
## 10 Clay         41   34.9           81.2     65.8
## # ℹ 57 more rows
skim(florida_crime)
Data summary
Name florida_crime
Number of rows 67
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
County 0 1 3 9 0 67 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Crime 0 1 52.40 28.19 0.0 35.50 52.0 69.00 128.0 ▃▇▇▃▂
Income 0 1 24.51 4.68 15.4 21.05 24.6 28.15 35.6 ▂▇▅▅▂
HighSchoolGrad 0 1 69.49 8.86 54.5 62.45 69.0 76.90 84.9 ▇▇▆▇▆
UrbanPop 0 1 49.56 33.97 0.0 21.60 44.6 83.55 99.6 ▅▆▂▃▇
summary(florida_crime)
##     County              Crime           Income      HighSchoolGrad 
##  Length:67          Min.   :  0.0   Min.   :15.40   Min.   :54.50  
##  Class :character   1st Qu.: 35.5   1st Qu.:21.05   1st Qu.:62.45  
##  Mode  :character   Median : 52.0   Median :24.60   Median :69.00  
##                     Mean   : 52.4   Mean   :24.51   Mean   :69.49  
##                     3rd Qu.: 69.0   3rd Qu.:28.15   3rd Qu.:76.90  
##                     Max.   :128.0   Max.   :35.60   Max.   :84.90  
##     UrbanPop    
##  Min.   : 0.00  
##  1st Qu.:21.60  
##  Median :44.60  
##  Mean   :49.56  
##  3rd Qu.:83.55  
##  Max.   :99.60

Step 2 - Exploratory Data Analysis

Descriptive statistics for florida_crime

summary(florida_crime)
##     County              Crime           Income      HighSchoolGrad 
##  Length:67          Min.   :  0.0   Min.   :15.40   Min.   :54.50  
##  Class :character   1st Qu.: 35.5   1st Qu.:21.05   1st Qu.:62.45  
##  Mode  :character   Median : 52.0   Median :24.60   Median :69.00  
##                     Mean   : 52.4   Mean   :24.51   Mean   :69.49  
##                     3rd Qu.: 69.0   3rd Qu.:28.15   3rd Qu.:76.90  
##                     Max.   :128.0   Max.   :35.60   Max.   :84.90  
##     UrbanPop    
##  Min.   : 0.00  
##  1st Qu.:21.60  
##  Median :44.60  
##  Mean   :49.56  
##  3rd Qu.:83.55  
##  Max.   :99.60

By using the summary function (twice) we can take a look at the mean, median, and range od the dataset

Mean Crime = 52.4, Mean Income = 24.15, Mean HighSchoolGrad = 69.49, Mean UrbanPop = 49.56 Median Crime = 52, Median Income = 24.60, Median HighSchoolGrad = 69, Median UrbanPop = 44.60 Range Crime = 0-128, Range Income = 15.40-35.60, Range HighSchoolGrad = 54.50-84.90, Range UrbanPop = 0-99.60

Visualizing the data in all its glory

viz_1 <- ggplot(florida_crime, aes(x=Income, y=Crime))+
  geom_point(size=2.5)+
  geom_smooth(method = "lm", se=FALSE)+
  labs(
    title = "Income vs Crime",
    x = "Income",
    y = "Crime"
  )
viz_1
## `geom_smooth()` using formula = 'y ~ x'

Looking at the slope line, the graph shows that as Income increases Crime also increases.

viz_2 <- ggplot(florida_crime, aes(x=HighSchoolGrad, y=Crime))+
  geom_point(size=2.5)+
  geom_smooth(method = "lm", se=FALSE)+
  labs(
    title = "School vs Crime",
    x = "School",
    y = "Crime"
  )
viz_2
## `geom_smooth()` using formula = 'y ~ x'

Here we aslo see a similar trend from the previous graph. As school graduation increases so does Crime.

viz_3 <- ggplot(florida_crime, aes(x=UrbanPop, y=Crime))+
  geom_point(size=2.5)+
  geom_smooth(method = "lm", se=FALSE)+
  labs(
    title = "Urban Lifestyle vs Crime",
    x = "UrbanPop",
    y = "Crime"
  )
viz_3
## `geom_smooth()` using formula = 'y ~ x'

Again, there is a similar pattern as the two previous graphs. As the percentage of the urban population increases so does Crime.

Let’s use patchwork for better readability

viz_1+viz_2+viz_3
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Step 3 – Correlation Analysis

Numeric_florida_crime <- florida_crime %>% 
  select(Crime, Income, HighSchoolGrad, UrbanPop)
head(Numeric_florida_crime)
## # A tibble: 6 × 4
##   Crime Income HighSchoolGrad UrbanPop
##   <dbl>  <dbl>          <dbl>    <dbl>
## 1   104   22.1           82.7     73.2
## 2    20   25.8           64.1     21.5
## 3    64   24.7           74.7     85  
## 4    50   24.6           65       23.2
## 5    64   30.5           82.3     91.9
## 6    94   30.6           76.8     98.9
Cor_Max <- rcorr(as.matrix(Numeric_florida_crime))
Cor_Max
##                Crime Income HighSchoolGrad UrbanPop
## Crime           1.00   0.43           0.47     0.68
## Income          0.43   1.00           0.79     0.73
## HighSchoolGrad  0.47   0.79           1.00     0.79
## UrbanPop        0.68   0.73           0.79     1.00
## 
## n= 67 
## 
## 
## P
##                Crime Income HighSchoolGrad UrbanPop
## Crime                2e-04  0e+00          0e+00   
## Income         2e-04        0e+00          0e+00   
## HighSchoolGrad 0e+00 0e+00                 0e+00   
## UrbanPop       0e+00 0e+00  0e+00

There is a positive correlation between Income and Crime r = 0.43 There is a positive correlation between HighSchoolGrad and Crime r= 0.47. There is a positive correlation between UrbanPop and Crime r = 0.68 (the strongest correlation among the variables).

Step 4 – Building Regression Models

m1 <- lm(Crime ~ Income, data = Numeric_florida_crime)
summary(m1)
## 
## Call:
## lm(formula = Crime ~ Income, data = Numeric_florida_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.452 -21.347  -3.102  17.580  69.357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.6059    16.7863  -0.691 0.491782    
## Income        2.6115     0.6729   3.881 0.000246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.6 on 65 degrees of freedom
## Multiple R-squared:  0.1881, Adjusted R-squared:  0.1756 
## F-statistic: 15.06 on 1 and 65 DF,  p-value: 0.0002456
AIC(m1)
## [1] 628.6045
m2 <- lm(Crime ~ HighSchoolGrad, data = Numeric_florida_crime)
summary(m2)
## 
## Call:
## lm(formula = Crime ~ HighSchoolGrad, data = Numeric_florida_crime)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.74 -21.36  -4.82  17.42  82.27 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -50.8569    24.4507  -2.080   0.0415 *  
## HighSchoolGrad   1.4860     0.3491   4.257 6.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.12 on 65 degrees of freedom
## Multiple R-squared:  0.218,  Adjusted R-squared:  0.206 
## F-statistic: 18.12 on 1 and 65 DF,  p-value: 6.806e-05
AIC(m2)
## [1] 626.0932
m3 <- lm(Crime ~ UrbanPop, data = Numeric_florida_crime)
summary(m3)
## 
## Call:
## lm(formula = Crime ~ UrbanPop, data = Numeric_florida_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.766 -16.541  -4.741  16.521  49.632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.54125    4.53930   5.406 9.85e-07 ***
## UrbanPop     0.56220    0.07573   7.424 3.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 65 degrees of freedom
## Multiple R-squared:  0.4588, Adjusted R-squared:  0.4505 
## F-statistic: 55.11 on 1 and 65 DF,  p-value: 3.084e-10
AIC(m3)
## [1] 601.43

Multiple regression models

m4 <- lm(Crime ~ Income + UrbanPop, data = Numeric_florida_crime)
summary(m4)
## 
## Call:
## lm(formula = Crime ~ Income + UrbanPop, data = Numeric_florida_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.130 -15.590  -6.484  16.595  48.921 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.9723    16.3536   2.444   0.0173 *  
## Income       -0.7906     0.8049  -0.982   0.3297    
## UrbanPop      0.6418     0.1110   5.784 2.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.91 on 64 degrees of freedom
## Multiple R-squared:  0.4669, Adjusted R-squared:  0.4502 
## F-statistic: 28.02 on 2 and 64 DF,  p-value: 1.815e-09
AIC(m4)
## [1] 602.4276
m5 <- lm(Crime ~ Income + HighSchoolGrad, data = Numeric_florida_crime)
summary(m5)
## 
## Call:
## lm(formula = Crime ~ Income + HighSchoolGrad, data = Numeric_florida_crime)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.75 -19.61  -4.57  18.52  77.86 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -46.1094    24.9723  -1.846   0.0695 .
## Income           1.0311     1.0839   0.951   0.3450  
## HighSchoolGrad   1.0540     0.5729   1.840   0.0705 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.14 on 64 degrees of freedom
## Multiple R-squared:  0.2289, Adjusted R-squared:  0.2048 
## F-statistic:   9.5 on 2 and 64 DF,  p-value: 0.000244
AIC(m5)
## [1] 627.1524
m6 <- lm(Crime ~ Income + UrbanPop + HighSchoolGrad, data = Numeric_florida_crime)
summary(m6)
## 
## Call:
## lm(formula = Crime ~ Income + UrbanPop + HighSchoolGrad, data = Numeric_florida_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.407 -15.080  -6.588  16.178  50.125 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     59.7147    28.5895   2.089   0.0408 *  
## Income          -0.3831     0.9405  -0.407   0.6852    
## UrbanPop         0.6972     0.1291   5.399 1.08e-06 ***
## HighSchoolGrad  -0.4673     0.5544  -0.843   0.4025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.95 on 63 degrees of freedom
## Multiple R-squared:  0.4728, Adjusted R-squared:  0.4477 
## F-statistic: 18.83 on 3 and 63 DF,  p-value: 7.823e-09
AIC(m6)
## [1] 603.6764

Comparing models

m1, R-squared of 0.188, Adjusted R-squared of 0.176, and AIC [628.6045] (variance explained is 18%).

m2, R-squared of 0.218, Adjusted R-squared of 0.206, and AIC [626.0932] (variance explained is 21%).

m3, R-squared of 0.459, Adjusted R-squared of 0.451, and AIC [601.43] (variance explained is 45%).

m4, R-squared of 0.467, Adjusted R-squared of 0.451, and AIC [602.4276] (variance explained is 45%).

m5, R-squared of 0.229, Adjusted R-squared of 0.205, and AIC [627.1524] (variance explained is 20%).

m6, R-squared of 0.473, Adjusted R-squared of 0.447, and AIC [603.6764] (variance explained is 45%).

I can see that m3, m4, and m6 have the highest R-squared and Adjusted R-squared compared to m1,m2, and m5. Also, m3 has the lowest AIC which makes this model less complex when explaining the variance between the variables. I can also see that m4 and m6 have the lowest AIC following m3. m3 is more balance when it comes to R-squared and Adjusted R-squared. It’s safe to say that m3 is the best model for accuracy and simplicity.

Conclusion

After cleaning and analyzing the data by comparing different models for better crime predictions, I am confident about recommending model 3 (Crime and Urban Population) as it the most influential predictor when it comes to crime. This model explains 45% of the variance between the dependent variable (Crime) and the independent variable (UrbanPop). The PD should focus on highly populated areas to provide a sense of security to the population, which can help reduce the possibilities of committing crimes. One limitation is that correlation is not causation, and besides we don’t really have much data to look into age as a potential variable to understand what age groups are more likely to engage in such activities and devise better plans for interventions.