The Florida Police Department has hired me as their new data analyst. The mission is to uncover what socioeconomic factors are most strongly associated with rising crime rates across Florida counties. The Florida Police Department is particularly intrested in whether income, education, or urbanization play the largest role in explaining differences in crime rates.
Florida_Data<-read_xlsx("Florida County Crime Rates.xlsx")
View(Florida_Data)
We will be renaming the columns to: Crime, Income< HighSchoolGrad, and UrbanPop and making sure all county names are formatted so that only the first letter is capitalized.
Florida_Data<- Florida_Data %>%
rename(
Crime= C,
Income= I,
HighSchoolGrad= HS,
UrbanPop=U
)
Florida_Data<-Florida_Data %>%
mutate(County=str_to_title(County))
Florida_Data
## # A tibble: 67 × 5
## County Crime Income HighSchoolGrad UrbanPop
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alachua 104 22.1 82.7 73.2
## 2 Baker 20 25.8 64.1 21.5
## 3 Bay 64 24.7 74.7 85
## 4 Bradford 50 24.6 65 23.2
## 5 Brevard 64 30.5 82.3 91.9
## 6 Broward 94 30.6 76.8 98.9
## 7 Calhoun 8 18.6 55.9 0
## 8 Charlotte 35 25.7 75.7 80.2
## 9 Citrus 27 21.3 68.6 31
## 10 Clay 41 34.9 81.2 65.8
## # ℹ 57 more rows
Next we will inspect and summarize the dataset
str(Florida_Data)
## tibble [67 × 5] (S3: tbl_df/tbl/data.frame)
## $ County : chr [1:67] "Alachua" "Baker" "Bay" "Bradford" ...
## $ Crime : num [1:67] 104 20 64 50 64 94 8 35 27 41 ...
## $ Income : num [1:67] 22.1 25.8 24.7 24.6 30.5 30.6 18.6 25.7 21.3 34.9 ...
## $ HighSchoolGrad: num [1:67] 82.7 64.1 74.7 65 82.3 76.8 55.9 75.7 68.6 81.2 ...
## $ UrbanPop : num [1:67] 73.2 21.5 85 23.2 91.9 98.9 0 80.2 31 65.8 ...
skim(Florida_Data)
| Name | Florida_Data |
| Number of rows | 67 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| County | 0 | 1 | 3 | 9 | 0 | 67 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Crime | 0 | 1 | 52.40 | 28.19 | 0.0 | 35.50 | 52.0 | 69.00 | 128.0 | ▃▇▇▃▂ |
| Income | 0 | 1 | 24.51 | 4.68 | 15.4 | 21.05 | 24.6 | 28.15 | 35.6 | ▂▇▅▅▂ |
| HighSchoolGrad | 0 | 1 | 69.49 | 8.86 | 54.5 | 62.45 | 69.0 | 76.90 | 84.9 | ▇▇▆▇▆ |
| UrbanPop | 0 | 1 | 49.56 | 33.97 | 0.0 | 21.60 | 44.6 | 83.55 | 99.6 | ▅▆▂▃▇ |
summary(Florida_Data)
## County Crime Income HighSchoolGrad
## Length:67 Min. : 0.0 Min. :15.40 Min. :54.50
## Class :character 1st Qu.: 35.5 1st Qu.:21.05 1st Qu.:62.45
## Mode :character Median : 52.0 Median :24.60 Median :69.00
## Mean : 52.4 Mean :24.51 Mean :69.49
## 3rd Qu.: 69.0 3rd Qu.:28.15 3rd Qu.:76.90
## Max. :128.0 Max. :35.60 Max. :84.90
## UrbanPop
## Min. : 0.00
## 1st Qu.:21.60
## Median :44.60
## Mean :49.56
## 3rd Qu.:83.55
## Max. :99.60
We will now compute basic descriptive statistics.
summary(Florida_Data)
## County Crime Income HighSchoolGrad
## Length:67 Min. : 0.0 Min. :15.40 Min. :54.50
## Class :character 1st Qu.: 35.5 1st Qu.:21.05 1st Qu.:62.45
## Mode :character Median : 52.0 Median :24.60 Median :69.00
## Mean : 52.4 Mean :24.51 Mean :69.49
## 3rd Qu.: 69.0 3rd Qu.:28.15 3rd Qu.:76.90
## Max. :128.0 Max. :35.60 Max. :84.90
## UrbanPop
## Min. : 0.00
## 1st Qu.:21.60
## Median :44.60
## Mean :49.56
## 3rd Qu.:83.55
## Max. :99.60
Using this simple code, we get the minimum, median, mean, and maximum of each column.
For Crime: Mean= 52.4, Median= 52, Range= 0-128
For Income: Mean= 24.51, Median= 24.60, Range= 15.40-35.60
For HighSchoolGrad: Mean= 54.50, Median= 69, Range= 54.50-84.90
For UrbanPop: Mean= 49.56, Median= 44.60, Range= 0-99.60
Next we will create three scatterplots below ### Visual 1: Income and Crime
Visual_1<- ggplot(Florida_Data, aes(x=Income, y=Crime))+
geom_point(size=2.5)+
geom_smooth(method = "lm", se=FALSE) +
labs(
title = "Income and Crime",
x="Income",
y="Crime Rate"
)
Visual_1
## `geom_smooth()` using formula = 'y ~ x'
As income increases, crime rate increases.
Visual_2<- ggplot(Florida_Data, aes(x=HighSchoolGrad, y=Crime))+
geom_point(size=2.5)+
geom_smooth(method = "lm", se=FALSE) +
labs(
title = "Education and Crime",
x="Education",
y="Crime Rate"
)
Visual_2
## `geom_smooth()` using formula = 'y ~ x'
As education increases, crime rate increases.
Visual_3<- ggplot(Florida_Data, aes(x=UrbanPop, y=Crime))+
geom_point(size=2.5)+
geom_smooth(method = "lm", se=FALSE) +
labs(
title = "Urbanization and Crime",
x="Urbanization",
y="Crime Rate"
)
Visual_3
## `geom_smooth()` using formula = 'y ~ x'
As urbanization increases, crime rate increases.
Visual_1+Visual_2+Visual_3
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
We can see that in our three scatterplots, we have positive relationships.
We will be investigating which factors are most strongly correlated with crime.
Numeric_Florida_Data<- Florida_Data %>%
select(Crime, Income, HighSchoolGrad, UrbanPop)
view(Numeric_Florida_Data)
Correlation_Matrix<-rcorr(as.matrix(Numeric_Florida_Data))
Correlation_Matrix
## Crime Income HighSchoolGrad UrbanPop
## Crime 1.00 0.43 0.47 0.68
## Income 0.43 1.00 0.79 0.73
## HighSchoolGrad 0.47 0.79 1.00 0.79
## UrbanPop 0.68 0.73 0.79 1.00
##
## n= 67
##
##
## P
## Crime Income HighSchoolGrad UrbanPop
## Crime 2e-04 0e+00 0e+00
## Income 2e-04 0e+00 0e+00
## HighSchoolGrad 0e+00 0e+00 0e+00
## UrbanPop 0e+00 0e+00 0e+00
Interpreting each relationship:
Income x Crime: 0.43. As income increases, crime increases (Positive-Weak)
Education x Crime: 0.47. As Education increases, crime increases. (Positive-Weak)
Urbanization x Crime: 0.68. As urbanization increases, crime increases. (Positive-Strongish)
The variable that shows the strongest relationship with Crime is UrbanPop(Urbanization).
m1<-lm(Crime ~ Income, data=Numeric_Florida_Data)
summary(m1)
##
## Call:
## lm(formula = Crime ~ Income, data = Numeric_Florida_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.452 -21.347 -3.102 17.580 69.357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.6059 16.7863 -0.691 0.491782
## Income 2.6115 0.6729 3.881 0.000246 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.6 on 65 degrees of freedom
## Multiple R-squared: 0.1881, Adjusted R-squared: 0.1756
## F-statistic: 15.06 on 1 and 65 DF, p-value: 0.0002456
AIC(m1)
## [1] 628.6045
m2<-lm(Crime ~ HighSchoolGrad, data=Numeric_Florida_Data)
summary(m2)
##
## Call:
## lm(formula = Crime ~ HighSchoolGrad, data = Numeric_Florida_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.74 -21.36 -4.82 17.42 82.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.8569 24.4507 -2.080 0.0415 *
## HighSchoolGrad 1.4860 0.3491 4.257 6.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.12 on 65 degrees of freedom
## Multiple R-squared: 0.218, Adjusted R-squared: 0.206
## F-statistic: 18.12 on 1 and 65 DF, p-value: 6.806e-05
AIC(m2)
## [1] 626.0932
m3<-lm(Crime ~ UrbanPop, data=Numeric_Florida_Data)
summary(m3)
##
## Call:
## lm(formula = Crime ~ UrbanPop, data = Numeric_Florida_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.766 -16.541 -4.741 16.521 49.632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.54125 4.53930 5.406 9.85e-07 ***
## UrbanPop 0.56220 0.07573 7.424 3.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 65 degrees of freedom
## Multiple R-squared: 0.4588, Adjusted R-squared: 0.4505
## F-statistic: 55.11 on 1 and 65 DF, p-value: 3.084e-10
AIC(m3)
## [1] 601.43
m4<-lm(Crime ~ Income + HighSchoolGrad, data=Numeric_Florida_Data)
summary(m4)
##
## Call:
## lm(formula = Crime ~ Income + HighSchoolGrad, data = Numeric_Florida_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.75 -19.61 -4.57 18.52 77.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -46.1094 24.9723 -1.846 0.0695 .
## Income 1.0311 1.0839 0.951 0.3450
## HighSchoolGrad 1.0540 0.5729 1.840 0.0705 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.14 on 64 degrees of freedom
## Multiple R-squared: 0.2289, Adjusted R-squared: 0.2048
## F-statistic: 9.5 on 2 and 64 DF, p-value: 0.000244
AIC(m4)
## [1] 627.1524
m5<-lm(Crime ~ Income + UrbanPop, data=Numeric_Florida_Data)
summary(m5)
##
## Call:
## lm(formula = Crime ~ Income + UrbanPop, data = Numeric_Florida_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.130 -15.590 -6.484 16.595 48.921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.9723 16.3536 2.444 0.0173 *
## Income -0.7906 0.8049 -0.982 0.3297
## UrbanPop 0.6418 0.1110 5.784 2.36e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.91 on 64 degrees of freedom
## Multiple R-squared: 0.4669, Adjusted R-squared: 0.4502
## F-statistic: 28.02 on 2 and 64 DF, p-value: 1.815e-09
AIC(m5)
## [1] 602.4276
m6<-lm(Crime ~ Income + HighSchoolGrad + UrbanPop, data=Numeric_Florida_Data)
summary(m6)
##
## Call:
## lm(formula = Crime ~ Income + HighSchoolGrad + UrbanPop, data = Numeric_Florida_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.407 -15.080 -6.588 16.178 50.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.7147 28.5895 2.089 0.0408 *
## Income -0.3831 0.9405 -0.407 0.6852
## HighSchoolGrad -0.4673 0.5544 -0.843 0.4025
## UrbanPop 0.6972 0.1291 5.399 1.08e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.95 on 63 degrees of freedom
## Multiple R-squared: 0.4728, Adjusted R-squared: 0.4477
## F-statistic: 18.83 on 3 and 63 DF, p-value: 7.823e-09
AIC(m6)
## [1] 603.6764
m1: Model 1 has a R square of 0.19, adjusted r square of 0.18, and an AIC of 628.60. 18% of the variance the model explains.
m2: Model 2 has a R square of 0.22, adjusted r square of 0.20, and an AIC of 626.09. 20% of the variance the model explains.
m3: Model 3 has a R square of 0.46, adjusted r square of 0.45, and an AIC of 601.43. 45% of the variance the model explains.
m4: Model 4 has a R square of 0.23, adjusted r square of 0.20, and an AIC of 627.15. 20% of the variance the model explains.
m5: Model 5 has a R square of 0.47, adjusted r square of 0.45, and an AIC of 602.43. 45% of the variance the model explains.
m6: Model 6 has a R square of 0.47, adjusted r square of 0.45, and an AIC of 603.68. 45% of the variance the model explains.
Model 3, 5, and 6 has an adjusted r square of 0.45 but different AIC. Model 3 AIC is 601.43, Model 5 AIC is 602.43, and Model 6 AIC is 603.68. Model 3 has the lowest AIC, therfore Model 3 (Crime and urbanization) is the model that best balances accuracy and simplicity.
Dear Chief of the Florida Police Department,
The best model for predicting crime rates is model 3 (Crime~Urbanization), with the most influential predictor being Urbanization. 45% of the variance, the model explains. I recommend focusing on more urbanized areas to reduce crime rates because more urbanized areas experience higher rates of crime. One limitation of this analysis is that correlation does not equal causation. Just because there was a strongish correlation between crime and urbanization, does not mean that one causes the other. I would recommend into consideration my memo, but taking into consideration other factors that may contribute to high crime rates.