Now I’m going to load and clean the data
florida_crime <- read_excel("Florida County Crime Rates.xlsx")
view(florida_crime)
head(florida_crime)
## # A tibble: 6 × 5
## County C I HS U
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALACHUA 104 22.1 82.7 73.2
## 2 BAKER 20 25.8 64.1 21.5
## 3 BAY 64 24.7 74.7 85
## 4 BRADFORD 50 24.6 65 23.2
## 5 BREVARD 64 30.5 82.3 91.9
## 6 BROWARD 94 30.6 76.8 98.9
florida_crime <- florida_crime %>%
rename (
Crime = C,
Income = I,
HighSchoolGrad = HS,
UrbanPop = U
)
florida_crime
## # A tibble: 67 × 5
## County Crime Income HighSchoolGrad UrbanPop
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALACHUA 104 22.1 82.7 73.2
## 2 BAKER 20 25.8 64.1 21.5
## 3 BAY 64 24.7 74.7 85
## 4 BRADFORD 50 24.6 65 23.2
## 5 BREVARD 64 30.5 82.3 91.9
## 6 BROWARD 94 30.6 76.8 98.9
## 7 CALHOUN 8 18.6 55.9 0
## 8 CHARLOTTE 35 25.7 75.7 80.2
## 9 CITRUS 27 21.3 68.6 31
## 10 CLAY 41 34.9 81.2 65.8
## # ℹ 57 more rows
florida_crime <- florida_crime %>%
mutate (County = str_to_title(County))
florida_crime
## # A tibble: 67 × 5
## County Crime Income HighSchoolGrad UrbanPop
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alachua 104 22.1 82.7 73.2
## 2 Baker 20 25.8 64.1 21.5
## 3 Bay 64 24.7 74.7 85
## 4 Bradford 50 24.6 65 23.2
## 5 Brevard 64 30.5 82.3 91.9
## 6 Broward 94 30.6 76.8 98.9
## 7 Calhoun 8 18.6 55.9 0
## 8 Charlotte 35 25.7 75.7 80.2
## 9 Citrus 27 21.3 68.6 31
## 10 Clay 41 34.9 81.2 65.8
## # ℹ 57 more rows
skim(florida_crime)
| Name | florida_crime |
| Number of rows | 67 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| County | 0 | 1 | 3 | 9 | 0 | 67 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Crime | 0 | 1 | 52.40 | 28.19 | 0.0 | 35.50 | 52.0 | 69.00 | 128.0 | ▃▇▇▃▂ |
| Income | 0 | 1 | 24.51 | 4.68 | 15.4 | 21.05 | 24.6 | 28.15 | 35.6 | ▂▇▅▅▂ |
| HighSchoolGrad | 0 | 1 | 69.49 | 8.86 | 54.5 | 62.45 | 69.0 | 76.90 | 84.9 | ▇▇▆▇▆ |
| UrbanPop | 0 | 1 | 49.56 | 33.97 | 0.0 | 21.60 | 44.6 | 83.55 | 99.6 | ▅▆▂▃▇ |
summary(florida_crime)
## County Crime Income HighSchoolGrad
## Length:67 Min. : 0.0 Min. :15.40 Min. :54.50
## Class :character 1st Qu.: 35.5 1st Qu.:21.05 1st Qu.:62.45
## Mode :character Median : 52.0 Median :24.60 Median :69.00
## Mean : 52.4 Mean :24.51 Mean :69.49
## 3rd Qu.: 69.0 3rd Qu.:28.15 3rd Qu.:76.90
## Max. :128.0 Max. :35.60 Max. :84.90
## UrbanPop
## Min. : 0.00
## 1st Qu.:21.60
## Median :44.60
## Mean :49.56
## 3rd Qu.:83.55
## Max. :99.60
Descriptive statistics for florida_crime
summary(florida_crime)
## County Crime Income HighSchoolGrad
## Length:67 Min. : 0.0 Min. :15.40 Min. :54.50
## Class :character 1st Qu.: 35.5 1st Qu.:21.05 1st Qu.:62.45
## Mode :character Median : 52.0 Median :24.60 Median :69.00
## Mean : 52.4 Mean :24.51 Mean :69.49
## 3rd Qu.: 69.0 3rd Qu.:28.15 3rd Qu.:76.90
## Max. :128.0 Max. :35.60 Max. :84.90
## UrbanPop
## Min. : 0.00
## 1st Qu.:21.60
## Median :44.60
## Mean :49.56
## 3rd Qu.:83.55
## Max. :99.60
By using the summary function (twice) we can take a look at the mean, median, and range od the dataset
Mean Crime = 52.4, Mean Income = 24.15, Mean HighSchoolGrad = 69.49, Mean UrbanPop = 49.56 Median Crime = 52, Median Income = 24.60, Median HighSchoolGrad = 69, Median UrbanPop = 44.60 Range Crime = 0-128, Range Income = 15.40-35.60, Range HighSchoolGrad = 54.50-84.90, Range UrbanPop = 0-99.60
Visualizing the data in all its glory
viz_1 <- ggplot(florida_crime, aes(x=Income, y=Crime))+
geom_point(size=2.5)+
geom_smooth(method = "lm", se=FALSE)+
labs(
title = "Income vs Crime",
x = "Income",
y = "Crime"
)
viz_1
## `geom_smooth()` using formula = 'y ~ x'
Looking at the slope line, the graph shows that as Income increases
Crime also increases.
viz_2 <- ggplot(florida_crime, aes(x=HighSchoolGrad, y=Crime))+
geom_point(size=2.5)+
geom_smooth(method = "lm", se=FALSE)+
labs(
title = "School vs Crime",
x = "School",
y = "Crime"
)
viz_2
## `geom_smooth()` using formula = 'y ~ x'
Here we aslo see a similar trend from the previous graph. As school
graduation increases so does Crime.
viz_3 <- ggplot(florida_crime, aes(x=UrbanPop, y=Crime))+
geom_point(size=2.5)+
geom_smooth(method = "lm", se=FALSE)+
labs(
title = "Urban Lifestyle vs Crime",
x = "UrbanPop",
y = "Crime"
)
viz_3
## `geom_smooth()` using formula = 'y ~ x'
Again, there is a similar pattern as the two previous graphs. As the
percentage of the urban population increases so does Crime.
Let’s use patchwork for better readability
viz_1+viz_2+viz_3
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Numeric_florida_crime <- florida_crime %>%
select(Crime, Income, HighSchoolGrad, UrbanPop)
head(Numeric_florida_crime)
## # A tibble: 6 × 4
## Crime Income HighSchoolGrad UrbanPop
## <dbl> <dbl> <dbl> <dbl>
## 1 104 22.1 82.7 73.2
## 2 20 25.8 64.1 21.5
## 3 64 24.7 74.7 85
## 4 50 24.6 65 23.2
## 5 64 30.5 82.3 91.9
## 6 94 30.6 76.8 98.9
Cor_Max <- rcorr(as.matrix(Numeric_florida_crime))
Cor_Max
## Crime Income HighSchoolGrad UrbanPop
## Crime 1.00 0.43 0.47 0.68
## Income 0.43 1.00 0.79 0.73
## HighSchoolGrad 0.47 0.79 1.00 0.79
## UrbanPop 0.68 0.73 0.79 1.00
##
## n= 67
##
##
## P
## Crime Income HighSchoolGrad UrbanPop
## Crime 2e-04 0e+00 0e+00
## Income 2e-04 0e+00 0e+00
## HighSchoolGrad 0e+00 0e+00 0e+00
## UrbanPop 0e+00 0e+00 0e+00
There is a positive correlation between Income and Crime r = 0.43 There is a positive correlation between HighSchoolGrad and Crime r= 0.47. There is a positive correlation between UrbanPop and Crime r = 0.68 (the strongest correlation among the variables).
m1 <- lm(Crime ~ Income, data = Numeric_florida_crime)
summary(m1)
##
## Call:
## lm(formula = Crime ~ Income, data = Numeric_florida_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.452 -21.347 -3.102 17.580 69.357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.6059 16.7863 -0.691 0.491782
## Income 2.6115 0.6729 3.881 0.000246 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.6 on 65 degrees of freedom
## Multiple R-squared: 0.1881, Adjusted R-squared: 0.1756
## F-statistic: 15.06 on 1 and 65 DF, p-value: 0.0002456
AIC(m1)
## [1] 628.6045
m2 <- lm(Crime ~ HighSchoolGrad, data = Numeric_florida_crime)
summary(m2)
##
## Call:
## lm(formula = Crime ~ HighSchoolGrad, data = Numeric_florida_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.74 -21.36 -4.82 17.42 82.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.8569 24.4507 -2.080 0.0415 *
## HighSchoolGrad 1.4860 0.3491 4.257 6.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.12 on 65 degrees of freedom
## Multiple R-squared: 0.218, Adjusted R-squared: 0.206
## F-statistic: 18.12 on 1 and 65 DF, p-value: 6.806e-05
AIC(m2)
## [1] 626.0932
m3 <- lm(Crime ~ UrbanPop, data = Numeric_florida_crime)
summary(m3)
##
## Call:
## lm(formula = Crime ~ UrbanPop, data = Numeric_florida_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.766 -16.541 -4.741 16.521 49.632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.54125 4.53930 5.406 9.85e-07 ***
## UrbanPop 0.56220 0.07573 7.424 3.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 65 degrees of freedom
## Multiple R-squared: 0.4588, Adjusted R-squared: 0.4505
## F-statistic: 55.11 on 1 and 65 DF, p-value: 3.084e-10
AIC(m3)
## [1] 601.43
Multiple regression models
m4 <- lm(Crime ~ Income + UrbanPop, data = Numeric_florida_crime)
summary(m4)
##
## Call:
## lm(formula = Crime ~ Income + UrbanPop, data = Numeric_florida_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.130 -15.590 -6.484 16.595 48.921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.9723 16.3536 2.444 0.0173 *
## Income -0.7906 0.8049 -0.982 0.3297
## UrbanPop 0.6418 0.1110 5.784 2.36e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.91 on 64 degrees of freedom
## Multiple R-squared: 0.4669, Adjusted R-squared: 0.4502
## F-statistic: 28.02 on 2 and 64 DF, p-value: 1.815e-09
AIC(m4)
## [1] 602.4276
m5 <- lm(Crime ~ Income + HighSchoolGrad, data = Numeric_florida_crime)
summary(m5)
##
## Call:
## lm(formula = Crime ~ Income + HighSchoolGrad, data = Numeric_florida_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.75 -19.61 -4.57 18.52 77.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -46.1094 24.9723 -1.846 0.0695 .
## Income 1.0311 1.0839 0.951 0.3450
## HighSchoolGrad 1.0540 0.5729 1.840 0.0705 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.14 on 64 degrees of freedom
## Multiple R-squared: 0.2289, Adjusted R-squared: 0.2048
## F-statistic: 9.5 on 2 and 64 DF, p-value: 0.000244
AIC(m5)
## [1] 627.1524
m6 <- lm(Crime ~ Income + UrbanPop + HighSchoolGrad, data = Numeric_florida_crime)
summary(m6)
##
## Call:
## lm(formula = Crime ~ Income + UrbanPop + HighSchoolGrad, data = Numeric_florida_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.407 -15.080 -6.588 16.178 50.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.7147 28.5895 2.089 0.0408 *
## Income -0.3831 0.9405 -0.407 0.6852
## UrbanPop 0.6972 0.1291 5.399 1.08e-06 ***
## HighSchoolGrad -0.4673 0.5544 -0.843 0.4025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.95 on 63 degrees of freedom
## Multiple R-squared: 0.4728, Adjusted R-squared: 0.4477
## F-statistic: 18.83 on 3 and 63 DF, p-value: 7.823e-09
AIC(m6)
## [1] 603.6764
m1, R-squared of 0.188, Adjusted R-squared of 0.176, and AIC [628.6045] (variance explained is 18%).
m2, R-squared of 0.218, Adjusted R-squared of 0.206, and AIC [626.0932] (variance explained is 21%).
m3, R-squared of 0.459, Adjusted R-squared of 0.451, and AIC [601.43] (variance explained is 45%).
m4, R-squared of 0.467, Adjusted R-squared of 0.451, and AIC [602.4276] (variance explained is 45%).
m5, R-squared of 0.229, Adjusted R-squared of 0.205, and AIC [627.1524] (variance explained is 20%).
m6, R-squared of 0.473, Adjusted R-squared of 0.447, and AIC [603.6764] (variance explained is 45%).
I can see that m3, m4, and m6 have the highest R-squared and Adjusted R-squared compared to m1,m2, and m5. Also, m3 has the lowest AIC which makes this model less complex when explaining the variance between the variables. I can also see that m4 and m6 have the lowest AIC following m3. m3 is more balance when it comes to R-squared and Adjusted R-squared. It’s safe to say that m3 is the best model for accuracy and simplicity.
After cleaning and analyzing the data by comparing different models for better crime predictions, I am confident about recommending model 3 (Crime and Urban Population) as it the most influential predictor when it comes to crime. This model explains 45% of the variance between the dependent variable (Crime) and the independent variable (UrbanPop). The PD should focus on highly populated areas to provide a sense of security to the population, which can help reduce the possibilities of committing crimes. One limitation is that correlation is not causation, and besides we don’t really have much data to look into age as a potential variable to understand what age groups are more likely to engage in such activities and devise better plans for interventions.