Introduction

The Florida Police Department has hired me as their new data analyst. The mission is to uncover what socioeconomic factors are most strongly associated with rising crime rates across Florida counties. The Florida Police Department is particularly intrested in whether income, education, or urbanization play the largest role in explaining differences in crime rates.

Step 1: Loading and Preparing the data

Loading the data

Florida_Data<-read_xlsx("Florida County Crime Rates.xlsx")
View(Florida_Data)

Cleaning the data

We will be renaming the columns to: Crime, Income< HighSchoolGrad, and UrbanPop and making sure all county names are formatted so that only the first letter is capitalized.

Florida_Data<- Florida_Data %>%
  rename(
    Crime= C,
    Income= I,
    HighSchoolGrad= HS,
    UrbanPop=U
  )

Florida_Data<-Florida_Data %>%
  mutate(County=str_to_title(County))

Florida_Data
## # A tibble: 67 × 5
##    County    Crime Income HighSchoolGrad UrbanPop
##    <chr>     <dbl>  <dbl>          <dbl>    <dbl>
##  1 Alachua     104   22.1           82.7     73.2
##  2 Baker        20   25.8           64.1     21.5
##  3 Bay          64   24.7           74.7     85  
##  4 Bradford     50   24.6           65       23.2
##  5 Brevard      64   30.5           82.3     91.9
##  6 Broward      94   30.6           76.8     98.9
##  7 Calhoun       8   18.6           55.9      0  
##  8 Charlotte    35   25.7           75.7     80.2
##  9 Citrus       27   21.3           68.6     31  
## 10 Clay         41   34.9           81.2     65.8
## # ℹ 57 more rows

Inspect and summarize data set

Next we will inspect and summarize the dataset

str(Florida_Data)
## tibble [67 × 5] (S3: tbl_df/tbl/data.frame)
##  $ County        : chr [1:67] "Alachua" "Baker" "Bay" "Bradford" ...
##  $ Crime         : num [1:67] 104 20 64 50 64 94 8 35 27 41 ...
##  $ Income        : num [1:67] 22.1 25.8 24.7 24.6 30.5 30.6 18.6 25.7 21.3 34.9 ...
##  $ HighSchoolGrad: num [1:67] 82.7 64.1 74.7 65 82.3 76.8 55.9 75.7 68.6 81.2 ...
##  $ UrbanPop      : num [1:67] 73.2 21.5 85 23.2 91.9 98.9 0 80.2 31 65.8 ...
skim(Florida_Data)
Data summary
Name Florida_Data
Number of rows 67
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
County 0 1 3 9 0 67 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Crime 0 1 52.40 28.19 0.0 35.50 52.0 69.00 128.0 ▃▇▇▃▂
Income 0 1 24.51 4.68 15.4 21.05 24.6 28.15 35.6 ▂▇▅▅▂
HighSchoolGrad 0 1 69.49 8.86 54.5 62.45 69.0 76.90 84.9 ▇▇▆▇▆
UrbanPop 0 1 49.56 33.97 0.0 21.60 44.6 83.55 99.6 ▅▆▂▃▇
summary(Florida_Data)
##     County              Crime           Income      HighSchoolGrad 
##  Length:67          Min.   :  0.0   Min.   :15.40   Min.   :54.50  
##  Class :character   1st Qu.: 35.5   1st Qu.:21.05   1st Qu.:62.45  
##  Mode  :character   Median : 52.0   Median :24.60   Median :69.00  
##                     Mean   : 52.4   Mean   :24.51   Mean   :69.49  
##                     3rd Qu.: 69.0   3rd Qu.:28.15   3rd Qu.:76.90  
##                     Max.   :128.0   Max.   :35.60   Max.   :84.90  
##     UrbanPop    
##  Min.   : 0.00  
##  1st Qu.:21.60  
##  Median :44.60  
##  Mean   :49.56  
##  3rd Qu.:83.55  
##  Max.   :99.60

Step 2: Exploratory Data Analysis

We will now compute basic descriptive statistics.

summary(Florida_Data)
##     County              Crime           Income      HighSchoolGrad 
##  Length:67          Min.   :  0.0   Min.   :15.40   Min.   :54.50  
##  Class :character   1st Qu.: 35.5   1st Qu.:21.05   1st Qu.:62.45  
##  Mode  :character   Median : 52.0   Median :24.60   Median :69.00  
##                     Mean   : 52.4   Mean   :24.51   Mean   :69.49  
##                     3rd Qu.: 69.0   3rd Qu.:28.15   3rd Qu.:76.90  
##                     Max.   :128.0   Max.   :35.60   Max.   :84.90  
##     UrbanPop    
##  Min.   : 0.00  
##  1st Qu.:21.60  
##  Median :44.60  
##  Mean   :49.56  
##  3rd Qu.:83.55  
##  Max.   :99.60

Using this simple code, we get the minimum, median, mean, and maximum of each column.

  • For Crime: Mean= 52.4, Median= 52, Range= 0-128

  • For Income: Mean= 24.51, Median= 24.60, Range= 15.40-35.60

  • For HighSchoolGrad: Mean= 54.50, Median= 69, Range= 54.50-84.90

  • For UrbanPop: Mean= 49.56, Median= 44.60, Range= 0-99.60

Next we will create three scatterplots below ### Visual 1: Income and Crime

Visual_1<- ggplot(Florida_Data, aes(x=Income, y=Crime))+
  geom_point(size=2.5)+ 
  geom_smooth(method = "lm", se=FALSE) +
  labs(
    title = "Income and Crime",
    x="Income",
    y="Crime Rate"
  )
Visual_1
## `geom_smooth()` using formula = 'y ~ x'

As income increases, crime rate increases.

Visual 2: Education and crime

Visual_2<- ggplot(Florida_Data, aes(x=HighSchoolGrad, y=Crime))+
  geom_point(size=2.5)+ 
  geom_smooth(method = "lm", se=FALSE) +
  labs(
    title = "Education and Crime",
    x="Education",
    y="Crime Rate"
  )
Visual_2
## `geom_smooth()` using formula = 'y ~ x'

As education increases, crime rate increases.

Visual 3: Urbanization and Crime

Visual_3<- ggplot(Florida_Data, aes(x=UrbanPop, y=Crime))+
  geom_point(size=2.5)+ 
  geom_smooth(method = "lm", se=FALSE) +
  labs(
    title = "Urbanization and Crime",
    x="Urbanization",
    y="Crime Rate"
  )
Visual_3
## `geom_smooth()` using formula = 'y ~ x'

As urbanization increases, crime rate increases.

Lets look at all of our visuals side by side

Visual_1+Visual_2+Visual_3
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

We can see that in our three scatterplots, we have positive relationships.

Step 3: Correlation Analysis

We will be investigating which factors are most strongly correlated with crime.

Computing Correlation Matrix

Numeric_Florida_Data<- Florida_Data %>%
  select(Crime, Income, HighSchoolGrad, UrbanPop)
view(Numeric_Florida_Data)

Correlation_Matrix<-rcorr(as.matrix(Numeric_Florida_Data))
Correlation_Matrix
##                Crime Income HighSchoolGrad UrbanPop
## Crime           1.00   0.43           0.47     0.68
## Income          0.43   1.00           0.79     0.73
## HighSchoolGrad  0.47   0.79           1.00     0.79
## UrbanPop        0.68   0.73           0.79     1.00
## 
## n= 67 
## 
## 
## P
##                Crime Income HighSchoolGrad UrbanPop
## Crime                2e-04  0e+00          0e+00   
## Income         2e-04        0e+00          0e+00   
## HighSchoolGrad 0e+00 0e+00                 0e+00   
## UrbanPop       0e+00 0e+00  0e+00

Interpreting each relationship:

  • Income x Crime: 0.43. As income increases, crime increases (Positive-Weak)

  • Education x Crime: 0.47. As Education increases, crime increases. (Positive-Weak)

  • Urbanization x Crime: 0.68. As urbanization increases, crime increases. (Positive-Strongish)

The variable that shows the strongest relationship with Crime is UrbanPop(Urbanization).

Step 4: Building Regression Models

Building simple regression models

m1<-lm(Crime ~ Income, data=Numeric_Florida_Data)
summary(m1)
## 
## Call:
## lm(formula = Crime ~ Income, data = Numeric_Florida_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.452 -21.347  -3.102  17.580  69.357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.6059    16.7863  -0.691 0.491782    
## Income        2.6115     0.6729   3.881 0.000246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.6 on 65 degrees of freedom
## Multiple R-squared:  0.1881, Adjusted R-squared:  0.1756 
## F-statistic: 15.06 on 1 and 65 DF,  p-value: 0.0002456
AIC(m1)
## [1] 628.6045
m2<-lm(Crime ~ HighSchoolGrad, data=Numeric_Florida_Data)
summary(m2)
## 
## Call:
## lm(formula = Crime ~ HighSchoolGrad, data = Numeric_Florida_Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.74 -21.36  -4.82  17.42  82.27 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -50.8569    24.4507  -2.080   0.0415 *  
## HighSchoolGrad   1.4860     0.3491   4.257 6.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.12 on 65 degrees of freedom
## Multiple R-squared:  0.218,  Adjusted R-squared:  0.206 
## F-statistic: 18.12 on 1 and 65 DF,  p-value: 6.806e-05
AIC(m2)
## [1] 626.0932
m3<-lm(Crime ~ UrbanPop, data=Numeric_Florida_Data)
summary(m3)
## 
## Call:
## lm(formula = Crime ~ UrbanPop, data = Numeric_Florida_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.766 -16.541  -4.741  16.521  49.632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.54125    4.53930   5.406 9.85e-07 ***
## UrbanPop     0.56220    0.07573   7.424 3.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 65 degrees of freedom
## Multiple R-squared:  0.4588, Adjusted R-squared:  0.4505 
## F-statistic: 55.11 on 1 and 65 DF,  p-value: 3.084e-10
AIC(m3)
## [1] 601.43

Building Multiple Regression Models

m4<-lm(Crime ~ Income + HighSchoolGrad, data=Numeric_Florida_Data)
summary(m4)
## 
## Call:
## lm(formula = Crime ~ Income + HighSchoolGrad, data = Numeric_Florida_Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.75 -19.61  -4.57  18.52  77.86 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -46.1094    24.9723  -1.846   0.0695 .
## Income           1.0311     1.0839   0.951   0.3450  
## HighSchoolGrad   1.0540     0.5729   1.840   0.0705 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.14 on 64 degrees of freedom
## Multiple R-squared:  0.2289, Adjusted R-squared:  0.2048 
## F-statistic:   9.5 on 2 and 64 DF,  p-value: 0.000244
AIC(m4)
## [1] 627.1524
m5<-lm(Crime ~ Income + UrbanPop, data=Numeric_Florida_Data)
summary(m5)
## 
## Call:
## lm(formula = Crime ~ Income + UrbanPop, data = Numeric_Florida_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.130 -15.590  -6.484  16.595  48.921 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.9723    16.3536   2.444   0.0173 *  
## Income       -0.7906     0.8049  -0.982   0.3297    
## UrbanPop      0.6418     0.1110   5.784 2.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.91 on 64 degrees of freedom
## Multiple R-squared:  0.4669, Adjusted R-squared:  0.4502 
## F-statistic: 28.02 on 2 and 64 DF,  p-value: 1.815e-09
AIC(m5)
## [1] 602.4276
m6<-lm(Crime ~ Income + HighSchoolGrad + UrbanPop, data=Numeric_Florida_Data)
summary(m6)
## 
## Call:
## lm(formula = Crime ~ Income + HighSchoolGrad + UrbanPop, data = Numeric_Florida_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.407 -15.080  -6.588  16.178  50.125 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     59.7147    28.5895   2.089   0.0408 *  
## Income          -0.3831     0.9405  -0.407   0.6852    
## HighSchoolGrad  -0.4673     0.5544  -0.843   0.4025    
## UrbanPop         0.6972     0.1291   5.399 1.08e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.95 on 63 degrees of freedom
## Multiple R-squared:  0.4728, Adjusted R-squared:  0.4477 
## F-statistic: 18.83 on 3 and 63 DF,  p-value: 7.823e-09
AIC(m6)
## [1] 603.6764

Comparing all models based on R squared, adjusted R squared, and AIC.

  • m1: Model 1 has a R square of 0.19, adjusted r square of 0.18, and an AIC of 628.60. 18% of the variance the model explains.

  • m2: Model 2 has a R square of 0.22, adjusted r square of 0.20, and an AIC of 626.09. 20% of the variance the model explains.

  • m3: Model 3 has a R square of 0.46, adjusted r square of 0.45, and an AIC of 601.43. 45% of the variance the model explains.

  • m4: Model 4 has a R square of 0.23, adjusted r square of 0.20, and an AIC of 627.15. 20% of the variance the model explains.

  • m5: Model 5 has a R square of 0.47, adjusted r square of 0.45, and an AIC of 602.43. 45% of the variance the model explains.

  • m6: Model 6 has a R square of 0.47, adjusted r square of 0.45, and an AIC of 603.68. 45% of the variance the model explains.

  • Model 3, 5, and 6 has an adjusted r square of 0.45 but different AIC. Model 3 AIC is 601.43, Model 5 AIC is 602.43, and Model 6 AIC is 603.68. Model 3 has the lowest AIC, therfore Model 3 (Crime and urbanization) is the model that best balances accuracy and simplicity.

Step 5: Findings

Dear Chief of the Florida Police Department,

The best model for predicting crime rates is model 3 (Crime~Urbanization), with the most influential predictor being Urbanization. 45% of the variance, the model explains. I recommend focusing on more urbanized areas to reduce crime rates because more urbanized areas experience higher rates of crime. One limitation of this analysis is that correlation does not equal causation. Just because there was a strongish correlation between crime and urbanization, does not mean that one causes the other. I would recommend into consideration my memo, but taking into consideration other factors that may contribute to high crime rates.