1 A Socio-Economic Investigation into Crime

Presented by:

1.1 Introduction

This project provided us with the opportunity of showcasing many of the skills we have learned throughout this course and of applying them to an investigation into datasets of our choosing. We narrowed our scope to a few datasets containing information on social economic information, namely unemployment and crime data in NYC. We hoped that this investigation would reveal valuable information that could be used to formulate policy proposals. This project provided us with the opportunity of showcasing many of the skills we have learned throughout this course and of applying them to an investigation into datasets of our choosing. We narrowed our scope to a few datasets containing information on social economic information, namely unemployment crime data in NYC. We hoped that this investigation would reveal valuable information that could be used to formulate policy proposals.

We used the following workflow for each dataset:

  1. Data Import (API and .csv)
  2. Data Transformation (handling missing data, tidying)
  3. Data Exploration & Analysis (commentary, visualizations)

We then merged the datasets to explore further and try to draw some final conslusions.

1.2 Work flow-chart

workflowchart

1.4 Environment Setup

source("environment_setup.R", echo = T, prompt.echo = "", spaced = F)
## if (!require("dplyr")) install.packages("dplyr")
## if (!require("RSocrata")) install.packages("RSocrata")
## if (!require("tidyverse")) install.packages("tidyverse")
## if (!require("ggplot2")) install.packages("ggplot2")
## if (!require("readxl")) install.packages("readxl")
## if (!require("plyr")) install.packages("plyr")
## if (!require("treemap")) install.packages("treemap")
## if (!require("leaflet")) install.packages("leaflet")
## if (!require("forcats")) install.packages("forcats")
## if (!require("ggExtra")) install.packages("ggExtra")
## if (!require("GGally")) install.packages("GGally")

2 NYPD Arrests

2.1 NYPD Arrests - Data Import via API

We will start with the NYPD Arrests Data (Historic) data from NYC Open Data found below and conduct some exploratory data analysis to find out how arrests are distributed in general. We will explore trends like for example investigating seasonality trends or trends in particular kinds of arrest or by boroughs.

There are 4.8M rows, there are 18 columns and each row is an arrest.

variable description
arrest_date Exact date of arrest for the reported event.
ofns_desc Description of internal classification corresponding with KY code (more general category than PD description).
arrest_boro Borough of arrest. B(Bronx), S(Staten Island), K(Brooklyn), M(Manhattan), Q(Queens)
age_group Perpetrator’s age within a category.
perp_sex Perpetrator’s sex description.
perp_race Perpetrator’s race description.
x_coord_cd Midblock X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104).
y_coord_cd Midblock Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
latitude Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
longitude Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)

Load the data into R using the RSocrata API.

source("arrests_dataset.R", echo = F, prompt.echo = "", spaced = F)
head(arrests_df, 10)

2.2 NYPD Arrests - Data Transformation

Rename the borough letters to proper names.

arrests_df$arrest_boro <- revalue(arrests_df$arrest_boro, c("Q"="Queens", "K"="Brooklyn", "M"="Manhatttan", "S"="Staten Island", "B" = "Bronx")) 

Remove missing values where no offense description is recorded.

arrests_df <- arrests_df %>% filter(ofns_desc != "")

We generate a series of data frames aggregating the data in different manners for analysis and plotting. For example, we look at arrests by race, borough, offense.

murder_counts <- arrests_df %>%
  group_by(arrest_boro, year, perp_race) %>%
  dplyr::summarise(murder_counts = n()) %>%
  arrange(desc(year))
murder_counts
# get the count of arrests per year, by borough
grouped_boro <- arrests_df %>% 
  group_by(year, arrest_boro) %>% 
  dplyr::summarize(count = n()) %>% 
  arrange(desc(count))
# get the count of offenses per year, by borough
grouped_offenses <- arrests_df %>% 
  group_by(year, arrest_boro, ofns_desc) %>% 
  dplyr::summarize(count = n()) %>%
  arrange(desc(count))
# get the top five offense per borough
t5 <- grouped_offenses %>% top_n(5)
# get the counts of offenses overall
crime_counts <- arrests_df %>% 
  group_by(ofns_desc) %>% 
  dplyr::summarize(count = n()) %>% 
  arrange(desc(count))
# get the count of arrests related to dangerous drugs by year, by borough
drugs <- arrests_df %>% 
  filter(ofns_desc == 'DANGEROUS DRUGS') %>% 
  group_by(year, arrest_boro) %>% 
  dplyr::summarize(count = n())

2.3 NYPD Arrest - Data Exploration & Analysis

2.3.1 Evolution of Crime 2014-2018

Let’s study the evolution of crime over the period of interest (2014-2018).

What the plot below reveals is that overall crime is decreasing for all boroughs of NYC. The data year over year is very similar, appearing to simply scale down over time.

What we can note as suprising is the fact that total crime between Manhattan and Brooklyn is at fairly similar levels. Total crime is aggregated without accounting for different types of crime so we will further our investigation by looking at top crimes overall, and then dissecting crime per borough.

Here is a plot of the top 10 most common crimes for the 2014-2018 period across all boroughs. We learn that dangerous drugs related offenses are the most prevailent followed by 3rd degree assaults.

A peek at the bottom 10 crimes for the same period reveals somewhat unexpected crimes like disruption of a religious service. It is interesting to note that while dangerous drugs offenses are the most common crime, only 1 person was arrested for being under the influence of drugs.

Following from the exploration above, we take a deeper look at the most common crimes by borough. On the plot below, we once again see that how drug related offenses are the most common crimes and that this is consistent across boroughs. We notice that while Brooklyn and Manhattan had the most crimes, the Bronx captures the most drug arrests.

The plot below explores that relationship over time for each borough. We observe that similarly to crime in general, drug related arrests are going down.

We continue investigating the demographics and take a look at the distribution of crime by gender. Male adults between the ages of 25-44 remain the most common perpetrators.

This interactive map will let you explore the distribution of crime geographically.

3 Labor Bureau

To explore the socio-economic aspect of the question, we dig into data from the Labor Bureau. In particular, we will be looking at unemployment data.

3.1 Labor Bureau - Data Import via .csv

We wanted to invistigate if there is a correlation between crimes and unemployment rate. So we tried to investigate another dataset from the labor bureau. The dataset has five tables for the five boroghs for a period of four years from 2014 - 2018.

source("unemployed_dataset.R", echo = F, prompt.echo = "", spaced = F)
head(bronx)
head(queens)
head(brooklyn)
head(manhattan)
head(staten)

3.2 Labor Bureau - Data Transformation

clean_table <- function (table) {
  table_content <- table %>%
    na.omit()
  colnames(table_content) = c("arrest_boro","year","month","labor_force","employed","unemployed","unemployment_rate")
  final_table <- table_content %>%
    select(arrest_boro, year, labor_force, employed, unemployed, unemployment_rate) 
  return(final_table)
}
bronx_income <- clean_table(bronx)
queens_income <- clean_table(queens)
brooklyn_income <- clean_table(brooklyn)
manhattan_income <- clean_table(manhattan)
staten_income <- clean_table(staten)
income_table <- Reduce(function(...) merge(..., all=T), list(bronx_income, queens_income, brooklyn_income, manhattan_income, staten_income))
income_table 

3.3 Labor Bureau - Data Exploration & Analysis

The following boxplot shows the unemployment rate distribution per borough per year. We can see that there is a decreasing trend in the unemployment rate over the years. However, the Bronx county seems to have the highest unemployment rate over the years of a range of 5 - 6%.

by_income <- income_table %>%
  group_by(arrest_boro, year) %>%
  dplyr::summarise(avg_unemployment_rate = max(unemployment_rate)) %>%
  arrange(desc(year))
by_income

4 Combined Datasets

After understanding our datasets individually, we merged them into a single data frame based on the variable that were shared between the two sources, namely year and arrest_boro. This allowed us to extend our analysis to the variables in the combined dataset and to study how they behave in relation to each other.

merged <- Reduce(function(...) merge(..., all=T), list(murder_counts, by_income)) %>%
  na.omit() %>%
  arrange(desc(year))
merged

We tried to plot a diverging plot to investigate which borough is above or below average. So we normalized the average unemployment rate and the number of crimes. Both Brooklyn and Bronx seem to have the most significant above average for crimes committed and the unemployment rate.

We can see that the 25-44 age category is the largest age category density for committing crimes across all boroughs. As also demonstrated, Bronx seems to have the highest density of crimes for all age categories amongst other boroughs.

The treemap was used to study which race is common for committing a crime. As illustrated, BLACK is the typical race. However, Brooklyn appears to have crimes committed by BLACK than Bronx.

5 Statistical Analysis & Modeling

The final task in this project was to use our combined data in a linear regression modeling exercise to understand the relationship between the variables. In this process we were able to identify which variables are the greatest predictors of the number of crimes commited.

source("complain_dataset.R", echo = F, prompt.echo = "", spaced = F)
dat
res <- dat %>%
  group_by(year, boro_nm) %>%
  filter(boro_nm != "") %>%
  dplyr::summarise(crimes = n()) 
res
by_crime <- res %>% mutate_if(is.character, str_to_lower) 
by_crime
dat2 <- income_table
dat2$unemployed <- as.numeric(gsub(",", "", dat2$unemployed))
names(dat2)[1] <- "boro_nm"
subet <- dat2[c(1:2,5)]
by_unemployment <- subet %>%
  group_by(year, boro_nm) %>%
  dplyr::summarise(unemployed = round(mean(unemployed), 2)) %>%
  mutate_if(is.character, str_to_lower)
by_unemployment
joined <- Reduce(function(...) merge(..., all=T), list(by_crime, by_unemployment)) %>%
  na.omit()
joined

To start building the predictive model, we need to see if there is a correlation between our predictor and response. In this case, crimes and unemployed. We will begin by doing some exploratory data visualization. The function ggpairs() from the GGally package was used to create a plot matrix demonstrating how the variables relate to one another.

There is essential regard that correlation doesn’t imply causation, so constructing a regression model is imperative to comprehend whether we can use this variable as a predictor.

We started by scrutinizing the relationship between the outcome and covariant. In our case, it is the number of crimes committed and the unemployment rate. It seems that their relationship is linear.

m_unemployed <- lm(crimes ~ unemployed, data = joined)
summary(m_unemployed)
## 
## Call:
## lm(formula = crimes ~ unemployed, data = joined)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39272 -15713    742  16157  32022 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.690e+04  9.987e+03   1.692    0.109    
## unemployed  1.833e+00  2.083e-01   8.798 9.77e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20450 on 17 degrees of freedom
## Multiple R-squared:  0.8199, Adjusted R-squared:  0.8093 
## F-statistic:  77.4 on 1 and 17 DF,  p-value: 9.772e-08

As pointed by the simple linear regression, unemployment has an R-squared ~0.8. This makes it a better predictor of the number of crimes. However, to make sure that our predictor model is reliable, we need to see if the residuals are so close to the actual value. This would give us a quality indicator of how is our prediction is fitting.

For our model, the Q-Q plot shows pretty good alignment to the the line with a few points at the top slightly offset.

The residuals are not reasonably well spread above and below a pretty non horizontal line. This may raise a concern for us, this means that the relationship between the two variables are not linear. So we tried to get investigate more mon building a mutlivariable regression model taking other variables into account.

by_cat <- dat %>%
  group_by(year, boro_nm, susp_age_group, susp_sex, susp_race) %>%
  filter(boro_nm != "" & susp_age_group !="") %>%
  dplyr::summarise(crimes = n()) 
by_cat

Categorize age-group preparing for dummy variables

dummy_df <- by_cat
dummy_df$susp_age_group <- revalue(by_cat$susp_age_group, c("<18"="child", "18-24"="youth", "25-44"="adult", "45-64"="senior", "65+" = "senior")) 
dummy_df

Converting the independent variables into factors

#function to categorize and indicate the cofounder variable
#
# function to categorise - dummy variables
filtered <- dummy_df %>%
  filter(susp_age_group %in% c("child", "youth", "adult", "senior")) 
## change to factor level
filtered$boro_nm <- as.factor(filtered$boro_nm)
filtered$susp_age_group <- as.factor(filtered$susp_age_group)
filtered$susp_sex <- as.factor(filtered$susp_sex)
filtered$susp_race <- as.factor(filtered$susp_race)
# contrasts(filtered$susp_race)
# reference variable is AMERICAN INDIAN/ALASKAN NATIVE, we can rereference by using relevel(var, ref = "new_ref")

We will start with some exploratory analysis on the filtered dataset.

Having some information shown on plots would give a good idea as to which categorical variables are good predictive features and can be used to build a machine learning model. Best plots for factor to factor variable comparison would be any of a jitter plot or heat map. I would use a jitter plot in this case for all of our factor-factor plots.

# contrasts(filtered$susp_sex) -> female is the reference
# filtered %>%
#   mutate(susp_race = relevel(susp_race, ref = "BLACK")) -> filtered
# contrasts(filtered$susp_race)  #- > now BLACK is the ref
m_race <- lm(crimes ~ susp_race, data = filtered)
summary(m_race) # no significance
## 
## Call:
## lm(formula = crimes ~ susp_race, data = filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1222.0  -321.8  -102.8    -0.7 13004.0 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          15.01      68.18   0.220 0.825737    
## susp_raceASIAN / PACIFIC ISLANDER   124.78      89.45   1.395 0.163135    
## susp_raceBLACK                     1208.02      87.54  13.800  < 2e-16 ***
## susp_raceBLACK HISPANIC             199.50      89.50   2.229 0.025912 *  
## susp_raceUNKNOWN                     81.64      87.73   0.931 0.352126    
## susp_raceWHITE                      338.78      87.73   3.862 0.000116 ***
## susp_raceWHITE HISPANIC             618.94      87.87   7.044 2.45e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1052 on 2343 degrees of freedom
## Multiple R-squared:  0.1253, Adjusted R-squared:  0.123 
## F-statistic: 55.93 on 6 and 2343 DF,  p-value: < 2.2e-16
m_all <- lm(crimes ~ susp_age_group + susp_sex + susp_race + boro_nm, data = filtered)
anova(m_all)

Taking other variables (susp_age_group, susp_sex, and boro_nm) into account.

summary(m_all)
## 
## Call:
## lm(formula = crimes ~ susp_age_group + susp_sex + susp_race + 
##     boro_nm, data = filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1552.2  -450.7   -81.4   277.9 11664.1 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         406.16      81.54   4.981 6.78e-07 ***
## susp_age_groupchild                -926.82      58.55 -15.829  < 2e-16 ***
## susp_age_groupsenior               -832.74      50.48 -16.496  < 2e-16 ***
## susp_age_groupyouth                -600.08      57.57 -10.423  < 2e-16 ***
## susp_sexM                           604.39      43.96  13.749  < 2e-16 ***
## susp_sexU                          -367.70      47.88  -7.679 2.33e-14 ***
## susp_raceASIAN / PACIFIC ISLANDER   304.60      77.46   3.932 8.65e-05 ***
## susp_raceBLACK                     1454.98      76.20  19.095  < 2e-16 ***
## susp_raceBLACK HISPANIC             374.76      77.55   4.833 1.43e-06 ***
## susp_raceUNKNOWN                    319.68      76.32   4.189 2.91e-05 ***
## susp_raceWHITE                      587.14      76.34   7.691 2.13e-14 ***
## susp_raceWHITE HISPANIC             853.51      76.41  11.170  < 2e-16 ***
## boro_nmBROOKLYN                      97.37      58.94   1.652   0.0987 .  
## boro_nmMANHATTAN                    -41.08      58.91  -0.697   0.4857    
## boro_nmQUEENS                       -86.35      58.82  -1.468   0.1422    
## boro_nmSTATEN ISLAND               -455.92      60.83  -7.495 9.37e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 906.9 on 2334 degrees of freedom
## Multiple R-squared:  0.3522, Adjusted R-squared:  0.3481 
## F-statistic:  84.6 on 15 and 2334 DF,  p-value: < 2.2e-16

It can be seen that race Black is significantly associated with an average increase of 1454 in crime compared to other races. Now we will study the residuals.

Same as when we used unemployed variables, we have concerns about residuals distribution resulted from the Q-Q plot.

6 Conclusion

6.1 What did we learn?

  1. Comment on decreasing trends in crime with references to demographics.
  2. Comment on unemployment trends
  3. Comment on combined analysis
  4. Comment on modeling and best predictor variables
  5. Build a geographic geomap for the commited crimes per borough

6.2 Why is this important

  1. Important in evaluating successes of policies over time
  2. Important in shaping future police inforcement policy
  3. Connection between unemployment and saftey.

6.3 Challenges

  1. Working with very large data sets and how we filtered it down
  2. Sourcing external scripts into Rmarkdown
  3. Merging dataset with different observation values - lower & upper case.
  4. Study which variable has a better fit for regression model.

6.4 Future Work

6.5 References