1 A Socio-Economic Investigation into Crime

Presented by:

1.1 Introduction

This project provided us with the opportunity of showcasing many of the skills we have learned throughout this course and of applying them to an investigation into datasets of our choosing. We narrowed our scope to a few datasets containing information on socio-economic information, namely unemployment and crime data in NYC. Specifically we wanted to explore the impact of unemployment on crime within New York City. The guiding question for our statistical analysis is the following hypothesis:

\[H0: Unemployment \ does \ not \ impact \ crime\ (R^2 = 0) \] \[HA: Unemployment \ does \ impact \ crime\ (R^2 > 0) \]

We hoped that this investigation alongside exploratory data analysis would provide us with useful information that could be used to formulate policy proposals. We used the following process for each dataset:

  1. Data Import (API and .csv)
  2. Data Transformation (handling missing data, tidying)
  3. Data Exploration & Analysis (commentary, visualizations)

We then merged the datasets to explore further and try to draw some final conslusions based on the resulting model.

1.2 Work flow-chart

workflowchart

1.4 Environment Setup

The packages are loaded from a separate R file.

source("environment_setup.R", echo = T, prompt.echo = "", spaced = F)
## if (!require("dplyr")) install.packages("dplyr")
## if (!require("RSocrata")) install.packages("RSocrata")
## if (!require("tidyverse")) install.packages("tidyverse")
## if (!require("ggplot2")) install.packages("ggplot2")
## if (!require("readxl")) install.packages("readxl")
## if (!require("plyr")) install.packages("plyr")
## if (!require("treemap")) install.packages("treemap")
## if (!require("leaflet")) install.packages("leaflet")
## if (!require("forcats")) install.packages("forcats")
## if (!require("ggExtra")) install.packages("ggExtra")
## if (!require("GGally")) install.packages("GGally")

2 NYPD Arrests

We will start with the NYPD Arrests Data (Historic) data from NYC Open Data found below and conduct some exploratory data analysis to find out how arrests are distributed in general. We will explore demographic trends as well as trends in particular kinds of arrest or by boroughs.

This dataset provides observations of confirmed (individuals NOT acquitted of all charges) crimes as recorded New York City Police Department (NYPD).

2.1 NYPD Arrests - Data Import via API

There are 4.8M rows, there are 18 columns and each row is an arrest.

variable description
arrest_date Exact date of arrest for the reported event.
ofns_desc Description of internal classification corresponding with KY code (more general category than PD description).
arrest_boro Borough of arrest. B(Bronx), S(Staten Island), K(Brooklyn), M(Manhattan), Q(Queens)
age_group Perpetrator’s age within a category.
perp_sex Perpetrator’s sex description.
perp_race Perpetrator’s race description.
x_coord_cd Midblock X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104).
y_coord_cd Midblock Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
latitude Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
longitude Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)

The loaded R script will read the data into R using the RSocrata API.

source("arrests_dataset.R", echo = F, prompt.echo = "", spaced = F)
head(arrests_df, 10)

2.2 NYPD Arrests - Data Transformation

Rename the borough letters to proper names.

arrests_df$arrest_boro <- revalue(arrests_df$arrest_boro, c("Q"="Queens", "K"="Brooklyn", "M"="Manhattan", "S"="Staten Island", "B" = "Bronx")) 

Remove missing values where no offense description is recorded.

arrests_df <- arrests_df %>% filter(ofns_desc != "")

We generate a series of data frames aggregating the data in different manners for analysis and plotting. For example, we look at arrests by race, borough, offense.

murder_counts <- arrests_df %>%
  group_by(arrest_boro, year, perp_race) %>%
  dplyr::summarise(murder_counts = n()) %>%
  arrange(desc(year))
murder_counts
# get the count of arrests per year, by borough
grouped_boro <- arrests_df %>% 
  group_by(year, arrest_boro) %>% 
  dplyr::summarize(count = n()) %>% 
  arrange(desc(count))
# get the count of offenses per year, by borough
grouped_offenses <- arrests_df %>% 
  group_by(year, arrest_boro, ofns_desc) %>% 
  dplyr::summarize(count = n()) %>%
  arrange(desc(count))
# get the top five offense per borough
t5 <- grouped_offenses %>% top_n(5)
# get the counts of offenses overall
crime_counts <- arrests_df %>% 
  group_by(ofns_desc) %>% 
  dplyr::summarize(count = n()) %>% 
  arrange(desc(count))
# get the count of arrests related to dangerous drugs by year, by borough
drugs <- arrests_df %>% 
  filter(ofns_desc == 'DANGEROUS DRUGS') %>% 
  group_by(year, arrest_boro) %>% 
  dplyr::summarize(count = n())

2.3 NYPD Arrest - Data Exploration & Analysis

Let’s study the evolution of crime over the period of interest (2014-2018).

What the plot below reveals is that overall crime is decreasing for all boroughs of NYC. The data year over year is very similar, appearing to simply scale down over time.

What we can note as suprising is the fact that total crime between Manhattan and Brooklyn is at fairly similar levels. Total crime is aggregated without accounting for different types of crime so we will further our investigation by looking at top crimes overall, and then dissecting crime per borough.

Here is a plot of the top 10 most common crimes for the 2014-2018 period across all boroughs. We learn that dangerous drugs related offenses are the most prevailent followed by 3rd degree assaults.

A peek at the bottom 10 crimes for the same period reveals somewhat unexpected crimes like disruption of a religious service. It is interesting to note that while dangerous drugs offenses are the most common crime, only 1 person was arrested for being under the influence of drugs.

Following from the exploration above, we take a deeper look at the most common crimes by borough. On the plot below, we once again see that how drug related offenses are the most common crimes and that this is consistent across boroughs. We notice that while Brooklyn and Manhattan had the most crimes, the Bronx captures the most drug arrests.

The plot below explores that relationship over time for each borough. We observe that similarly to crime in general, drug related arrests are going down.

We continue investigating the demographics and take a look at the distribution of crime by gender. Male adults between the ages of 25-44 remain the most common perpetrators. This age group is also the common for female perpetrators.

By exploring age group futher by borough, we can see that the 25-44 age category is also the largest age category density for committing crimes across all boroughs. As also shown below, the Bronx seems to have the highest density of crimes for all age categories across boroughs.

The treemap below was used to study which perpetrator race is the most common by borough. As illustrated, BLACK is the greatest proportion. However, Brooklyn appears to have more crimes committed by BLACK than the Bronx.

To take the investigation even further, this interactive map will let you explore the distribution of crime geographically.

3 Labor Bureau

We wanted to investigate if there is a correlation between crimes and unemployment rate. So we extended our analysis with another dataset from the labor bureau. The dataset has five tables for the five boroghs for a period of four years from 2014 - 2018.

This dataset provides socio-economic observations like unemployed population and the unemployment rate of the active labor force.

3.1 Labor Bureau - Data Import via .csv

We load the script below to import the data into R. Here is a snapshot of the data for Manhattan.

source("unemployed_dataset.R", echo = F, prompt.echo = "", spaced = F)
head(manhattan)

3.2 Labor Bureau - Data Transformation

We perfomed typical data transformation operations to clean the table. This involves removing missing values and renaming columns.

clean_table <- function (table) {
  table_content <- table %>%
    na.omit()
  colnames(table_content) = c("arrest_boro","year","month","labor_force","employed","unemployed","unemployment_rate")
  final_table <- table_content %>%
    select(arrest_boro, year, labor_force, employed, unemployed, unemployment_rate) 
  return(final_table)
}
bronx_income <- clean_table(bronx)
queens_income <- clean_table(queens)
brooklyn_income <- clean_table(brooklyn)
manhattan_income <- clean_table(manhattan)
staten_income <- clean_table(staten)
income_table <- Reduce(function(...) merge(..., all=T), list(bronx_income, queens_income, brooklyn_income, manhattan_income, staten_income))
income_table 
by_avg_unemployment <- income_table %>%
  group_by(arrest_boro, year) %>%
  dplyr::summarise(avg_unemployment_rate = max(unemployment_rate)) %>%
  arrange(desc(year))
by_avg_unemployment$arrest_boro <- revalue(by_avg_unemployment$arrest_boro, c("Queens "="Queens"))
by_avg_unemployment

3.3 Labor Bureau - Data Exploration & Analysis

The following boxplot shows the unemployment rate distribution per borough per year. We can see that there is a decreasing trend in the unemployment rate over the years. However, the Bronx county seems to have the highest unemployment rate over the years and it remains the highest in 2018 with a range of 5 - 6%.

4 Combined Datasets

After understanding our datasets individually, we merged them into a single data frame based on the variable that were shared between the two sources, namely year and arrest_boro. This allowed us to extend our analysis to the variables in the combined dataset and to study how they behave in relation to each other.

merged <- Reduce(function(...) merge(..., all=T), list(murder_counts, by_avg_unemployment)) %>%
  na.omit() %>%
  arrange(desc(year))
merged

4.1 Combined - Data Transformation

We transform our datasets to prepare them to be merged.

by_crime <- arrests_df %>% select(year, arrest_boro) %>% group_by(year, arrest_boro) %>% dplyr::summarise(crimes = n())
colnames(by_crime) <- c("year", "boro_nm", "crimes" )
by_crime
dat2 <- income_table
dat2$unemployed <- as.numeric(gsub(",", "", dat2$unemployed))
names(dat2)[1] <- "boro_nm"
subet <- dat2[c(1:2,5)]
by_unemployment <- subet %>%
  group_by(year, boro_nm) %>%
  dplyr::summarise(unemployed = round(mean(unemployed), 2)) #%>%
  #mutate_if(is.character, str_to_lower)
by_unemployment$boro_nm <- revalue(by_unemployment$boro_nm, c("Queens "="Queens"))
by_unemployment
joined <- left_join(by_crime, by_unemployment, by = c("year"="year", "boro_nm"="boro_nm"))
joined
by_cat <- arrests_df %>% select(year, arrest_boro, age_group, perp_sex, perp_race) %>% group_by(year, arrest_boro, age_group, perp_sex, perp_race) %>% dplyr::summarise(crimes=n())
colnames(by_cat) <- c("year", "boro_nm", "susp_age_group", "susp_sex", "susp_race", "crimes")
by_cat

Categorize age-group preparing for dummy variables

dummy_df <- by_cat
dummy_df$susp_age_group <- revalue(by_cat$susp_age_group, c("<18"="child", "18-24"="youth", "25-44"="adult", "45-64"="senior", "65+" = "senior")) 
dummy_df

Converting the independent variables into factors

#function to categorize and indicate the cofounder variable
#
# function to categorise - dummy variables
filtered <- dummy_df %>%
  filter(susp_age_group %in% c("child", "youth", "adult", "senior")) 
## change to factor level
filtered$boro_nm <- as.factor(filtered$boro_nm)
filtered$susp_age_group <- as.factor(filtered$susp_age_group)
filtered$susp_sex <- as.factor(filtered$susp_sex)
filtered$susp_race <- as.factor(filtered$susp_race)
# contrasts(filtered$susp_race)
# reference variable is AMERICAN INDIAN/ALASKAN NATIVE, we can rereference by using relevel(var, ref = "new_ref")

4.2 Combined - Data Exploration & Analysis

We tried to plot a diverging plot to investigate which borough is above or below average. So we normalized the average unemployment rate and the number of crimes. Both Brooklyn and Bronx seem to have the most significant above average for crimes committed and the unemployment rate.

5 Statistical Analysis & Modeling

To answer the initial hypothesized question, the combined dataset is utilized in a linear regression modeling exercise to understand the relationship between the variables. In this process we were able to identify which variables are the greatest predictors of the number of crimes commited.

We will also verify the conditions for inference for linear modeling.

5.1 Linear Regression

To start building the predictive model, we need to see if there is a correlation between our predictor and response. In this case, crimes and unemployed. We will begin by doing some exploratory data visualization to verify the conditions for inference. The function ggpairs() from the GGally package was used to create a plot matrix demonstrating how the variables relate to one another.

There is essential regard that correlation doesn’t imply causation, so constructing a regression model is imperative to comprehend whether we can use this variable as a predictor.

We started by scrutinizing the relationship between the outcome and covariant. In our case, it is the number of crimes committed and the unemployment rate. It seems that their relationship is linear.

m_unemployed <- lm(crimes ~ unemployed, data = joined)
summary(m_unemployed)
## 
## Call:
## lm(formula = crimes ~ unemployed, data = joined)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22185 -10676  -3412   7705  30927 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7906.4279  7363.1438   1.074    0.295    
## unemployed     1.2209     0.1479   8.252 3.51e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15430 on 22 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7558, Adjusted R-squared:  0.7447 
## F-statistic:  68.1 on 1 and 22 DF,  p-value: 3.513e-08

As pointed by the simple linear regression, unemployment has an R-squared ~0.75. This makes it a better predictor of the number of crimes.

With an R-squared value of 0.75 and significant p-value for the unemployed variable, we have sufficient evidence to reject the null hypothesis, and accept the alternate that unemployment is a good predictor of crime.

However, to make sure that our predictor model is reliable, we need to see if the residuals are so close to the actual value. This would give us a quality indicator of how is our prediction is fitting.

For our model, the Q-Q plot testing the normality of the residuals shows pretty good alignment to the the line with a few points at the top slightly offset indicating some skew.

The residuals are not reasonably well spread above and below a pretty non horizontal line. This may raise a concern for us, this means that the relationship between the two variables are not linear.

Since the model above may not be convincing, we looked at other factors within the arrests dataset. The model below based on race only explains ~20% of the variance in the data but it indicates that the predictors of black, white and white hispanic are significant.

m_race <- lm(crimes ~ susp_race, data = filtered)
summary(m_race) # no significance
## 
## Call:
## lm(formula = crimes ~ susp_race, data = filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3010.6  -586.6  -164.5    17.2 20830.4 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          18.82     139.80   0.135  0.89291    
## susp_raceASIAN / PACIFIC ISLANDER   284.89     188.74   1.509  0.13138    
## susp_raceBLACK                     2994.76     188.57  15.881  < 2e-16 ***
## susp_raceBLACK HISPANIC             528.21     189.26   2.791  0.00532 ** 
## susp_raceUNKNOWN                     47.99     192.05   0.250  0.80272    
## susp_raceWHITE                      752.00     188.57   3.988 6.95e-05 ***
## susp_raceWHITE HISPANIC            1579.65     188.40   8.385  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1997 on 1668 degrees of freedom
## Multiple R-squared:  0.201,  Adjusted R-squared:  0.1981 
## F-statistic: 69.93 on 6 and 1668 DF,  p-value: < 2.2e-16

To further our analysis we investigated adding more variables to build a mutlivariable regression model taking other variables into account.

5.2 Multiple Regression

We will start with some exploratory analysis on the filtered dataset.

Having some information shown on plots would give a good idea as to which categorical variables are good predictive features and can be used to build a machine learning model. Best plots for factor to factor variable comparison would be any of a jitter plot or heat map. I would use a jitter plot in this case for all of our factor-factor plots.

We then took other variables (susp_age_group, susp_sex, and boro_nm) into account.

m_all <- lm(crimes ~ susp_age_group + susp_sex + susp_race + boro_nm, data = filtered)
summary(m_all)
## 
## Call:
## lm(formula = crimes ~ susp_age_group + susp_sex + susp_race + 
##     boro_nm, data = filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3241.5  -928.8  -206.1   580.2 18500.0 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         598.51     171.24   3.495 0.000486 ***
## susp_age_groupchild               -1979.83     130.56 -15.164  < 2e-16 ***
## susp_age_groupsenior              -1805.18     113.20 -15.947  < 2e-16 ***
## susp_age_groupyouth               -1061.32     128.93  -8.232 3.68e-16 ***
## susp_sexM                          1262.42      83.30  15.156  < 2e-16 ***
## susp_raceASIAN / PACIFIC ISLANDER   502.01     161.14   3.115 0.001869 ** 
## susp_raceBLACK                     3219.82     161.02  19.996  < 2e-16 ***
## susp_raceBLACK HISPANIC             729.82     161.55   4.518 6.70e-06 ***
## susp_raceUNKNOWN                    155.10     163.75   0.947 0.343702    
## susp_raceWHITE                      980.95     161.03   6.092 1.39e-09 ***
## susp_raceWHITE HISPANIC            1812.59     160.90  11.265  < 2e-16 ***
## boro_nmBrooklyn                     263.25     130.73   2.014 0.044198 *  
## boro_nmManhattan                    236.97     130.45   1.817 0.069465 .  
## boro_nmQueens                       -99.08     130.55  -0.759 0.447979    
## boro_nmStaten Island               -968.81     134.04  -7.228 7.45e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1702 on 1660 degrees of freedom
## Multiple R-squared:  0.4224, Adjusted R-squared:  0.4175 
## F-statistic: 86.71 on 14 and 1660 DF,  p-value: < 2.2e-16

It can be seen that other factors remaining the same, Black is associated with an average increase of 3219 in crime rates compared to other races.

We study the residuals to verify the validity of our model and similarly to when we used the unemployed variable, we have concerns about the residuals distribution revealed by the Q-Q plot.

5.3 ANOVA

Given that for our model we had an Adjusted R-squared: 0.4175, this would indicate that the model does explain 41.75% of the variability.

We know that multiple regression models are written in the form below, where \(\beta_i\) represents the predictor variables coefficients.

\(y = \beta_0x_0+...+\beta_nx_n + residuals\)

To test the significance of our findings, we postulate a secondary hypothesis:

\[H0: \beta_0 = ... = \beta_n = 0\] \[HA: \beta_j \neq 0; j = 1,...,n\]

In order to test if these predictors are significant, we analyze the variances with ANOVA. Since the p-values of the variances are near zero,
we have sufficient evidence to reject the null hypothesis and accept the alternate that at least one of the variables used in the model is predictive of crime.

anova(m_all)

6 Conclusion

6.1 Summary of findings

Our investigation allows us to make a number of findings:

  1. While crime overall is decreasing in NYC, black individuals are still disportionaly inclined to commit more crimes. The largest coefficent in the regression model is where the suspect race is black, 3219.82.

  2. The Bronx remains the borough with the highest unemployment rate, but Brooklyn and Manhattan have the highest crime rates of the 5 boroughs.

  3. The 25-44 age groups is the most likely to commit crimes for both genders.

  4. At a macro level, each increase in unemployment by one, results in a crime increase of 1.2209. At a more micro level, being a black or white/hispanic male is a good predictor of a crime.

  5. We built a geographic map of the distribution of commited crimes per borough.

  6. With an R-squared of ~75% we found that unemployment is indeed a good predictor of crime but the model assumptions may not be valid for linear regression. The question could be investigated further with non-linear regression.

  7. With an R-squared of ~42% we found that our multivariable model explains some of the variability in the data but the significant skew makes us cautious.

6.2 Why is this important?

  1. This kind of study is important in evaluating the success of policies over time by seeing the impact and relationships between crime and particular demographics.

  2. It can also be used to shape the future of police enforcement policy by allowin the department to better use their resources to tackle most common crimes where they are more likely to be concentrated.

  3. By merging crimes obersvations with socio-economic factor, we can identify connections to other variables like unemployment and shape policy by arguing that an investment in labor resources could have a significant beneficial impact on crime.

6.3 Future Work

Future work would involve collecting more variables for demographics and socio economic conditions like health and eduction and building a more encompassing model to predict crime rates.

6.4 Challenges

Throughout this assignment, we encountered a series of challenges like the following:

  1. Working with very large data sets and we filtering them down

  2. Sourcing external scripts into Rmarkdown to speed up knitting

  3. Merging datasets with no distinct identifiers across each observations

  4. Merging dataset with different inconsistent observation values

  5. Determining which set of variables are a better fit for regression modeling

6.5 References