607 final project: NYC crime with other factors

Jie Zou

2021-05-08

Introduction

the project is going to see the crime rate in New York City. Specifically, the time during covid-19 because of many events happen in this time, and I want to see if those events affect the crime rate.

Event I: unemployeement benefit government sent out unemployeement benefits to people who lost their job in this rough situation. Did the crime rate increase because of the benefit?

Event II: hate crime As we check the news everyday, we know that pandemic has increased the hatred of many people towards Asians. Is it just Asian? How many increased crime cases due to hatred? what is the spread of suspects and victims?

To answer these questions, I found the data sets listed below, and I download and save them locally.

  1. overall NYC complaints: NYDP complaint

  2. NYC hate crime: NYDP hate crime

  3. NYC unemployment claims: unemployment benefit

load data

# NYDP complaint data
complaint <- read.csv("~/Desktop/readings/NYPD_Complaint_Data_Current__Year_To_Date_.csv")

# NYDP hate crime data
hate <- read.csv("~/Desktop/readings/NYPD_Hate_Crimes.csv")

# unemployment claims
xml.parse <- xmlRoot(xmlParse(getURL("https://raw.githubusercontent.com/Sugarcane-svg/R/main/R607/Projects/final_project/unemployeement.xml")))
unemp <- xmlToDataFrame(xml.parse)

Tidy data

To check the dimension of each data set, we can see that complaint data has over 40K observations and 36 variables, hate data has 832 observations and 15 variables, and unemployment data has 1790 observations and 7 variables.

dim(complaint)
## [1] 413412     36
dim(hate)
## [1] 832  15
dim(unemp)
## [1] 1790    7

we are not going to see complaint records day by day, instead, we are going to see the observation monthly. therefore, I split the date format into month and year. and tidy a bit of complaint data set because it seems “dirty” so far. Meanwhile, we are focusing on the observations that are only after 2018.

# split date into month and year
complaint_date <- complaint %>%
  select(CMPLNT_FR_DT) %>%
  mutate(month = str_extract(CMPLNT_FR_DT, "^\\d{2}"),
         year = str_extract(CMPLNT_FR_DT, "\\d{4}$"))

# subset of complaint data whose observations are after 2018
complaint1 <- complaint %>%
  select(-c(CMPLNT_NUM,ADDR_PCT_CD,CMPLNT_FR_DT,CMPLNT_TO_DT,CMPLNT_TO_TM,HADEVELOPT,HOUSING_PSA,KY_CD,PARKS_NM, PATROL_BORO,RPT_DT, STATION_NAME,JURISDICTION_CODE, TRANSIT_DISTRICT,X_COORD_CD, Y_COORD_CD, Latitude, Longitude, Lat_Lon, New.Georeferenced.Column)) %>% # take out the variables not used for analysis
  mutate(month = complaint_date$month,  # split date into month and year
         year = complaint_date$year) %>%
  filter(year >= 2019) # get observation after 2018

unemp1 <- unemp %>% 
  mutate(
    month = str_extract(weekEnded, "^\\d{2}"),
    year = str_extract(weekEnded, "\\d{4}$"),
    InitialClaims = str_remove(InitialClaims, "[,]")
         ) %>% 
  select(month, year, InitialClaims, InsuredUnemploymentRate) %>%
  filter(year >= 2019) %>%
  group_by(month, year) %>%
  mutate(monthly_initial_claims = sum(as.numeric(InitialClaims))) # compute monthly claims

Explore data

complaint data set

which year has the most criminal activities happen?

since complaint data has only updated until Jan. 2021, there are no records for 2021 yet. We can see that there were 98.4% of cases reported from 2020, and 1.6% from 2019

ggplot(complaint1, aes(x = year)) +
  geom_bar() +
  labs(y = "number of cases",
       title = "2019 vs 2020: total crime cases")

complaint1 %>% 
  count(year) %>%
  summarize(ratio = n/sum(n))
##        ratio
## 1 0.01596028
## 2 0.98403972

which county has the most criminal activities?

as we can see from the plot, Brooklyn had the most crime activicies(received more complaints). If we remove the undefined observartions, we see that Brooklyn had 28.9% of crime rate, and Staten Island had the lowest crime rate.

ggplot(complaint1, aes(x = BORO_NM)) + 
  geom_bar() +
  labs(x = "county",
       y = "number of crime cases",
       title = "crime cases distributed in each county")

complaint1 %>% 
  filter(BORO_NM !="") %>% 
  count(BORO_NM) %>%
  mutate(ratio = n/sum(n))
##         BORO_NM      n      ratio
## 1         BRONX  90089 0.21920158
## 2      BROOKLYN 118610 0.28859794
## 3     MANHATTAN  96948 0.23589067
## 4        QUEENS  88440 0.21518929
## 5 STATEN ISLAND  16900 0.04112052

among these complaints, what is the dominant offense category?

According to the plot, we see that misdemeanor is the dominant category, it took over 50% of overall crime cases.

ggplot(complaint1, aes(x=LAW_CAT_CD)) +
  geom_bar() +
  labs(x = "level of offense",
       y = "cases")

complaint1 %>% 
  count(LAW_CAT_CD) %>%
  mutate(ratio = n/sum(n))
##    LAW_CAT_CD      n     ratio
## 1      FELONY 133816 0.3252232
## 2 MISDEMEANOR 210515 0.5116306
## 3   VIOLATION  67128 0.1631463

complaint vs unemployment claim?

did crime cases increase when there were more people make initial claim in unemployment benefit?

first of all, we need to take a look at the distribution of monthly initial claims, we found that the initial claims in 2019 were pretty stable. While, in 2020, the initial claims increased from March and became roughly stable after June. It was reasonable because large amount of people lost their jobs and the city had lockdown during that time. the updated records only show the first quarter of 2021, however, we can still see that the initial claim is decreasing.

# the spread of initial claims
ggplot(unemp1, aes(x = as.integer(month), y=monthly_initial_claims, color = year)) +
  geom_point()+
  geom_line() +
  labs(x = "month",
       y = "monthly initial claims",
       title = "monthly initial claim status of listed years")

secondly, we are going to take a look at the crime cases after 2018. The plot is showing the cases distribution, we see that in 2020, the lowest reported cases in the month that had highest initial claim. It surprises me because I thought the crime cases were going to increase, at least cyber crime.

# 2020 crime cases
ggplot(complaint1, aes(x = month, fill = year)) + 
  geom_bar(position = "dodge") +
  labs(y = "cases",
       title = "2019 vs 2020: crime cases by month")

now, we are going to focus on 2020 to see if there is any relationship between unemployment claims and overall complaints. firstly, we subset the monthly complaint data, then we subset the monthly claim data, and join them together by month. Finally using linear model to see of there is relationship. The linear mode has negative slope, which mean every new initial claim will reduce the complaint by 0.0088. the r-sq is 60% which mean the model explains 60% of variance. p-value is smaller than significant level(0.05), which means that the relationship is significant.

the linear equation is \(\widehat{cases}= -0.0088 * claims + 37210.50\)

# subset of monthly crimes case
monthly_complaints <- complaint1 %>% 
  filter(year == 2020) %>%
  select(month, BORO_NM) %>%
  count(month)

# modify column names
colnames(monthly_complaints) <- c("month", "monthly_cases")

# subset of monthly initial claims
monthly_claims <- unemp1 %>%
  filter(year == "2020") %>%
  select(month, InitialClaims) %>%
  group_by(month) %>%
  summarize(monthly_claims = sum(as.numeric(InitialClaims)))
## Adding missing grouping variables: `year`
# data set with monthly complaints and monthlt initial claims
monthly_claims_complaints <- monthly_complaints %>%
  left_join(monthly_claims, by = "month")

# relationship between monthly claims and monthly complaints
lm_claim_complaint <- linear_reg(mode = "regression") %>%
  set_engine("lm") %>%
  fit(monthly_cases~monthly_claims, data = monthly_claims_complaints)

ggplot(monthly_claims_complaints, aes(x = monthly_claims, y = monthly_cases)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x, se = FALSE)

# check the assumption of lm
plot(lm_claim_complaint$fit) 

tidy(lm_claim_complaint) # lm info
## # A tibble: 2 x 5
##   term              estimate  std.error statistic  p.value
##   <chr>                <dbl>      <dbl>     <dbl>    <dbl>
## 1 (Intercept)    37210.      1134.          32.8  1.63e-11
## 2 monthly_claims    -0.00884    0.00227     -3.89 2.99e- 3
glance(lm_claim_complaint) # r-sq and p-value
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.603         0.563 2430.      15.2 0.00299     1  -109.  225.  226.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

hate crime

how is the hate crime distribution annually?

we see from the plot that 2019 and 2020 did not show much increases or decreases or crime case, they were roughly stable in the year. however, 2021, we can clearly see the increase of crime cases.

hate <- hate %>% 
  mutate(Month.Number = str_extract(Record.Create.Date, "^\\d{2}"))

# hate crime distribution monthly
supp_line <- hate %>% 
  count(Month.Number, Complaint.Year.Number)

ggplot(hate, aes(x = Month.Number, fill = as.character(Complaint.Year.Number))) +
  geom_bar(position = "dodge") +
  guides(fill = guide_legend(title = "year")) +
  labs(x = "month")

the distribution of offense type

like complaint data set, the main two categories of offense are felony and misdemeanor which take up about 98%.

# offense level
ggplot(hate, aes(x = Law.Code.Category.Description)) +
  geom_bar() +
  labs(x = "offense type",
       y = "number of cases",
       title = "the number of cases in hate crime of offense category")

hate %>% 
  count(Law.Code.Category.Description) %>% 
  mutate(ratio = n/sum(n))
##   Law.Code.Category.Description   n       ratio
## 1                        FELONY 406 0.487980769
## 2                 INVESTIGATION   1 0.001201923
## 3                   MISDEMEANOR 413 0.496394231
## 4                     VIOLATION  12 0.014423077

hate crime cases distributed in different county

from the plot, both Brooklyn and Manhattan have more crime cases than other county and similarity, Brooklyn is in the dominant position like the observation in complaint data.

# hate crime spread in different county
hate_county <- hate %>%
  count(County) %>% 
  arrange(desc(County)) %>% 
  mutate(prop = n/sum(n) * 100) %>% 
  mutate(ypos = cumsum(prop) - 0.5*prop)
 
ggplot(hate_county, aes(x = "", y = prop, fill = County))+
  geom_bar(stat = "identity") +
  coord_polar("y", start = 0) +
  geom_text(aes(y = ypos, label = paste(round(prop, 0), "%")), size = 3, color = "white") +
  theme_void() +
  labs(title = "hate crime cases in each county")

how many of these crime actually got caught?

Although, the plot shows 2019 has the most hate crime being caught. notice that the record for 2021 only showing the first quarter, and it already takes up 14%. it indicates that the number of arrested crime may exceed previous years.

# hate crime actually got caught in these years
hate <- hate %>% 
  mutate(arrested = ifelse(Arrest.Date=="", "no", "yes"))

hate_arrest <- hate %>% 
  filter(arrested == "yes") %>% 
  count(Complaint.Year.Number) %>% 
  arrange(desc(Complaint.Year.Number)) %>% 
  mutate(prop = n/sum(n) * 100) %>% 
  mutate(ypos = cumsum(prop) - 0.5*prop)

ggplot(hate_arrest, aes(x = "", y = prop, fill = as.character(Complaint.Year.Number))) +
  geom_bar(stat = "identity") +
  coord_polar("y", start = 0) + 
  theme_void() +
  geom_text(aes(y = ypos, label = paste(round(prop, 0), "%")), size = 3, color = "white") +
  labs(title = "ratio of arrested hate crimes in these years") +
  guides(fill = guide_legend("year"))

how much has hate crime increased?

based on the question, we know that the hate crime is increasing in the first quarter of 2021. So, I made a subset of data showing the hate crime cases on only Jan, Feb, March in these year for the convenience of comparison.

hate_increase <- hate %>% 
  filter(Month.Number == "01" | Month.Number =="02"|Month.Number =="03") %>%  # get the month
  select(Month.Number, Complaint.Year.Number) %>%
  count(Month.Number, Complaint.Year.Number) %>% 
  group_by(Complaint.Year.Number) %>% 
  mutate(quarter_sum = sum(n)) # three month total of these year

DT::datatable(hate_increase)
# comparison in Jan, Feb and March in 2021
paste("compare to Jan. 2021, Feb. 2021 has increased", round((23-17)/17,2) * 100, "%")
## [1] "compare to Jan. 2021, Feb. 2021 has increased 35 %"
paste("compare to Feb. 2021, March. 2021 has increased", round((61-23)/23,2) * 100, "%")
## [1] "compare to Feb. 2021, March. 2021 has increased 165 %"
# compare 2019 with 2021
paste("compare to March. 2019, March. 2021 has increased", round((61-54)/54,2) * 100, "%")
## [1] "compare to March. 2019, March. 2021 has increased 13 %"
paste("compare to March. 2020, March. 2021 has increased", round((61-34)/34,2) * 100, "%")
## [1] "compare to March. 2020, March. 2021 has increased 79 %"

who are the main victims in hate crime? what about the overall complaint cases?

from the statistic below(estimation), the most cases of hate crime type is anti-jewish, then anti-black. Anti-asian is on the forth position. From the whole complaint data, except the undefined and unknown race, the rate of considering Asians as suspects or victims are both below 10%. However, considering Black race will range around 26% to 28%. However, the whole complaint data does not contain the records in 2021, if the data set will be up to date, the ratio should be more accurate.

# hate crime type included 2021
hate %>% 
  count(Bias.Motive.Description, sort = TRUE)
##              Bias.Motive.Description   n
## 1                        ANTI-JEWISH 406
## 2                         ANTI-BLACK  85
## 3         ANTI-MALE HOMOSEXUAL (GAY)  84
## 4                         ANTI-ASIAN  81
## 5                         ANTI-WHITE  38
## 6                        ANTI-MUSLIM  21
## 7                   ANTI-TRANSGENDER  19
## 8                      ANTI-CATHOLIC  15
## 9                        ANTI-FEMALE  14
## 10              ANTI-OTHER ETHNICITY  13
## 11  ANTI-FEMALE HOMOSEXUAL (LESBIAN)  12
## 12                     ANTI-HISPANIC  11
## 13           ANTI-LGBT (MIXED GROUP)  11
## 14                         ANTI-ARAB   6
## 15          ANTI-MULTI-RACIAL GROUPS   3
## 16                     ANTI-BUDDHIST   2
## 17        ANTI-GENDER NON-CONFORMING   2
## 18                        ANTI-HINDU   2
## 19               ANTI-OTHER RELIGION   2
## 20 ANTI-RELIGIOUS PRACTICE GENERALLY   2
## 21                  60 YRS AND OLDER   1
## 22             ANTI-JEHOVAHS WITNESS   1
## 23          ANTI-PHYSICAL DISABILITY   1
# overall victim type not included 2021
victim <- complaint1 %>% 
  count(VIC_RACE)
colnames(victim) <- c("race", "v_n")
# overall suspect type not included 2021
suspect <- complaint1 %>% 
  count(SUSP_RACE)
colnames(suspect) <- c("race", "s_n")
victim %>% 
  left_join(suspect, by = "race") %>%
  mutate(v_ratio = round(v_n / sum(v_n), 3),
         s_ratio = round(s_n / sum(s_n), 3))
##                             race    v_n    s_n v_ratio s_ratio
## 1                                     1  94395   0.000   0.229
## 2 AMERICAN INDIAN/ALASKAN NATIVE   1321    618   0.003   0.002
## 3       ASIAN / PACIFIC ISLANDER  32148  10788   0.078   0.026
## 4                          BLACK 109197 114924   0.265   0.279
## 5                 BLACK HISPANIC  17871  15904   0.043   0.039
## 6                        UNKNOWN 109481  95506   0.266   0.232
## 7                          WHITE  65697  28995   0.160   0.070
## 8                 WHITE HISPANIC  75743  50329   0.184   0.122

Conclusion

the unemployment benefit actually lower the crime rate which is out of my expectation. For the hate crime, even though there were lots of news showing the hatred towards to Asians, according to the statistic, Asians are not the main group hated by others. African Americans are the main victims.