# Install required packages if not already available
if (!require("readr"))
install.packages("readr")
library(readr)
if (!require("tidyverse"))
install.packages("tidyverse")
library(tidyverse)
if (!require("infer"))
install.packages("infer")
library(infer)
if (!require("ggplot2"))
install.packages("ggplot2")
library(ggplot2)
if (!require("RSocrata"))
install.packages("RSocrata")
library(RSocrata)
if (!require("httr"))
install.packages("httr")
library(httr)NYPD Arrest Rate vs Weather Variables
Introduction
The NYPD Arrests Data-sets provides a detailed record of arrests in New York City during this period. It serves as a valuable resource for understanding crime patterns and trends within the city. The dataset encompasses a wide range of information, including the demographic details of individuals arrested, the types of crimes committed, and the locations where arrests occurred.
Investigating the correlation between arrest rates and weather variables can yield significant insights that inform critical decisions across various domains. This analysis has the potential to influence public policy, policing strategies, and even resource allocation through tax optimization.
Data
NYPD Arrests Data (Historic) - List of every arrest in NYC going back to 2006 through the end of the previous calendar year. This is a breakdown of every arrest effected in NYC by the NYPD going back to 2006 through the end of the previous calendar year. This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning before being posted on the NYPD website. Each record represents an arrest effected in NYC by the NYPD and includes information about the type of crime, the location and time of enforcement.
NYPD Arrest Data (Year to Date) - This is a breakdown of every arrest effected in NYC by the NYPD during the current year. This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. Each record represents an arrest effected in NYC by the NYPD and includes information about the type of crime, the location and time of enforcement.
Weather Data - Visual Crossing Weather is the easiest-to-use and lowest-cost source for historical and forecast weather data. The Weather API is designed to integrate easily into any app or code, and prices are lower than any other provider in the industry. The data is used daily by a diverse customer-base including business analysts, data scientists, insurance professionals, energy producers, construction planners, and academics.
Import Data
First I read the csv files into data-frames. First the Weather Data then the 2 NYPD Data sets.
arrest_key arrest_date pd_cd law_code law_cat_cd arrest_boro
1 236791704 2021-11-22 581 PL 2225001 M M
2 237354740 2021-12-04 153 PL 1302502 F B
3 236081433 2021-11-09 681 PL 2601001 M Q
4 32311380 2007-06-18 511 PL 2200300 M Q
5 192799737 2019-01-26 177 PL 1306503 F M
6 193260691 2019-02-06 <NA> PL 2203400 F M
arrest_precinct jurisdiction_code age_group perp_sex perp_race
1 28 0 45-64 M BLACK
2 41 0 25-44 M WHITE HISPANIC
3 113 0 25-44 M BLACK
4 27 1 18-24 M BLACK
5 25 0 45-64 M BLACK
6 14 0 25-44 M UNKNOWN
x_coord_cd y_coord_cd latitude longitude lon_lat.type
1 997427 230378 40.799008797000056 -73.95240854099995 Point
2 1013232 236725 40.816391847000034 -73.89529641399997 Point
3 1046367 186986 40.67970040800003 -73.77604736799998 Point
4 <NA> <NA> <NA> <NA> <NA>
5 1000555 230994 40.800694331000045 -73.941109285999971 Point
6 986685 215375 40.757839003000072 -73.991212110999982 Point
lon_lat.coordinates pd_desc ky_cd ofns_desc
1 -73.95241, 40.79901 <NA> <NA> <NA>
2 -73.89530, 40.81639 RAPE 3 104 RAPE
3 -73.77605, 40.67970 CHILD, ENDANGERING WELFARE 233 SEX CRIMES
4 NULL CONTROLLED SUBSTANCE, POSSESSION 7 235 DANGEROUS DRUGS
5 -73.94111, 40.80069 SEXUAL ABUSE 116 SEX CRIMES
6 -73.99121, 40.75784 <NA> <NA> <NA>
arrest_key arrest_date pd_cd pd_desc ky_cd
1 279764521 2024-01-01 109 ASSAULT 2,1,UNCLASSIFIED 106
2 279766985 2024-01-01 905 INTOXICATED DRIVING,ALCOHOL 347
3 279766989 2024-01-01 922 TRAFFIC,UNCLASSIFIED MISDEMEAN 348
4 279767308 2024-01-01 918 RECKLESS DRIVING 348
5 279767573 2024-01-01 109 ASSAULT 2,1,UNCLASSIFIED 106
6 279768912 2024-01-01 268 CRIMINAL MIS 2 & 3 121
ofns_desc law_code law_cat_cd arrest_boro
1 FELONY ASSAULT PL 1200501 F Q
2 INTOXICATED & IMPAIRED DRIVING VTL11920U2 M M
3 VEHICLE AND TRAFFIC LAWS VTL0511001 M S
4 VEHICLE AND TRAFFIC LAWS VTL1212000 M K
5 FELONY ASSAULT PL 1200502 F B
6 CRIMINAL MISCHIEF & RELATED OF PL 1450502 F Q
arrest_precinct jurisdiction_code age_group perp_sex perp_race
1 114 0 45-64 M WHITE
2 6 0 25-44 M BLACK
3 120 0 18-24 M BLACK
4 76 0 25-44 M BLACK
5 41 0 <18 M BLACK HISPANIC
6 111 0 18-24 M WHITE
x_coord_cd y_coord_cd latitude longitude
1 1002279 222018 40.776047 -73.934905
2 981400 206384 40.733152570631795 -74.01028350206072
3 964114 166065 40.62246362389224 -74.07253516606062
4 984493 190398 40.68927525699562 -73.99912377311799
5 1015773 240835 40.82765564227593 -73.88609567466663
6 1047645 217681 40.763933 -73.771149
geocoded_column.type geocoded_column.coordinates
1 Point -73.93491, 40.77605
2 Point -74.01028, 40.73315
3 Point -74.07254, 40.62246
4 Point -73.99912, 40.68928
5 Point -73.88610, 40.82766
6 Point -73.77115, 40.76393
# Replace '8XZYX6CMANT9QRNP4Q4WEWKGJ' with your actual Visual Crossing API key
api_key <- "8XZYX6CMANT9QRNP4Q4WEWKGJ"
# Base URL for the weather data API
base_url <- "https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/timeline/"
# Define location and date range
location <- "New%20York%20City%2CUSA"
start_date <- "2006-01-01"
end_date <- "2024-05-04"
# Build the API request URL
url <- paste0(
base_url,
location,
"/",
start_date,
"/",
end_date,
"?unitGroup=metric&include=days&key=",
api_key,
"&contentType=csv"
)
# Send GET request and handle errors
response <- GET(url)
if (!response$status_code == 200) {
stop(paste0("Unexpected Status code: ", response$status_code))
}
# Read CSV data from response content
weather_data <- read_csv(response$content)
head(weather_data)# A tibble: 6 × 33
name datetime tempmax tempmin temp feelslikemax feelslikemin feelslike
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 New York… 2006-01-01 5.8 -0.1 2.9 5.6 -2.5 2
2 New York… 2006-01-02 7.6 3.1 5 6.9 0.5 3.4
3 New York… 2006-01-03 4.5 1.9 2.8 0.3 -3.5 -2.3
4 New York… 2006-01-04 2.9 -1.7 1.1 0.9 -6.8 -2.4
5 New York… 2006-01-05 9.9 2.7 5.8 7.2 0.8 3.8
6 New York… 2006-01-06 5.4 -1 3.2 5.4 -5.8 -0.1
# ℹ 25 more variables: dew <dbl>, humidity <dbl>, precip <dbl>,
# precipprob <dbl>, precipcover <dbl>, preciptype <chr>, snow <dbl>,
# snowdepth <dbl>, windgust <dbl>, windspeed <dbl>, winddir <dbl>,
# sealevelpressure <dbl>, cloudcover <dbl>, visibility <dbl>,
# solarradiation <dbl>, solarenergy <dbl>, uvindex <dbl>, severerisk <dbl>,
# sunrise <dttm>, sunset <dttm>, moonphase <dbl>, conditions <chr>,
# description <chr>, icon <chr>, stations <chr>
Tidy and Manipulate Data
With the data successfully loaded, we can commence the data cleaning and manipulation phase. We will prioritize the NYPD data-sets, followed by the weather dataset, to ensure a consistent workflow.
NYPD Datasets
- Combined historic and year to date arrest data
- Made all column names lowercase
- Removed every column except arrest_date, arrest_boro, age_group, perp_sex, and perp_race
- Changed name of arrest_date column to date
- Adjusted the arrest_boro column to display the full names
- Set date column as date object
#Combine historic and year to date arrest data
arrest_data <- bind_rows(arrest_data_ytd, historic_arrest_data)
# Make all column names lowercase
colnames(arrest_data) <- tolower(colnames(arrest_data))
# Remove every column except arrest_date, arrest_boro, age_group, perp_sex, and perp_race
arrest_data <-
arrest_data |> select(arrest_date, arrest_boro, age_group, perp_sex, perp_race)
# Change name of arrest_date column to date
colnames(arrest_data)[colnames(arrest_data) == "arrest_date"] <-
"date"
# Adjust the arrest_boro column to display the full names
arrest_data$arrest_boro <- case_when(
arrest_data$arrest_boro == "M" ~ "Manhattan",
arrest_data$arrest_boro == "B" ~ "Bronx",
arrest_data$arrest_boro == "K" ~ "Brooklyn",
arrest_data$arrest_boro == "Q" ~ "Queens",
arrest_data$arrest_boro == "S" ~ "Staten Island",
TRUE ~ arrest_data$arrest_boro
)
# Set date column as date object
arrest_data$date <- as.Date(arrest_data$date, format = "%m/%d/%Y")
head(arrest_data) date arrest_boro age_group perp_sex perp_race
1 2024-01-01 Queens 45-64 M WHITE
2 2024-01-01 Manhattan 25-44 M BLACK
3 2024-01-01 Staten Island 18-24 M BLACK
4 2024-01-01 Brooklyn 25-44 M BLACK
5 2024-01-01 Bronx <18 M BLACK HISPANIC
6 2024-01-01 Queens 18-24 M WHITE
Weather data
- Removed unneeded columns
- Convert temperatures to Fahrenheit
- Convert precip column to inches
- Adjust variable names
- Create new variable called “isRain”
- Set date column as date object
# Remove unneeded columns
weather_data <-
weather_data |> select(datetime, temp, precip, humidity, conditions)
head(weather_data)# A tibble: 6 × 5
datetime temp precip humidity conditions
<date> <dbl> <dbl> <dbl> <chr>
1 2006-01-01 2.9 1.76 76.1 Snow, Rain, Overcast
2 2006-01-02 5 11.1 75 Rain, Partially cloudy
3 2006-01-03 2.8 31.3 88 Rain, Overcast
4 2006-01-04 1.1 0 69.7 Partially cloudy
5 2006-01-05 5.8 1.28 75.9 Rain, Partially cloudy
6 2006-01-06 3.2 0 60.3 Overcast
# Convert temperatures to Fahrenheit
weather_data$temp <- (weather_data$temp * 9 / 5) + 32
# Convert precip variable to inches
weather_data$precip <- round(weather_data$precip * 0.0393701, 3)
# Adjust variable names
colnames(weather_data)[colnames(weather_data) == "datetime"] <-
"date"
# Create new variable called "isRain"
weather_data$isRain <-
grepl("rain", weather_data$conditions, ignore.case = TRUE)
# Remove conditions column
weather_data <- weather_data |> select(-conditions)
# Create Date object
weather_data$date <- as.Date(weather_data$date, format = "%m/%d/%Y")
head(weather_data)# A tibble: 6 × 5
date temp precip humidity isRain
<date> <dbl> <dbl> <dbl> <lgl>
1 2006-01-01 37.2 0.069 76.1 TRUE
2 2006-01-02 41 0.435 75 TRUE
3 2006-01-03 37.0 1.23 88 TRUE
4 2006-01-04 34.0 0 69.7 FALSE
5 2006-01-05 42.4 0.051 75.9 TRUE
6 2006-01-06 37.8 0 60.3 FALSE
The final step involves integrating the two data-sets using the date attribute as the key. This consolidated dataset will serve as the foundation for our subsequent analysis, enabling us to explore the potential relationships between arrest rates and weather conditions
# Left join
combined_data <- left_join(arrest_data, weather_data, by = "date")
head(combined_data[order(combined_data$date), ], 20) date arrest_boro age_group perp_sex perp_race temp precip
998703 2006-01-01 Brooklyn 25-44 M BLACK 37.22 0.069
998716 2006-01-01 Manhattan 18-24 M WHITE HISPANIC 37.22 0.069
998720 2006-01-01 Queens 18-24 F WHITE HISPANIC 37.22 0.069
998847 2006-01-01 Brooklyn <18 M BLACK 37.22 0.069
998874 2006-01-01 Manhattan 25-44 M BLACK 37.22 0.069
998878 2006-01-01 Manhattan <18 M WHITE 37.22 0.069
998885 2006-01-01 Brooklyn 18-24 M WHITE HISPANIC 37.22 0.069
998905 2006-01-01 Brooklyn 18-24 M BLACK 37.22 0.069
998907 2006-01-01 Queens 18-24 M BLACK 37.22 0.069
998931 2006-01-01 Bronx 25-44 F BLACK 37.22 0.069
998951 2006-01-01 Manhattan 25-44 M BLACK 37.22 0.069
999012 2006-01-01 Brooklyn 25-44 M BLACK 37.22 0.069
999014 2006-01-01 Brooklyn 45-64 M BLACK 37.22 0.069
999118 2006-01-01 Staten Island <18 F WHITE 37.22 0.069
999137 2006-01-01 Bronx 25-44 M WHITE HISPANIC 37.22 0.069
999171 2006-01-01 Manhattan 45-64 M WHITE 37.22 0.069
999181 2006-01-01 Bronx 18-24 M WHITE HISPANIC 37.22 0.069
999199 2006-01-01 Queens 25-44 M WHITE HISPANIC 37.22 0.069
999223 2006-01-01 Brooklyn 18-24 M BLACK 37.22 0.069
999272 2006-01-01 Manhattan 18-24 M WHITE 37.22 0.069
humidity isRain
998703 76.1 TRUE
998716 76.1 TRUE
998720 76.1 TRUE
998847 76.1 TRUE
998874 76.1 TRUE
998878 76.1 TRUE
998885 76.1 TRUE
998905 76.1 TRUE
998907 76.1 TRUE
998931 76.1 TRUE
998951 76.1 TRUE
999012 76.1 TRUE
999014 76.1 TRUE
999118 76.1 TRUE
999137 76.1 TRUE
999171 76.1 TRUE
999181 76.1 TRUE
999199 76.1 TRUE
999223 76.1 TRUE
999272 76.1 TRUE
To facilitate our exploratory analysis, we will construct a new data-frame that aggregates daily arrest counts from the combined dataset.
# Aggregate the data by date
total_arrest_data <- combined_data |>
group_by(date) |>
mutate(arrests = n())
total_arrest_data <- total_arrest_data |>
distinct(date, .keep_all = TRUE)Explaratory Data Analysis
Precipitation vs Average Daily Arrests
Summary of Average Daily Arrests
As we see on the summary of daily arrests, the average daily arrests is 869 with a minimum of 91 and a maximum of 1773.
# Summary Arrests
summary(total_arrest_data$arrests) Min. 1st Qu. Median Mean 3rd Qu. Max.
91.0 598.0 812.0 868.6 1171.0 1773.0
Visualization of the summary of daily arrests
# Visualize the summary of daily arrests
total_arrest_data |>
ggplot(aes(x = arrests)) +
geom_boxplot(fill = 'blue', color = 'black') +
labs(title = "Distribution of Daily Arrests",
x = "Arrests",
y = "Count") +
theme_minimal()The daily arrests are higher when it does not rain than when it does.
# Plotting side by side box plot for isRain and avg daily arrests
total_arrest_data |>
ggplot(aes(x = isRain, y = arrests, fill = isRain)) +
geom_boxplot() +
labs(title = "Total Arrests by Rain",
x = "Rain",
y = "Total Arrests") +
theme_minimal()We can also calculate
# Calculate the daily avg arrests when it rains
total_arrest_data |>
group_by(isRain) |>
summarise(avg_arrests = mean(arrests, na.rm =TRUE))# A tibble: 2 × 2
isRain avg_arrests
<lgl> <dbl>
1 FALSE 909.
2 TRUE 822.
# Calculate total arrests when it rains
total_arrest_data |>
group_by(isRain) |>
summarise(total_arrests = sum(arrests, na.rm =TRUE))# A tibble: 2 × 2
isRain total_arrests
<lgl> <int>
1 FALSE 3260994
2 TRUE 2528149
Null Hypothesis (H₀): There is no difference in the average daily arrests between when it rains and when it does not rain.
Alternative Hypothesis (H₁): There is a difference in the average daily arrests between when it rains and when it does not rain.
# Conduct hypothesis test to determine if there is a significant difference in the average daily arrests when it rains and when it does not rain
obs_diff <- total_arrest_data |>
drop_na(isRain) |>
specify(arrests ~ isRain) |>
calculate(stat = "diff in means", order = c("TRUE", "FALSE"))
# Simulate
null_dist <- total_arrest_data |>
drop_na(isRain) |>
specify(arrests ~ isRain) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "diff in means", order = c("TRUE", "FALSE"))
# Visualize
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram(bins = 20, color = "#006BB6", fill = "#F58426", linewidth =1.5) +
labs(title = "Distribution of Test Statistic (Null Hypothesis)",
x = "Test Statistic",
y = "Frequency") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.margin = margin(unit = "points", b = 15, l = 15, r = 15, t = 10)
)# Get p-value
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")# A tibble: 1 × 1
p_value
<dbl>
1 0
A reported p-value of 0 indicates extremely strong evidence against the null hypothesis. This suggests that the observed relationship between arrest rates and rain is unlikely to be due to random chance. In other words, the data provides strong evidence that rain may be associated with arrest rates.
Correlation between temperature and arrests
# Visualize the distribution of arrests
hist(total_arrest_data$arrests, main = "Distribution of Arrests", xlab = "Arrests")
abline(v = mean(total_arrest_data$arrests), col = "#F58426", lwd = 2)
abline(v = median(total_arrest_data$arrests), col = "#006BB6", lwd = 2)
legend("topleft", legend = c("Mean", "Median"), col = c("#F58426", "#006BB6"), lwd = 2)Looking at the scatterplot we see that the concentration of arrests fall in between what one may call the “temperate” range of 40°F to 80°F. This shows that there may be a correlation between the temperature and the number of arrests. It shows a positive correlation up until the extreme temperatures.
# Visualize the relationship between temperature and arrests
total_arrest_data |>
ggplot(aes(x = temp, y = arrests)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Temperature vs. Arrests",
x = "Temperature (°F)",
y = "Arrests") +
theme_minimal()Using a linear model will does not show a correlation between temperature and arrests. The p-value is 0.79 which is not significant. This indicates that the temperature does not have a significant impact on the number of arrests.
# Fit a linear model to the data
lm_model <- lm(arrests ~ temp, data = total_arrest_data)
summary(lm_model)
Call:
lm(formula = arrests ~ temp, data = total_arrest_data)
Residuals:
Min 1Q Median 3Q Max
-775.51 -270.84 -56.25 301.97 905.63
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 864.91282 15.05695 57.443 <2e-16 ***
temp 0.06532 0.25638 0.255 0.799
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 351.5 on 6663 degrees of freedom
Multiple R-squared: 9.741e-06, Adjusted R-squared: -0.0001403
F-statistic: 0.0649 on 1 and 6663 DF, p-value: 0.7989
We have a weak positive correlation between temperature and arrests. Indicating there isn’t a strong relationship between the two variables.
# Correlation between temperature and arrests
cor(total_arrest_data$temp, total_arrest_data$arrests)[1] 0.003121016
# Run correlation test
cor.test(total_arrest_data$temp, total_arrest_data$arrests)
Pearson's product-moment correlation
data: total_arrest_data$temp and total_arrest_data$arrests
t = 0.25476, df = 6663, p-value = 0.7989
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02088890 0.02712733
sample estimates:
cor
0.003121016
Non-Linearity between temperature and arrests
# Fit a quadratic model to the data
quad_model <- lm(arrests ~ poly(temp, 2), data = total_arrest_data)
summary(quad_model)
Call:
lm(formula = arrests ~ poly(temp, 2), data = total_arrest_data)
Residuals:
Min 1Q Median 3Q Max
-761.55 -270.18 -57.11 303.06 905.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 868.589 4.305 201.770 <2e-16 ***
poly(temp, 2)1 89.542 351.445 0.255 0.799
poly(temp, 2)2 -504.248 351.445 -1.435 0.151
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 351.4 on 6662 degrees of freedom
Multiple R-squared: 0.0003187, Adjusted R-squared: 1.854e-05
F-statistic: 1.062 on 2 and 6662 DF, p-value: 0.3459
Lets try polynomial regression model
plot(total_arrest_data$temp, total_arrest_data$arrests, main = "Temperature vs. Arrests", xlab = "Temperature (°F)", ylab = "Arrests")
attach(total_arrest_data)The following object is masked from package:datasets:
precip
spline2 <- smooth.spline(temp, arrests, df = 2)
lines(spline2, col = "red")plot(total_arrest_data$temp, total_arrest_data$arrests, main = "Temperature vs. Arrests", xlab = "Temperature (°F)", ylab = "Arrests")
attach(total_arrest_data)The following objects are masked from total_arrest_data (pos = 3):
age_group, arrest_boro, arrests, date, humidity, isRain, perp_race,
perp_sex, precip, temp
The following object is masked from package:datasets:
precip
spline3 <- smooth.spline(temp, arrests, df = 3)
lines(spline3, col = "red")# Build polynomial model
poly.model <- lm(arrests ~ temp+I(temp^2), data = total_arrest_data)
summary(poly.model)
Call:
lm(formula = arrests ~ temp + I(temp^2), data = total_arrest_data)
Residuals:
Min 1Q Median 3Q Max
-761.55 -270.18 -57.11 303.06 905.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 805.46543 44.08360 18.271 <2e-16 ***
temp 2.46006 1.68863 1.457 0.145
I(temp^2) -0.02184 0.01522 -1.435 0.151
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 351.4 on 6662 degrees of freedom
Multiple R-squared: 0.0003187, Adjusted R-squared: 1.854e-05
F-statistic: 1.062 on 2 and 6662 DF, p-value: 0.3459
# Visualize
total_arrest_data |>
ggplot(aes(x = temp, y = arrests)) +
geom_point() +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Temperature vs. Arrests",
x = "Temperature (°F)",
y = "Arrests") +
theme_minimal()Conclusion:
I have attempted to investigate the correlation between temperature and the number of arrests in New York City. I have used a linear model to analyze the relationship between temperature and the number of arrests. I also used a polynomial model to analyze the relationship between temperature and the number of arrests.
It seems that the correlation isn’t strong enough to make a significant impact on the number of arrests. The data shows that the number of arrests is higher when it does not rain than when it does. This could be due to the fact that people are more likely to stay indoors when it rains. The data also shows that the number of arrests is higher when the temperature is between 40°F and 80°F. This could be due to the fact that people are more likely to be outside when the temperature is moderate. Overall, the data shows that there is not a linear correlation between weather variables and the number of arrests in New York City however there is some correlation. However, there could be another model that can better show a correlation.