options(repos = c(CRAN = “https://cloud.r-project.org”))
pkgs <- c( “tidyverse”,“lubridate”,“GGally”,“janitor”,“broom”,“huxtable”,“kableExtra”, “skimr”,“mosaic”,“leaflet”,“ggtext”,“viridis”,“vroom”,“here”,“performance”, “patchwork”,“car”,“ggfortify” )
need <- setdiff(pkgs, rownames(installed.packages())) if (length(need)) install.packages(need, dependencies = TRUE, Ncpus = max(1, parallel::detectCores()-1))
New York City is full of urban wildlife, and rats are one of the city’s most infamous animals. Rats in NYC are plentiful and in September 2015 a viral video of a brown rat carrying a slice of pizza down the steps of a NYC subway station in Manhattan created the legend of Pizza Rat.
The source for the data of rat sightings in the city is from NYC’s 311
Service Requests from 2010 to Present. This is a dataset that is
pretty much constantly updated and I downloaded this dataset on
4-Oct-2025. For this group project, you will use R,
dplyr, and ggplot2 to explore and explain the
data, and tell an interesting story. The data fields are shown below and
a full
data dictionary can be found here
We began by loading the dataset containing rat sighting reports from NYC’s 311 system, which records the date, borough, coordinates, and other contextual information for each sighting.
The first step ensured that the data was imported correctly, checking
column names, types, and the presence of key variables like
sighting_year, sighting_month,
borough, latitude, and
longitude.
We also loaded all necessary R packages, including:
dplyr, tidyr, and lubridate for
data manipulation; ggplot2 for visualization;
skimr for descriptive summaries; GGally for
correlation plots; and leaflet for interactive maps.
After loading, we used functions such as glimpse() and
skim() to confirm that the dataset had roughly 270,000
observations and multiple columns,
Most of the key fields were populated (some administrative ones like
bridge_highway_* were expectedly sparse), and
Numeric, categorical, and
date/time variables were correctly identified.
This initial inspection established confidence in data integrity and guided the cleaning decisions for the EDA phase.
setwd("~/Desktop/Group_Assignment_1_export")
# Load and clean Rat Sightings data
setwd("~/Desktop/Group_Assignment_1_export")
rats <- vroom::vroom("Rat_Sightings_20251004.zip", na = c("", "NA", "N/A")) %>%
janitor::clean_names() %>%
mutate(
# Convert the date string into datetime format (handle missing seconds)
created_date = suppressWarnings(coalesce(mdy_hms(created_date, quiet = TRUE),
mdy_hm(created_date, quiet = TRUE))),
# Create additional date variables
sighting_date = as.Date(created_date),
sighting_year = year(created_date),
sighting_month = month(created_date),
sighting_month_name = month(created_date, label = TRUE, abbr = FALSE),
sighting_day = day(created_date),
sighting_weekday = wday(created_date, label = TRUE, abbr = FALSE),
# Ensure numeric coordinates
latitude = suppressWarnings(as.numeric(latitude)),
longitude = suppressWarnings(as.numeric(longitude))
) %>%
# Remove observations with unspecified boroughs
filter(!is.na(borough), borough != "Unspecified") %>%
# Make borough names easier to read
mutate(borough = str_to_title(borough)) %>%
# Remove invalid or extreme coordinates
filter(!is.na(latitude), !is.na(longitude)) %>%
filter(latitude > 35 & latitude < 45, longitude > -80 & longitude < -70)
# Load and clean Weather data
nyc_weather <- read_csv("nyc_weather.csv") %>%
janitor::clean_names() %>%
mutate(
date = as.Date(sunrise) # extract just the date part from sunrise timestamp
) %>%
relocate(date, .before = 1) # keep 'date' as the first column
# Merge Rat Sightings and Weather data by date
rats_weather <- rats %>%
left_join(nyc_weather, by = c("sighting_date" = "date"))
# Remove invalid or extreme coordinates
rats_weather <- rats_weather %>%
filter(!is.na(latitude), !is.na(longitude)) %>%
filter(latitude > 35 & latitude < 45, longitude > -80 & longitude < -70)
# Export the cleaned, merged dataset to CSV for reproducibility
write_csv(rats_weather, "rats_weather_cleaned.csv")The purpose of this section is to explore and understand patterns in rat sightings across New York City. We’ll examine data structure, temporal and spatial patterns, correlations, and seasonality to uncover meaningful insights. Each analysis step begins with a rationale and ends with an interpretation
Before analyzing patterns, it’s essential to understand the dataset’s shape and completeness.
Here we inspect how many observations it contains, how variables are distributed across types (numeric, categorical, date/time), and where missing data might affect analysis.
Rows: 273,608
Columns: 44
$ unique_key <dbl> 66349537, 66348361, 66344540, 66344538,…
$ created_date <dttm> 2025-10-03 00:35:05, 2025-10-03 00:02:…
$ closed_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ agency <chr> "DOHMH", "DOHMH", "DOHMH", "DOHMH", "DO…
$ agency_name <chr> "Department of Health and Mental Hygien…
$ complaint_type <chr> "Rodent", "Rodent", "Rodent", "Rodent",…
$ descriptor <chr> "Rat Sighting", "Rat Sighting", "Rat Si…
$ location_type <chr> "3+ Family Apt. Building", "Vacant Lot"…
$ incident_zip <dbl> 11205, 11205, 10036, 10009, 10009, 1123…
$ incident_address <chr> "282 SKILLMAN STREET", "470 FLUSHING AV…
$ street_name <chr> "SKILLMAN STREET", "FLUSHING AVENUE", "…
$ cross_street_1 <chr> "DEKALB AVENUE", "SPENCER STREET", "AMT…
$ cross_street_2 <chr> "LAFAYETTE AVENUE", "WALWORTH STREET", …
$ intersection_street_1 <chr> "DEKALB AVENUE", "SPENCER STREET", "AMT…
$ intersection_street_2 <chr> "LAFAYETTE AVENUE", "WALWORTH STREET", …
$ address_type <chr> "ADDRESS", "ADDRESS", "ADDRESS", "ADDRE…
$ city <chr> "BROOKLYN", "BROOKLYN", "NEW YORK", "NE…
$ landmark <chr> "SKILLMAN STREET", "FLUSHING AVENUE", "…
$ facility_type <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ status <chr> "In Progress", "In Progress", "In Progr…
$ due_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ resolution_action_updated_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ community_board <chr> "03 BROOKLYN", "03 BROOKLYN", "04 MANHA…
$ borough <chr> "Brooklyn", "Brooklyn", "Manhattan", "M…
$ x_coordinate_state_plane <dbl> 996399, 996506, 985140, 990207, 990736,…
$ y_coordinate_state_plane <dbl> 190626, 193985, 216235, 203023, 203589,…
$ park_facility_name <chr> "Unspecified", "Unspecified", "Unspecif…
$ park_borough <chr> "BROOKLYN", "BROOKLYN", "MANHATTAN", "M…
$ vehicle_type <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxi_company_borough <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxi_pick_up_location <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_direction <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ road_ramp <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_segment <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latitude <dbl> 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,…
$ longitude <dbl> -74, -74, -74, -74, -74, -74, -74, -74,…
$ location <chr> "POINT (-73.95619185426317 40.689892769…
$ sighting_date <date> 2025-10-03, 2025-10-03, 2025-10-02, 20…
$ sighting_year <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 202…
$ sighting_month <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
$ sighting_month_name <ord> October, October, October, October, Oct…
$ sighting_day <int> 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ sighting_weekday <ord> Friday, Friday, Thursday, Thursday, Thu…
# Compact, clean summaries from skimr
skim_obj <- skimr::skim(rats)
# helper for consistent table styling
tbl <- function(x, cap) {
kableExtra::kbl(x, caption = cap, digits = 2, align = "lrrrrrrrr") |>
kableExtra::kable_styling(full_width = FALSE,
bootstrap_options = c("striped","hover","condensed"))
}
# CHARACTER
char_tbl <- skim_obj |>
skimr::yank("character") |>
dplyr::select(skim_variable, n_missing, complete_rate, n_unique) |>
dplyr::arrange(dplyr::desc(n_missing)) |>
dplyr::rename(Variable = skim_variable,
Missing = n_missing,
Complete = complete_rate,
`Unique vals` = n_unique)
tbl(char_tbl, "Character variables (compact)")| Variable | Missing | Complete | Unique vals |
|---|---|---|---|
| due_date | 142492 | 0.48 | 131059 |
| landmark | 138173 | 0.49 | 4510 |
| intersection_street_1 | 117688 | 0.57 | 6090 |
| intersection_street_2 | 117609 | 0.57 | 6147 |
| cross_street_1 | 26275 | 0.90 | 6826 |
| cross_street_2 | 26227 | 0.90 | 6912 |
| closed_date | 15860 | 0.94 | 146720 |
| incident_address | 10489 | 0.96 | 119832 |
| street_name | 10489 | 0.96 | 8455 |
| resolution_action_updated_date | 7542 | 0.97 | 186040 |
| city | 2277 | 0.99 | 91 |
| address_type | 81 | 1.00 | 7 |
| location_type | 12 | 1.00 | 54 |
| agency | 0 | 1.00 | 1 |
| agency_name | 0 | 1.00 | 1 |
| complaint_type | 0 | 1.00 | 1 |
| descriptor | 0 | 1.00 | 1 |
| status | 0 | 1.00 | 7 |
| community_board | 0 | 1.00 | 73 |
| borough | 0 | 1.00 | 5 |
| park_facility_name | 0 | 1.00 | 1 |
| park_borough | 0 | 1.00 | 5 |
| location | 0 | 1.00 | 130602 |
# FACTOR
fac_tbl <- skim_obj |>
skimr::yank("factor") |>
dplyr::select(skim_variable, n_missing, complete_rate, ordered, n_unique, top_counts) |>
dplyr::rename(Variable = skim_variable,
Missing = n_missing,
Complete = complete_rate,
Ordered = ordered,
`Unique vals` = n_unique,
`Top counts` = top_counts)
# Truncate long "Top counts" strings so the table stays tidy
fac_tbl$`Top counts` <- stringr::str_trunc(fac_tbl$`Top counts`, 60)
tbl(fac_tbl, "Factor variables (levels & top counts)")| Variable | Missing | Complete | Ordered | Unique vals | Top counts |
|---|---|---|---|---|---|
| sighting_month_name | 0 | 1 | TRUE | 12 | Jul: 30436, Aug: 30011, Jun: 29160, May: 28163 |
| sighting_weekday | 0 | 1 | TRUE | 7 | Mon: 46454, Tue: 45808, Wed: 44905, Thu: 43381 |
# DATE
date_tbl <- skim_obj |>
skimr::yank("Date") |>
dplyr::select(skim_variable, n_missing, complete_rate, min, median, max, n_unique) |>
dplyr::rename(Variable = skim_variable,
Missing = n_missing,
Complete = complete_rate,
Min = min, Median = median, Max = max,
`Unique dates` = n_unique)
tbl(date_tbl, "Date variables (range)")| Variable | Missing | Complete | Min | Median | Max | Unique dates |
|---|---|---|---|---|---|---|
| sighting_date | 0 | 1 | 2010-01-01 | 2019-10-08 | 2025-10-03 | 5754 |
# LOGICAL
log_tbl <- skim_obj |>
skimr::yank("logical") |>
dplyr::select(skim_variable, n_missing, complete_rate) |>
dplyr::rename(Variable = skim_variable,
Missing = n_missing,
Complete = complete_rate)
tbl(log_tbl, "Logical variables")| Variable | Missing | Complete |
|---|---|---|
| facility_type | 273608 | 0 |
| vehicle_type | 273608 | 0 |
| taxi_company_borough | 273608 | 0 |
| taxi_pick_up_location | 273608 | 0 |
| bridge_highway_name | 273608 | 0 |
| bridge_highway_direction | 273608 | 0 |
| road_ramp | 273608 | 0 |
| bridge_highway_segment | 273608 | 0 |
# NUMERIC
num_tbl <- skim_obj |>
skimr::yank("numeric") |>
dplyr::select(skim_variable, n_missing, complete_rate, mean, sd, p0, p25, p50, p75, p100) |>
dplyr::rename(Variable = skim_variable,
Missing = n_missing,
Complete = complete_rate,
Mean = mean, SD = sd,
Min = p0, Q1 = p25, Median = p50, Q3 = p75, Max = p100)
tbl(num_tbl, "Numeric variables (distribution summary)")| Variable | Missing | Complete | Mean | SD | Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|---|---|---|---|
| unique_key | 0 | 1 | 43757775.9 | 14102172.84 | 11464394 | 31998896 | 44009486 | 56143316 | 66353297 |
| incident_zip | 19 | 1 | 10758.7 | 554.98 | 83 | 10111 | 11105 | 11225 | 12345 |
| x_coordinate_state_plane | 0 | 1 | 1002097.3 | 18187.34 | 913495 | 993417 | 1000624 | 1011119 | 1067170 |
| y_coordinate_state_plane | 0 | 1 | 207832.8 | 29213.26 | 121350 | 186382 | 203284 | 233288 | 271876 |
| latitude | 0 | 1 | 40.7 | 0.08 | 40 | 41 | 41 | 41 | 41 |
| longitude | 0 | 1 | -73.9 | 0.07 | -74 | -74 | -74 | -74 | -74 |
| sighting_year | 0 | 1 | 2018.7 | 4.35 | 2010 | 2015 | 2019 | 2022 | 2025 |
| sighting_month | 0 | 1 | 6.5 | 3.06 | 1 | 4 | 7 | 9 | 12 |
| sighting_day | 0 | 1 | 15.7 | 8.75 | 1 | 8 | 16 | 23 | 31 |
The dataset contains over 270,000 records and 44 columns, with variables describing borough, location, and sighting time. Most analytical variables (borough, year, month, latitude, longitude) are complete, while administrative fields have expected missing values.
Next, we summarise how many variables are numeric, categorical, or date/time.
This helps identify the types of information available for different types of plots or models that we want to use later on our analysis.
# Dimensions
knitr::kable(
tibble::tibble(
observations = nrow(rats),
variables = ncol(rats)
),
caption = "Dataset Dimensions"
) |>
kableExtra::kable_styling(full_width = FALSE)| observations | variables |
|---|---|
| 273608 | 44 |
# Variable type counts
knitr::kable(
tibble::tibble(
numeric_vars = sum(sapply(rats, is.numeric)),
categorical_vars = sum(sapply(rats, function(x) is.character(x) || is.factor(x))),
date_time_vars = sum(sapply(rats, inherits, "POSIXct")) + sum(sapply(rats, inherits, "Date"))
),
caption = "Variable Type Counts"
) |>
kableExtra::kable_styling(full_width = FALSE)| numeric_vars | categorical_vars | date_time_vars |
|---|---|---|
| 9 | 25 | 2 |
Most variables are categorical or date-based, which is expected for a dataset describing reported incidents across time and boroughs.
Before we visualize correlations between numeric variables, we first
review basic descriptive statistics
and compute a correlation matrix to quantify relationships.
library(dplyr)
library(tidyr)
library(kableExtra)
rats_num <- rats %>%
mutate(
sighting_year = as.numeric(sighting_year),
sighting_month = as.numeric(sighting_month)
)
# Descriptive stats (tidy and readable)
desc_stats <- rats_num %>%
summarise(
latitude_min = min(latitude, na.rm = TRUE),
latitude_mean = mean(latitude, na.rm = TRUE),
latitude_sd = sd(latitude, na.rm = TRUE),
latitude_max = max(latitude, na.rm = TRUE),
longitude_min = min(longitude, na.rm = TRUE),
longitude_mean = mean(longitude, na.rm = TRUE),
longitude_sd = sd(longitude, na.rm = TRUE),
longitude_max = max(longitude, na.rm = TRUE),
year_min = min(sighting_year, na.rm = TRUE),
year_max = max(sighting_year, na.rm = TRUE),
month_min = min(sighting_month, na.rm = TRUE),
month_max = max(sighting_month, na.rm = TRUE)
) %>%
pivot_longer(everything(),
names_to = c("Variable","Statistic"),
names_sep = "_",
values_to = "Value") %>%
pivot_wider(names_from = Statistic, values_from = Value)
kableExtra::kbl(
desc_stats,
caption = "Descriptive Statistics for Key Numeric Variables",
digits = 2
) |>
kableExtra::kable_styling(full_width = FALSE,
bootstrap_options = c("hover","condensed"))| Variable | min | mean | sd | max |
|---|---|---|---|---|
| latitude | 40 | 41 | 0.08 | 41 |
| longitude | -74 | -74 | 0.07 | -74 |
| year | 2010 | NA | NA | 2025 |
| month | 1 | NA | NA | 12 |
# Correlation matrix
cor_matrix <- rats_num %>%
select(latitude, longitude, sighting_year, sighting_month) %>%
cor(use = "pairwise.complete.obs") %>%
round(2)
kableExtra::kbl(
cor_matrix,
caption = "Correlation Matrix of Numeric Variables"
) |>
kableExtra::kable_styling(full_width = FALSE,
bootstrap_options = c("hover","condensed"))| latitude | longitude | sighting_year | sighting_month | |
|---|---|---|---|---|
| latitude | 1.00 | 0.40 | -0.02 | 0.00 |
| longitude | 0.40 | 1.00 | -0.02 | 0.00 |
| sighting_year | -0.02 | -0.02 | 1.00 | -0.05 |
| sighting_month | 0.00 | 0.00 | -0.05 | 1.00 |
The descriptive summary from mosaic::favstats() shows
that latitude and longitude fall within the
expected NYC range, confirming valid spatial coverage.
Year values span from 2010–2025, ensuring full temporal
depth.
The correlation matrix reveals a strong positive relationship between latitude and longitude (expected for spatial coordinates), while temporal variables (sighting_year, sighting_month) are largely independent — suggesting stable geographic distribution of sightings over time.
Having reviewed the numeric correlations above, we now visualize these relationships to better understand their strength and linearity.
num_small <- rats %>%
transmute(
year = as.numeric(sighting_year),
month = as.numeric(sighting_month),
lat = latitude,
lon = longitude,
zip = suppressWarnings(as.numeric(incident_zip))
) %>%
drop_na()
if (nrow(num_small) > 4000) {
set.seed(1)
num_small <- dplyr::slice_sample(num_small, n = 4000)
}
GGally::ggpairs(num_small,
upper = list(continuous = GGally::wrap("cor")),
lower = list(continuous = GGally::wrap("points", alpha = 0.08)),
diag = list(continuous = "densityDiag"),
progress = FALSE,
title = "Scatterplot & Correlation Matrix (Sampled)"
)Latitude and longitude are slightly related to borough location, but correlations among numeric variables are low to moderate, meaning no strong linear dependencies exist between year and coordinates and each contributes unique information.
Having cleaned our data and inspected our variables, we start by exploring the total counts by borough to see which specific areas of New York City report the most rat sightings during the whole examining period.
borough_totals <- rats %>%
count(borough, name = "n") %>%
arrange(desc(n))
kableExtra::kbl(borough_totals, caption = "Total Rat Sightings by Borough") |>
kableExtra::kable_styling(full_width = FALSE)| borough | n |
|---|---|
| Brooklyn | 100687 |
| Manhattan | 71148 |
| Bronx | 49030 |
| Queens | 42303 |
| Staten Island | 10440 |
It is evident that Brooklyn and Manhattan dominate total sighting reports, while Staten Island has the lowest counts.
This likely reflects differences in density, waste management, and public reporting habits.
Next, we explore how sightings evolve over time, identifying trends or shocks such as COVID-19.
rats %>%
count(sighting_year, name = "n") %>%
ggplot(aes(x = sighting_year, y = n)) +
geom_line(linewidth = 1.2, color = "#1f77b4") +
geom_point(size = 2) +
scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
labs(title = "Total Rat Sightings by Year (2015–2024)",
x = "Year", y = "Number of Sightings") +
theme_minimal(base_size = 13)Sightings increased steadily from 2015 to 2019, dropped sharply in 2020 due to pandemic restrictions, and rebounded dramatically in 2021. This pattern shows how rodent sightings correlate with human activity cycles.
We want to further investigate whether rat sightings follow a seasonal pattern by examining monthly average, as warmer months may show higher rat activity.
rats %>%
mutate(sighting_month_name = factor(sighting_month_name, levels = month.name)) %>%
count(borough, sighting_month_name, name = "n") %>%
ggplot(aes(x = sighting_month_name, y = n, fill = borough)) +
geom_col(position = "dodge") +
theme_minimal(base_size = 13) +
labs(title = "Monthly Rat Sightings by Borough", x = "Month", y = "Number of Sightings") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Sightings peak between June and September, coinciding with warmer temperatures that enhance rat activity and visibility. This demonstrates strong seasonality tied to environmental and human behavioral factors.
In order to better reflect that seasonality across boroughs, we use a heatmap, as shown below:
rats %>%
mutate(sighting_month_name = factor(sighting_month_name, levels = month.name)) %>%
count(borough, sighting_month_name, name = "n") %>%
ggplot(aes(x = sighting_month_name, y = borough, fill = n)) +
geom_tile() +
scale_fill_continuous(labels = label_number(scale_cut = cut_short_scale())) +
labs(title = "Heatmap: Sightings by Month and Borough",
x = "Month", y = "Borough", fill = "Count") +
theme_minimal(base_size = 13) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))The heatmap clearly indicates that Brooklyn and Manhattan have higher sighting, especially during the summer months, further validating the correlation of warmth and rat’s sightings.
| Additional heatmap: to further uncover the reporting peak month in each borough, we demonstrate the percentage of monthly reporting data in each borough using heatmap. The white dots mark each borough’s peak month. It reveals that Staten Island—despite having the lowest total rat reports—consistently peaks earlier (June) than the other boroughs, which peak in August–September. In other words, Staten Island’s seasonal crest leads the citywide crest by ~2–3 months. Operationally, this makes Staten Island a leading indicator of the summer infestation wave: when Staten Island’s share spikes in June, it signals that conditions (temperature, food/waste availability, breeding cycles) are becoming favorable citywide. |
| ``` r # 0) Ensure months are in calendar order rats2 <- rats %>% mutate( sighting_month_name = factor(sighting_month_name, levels = month.name) ) |
| # 1) Summaries: counts + within-borough shares; also order boroughs by annual total summ <- rats2 %>% count(borough, sighting_month_name, name = “n”) %>% group_by(borough) %>% mutate( borough_total = sum(n, na.rm = TRUE), share = n / borough_total # within-borough seasonality (structure, not size) ) %>% ungroup() %>% mutate( borough = fct_reorder(borough, borough_total, .desc = TRUE) ) |
| # 2A) Heatmap A — COUNTS (log-scale fill to handle heavy tails) p_counts <- ggplot(summ, aes(x = sighting_month_name, y = borough, fill = n)) + geom_tile() + scale_fill_gradientn( colours = scales::viridis_pal()(6), trans = “log10”, labels = scales::label_number(scale_cut = scales::cut_short_scale()) ) + labs( title = “Rat Sightings by Month × Borough (Counts, log scale)”, x = “Month”, y = “Borough”, fill = “Count” ) + theme_minimal(base_size = 13) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) |
| # 2B) Heatmap B — WITHIN-BOROUGH SHARE (removes size; highlights seasonality) p_share <- ggplot(summ, aes(x = sighting_month_name, y = borough, fill = share)) + geom_tile() + scale_fill_gradientn( colours = scales::viridis_pal()(6), labels = scales::percent_format(accuracy = 1) # ✅ fully qualified ) + labs( title = “Seasonality Within Borough (Share of Borough’s Annual Reports)”, x = “Month”, y = NULL, fill = “Share” ) + theme_minimal(base_size = 13) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) |
| # 3) annotate each borough’s peak month on the share heatmap peaks <- summ %>% group_by(borough) %>% slice_max(order_by = share, n = 1, with_ties = FALSE) %>% ungroup() |
| p_share_annot <- p_share + geom_point(data = peaks, shape = 21, color = “white”, size = 2, stroke = 0.6) + labs(subtitle = “White dots mark each borough’s peak month”) |
| # 4) Show both panels side-by-side p_counts + p_share_annot + plot_layout(widths = c(1, 1.05)) ``` |
Having examined the trends in years and months, it might be insightful to examine if a pattern exists on certain weekdays.
rats %>%
count(sighting_weekday, name = "n") %>%
mutate(sighting_weekday = factor(sighting_weekday,
levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))) %>%
ggplot(aes(x = sighting_weekday, y = n)) +
geom_col(fill = "#1f77b4") +
theme_minimal(base_size = 13) +
labs(title = "Rat Sightings by Day of the Week", x = "Weekday", y = "Number of Sightings")Based on the chart above, reports are highest mid-week (Tuesday to Thursday) and drop over the weekends. However, it is very likely that reporting, rather than rat activity drives the observed pattern.
In order to better visualize the geographic distribution of 2024 rat sightings, we use an interactive map to identify spatial clusters.
# ---- Create a palette for the top location types ----
top_location_types <- rats %>%
count(location_type, sort = TRUE) %>%
slice(1:7) %>%
pull(location_type)
# ---- Create color palette for the map ----
palette <- colorFactor(rainbow(length(top_location_types)), top_location_types)
rats %>%
filter(sighting_year == 2024, location_type %in% top_location_types, !is.na(latitude), !is.na(longitude)) %>%
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(lng = ~longitude, lat = ~latitude, color = ~palette(location_type), radius = 1, fillOpacity = 0.6, popup = ~paste("Type:", location_type)) %>%
addLegend("bottomright", pal = palette, values = ~location_type, title = "Location Type (2024 Sightings)")Sightings are highly concentrated in dense urban areas, especially Brooklyn, lower Manhattan, and the Bronx, around residential and food-related sites. This spatial clustering reinforces patterns linked to human density and infrastructure.
Finally, since we show above the geographic concentration of rats in 2024, it is interesting to examine whether people reported new infestations throughout the examining period or repeatedly reported the same sites
rats %>%
mutate(location_id = paste(round(latitude, 4), round(longitude, 4))) %>%
group_by(sighting_year) %>%
summarise(unique_locations = n_distinct(location_id)) %>%
ggplot(aes(x = sighting_year, y = unique_locations)) +
geom_line(linewidth = 1.2, color = "#1f77b4") +
geom_point(size = 2) +
theme_minimal(base_size = 13) +
labs(title = "Unique Rat Sighting Locations per Year",
x = "Year", y = "Unique Locations")
The data clearly showcase that While total reports increased, the
number of unique locations also rose, indicating a true
expansion of rat activity rather than repeated complaints from the same
sites.
The exploratory analysis reveals several key insights:
Overall, rat activity in NYC is tightly connected to human behavior and city infrastructure, making it both an ecological and social phenomenon. These insights form the foundation for regression analysis in the next section.
We now move from exploratory plots to statistical modelling.
Our goal in this section is to explain differences in rat complaints
across NYC using:
- environment (temperature, precipitation, humidity, wind speed),
- geography (borough),
- and calendar effects (weekday and month/season).
To do that, we: 1. Aggregate the raw 311 reports into daily rat sightings per borough. 2. Merge those counts with daily weather data for NYC. 3. Model the log of daily sightings using linear regression and interpret the coefficients. 4. Diagnose model assumptions and compare model performance.
First, we create our modelling dataset.
n, the number of rat sightings per
borough per day.log_n = log(n). We use log counts
because raw daily counts of rat sightings are very skewed (lots of low
values and a few very high ones), and linear regression works better
when the dependent variable is closer to normal and has more stable
variance.# Build daily counts per borough × date (dependent variable n)
rats_daily <- rats |>
count(borough, sighting_date, name = "n")
# Sanity check: make sure weather's date column is actually Date
stopifnot(inherits(nyc_weather$date, "Date"))
# Join rat counts to weather by date
model_data <- rats_daily |>
left_join(nyc_weather, by = c("sighting_date" = "date"))At this point, model_data has, for each borough-day: -
number of rat complaints (n) - temperature, precipitation,
humidity, windspeed - the calendar date (sighting_date) -
the borough
This is the unit of analysis we will model.
We now prepare variables for modelling:
n is missing.n to create
log_n. This stabilises variance and makes the distribution
closer to normal.weekday and
month, so we can later control for day-of-week and
seasonality.Bronx as the reference borough. In
regression with factors, one level is treated as the baseline; all other
borough coefficients are interpreted relative to that baseline. We use
Bronx because it is large and well-reported.model_data <- model_data |>
filter(!is.na(temp), !is.na(n)) |>
mutate(
log_n = log(n), # log transform for right-skewed counts
wday = lubridate::wday(sighting_date, label = TRUE, abbr = FALSE),
month = lubridate::month(sighting_date, label = TRUE, abbr = FALSE),
borough= forcats::fct_relevel(borough, "Bronx") # Bronx as reference
)Because log(0) is undefined, we drop borough-days where
no rat complaints were recorded (n == 0).
We then compare the raw counts n and the transformed
variable log_n: - n is extremely right-skewed:
most borough-days have just a few complaints, but some days spike. -
log_n is much closer to bell-shaped.
This supports using log_n as the dependent variable in
our regression instead of the raw count.
We now fit three linear models, each one adding more structure:
Model 1: log_n ~ temp
Just temperature. Tests: do rats get reported more on warmer
days?
Model 2:
log_n ~ temp + borough
Adds borough fixed effects. Tests: are some boroughs systematically
“rattier” than others, holding temperature constant?
Model 3:
log_n ~ temp + borough + weather + calendar
Adds precipitation, humidity, windspeed, weekday, and month.
Tests: do short-run weather shocks and seasonal patterns still matter
after controlling for borough?
We then compare significance, interpretability, and model fit.
model_data <- model_data |>
filter(n > 0)
ggplot(model_data, aes(x = n)) +
geom_histogram(bins = 30, fill = "grey70", color = "black") +
labs(title = "Distribution of daily rat sightings (n)",
x = "n (per borough per day)", y = "Count of days")ggplot(model_data, aes(x = log_n)) +
geom_histogram(bins = 30, fill = "grey70", color = "black") +
labs(title = "Distribution of log(n)",
x = "log(n)", y = "Count of days")# Core models required by the brief
model1 <- lm(log_n ~ temp, data = model_data)
model2 <- lm(log_n ~ temp + borough, data = model_data)
# Richer model with weather + calendar fixed effects (use only fields that exist)
has_cols <- intersect(names(model_data), c("precip","humidity","windspeed"))
rhs <- c("temp", "borough", has_cols, "wday", "month")
fmla <- reformulate(rhs, response = "log_n")
model3 <- lm(fmla, data = model_data)
summary(model1)
Call:
lm(formula = log_n ~ temp, data = model_data)
Residuals:
Min 1Q Median 3Q Max
-2.396 -0.561 0.129 0.640 2.467
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.557086 0.011509 135.3 <0.0000000000000002 ***
temp 0.024533 0.000686 35.8 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.89 on 19087 degrees of freedom
Multiple R-squared: 0.0628, Adjusted R-squared: 0.0628
F-statistic: 1.28e+03 on 1 and 19087 DF, p-value: <0.0000000000000002
Call:
lm(formula = log_n ~ temp + borough, data = model_data)
Residuals:
Min 1Q Median 3Q Max
-2.8234 -0.3932 0.0579 0.4574 2.5532
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.561285 0.012266 127.3 <0.0000000000000002 ***
temp 0.028781 0.000501 57.4 <0.0000000000000002 ***
boroughBrooklyn 0.588681 0.014398 40.9 <0.0000000000000002 ***
boroughManhattan 0.327607 0.014408 22.7 <0.0000000000000002 ***
boroughQueens -0.278607 0.014516 -19.2 <0.0000000000000002 ***
boroughStaten Island -1.308180 0.015759 -83.0 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.65 on 19083 degrees of freedom
Multiple R-squared: 0.502, Adjusted R-squared: 0.502
F-statistic: 3.84e+03 on 5 and 19083 DF, p-value: <0.0000000000000002
Call:
lm(formula = fmla, data = model_data)
Residuals:
Min 1Q Median 3Q Max
-2.6606 -0.3732 0.0434 0.4248 2.9123
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.638975 0.029558 55.45 < 0.0000000000000002 ***
temp 0.022844 0.001089 20.98 < 0.0000000000000002 ***
boroughBrooklyn 0.590704 0.013346 44.26 < 0.0000000000000002 ***
boroughManhattan 0.329436 0.013356 24.67 < 0.0000000000000002 ***
boroughQueens -0.282118 0.013456 -20.97 < 0.0000000000000002 ***
boroughStaten Island -1.340530 0.014623 -91.67 < 0.0000000000000002 ***
humidity 0.001525 0.000322 4.74 0.00000216857579 ***
precip -0.002470 0.000341 -7.23 0.00000000000049 ***
windspeed -0.004105 0.000679 -6.04 0.00000000156138 ***
wday.L -0.066895 0.011657 -5.74 0.00000000968975 ***
wday.Q -0.561267 0.011666 -48.11 < 0.0000000000000002 ***
wday.C 0.107483 0.011547 9.31 < 0.0000000000000002 ***
wday^4 -0.184564 0.011485 -16.07 < 0.0000000000000002 ***
wday^5 0.044519 0.011436 3.89 0.00009936209739 ***
wday^6 -0.036319 0.011417 -3.18 0.0015 **
month.L -0.162688 0.018669 -8.71 < 0.0000000000000002 ***
month.Q -0.263857 0.031552 -8.36 < 0.0000000000000002 ***
month.C 0.020435 0.016828 1.21 0.2246
month^4 -0.030078 0.016627 -1.81 0.0705 .
month^5 -0.007385 0.015324 -0.48 0.6299
month^6 0.094175 0.015275 6.17 0.00000000071695 ***
month^7 0.029817 0.015179 1.96 0.0495 *
month^8 0.062468 0.014984 4.17 0.00003072300185 ***
month^9 0.001455 0.015028 0.10 0.9229
month^10 0.017082 0.014964 1.14 0.2537
month^11 0.027705 0.014888 1.86 0.0628 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6 on 19063 degrees of freedom
Multiple R-squared: 0.572, Adjusted R-squared: 0.572
F-statistic: 1.02e+03 on 25 and 19063 DF, p-value: <0.0000000000000002
Model 1 (temp only):
The coefficient on temp is positive and statistically
significant. This means that, on average, hotter days are associated
with higher rat complaint activity.
Because the dependent variable is log(counts), the coefficient on
temp can be interpreted approximately as a percentage
change: for a 1°C increase in temperature, rat complaints increase by
about 100 × (exp(β_temp) − 1) percent, holding other factors
constant.
Model 2 (temp + borough):
After adding borough fixed effects, model fit improves a lot (higher
adjusted R²).
Bronx is the baseline borough. A borough coefficient tells
us how much higher or lower expected rat complaints are in that borough
relative to the Bronx, holding temperature constant.
For example, if the Brooklyn coefficient is around 0.59,
exp(0.59) ≈ 1.80, which means ~80% more rat sightings per
day in Brooklyn than in the Bronx at the same temperature.
Temperature stays significant in Model 2, so weather still matters even after controlling for which borough we’re in.
Model 3 (full weather + calendar controls):
Model 3 adds precipitation, humidity, windspeed, weekday, and
month.
This lets us separate out: - short-term conditions (rain, humidity), -
recurring patterns (day of week), - and strong seasonality (summer vs
winter).
If Model 3 has higher adjusted R² and lower residual standard error, that means these extra factors help explain additional variation in rat reports. This model is best if our goal is prediction/forecasting.
After fitting a regression, we need to check assumptions: - Are residuals roughly centered and patternless? (linearity and homoscedasticity) - Do residuals look roughly normal? (normality) - Are there extremely influential outliers that dominate the fit?
We do this using the standard 4 diagnostic plots for each model.
log_n.Overall, model assumptions look acceptable.
Before trusting Model 3, we need to confirm that the weather variables are not so highly correlated with each other that they cause multicollinearity.
Below we compute pairwise correlations among temperature, precipitation, humidity, and windspeed:
# Correlation among weather vars
model_data |>
dplyr::select(any_of(c("temp","precip","humidity","windspeed"))) |>
cor(use = "complete.obs") |>
round(2) temp precip humidity windspeed
temp 1.00 -0.02 0.18 -0.28
precip -0.02 1.00 0.29 0.11
humidity 0.18 0.29 1.00 -0.05
windspeed -0.28 0.11 -0.05 1.00
All pairwise correlations are moderate (typically well below
|0.8|).
This suggests no single weather variable is basically a duplicate of
another.
As a secondary check, we also compute Variance Inflation Factors (VIF). A VIF > 10 would indicate serious collinearity.
# VIF on the richest model you keep (model2 or model3)
if (requireNamespace("car", quietly = TRUE)) {
# get raw VIF output
v <- car::vif(model2)
# if it's a vector, turn it into a data frame
if (is.vector(v)) {
vif_df <- tibble::tibble(
term = names(v),
VIF = as.numeric(v)
)
} else {
# otherwise assume it's a matrix/data.frame from GVIF
vif_df <- as.data.frame(v) |>
tibble::rownames_to_column("term")
# make the column names nice
colnames(vif_df) <- gsub("\\.", " ", colnames(vif_df))
# typical columns are: term, GVIF, Df, GVIF^(1/(2*Df))
}
knitr::kable(
vif_df,
digits = 2,
caption = "Variance Inflation Factors (Model 2)"
)
}| term | GVIF | Df | GVIF^(1/(2*Df)) |
|---|---|---|---|
| temp | 1 | 1 | 1 |
| borough | 1 | 4 | 1 |
The Variance Inflation Factors (VIFs) for Model 2 indicate no signs of multicollinearity. Both temperature and borough are statistically independent (VIF ≈ 1), confirming that each provides unique explanatory power in the regression.
huxtable::huxreg(
"Model 1: Temp only" = model1,
"Model 2: Temp + Borough" = model2,
"Model 3: + Calendar/Weather" = model3,
statistics = c("N"="nobs","Adj. R²"="adj.r.squared","RSE"="sigma")
)| Model 1: Temp only | Model 2: Temp + Borough | Model 3: + Calendar/Weather | |
|---|---|---|---|
| (Intercept) | 1.557 *** | 1.561 *** | 1.639 *** |
| (0.012) | (0.012) | (0.030) | |
| temp | 0.025 *** | 0.029 *** | 0.023 *** |
| (0.001) | (0.001) | (0.001) | |
| boroughBrooklyn | 0.589 *** | 0.591 *** | |
| (0.014) | (0.013) | ||
| boroughManhattan | 0.328 *** | 0.329 *** | |
| (0.014) | (0.013) | ||
| boroughQueens | -0.279 *** | -0.282 *** | |
| (0.015) | (0.013) | ||
| boroughStaten Island | -1.308 *** | -1.341 *** | |
| (0.016) | (0.015) | ||
| humidity | 0.002 *** | ||
| (0.000) | |||
| precip | -0.002 *** | ||
| (0.000) | |||
| windspeed | -0.004 *** | ||
| (0.001) | |||
| wday.L | -0.067 *** | ||
| (0.012) | |||
| wday.Q | -0.561 *** | ||
| (0.012) | |||
| wday.C | 0.107 *** | ||
| (0.012) | |||
| wday^4 | -0.185 *** | ||
| (0.011) | |||
| wday^5 | 0.045 *** | ||
| (0.011) | |||
| wday^6 | -0.036 ** | ||
| (0.011) | |||
| month.L | -0.163 *** | ||
| (0.019) | |||
| month.Q | -0.264 *** | ||
| (0.032) | |||
| month.C | 0.020 | ||
| (0.017) | |||
| month^4 | -0.030 | ||
| (0.017) | |||
| month^5 | -0.007 | ||
| (0.015) | |||
| month^6 | 0.094 *** | ||
| (0.015) | |||
| month^7 | 0.030 * | ||
| (0.015) | |||
| month^8 | 0.062 *** | ||
| (0.015) | |||
| month^9 | 0.001 | ||
| (0.015) | |||
| month^10 | 0.017 | ||
| (0.015) | |||
| month^11 | 0.028 | ||
| (0.015) | |||
| N | 19089 | 19089 | 19089 |
| Adj. R² | 0.063 | 0.502 | 0.572 |
| RSE | 0.890 | 0.649 | 0.602 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||
Finally, we summarise the models side by side using
huxreg().
This table reports: - the estimated coefficients, - statistical
significance, - adjusted R² (goodness of fit, penalised), - residual
standard error (typical size of the remaining error).
This allows us to compare explanatory power and interpretability.
We see a clear pattern: - Model 1 (temperature only) explains some variation: weather matters. - Model 2 (adds borough) explains substantially more. Borough fixed effects capture persistent spatial differences in rat activity. - Model 3 (adds precipitation, humidity, windspeed, weekday, and month) gives the best overall fit and lowest residual error, showing that both environmental shocks and seasonality add predictive power.
In short: - Best for explanation / policy: Model 2. It’s easy to interpret (“Brooklyn vs Bronx”, temperature effect). - Best for prediction / forecasting: Model 3. It captures weather + calendar patterns and fits best statistically.
From an applied perspective, these results suggest that: - Rat activity (or at least rat complaint volume) is seasonal and weather-sensitive, not just random. - Some boroughs face structurally higher rat pressure even after controlling for weather, which implies persistent local conditions (population density, waste handling, building stock). - Targeted mitigation (sanitation, pest control) is most urgent in warmer months and in the high-intensity boroughs.
In other words: temperature matters, but location and seasonality matter even more.
#4 We made limited use of AI tools (such as ChatGPT) during this assignment, primarily to assist with specific R syntax and debugging certain technical issues (e.g., formatting tables, handling missing values, or improving code readability). These tools were helpful in speeding up troubleshooting and providing examples of proper function usage. However, all analytical decisions, including variable selection, interpretation of results, and overall structure of the analysis, were made by our team. We also reviewed, modified, and validated all AI-generated code to ensure it aligned with our objectives and understanding. While AI proved to be a useful supplementary resource, we recognize that it cannot replace critical thinking, domain knowledge, or independent problem-solving in data analysis.
Contribution of each group member: 1 - Data Cleaning: Jiaomingyue Jiao, Stephanie Chen 2 - EDA: Dimitrios Stathogiannopoulos, Sunghwan Choi 3 - Regression: Carla-Marie Passarini, Michelangelo Mauro, Cian Dwyer