1 CRAN mirror (robust default)

options(repos = c(CRAN = “https://cloud.r-project.org”))

2 List every package your Rmd uses

pkgs <- c( “tidyverse”,“lubridate”,“GGally”,“janitor”,“broom”,“huxtable”,“kableExtra”, “skimr”,“mosaic”,“leaflet”,“ggtext”,“viridis”,“vroom”,“here”,“performance”, “patchwork”,“car”,“ggfortify” )

3 Install only the ones you don’t have yet

need <- setdiff(pkgs, rownames(installed.packages())) if (length(need)) install.packages(need, dependencies = TRUE, Ncpus = max(1, parallel::detectCores()-1))

New York City is full of urban wildlife, and rats are one of the city’s most infamous animals. Rats in NYC are plentiful and in September 2015 a viral video of a brown rat carrying a slice of pizza down the steps of a NYC subway station in Manhattan created the legend of Pizza Rat.

The source for the data of rat sightings in the city is from NYC’s 311 Service Requests from 2010 to Present. This is a dataset that is pretty much constantly updated and I downloaded this dataset on 4-Oct-2025. For this group project, you will use R, dplyr, and ggplot2 to explore and explain the data, and tell an interesting story. The data fields are shown below and a full data dictionary can be found here

4 Load and clean data

We began by loading the dataset containing rat sighting reports from NYC’s 311 system, which records the date, borough, coordinates, and other contextual information for each sighting.

The first step ensured that the data was imported correctly, checking column names, types, and the presence of key variables like sighting_year, sighting_month, borough, latitude, and longitude.

We also loaded all necessary R packages, including: dplyr, tidyr, and lubridate for data manipulation; ggplot2 for visualization; skimr for descriptive summaries; GGally for correlation plots; and leaflet for interactive maps.

After loading, we used functions such as glimpse() and skim() to confirm that the dataset had roughly 270,000 observations and multiple columns,

Most of the key fields were populated (some administrative ones like bridge_highway_* were expectedly sparse), and Numeric, categorical, and date/time variables were correctly identified.

This initial inspection established confidence in data integrity and guided the cleaning decisions for the EDA phase.

 setwd("~/Desktop/Group_Assignment_1_export")
    # Load and clean Rat Sightings data
setwd("~/Desktop/Group_Assignment_1_export")
    rats <- vroom::vroom("Rat_Sightings_20251004.zip", na = c("", "NA", "N/A")) %>% 
    janitor::clean_names() %>%
    mutate(
    # Convert the date string into datetime format (handle missing seconds)
    created_date = suppressWarnings(coalesce(mdy_hms(created_date, quiet = TRUE),
                                             mdy_hm(created_date, quiet = TRUE))),
    
    # Create additional date variables
    sighting_date = as.Date(created_date),
    sighting_year = year(created_date),
    sighting_month = month(created_date),
    sighting_month_name = month(created_date, label = TRUE, abbr = FALSE),
    sighting_day = day(created_date),
    sighting_weekday = wday(created_date, label = TRUE, abbr = FALSE),
    
    # Ensure numeric coordinates
    latitude  = suppressWarnings(as.numeric(latitude)),
    longitude = suppressWarnings(as.numeric(longitude))
    ) %>%
    # Remove observations with unspecified boroughs
    filter(!is.na(borough), borough != "Unspecified") %>%
    # Make borough names easier to read 
    mutate(borough = str_to_title(borough)) %>%
    # Remove invalid or extreme coordinates
    filter(!is.na(latitude), !is.na(longitude)) %>%
    filter(latitude > 35 & latitude < 45, longitude > -80 & longitude < -70)

    # Load and clean Weather data
   nyc_weather <- read_csv("nyc_weather.csv") %>%
  janitor::clean_names() %>%
  mutate(
    date = as.Date(sunrise)    # extract just the date part from sunrise timestamp
  ) %>%
  relocate(date, .before = 1)    # keep 'date' as the first column 
    
    # Merge Rat Sightings and Weather data by date
    rats_weather <- rats %>%
    left_join(nyc_weather, by = c("sighting_date" = "date"))

   # Remove invalid or extreme coordinates
    rats_weather <- rats_weather %>%
    filter(!is.na(latitude), !is.na(longitude)) %>%
    filter(latitude > 35 & latitude < 45, longitude > -80 & longitude < -70)

   # Export the cleaned, merged dataset to CSV for reproducibility
    write_csv(rats_weather, "rats_weather_cleaned.csv")

5 Exploratory Data Analysis (EDA)

The purpose of this section is to explore and understand patterns in rat sightings across New York City. We’ll examine data structure, temporal and spatial patterns, correlations, and seasonality to uncover meaningful insights. Each analysis step begins with a rationale and ends with an interpretation

5.1 Understanding the Data Structure

Before analyzing patterns, it’s essential to understand the dataset’s shape and completeness.

Here we inspect how many observations it contains, how variables are distributed across types (numeric, categorical, date/time), and where missing data might affect analysis.

glimpse(rats)

Rows: 273,608
Columns: 44
$ unique_key                     <dbl> 66349537, 66348361, 66344540, 66344538,…
$ created_date                   <dttm> 2025-10-03 00:35:05, 2025-10-03 00:02:…
$ closed_date                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ agency                         <chr> "DOHMH", "DOHMH", "DOHMH", "DOHMH", "DO…
$ agency_name                    <chr> "Department of Health and Mental Hygien…
$ complaint_type                 <chr> "Rodent", "Rodent", "Rodent", "Rodent",…
$ descriptor                     <chr> "Rat Sighting", "Rat Sighting", "Rat Si…
$ location_type                  <chr> "3+ Family Apt. Building", "Vacant Lot"…
$ incident_zip                   <dbl> 11205, 11205, 10036, 10009, 10009, 1123…
$ incident_address               <chr> "282 SKILLMAN STREET", "470 FLUSHING AV…
$ street_name                    <chr> "SKILLMAN STREET", "FLUSHING AVENUE", "…
$ cross_street_1                 <chr> "DEKALB AVENUE", "SPENCER STREET", "AMT…
$ cross_street_2                 <chr> "LAFAYETTE AVENUE", "WALWORTH STREET", …
$ intersection_street_1          <chr> "DEKALB AVENUE", "SPENCER STREET", "AMT…
$ intersection_street_2          <chr> "LAFAYETTE AVENUE", "WALWORTH STREET", …
$ address_type                   <chr> "ADDRESS", "ADDRESS", "ADDRESS", "ADDRE…
$ city                           <chr> "BROOKLYN", "BROOKLYN", "NEW YORK", "NE…
$ landmark                       <chr> "SKILLMAN STREET", "FLUSHING AVENUE", "…
$ facility_type                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ status                         <chr> "In Progress", "In Progress", "In Progr…
$ due_date                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ resolution_action_updated_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ community_board                <chr> "03 BROOKLYN", "03 BROOKLYN", "04 MANHA…
$ borough                        <chr> "Brooklyn", "Brooklyn", "Manhattan", "M…
$ x_coordinate_state_plane       <dbl> 996399, 996506, 985140, 990207, 990736,…
$ y_coordinate_state_plane       <dbl> 190626, 193985, 216235, 203023, 203589,…
$ park_facility_name             <chr> "Unspecified", "Unspecified", "Unspecif…
$ park_borough                   <chr> "BROOKLYN", "BROOKLYN", "MANHATTAN", "M…
$ vehicle_type                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxi_company_borough           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxi_pick_up_location          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_name            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_direction       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ road_ramp                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_segment         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latitude                       <dbl> 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,…
$ longitude                      <dbl> -74, -74, -74, -74, -74, -74, -74, -74,…
$ location                       <chr> "POINT (-73.95619185426317 40.689892769…
$ sighting_date                  <date> 2025-10-03, 2025-10-03, 2025-10-02, 20…
$ sighting_year                  <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 202…
$ sighting_month                 <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
$ sighting_month_name            <ord> October, October, October, October, Oct…
$ sighting_day                   <int> 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ sighting_weekday               <ord> Friday, Friday, Thursday, Thursday, Thu…

# Compact, clean summaries from skimr
skim_obj <- skimr::skim(rats)

# helper for consistent table styling
tbl <- function(x, cap) {
  kableExtra::kbl(x, caption = cap, digits = 2, align = "lrrrrrrrr") |>
    kableExtra::kable_styling(full_width = FALSE,
                              bootstrap_options = c("striped","hover","condensed"))
}

# CHARACTER
char_tbl <- skim_obj |>
  skimr::yank("character") |>
  dplyr::select(skim_variable, n_missing, complete_rate, n_unique) |>
  dplyr::arrange(dplyr::desc(n_missing)) |>
  dplyr::rename(Variable = skim_variable,
                Missing = n_missing,
                Complete = complete_rate,
                `Unique vals` = n_unique)

tbl(char_tbl, "Character variables (compact)")

Character variables (compact)
Variable	Missing	Complete	Unique vals
due_date	142492	0.48	131059
landmark	138173	0.49	4510
intersection_street_1	117688	0.57	6090
intersection_street_2	117609	0.57	6147
cross_street_1	26275	0.90	6826
cross_street_2	26227	0.90	6912
closed_date	15860	0.94	146720
incident_address	10489	0.96	119832
street_name	10489	0.96	8455
resolution_action_updated_date	7542	0.97	186040
city	2277	0.99	91
address_type	81	1.00	7
location_type	12	1.00	54
agency	0	1.00	1
agency_name	0	1.00	1
complaint_type	0	1.00	1
descriptor	0	1.00	1
status	0	1.00	7
community_board	0	1.00	73
borough	0	1.00	5
park_facility_name	0	1.00	1
park_borough	0	1.00	5
location	0	1.00	130602

# FACTOR
fac_tbl <- skim_obj |>
  skimr::yank("factor") |>
  dplyr::select(skim_variable, n_missing, complete_rate, ordered, n_unique, top_counts) |>
  dplyr::rename(Variable = skim_variable,
                Missing = n_missing,
                Complete = complete_rate,
                Ordered = ordered,
                `Unique vals` = n_unique,
                `Top counts` = top_counts)

# Truncate long "Top counts" strings so the table stays tidy
fac_tbl$`Top counts` <- stringr::str_trunc(fac_tbl$`Top counts`, 60)

tbl(fac_tbl, "Factor variables (levels & top counts)")

Factor variables (levels & top counts)
Variable	Missing	Complete	Ordered	Unique vals	Top counts
sighting_month_name	0	1	TRUE	12	Jul: 30436, Aug: 30011, Jun: 29160, May: 28163
sighting_weekday	0	1	TRUE	7	Mon: 46454, Tue: 45808, Wed: 44905, Thu: 43381

# DATE
date_tbl <- skim_obj |>
  skimr::yank("Date") |>
  dplyr::select(skim_variable, n_missing, complete_rate, min, median, max, n_unique) |>
  dplyr::rename(Variable = skim_variable,
                Missing = n_missing,
                Complete = complete_rate,
                Min = min, Median = median, Max = max,
                `Unique dates` = n_unique)

tbl(date_tbl, "Date variables (range)")

Date variables (range)
Variable	Missing	Complete	Min	Median	Max	Unique dates
sighting_date	0	1	2010-01-01	2019-10-08	2025-10-03	5754

# LOGICAL
log_tbl <- skim_obj |>
  skimr::yank("logical") |>
  dplyr::select(skim_variable, n_missing, complete_rate) |>
  dplyr::rename(Variable = skim_variable,
                Missing = n_missing,
                Complete = complete_rate)

tbl(log_tbl, "Logical variables")

Logical variables
Variable	Missing	Complete
facility_type	273608	0
vehicle_type	273608	0
taxi_company_borough	273608	0
taxi_pick_up_location	273608	0
bridge_highway_name	273608	0
bridge_highway_direction	273608	0
road_ramp	273608	0
bridge_highway_segment	273608	0

# NUMERIC
num_tbl <- skim_obj |>
  skimr::yank("numeric") |>
  dplyr::select(skim_variable, n_missing, complete_rate, mean, sd, p0, p25, p50, p75, p100) |>
  dplyr::rename(Variable = skim_variable,
                Missing = n_missing,
                Complete = complete_rate,
                Mean = mean, SD = sd,
                Min = p0, Q1 = p25, Median = p50, Q3 = p75, Max = p100)

tbl(num_tbl, "Numeric variables (distribution summary)")

Numeric variables (distribution summary)
Variable	Missing	Complete	Mean	SD	Min	Q1	Median	Q3	Max
unique_key	0	1	43757775.9	14102172.84	11464394	31998896	44009486	56143316	66353297
incident_zip	19	1	10758.7	554.98	83	10111	11105	11225	12345
x_coordinate_state_plane	0	1	1002097.3	18187.34	913495	993417	1000624	1011119	1067170
y_coordinate_state_plane	0	1	207832.8	29213.26	121350	186382	203284	233288	271876
latitude	0	1	40.7	0.08	40	41	41	41	41
longitude	0	1	-73.9	0.07	-74	-74	-74	-74	-74
sighting_year	0	1	2018.7	4.35	2010	2015	2019	2022	2025
sighting_month	0	1	6.5	3.06	1	4	7	9	12
sighting_day	0	1	15.7	8.75	1	8	16	23	31

The dataset contains over 270,000 records and 44 columns, with variables describing borough, location, and sighting time. Most analytical variables (borough, year, month, latitude, longitude) are complete, while administrative fields have expected missing values.

5.2 Summary Statistics

5.2.1 Variables

Next, we summarise how many variables are numeric, categorical, or date/time.

This helps identify the types of information available for different types of plots or models that we want to use later on our analysis.

# Dimensions
knitr::kable(
  tibble::tibble(
    observations = nrow(rats),
    variables    = ncol(rats)
  ),
  caption = "Dataset Dimensions"
) |>
  kableExtra::kable_styling(full_width = FALSE)

Dataset Dimensions
observations	variables
273608	44

# Variable type counts
knitr::kable(
  tibble::tibble(
    numeric_vars     = sum(sapply(rats, is.numeric)),
    categorical_vars = sum(sapply(rats, function(x) is.character(x) || is.factor(x))),
    date_time_vars   = sum(sapply(rats, inherits, "POSIXct")) + sum(sapply(rats, inherits, "Date"))
  ),
  caption = "Variable Type Counts"
) |>
  kableExtra::kable_styling(full_width = FALSE)

Variable Type Counts
numeric_vars	categorical_vars	date_time_vars
9	25	2

Most variables are categorical or date-based, which is expected for a dataset describing reported incidents across time and boroughs.

5.2.2 Correlation Summary

Before we visualize correlations between numeric variables, we first review basic descriptive statistics
and compute a correlation matrix to quantify relationships.

library(dplyr)
library(tidyr)
library(kableExtra)

rats_num <- rats %>%
  mutate(
    sighting_year  = as.numeric(sighting_year),
    sighting_month = as.numeric(sighting_month)
  )

# Descriptive stats (tidy and readable)
desc_stats <- rats_num %>%
  summarise(
    latitude_min   = min(latitude, na.rm = TRUE),
    latitude_mean  = mean(latitude, na.rm = TRUE),
    latitude_sd    = sd(latitude, na.rm = TRUE),
    latitude_max   = max(latitude, na.rm = TRUE),
    longitude_min  = min(longitude, na.rm = TRUE),
    longitude_mean = mean(longitude, na.rm = TRUE),
    longitude_sd   = sd(longitude, na.rm = TRUE),
    longitude_max  = max(longitude, na.rm = TRUE),
    year_min       = min(sighting_year, na.rm = TRUE),
    year_max       = max(sighting_year, na.rm = TRUE),
    month_min      = min(sighting_month, na.rm = TRUE),
    month_max      = max(sighting_month, na.rm = TRUE)
  ) %>%
  pivot_longer(everything(),
               names_to = c("Variable","Statistic"),
               names_sep = "_",
               values_to = "Value") %>%
  pivot_wider(names_from = Statistic, values_from = Value)

kableExtra::kbl(
  desc_stats,
  caption = "Descriptive Statistics for Key Numeric Variables",
  digits = 2
) |>
  kableExtra::kable_styling(full_width = FALSE,
                            bootstrap_options = c("hover","condensed"))

Descriptive Statistics for Key Numeric Variables
Variable	min	mean	sd	max
latitude	40	41	0.08	41
longitude	-74	-74	0.07	-74
year	2010	NA	NA	2025
month	1	NA	NA	12

# Correlation matrix
cor_matrix <- rats_num %>%
  select(latitude, longitude, sighting_year, sighting_month) %>%
  cor(use = "pairwise.complete.obs") %>%
  round(2)

kableExtra::kbl(
  cor_matrix,
  caption = "Correlation Matrix of Numeric Variables"
) |>
  kableExtra::kable_styling(full_width = FALSE,
                            bootstrap_options = c("hover","condensed"))

Correlation Matrix of Numeric Variables
	latitude	longitude	sighting_year	sighting_month
latitude	1.00	0.40	-0.02	0.00
longitude	0.40	1.00	-0.02	0.00
sighting_year	-0.02	-0.02	1.00	-0.05
sighting_month	0.00	0.00	-0.05	1.00

The descriptive summary from mosaic::favstats() shows that latitude and longitude fall within the expected NYC range, confirming valid spatial coverage. Year values span from 2010–2025, ensuring full temporal depth.

The correlation matrix reveals a strong positive relationship between latitude and longitude (expected for spatial coordinates), while temporal variables (sighting_year, sighting_month) are largely independent — suggesting stable geographic distribution of sightings over time.

5.2.3 Correlations Between Numeric Variables

Having reviewed the numeric correlations above, we now visualize these relationships to better understand their strength and linearity.

num_small <- rats %>%
transmute(
year = as.numeric(sighting_year),
month = as.numeric(sighting_month),
lat = latitude,
lon = longitude,
zip = suppressWarnings(as.numeric(incident_zip))
) %>%
drop_na()

if (nrow(num_small) > 4000) {
set.seed(1)
num_small <- dplyr::slice_sample(num_small, n = 4000)
}

GGally::ggpairs(num_small,
upper = list(continuous = GGally::wrap("cor")),
lower = list(continuous = GGally::wrap("points", alpha = 0.08)),
diag = list(continuous = "densityDiag"),
progress = FALSE,
title = "Scatterplot & Correlation Matrix (Sampled)"
)

Latitude and longitude are slightly related to borough location, but correlations among numeric variables are low to moderate, meaning no strong linear dependencies exist between year and coordinates and each contributes unique information.

5.3 Visualizations

5.3.1 Borough-Level Summaries

Having cleaned our data and inspected our variables, we start by exploring the total counts by borough to see which specific areas of New York City report the most rat sightings during the whole examining period.

borough_totals <- rats %>%
count(borough, name = "n") %>%
arrange(desc(n))

kableExtra::kbl(borough_totals, caption = "Total Rat Sightings by Borough") |>
  kableExtra::kable_styling(full_width = FALSE)

Total Rat Sightings by Borough
borough	n
Brooklyn	100687
Manhattan	71148
Bronx	49030
Queens	42303
Staten Island	10440

It is evident that Brooklyn and Manhattan dominate total sighting reports, while Staten Island has the lowest counts.

This likely reflects differences in density, waste management, and public reporting habits.

5.3.2 Yearly Trends

Next, we explore how sightings evolve over time, identifying trends or shocks such as COVID-19.

rats %>%
count(sighting_year, name = "n") %>%
ggplot(aes(x = sighting_year, y = n)) +
geom_line(linewidth = 1.2, color = "#1f77b4") +
geom_point(size = 2) +
scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
labs(title = "Total Rat Sightings by Year (2015–2024)",
x = "Year", y = "Number of Sightings") +
theme_minimal(base_size = 13)

Sightings increased steadily from 2015 to 2019, dropped sharply in 2020 due to pandemic restrictions, and rebounded dramatically in 2021. This pattern shows how rodent sightings correlate with human activity cycles.

5.3.3 Monthly Seasonality

We want to further investigate whether rat sightings follow a seasonal pattern by examining monthly average, as warmer months may show higher rat activity.

rats %>%
  mutate(sighting_month_name = factor(sighting_month_name, levels = month.name)) %>%
  count(borough, sighting_month_name, name = "n") %>%
  ggplot(aes(x = sighting_month_name, y = n, fill = borough)) +
  geom_col(position = "dodge") +
  theme_minimal(base_size = 13) +
  labs(title = "Monthly Rat Sightings by Borough", x = "Month", y = "Number of Sightings") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Sightings peak between June and September, coinciding with warmer temperatures that enhance rat activity and visibility. This demonstrates strong seasonality tied to environmental and human behavioral factors.

5.3.4 Heatmap of Seasonality

In order to better reflect that seasonality across boroughs, we use a heatmap, as shown below:

rats %>%
mutate(sighting_month_name = factor(sighting_month_name, levels = month.name)) %>%
count(borough, sighting_month_name, name = "n") %>%
ggplot(aes(x = sighting_month_name, y = borough, fill = n)) +
geom_tile() +
scale_fill_continuous(labels = label_number(scale_cut = cut_short_scale())) +
labs(title = "Heatmap: Sightings by Month and Borough",
x = "Month", y = "Borough", fill = "Count") +
theme_minimal(base_size = 13) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

The heatmap clearly indicates that Brooklyn and Manhattan have higher sighting, especially during the summer months, further validating the correlation of warmth and rat’s sightings.

Additional heatmap: to further uncover the reporting peak month in each borough, we demonstrate the percentage of monthly reporting data in each borough using heatmap. The white dots mark each borough’s peak month. It reveals that Staten Island—despite having the lowest total rat reports—consistently peaks earlier (June) than the other boroughs, which peak in August–September. In other words, Staten Island’s seasonal crest leads the citywide crest by ~2–3 months. Operationally, this makes Staten Island a leading indicator of the summer infestation wave: when Staten Island’s share spikes in June, it signals that conditions (temperature, food/waste availability, breeding cycles) are becoming favorable citywide.

``` r # 0) Ensure months are in calendar order rats2 <- rats %>% mutate( sighting_month_name = factor(sighting_month_name, levels = month.name) )

# 1) Summaries: counts + within-borough shares; also order boroughs by annual total summ <- rats2 %>% count(borough, sighting_month_name, name = “n”) %>% group_by(borough) %>% mutate( borough_total = sum(n, na.rm = TRUE), share = n / borough_total # within-borough seasonality (structure, not size) ) %>% ungroup() %>% mutate( borough = fct_reorder(borough, borough_total, .desc = TRUE) )

# 2A) Heatmap A — COUNTS (log-scale fill to handle heavy tails) p_counts <- ggplot(summ, aes(x = sighting_month_name, y = borough, fill = n)) + geom_tile() + scale_fill_gradientn( colours = scales::viridis_pal()(6), trans = “log10”, labels = scales::label_number(scale_cut = scales::cut_short_scale()) ) + labs( title = “Rat Sightings by Month × Borough (Counts, log scale)”, x = “Month”, y = “Borough”, fill = “Count” ) + theme_minimal(base_size = 13) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

# 2B) Heatmap B — WITHIN-BOROUGH SHARE (removes size; highlights seasonality) p_share <- ggplot(summ, aes(x = sighting_month_name, y = borough, fill = share)) + geom_tile() + scale_fill_gradientn( colours = scales::viridis_pal()(6), labels = scales::percent_format(accuracy = 1) # ✅ fully qualified ) + labs( title = “Seasonality Within Borough (Share of Borough’s Annual Reports)”, x = “Month”, y = NULL, fill = “Share” ) + theme_minimal(base_size = 13) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

# 3) annotate each borough’s peak month on the share heatmap peaks <- summ %>% group_by(borough) %>% slice_max(order_by = share, n = 1, with_ties = FALSE) %>% ungroup()

p_share_annot <- p_share + geom_point(data = peaks, shape = 21, color = “white”, size = 2, stroke = 0.6) + labs(subtitle = “White dots mark each borough’s peak month”)

# 4) Show both panels side-by-side p_counts + p_share_annot + plot_layout(widths = c(1, 1.05)) ```

5.3.5 Day-of-week Pattern

Having examined the trends in years and months, it might be insightful to examine if a pattern exists on certain weekdays.

rats %>%
  count(sighting_weekday, name = "n") %>%
  mutate(sighting_weekday = factor(sighting_weekday,
                                   levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))) %>%
  ggplot(aes(x = sighting_weekday, y = n)) +
  geom_col(fill = "#1f77b4") +
  theme_minimal(base_size = 13) +
  labs(title = "Rat Sightings by Day of the Week", x = "Weekday", y = "Number of Sightings")

Based on the chart above, reports are highest mid-week (Tuesday to Thursday) and drop over the weekends. However, it is very likely that reporting, rather than rat activity drives the observed pattern.

5.3.6 Spatial Distribution (2024 Map)

In order to better visualize the geographic distribution of 2024 rat sightings, we use an interactive map to identify spatial clusters.

# ---- Create a palette for the top location types ----
top_location_types <- rats %>%
  count(location_type, sort = TRUE) %>%
  slice(1:7) %>%
  pull(location_type)

# ---- Create color palette for the map ----
palette <- colorFactor(rainbow(length(top_location_types)), top_location_types)

rats %>%
  filter(sighting_year == 2024, location_type %in% top_location_types, !is.na(latitude), !is.na(longitude)) %>%
  leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(lng = ~longitude, lat = ~latitude, color = ~palette(location_type), radius = 1, fillOpacity = 0.6, popup = ~paste("Type:", location_type)) %>%
  addLegend("bottomright", pal = palette, values = ~location_type, title = "Location Type (2024 Sightings)")

Sightings are highly concentrated in dense urban areas, especially Brooklyn, lower Manhattan, and the Bronx, around residential and food-related sites. This spatial clustering reinforces patterns linked to human density and infrastructure.

5.3.7 Unique Locations Over Time

Finally, since we show above the geographic concentration of rats in 2024, it is interesting to examine whether people reported new infestations throughout the examining period or repeatedly reported the same sites

rats %>%
  mutate(location_id = paste(round(latitude, 4), round(longitude, 4))) %>%
  group_by(sighting_year) %>%
  summarise(unique_locations = n_distinct(location_id)) %>%
  ggplot(aes(x = sighting_year, y = unique_locations)) +
  geom_line(linewidth = 1.2, color = "#1f77b4") +
  geom_point(size = 2) +
  theme_minimal(base_size = 13) +
  labs(title = "Unique Rat Sighting Locations per Year",
       x = "Year", y = "Unique Locations")

The data clearly showcase that While total reports increased, the number of unique locations also rose, indicating a true expansion of rat activity rather than repeated complaints from the same sites.

5.3.8 Summary of Findings

The exploratory analysis reveals several key insights:

Borough effects: Brooklyn and Manhattan consistently report the most sightings.
Trends: Long-term growth with a sharp pandemic-related dip.
Seasonality: Summer peaks align with warm weather and increased outdoor activity.
Correlations: Low, supporting multivariable modeling.
Spatial patterns: Clusters around population and waste hubs.

Overall, rat activity in NYC is tightly connected to human behavior and city infrastructure, making it both an ecological and social phenomenon. These insights form the foundation for regression analysis in the next section.

6 Regression

We now move from exploratory plots to statistical modelling.

Our goal in this section is to explain differences in rat complaints across NYC using:
- environment (temperature, precipitation, humidity, wind speed),
- geography (borough),
- and calendar effects (weekday and month/season).

To do that, we: 1. Aggregate the raw 311 reports into daily rat sightings per borough. 2. Merge those counts with daily weather data for NYC. 3. Model the log of daily sightings using linear regression and interpret the coefficients. 4. Diagnose model assumptions and compare model performance.

6.1 Regression Prep

First, we create our modelling dataset.

We count how many rat complaints are recorded in each borough on each date. This gives us n, the number of rat sightings per borough per day.
We then merge that with NYC weather for the same dates.
We keep only usable rows (e.g. days that have both sightings and weather).
Finally, we create log_n = log(n). We use log counts because raw daily counts of rat sightings are very skewed (lots of low values and a few very high ones), and linear regression works better when the dependent variable is closer to normal and has more stable variance.

# Build daily counts per borough × date (dependent variable n)
rats_daily <- rats |>
  count(borough, sighting_date, name = "n")

# Sanity check: make sure weather's date column is actually Date
stopifnot(inherits(nyc_weather$date, "Date"))

# Join rat counts to weather by date
model_data <- rats_daily |>
  left_join(nyc_weather, by = c("sighting_date" = "date"))

At this point, model_data has, for each borough-day: - number of rat complaints (n) - temperature, precipitation, humidity, windspeed - the calendar date (sighting_date) - the borough

This is the unit of analysis we will model.

We now prepare variables for modelling:

We drop any rows where weather is missing or rat count n is missing.
We take the natural log of n to create log_n. This stabilises variance and makes the distribution closer to normal.
We create categorical factors for weekday and month, so we can later control for day-of-week and seasonality.
We explicitly set Bronx as the reference borough. In regression with factors, one level is treated as the baseline; all other borough coefficients are interpreted relative to that baseline. We use Bronx because it is large and well-reported.

model_data <- model_data |>
  filter(!is.na(temp), !is.na(n)) |>
  mutate(
    log_n  = log(n),  # log transform for right-skewed counts
    wday   = lubridate::wday(sighting_date, label = TRUE, abbr = FALSE),
    month  = lubridate::month(sighting_date, label = TRUE, abbr = FALSE),
    borough= forcats::fct_relevel(borough, "Bronx")  # Bronx as reference
  )

Because log(0) is undefined, we drop borough-days where no rat complaints were recorded (n == 0).

We then compare the raw counts n and the transformed variable log_n: - n is extremely right-skewed: most borough-days have just a few complaints, but some days spike. - log_n is much closer to bell-shaped.

This supports using log_n as the dependent variable in our regression instead of the raw count.

We now fit three linear models, each one adding more structure:

Model 1: log_n ~ temp
Just temperature. Tests: do rats get reported more on warmer days?
Model 2: log_n ~ temp + borough
Adds borough fixed effects. Tests: are some boroughs systematically “rattier” than others, holding temperature constant?
Model 3: log_n ~ temp + borough + weather + calendar
Adds precipitation, humidity, windspeed, weekday, and month.
Tests: do short-run weather shocks and seasonal patterns still matter after controlling for borough?

We then compare significance, interpretability, and model fit.

model_data <- model_data |>
filter(n > 0)

ggplot(model_data, aes(x = n)) +
geom_histogram(bins = 30, fill = "grey70", color = "black") +
labs(title = "Distribution of daily rat sightings (n)",
x = "n (per borough per day)", y = "Count of days")

ggplot(model_data, aes(x = log_n)) +
geom_histogram(bins = 30, fill = "grey70", color = "black") +
labs(title = "Distribution of log(n)",
x = "log(n)", y = "Count of days")

# Core models required by the brief

model1 <- lm(log_n ~ temp, data = model_data)
model2 <- lm(log_n ~ temp + borough, data = model_data)

# Richer model with weather + calendar fixed effects (use only fields that exist)

has_cols <- intersect(names(model_data), c("precip","humidity","windspeed"))
rhs      <- c("temp", "borough", has_cols, "wday", "month")
fmla     <- reformulate(rhs, response = "log_n")
model3   <- lm(fmla, data = model_data)

summary(model1)


Call:
lm(formula = log_n ~ temp, data = model_data)

Residuals:
   Min     1Q Median     3Q    Max 
-2.396 -0.561  0.129  0.640  2.467 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 1.557086   0.011509   135.3 <0.0000000000000002 ***
temp        0.024533   0.000686    35.8 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.89 on 19087 degrees of freedom
Multiple R-squared:  0.0628,    Adjusted R-squared:  0.0628 
F-statistic: 1.28e+03 on 1 and 19087 DF,  p-value: <0.0000000000000002

summary(model2)


Call:
lm(formula = log_n ~ temp + borough, data = model_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8234 -0.3932  0.0579  0.4574  2.5532 

Coefficients:
                      Estimate Std. Error t value            Pr(>|t|)    
(Intercept)           1.561285   0.012266   127.3 <0.0000000000000002 ***
temp                  0.028781   0.000501    57.4 <0.0000000000000002 ***
boroughBrooklyn       0.588681   0.014398    40.9 <0.0000000000000002 ***
boroughManhattan      0.327607   0.014408    22.7 <0.0000000000000002 ***
boroughQueens        -0.278607   0.014516   -19.2 <0.0000000000000002 ***
boroughStaten Island -1.308180   0.015759   -83.0 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.65 on 19083 degrees of freedom
Multiple R-squared:  0.502, Adjusted R-squared:  0.502 
F-statistic: 3.84e+03 on 5 and 19083 DF,  p-value: <0.0000000000000002

summary(model3)


Call:
lm(formula = fmla, data = model_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6606 -0.3732  0.0434  0.4248  2.9123 

Coefficients:
                      Estimate Std. Error t value             Pr(>|t|)    
(Intercept)           1.638975   0.029558   55.45 < 0.0000000000000002 ***
temp                  0.022844   0.001089   20.98 < 0.0000000000000002 ***
boroughBrooklyn       0.590704   0.013346   44.26 < 0.0000000000000002 ***
boroughManhattan      0.329436   0.013356   24.67 < 0.0000000000000002 ***
boroughQueens        -0.282118   0.013456  -20.97 < 0.0000000000000002 ***
boroughStaten Island -1.340530   0.014623  -91.67 < 0.0000000000000002 ***
humidity              0.001525   0.000322    4.74     0.00000216857579 ***
precip               -0.002470   0.000341   -7.23     0.00000000000049 ***
windspeed            -0.004105   0.000679   -6.04     0.00000000156138 ***
wday.L               -0.066895   0.011657   -5.74     0.00000000968975 ***
wday.Q               -0.561267   0.011666  -48.11 < 0.0000000000000002 ***
wday.C                0.107483   0.011547    9.31 < 0.0000000000000002 ***
wday^4               -0.184564   0.011485  -16.07 < 0.0000000000000002 ***
wday^5                0.044519   0.011436    3.89     0.00009936209739 ***
wday^6               -0.036319   0.011417   -3.18               0.0015 ** 
month.L              -0.162688   0.018669   -8.71 < 0.0000000000000002 ***
month.Q              -0.263857   0.031552   -8.36 < 0.0000000000000002 ***
month.C               0.020435   0.016828    1.21               0.2246    
month^4              -0.030078   0.016627   -1.81               0.0705 .  
month^5              -0.007385   0.015324   -0.48               0.6299    
month^6               0.094175   0.015275    6.17     0.00000000071695 ***
month^7               0.029817   0.015179    1.96               0.0495 *  
month^8               0.062468   0.014984    4.17     0.00003072300185 ***
month^9               0.001455   0.015028    0.10               0.9229    
month^10              0.017082   0.014964    1.14               0.2537    
month^11              0.027705   0.014888    1.86               0.0628 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6 on 19063 degrees of freedom
Multiple R-squared:  0.572, Adjusted R-squared:  0.572 
F-statistic: 1.02e+03 on 25 and 19063 DF,  p-value: <0.0000000000000002

par(mfrow = c(2, 2))
plot(model1)

par(mfrow = c(2, 2))
plot(model2)

par(mfrow = c(2, 2))
plot(model3)

par(mfrow = c(1, 1))

Model 1 (temp only):
The coefficient on temp is positive and statistically significant. This means that, on average, hotter days are associated with higher rat complaint activity.
Because the dependent variable is log(counts), the coefficient on temp can be interpreted approximately as a percentage change: for a 1°C increase in temperature, rat complaints increase by about 100 × (exp(β_temp) − 1) percent, holding other factors constant.

Model 2 (temp + borough):
After adding borough fixed effects, model fit improves a lot (higher adjusted R²).
Bronx is the baseline borough. A borough coefficient tells us how much higher or lower expected rat complaints are in that borough relative to the Bronx, holding temperature constant.
For example, if the Brooklyn coefficient is around 0.59, exp(0.59) ≈ 1.80, which means ~80% more rat sightings per day in Brooklyn than in the Bronx at the same temperature.

Temperature stays significant in Model 2, so weather still matters even after controlling for which borough we’re in.

Model 3 (full weather + calendar controls):
Model 3 adds precipitation, humidity, windspeed, weekday, and month.
This lets us separate out: - short-term conditions (rain, humidity), - recurring patterns (day of week), - and strong seasonality (summer vs winter).

If Model 3 has higher adjusted R² and lower residual standard error, that means these extra factors help explain additional variation in rat reports. This model is best if our goal is prediction/forecasting.

After fitting a regression, we need to check assumptions: - Are residuals roughly centered and patternless? (linearity and homoscedasticity) - Do residuals look roughly normal? (normality) - Are there extremely influential outliers that dominate the fit?

We do this using the standard 4 diagnostic plots for each model.

The Residuals vs Fitted plots show no strong curvature, which supports using a linear functional form on log_n.
The Normal Q–Q plots are roughly straight, meaning residuals are approximately normal after the log transform.
The Scale–Location plots do not suggest severe heteroskedasticity (variance looks fairly constant across fitted values).
The Residuals vs Leverage plots do not reveal extreme high-leverage points, so no single borough-day is driving the results.

Overall, model assumptions look acceptable.

Before trusting Model 3, we need to confirm that the weather variables are not so highly correlated with each other that they cause multicollinearity.

Below we compute pairwise correlations among temperature, precipitation, humidity, and windspeed:

# Correlation among weather vars
model_data |>
dplyr::select(any_of(c("temp","precip","humidity","windspeed"))) |>
cor(use = "complete.obs") |>
round(2)

           temp precip humidity windspeed
temp       1.00  -0.02     0.18     -0.28
precip    -0.02   1.00     0.29      0.11
humidity   0.18   0.29     1.00     -0.05
windspeed -0.28   0.11    -0.05      1.00

All pairwise correlations are moderate (typically well below |0.8|).
This suggests no single weather variable is basically a duplicate of another.

As a secondary check, we also compute Variance Inflation Factors (VIF). A VIF > 10 would indicate serious collinearity.

# VIF on the richest model you keep (model2 or model3)
if (requireNamespace("car", quietly = TRUE)) {

  # get raw VIF output
  v <- car::vif(model2)

  # if it's a vector, turn it into a data frame
  if (is.vector(v)) {
    vif_df <- tibble::tibble(
      term = names(v),
      VIF  = as.numeric(v)
    )

  } else {
    # otherwise assume it's a matrix/data.frame from GVIF
    vif_df <- as.data.frame(v) |>
      tibble::rownames_to_column("term")

    # make the column names nice
    colnames(vif_df) <- gsub("\\.", " ", colnames(vif_df))
    # typical columns are: term, GVIF, Df, GVIF^(1/(2*Df))
  }

  knitr::kable(
    vif_df,
    digits = 2,
    caption = "Variance Inflation Factors (Model 2)"
  )
}

Variance Inflation Factors (Model 2)
term	GVIF	Df	GVIF^(1/(2*Df))
temp	1	1	1
borough	1	4	1

The Variance Inflation Factors (VIFs) for Model 2 indicate no signs of multicollinearity. Both temperature and borough are statistically independent (VIF ≈ 1), confirming that each provides unique explanatory power in the regression.

huxtable::huxreg(
"Model 1: Temp only"      = model1,
"Model 2: Temp + Borough" = model2,
"Model 3: + Calendar/Weather" = model3,
statistics = c("N"="nobs","Adj. R²"="adj.r.squared","RSE"="sigma")
)

	Model 1: Temp only	Model 2: Temp + Borough	Model 3: + Calendar/Weather
(Intercept)	1.557 ***	1.561 ***	1.639 ***
	(0.012)	(0.012)	(0.030)
temp	0.025 ***	0.029 ***	0.023 ***
	(0.001)	(0.001)	(0.001)
boroughBrooklyn		0.589 ***	0.591 ***
		(0.014)	(0.013)
boroughManhattan		0.328 ***	0.329 ***
		(0.014)	(0.013)
boroughQueens		-0.279 ***	-0.282 ***
		(0.015)	(0.013)
boroughStaten Island		-1.308 ***	-1.341 ***
		(0.016)	(0.015)
humidity			0.002 ***
			(0.000)
precip			-0.002 ***
			(0.000)
windspeed			-0.004 ***
			(0.001)
wday.L			-0.067 ***
			(0.012)
wday.Q			-0.561 ***
			(0.012)
wday.C			0.107 ***
			(0.012)
wday^4			-0.185 ***
			(0.011)
wday^5			0.045 ***
			(0.011)
wday^6			-0.036 **
			(0.011)
month.L			-0.163 ***
			(0.019)
month.Q			-0.264 ***
			(0.032)
month.C			0.020
			(0.017)
month^4			-0.030
			(0.017)
month^5			-0.007
			(0.015)
month^6			0.094 ***
			(0.015)
month^7			0.030 *
			(0.015)
month^8			0.062 ***
			(0.015)
month^9			0.001
			(0.015)
month^10			0.017
			(0.015)
month^11			0.028
			(0.015)
N	19089	19089	19089
Adj. R²	0.063	0.502	0.572
RSE	0.890	0.649	0.602
* p < 0.001; p < 0.01; * p < 0.05.

Finally, we summarise the models side by side using huxreg().
This table reports: - the estimated coefficients, - statistical significance, - adjusted R² (goodness of fit, penalised), - residual standard error (typical size of the remaining error).

This allows us to compare explanatory power and interpretability.

We see a clear pattern: - Model 1 (temperature only) explains some variation: weather matters. - Model 2 (adds borough) explains substantially more. Borough fixed effects capture persistent spatial differences in rat activity. - Model 3 (adds precipitation, humidity, windspeed, weekday, and month) gives the best overall fit and lowest residual error, showing that both environmental shocks and seasonality add predictive power.

In short: - Best for explanation / policy: Model 2. It’s easy to interpret (“Brooklyn vs Bronx”, temperature effect). - Best for prediction / forecasting: Model 3. It captures weather + calendar patterns and fits best statistically.

From an applied perspective, these results suggest that: - Rat activity (or at least rat complaint volume) is seasonal and weather-sensitive, not just random. - Some boroughs face structurally higher rat pressure even after controlling for weather, which implies persistent local conditions (population density, waste handling, building stock). - Targeted mitigation (sanitation, pest control) is most urgent in warmer months and in the high-intensity boroughs.

In other words: temperature matters, but location and seasonality matter even more.

#4 We made limited use of AI tools (such as ChatGPT) during this assignment, primarily to assist with specific R syntax and debugging certain technical issues (e.g., formatting tables, handling missing values, or improving code readability). These tools were helpful in speeding up troubleshooting and providing examples of proper function usage. However, all analytical decisions, including variable selection, interpretation of results, and overall structure of the analysis, were made by our team. We also reviewed, modified, and validated all AI-generated code to ensure it aligned with our objectives and understanding. While AI proved to be a useful supplementary resource, we recognize that it cannot replace critical thinking, domain knowledge, or independent problem-solving in data analysis.

Contribution of each group member: 1 - Data Cleaning: Jiaomingyue Jiao, Stephanie Chen 2 - EDA: Dimitrios Stathogiannopoulos, Sunghwan Choi 3 - Regression: Carla-Marie Passarini, Michelangelo Mauro, Cian Dwyer