The dataset I am exploring analyzes fire incidents in NYC collected by the Fire Department of New York City (FDNY). This dataset shows 194251 different incidents with 29 variables. Many variables are included such as the borough it happened (INCIDENT_BOROUGH), how fast the fire department unit responded in seconds (INCIDENT_RESPONSE_SECONDS_QY), the time between when the unit was assigned and when the unit got to the scene of the incident (INCIDENT_TRAVEL_TM_SECONDS_QY), two variables with the type of fire it was (INCIDENT_CLASSIFICATION) & (INCIDENT_CLASSIFICATION_GROUP), the time between the incident and when the first unit got assigned (DISPATCH_RESPONSE_SECONDS_QY), the day, month, year and time the incidents occurred (INCIDENT_DATETIME), and NYC congressional districts (CONGRESSIONALDISTRICT). I’m curious about the relationship between “INCIDENT_RESPONSE_SECONDS_QY” and “INCIDENT_TRAVEL_TM_SECONDS_QY”, “INCIDENT_RESPONSE_SECONDS_QY” and “DISPATCH_RESPONSE_SECONDS_QY”, and the relationships between “INCIDENT_BOROUGH” and every variable I listed above. I want to know which borough has the highest response, travel, and dispatch response time. I chose this dataset because when I was a kid I wanted to be a firefighter, my dreams have changed since then but I’m still very interested in Fire Departments and the work firefighters do. My source for this dataset is from the Fire Department of New York City (FDNY). The original dataset had 10 million observations so I filtered it out by structural fires and incidents reported by phones. I think that the data was collected by reports that fire units that responded would submit. It makes sense to me for a government agency to be very thorough with their records.
Load Packages and Data
I loaded the packages ‘tidyverse’, ‘gganimate’, ‘gifski’, and ‘highcharter’. I then set my working directory with setwd() and loaded my dataset using read_csv and named it “fires”. I then used the head() function to show the first 6 observations in my dataset.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Rows: 194251 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): STARFIRE_INCIDENT_ID, INCIDENT_DATETIME, ALARM_BOX_BOROUGH, ALARM_...
dbl (13): ALARM_BOX_NUMBER, ZIPCODE, POLICEPRECINCT, CITYCOUNCILDISTRICT, CO...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I used the function anyNA() to check if there are any NAs in the variables I’m going to use. If there are NAs, when I run the chunk TRUE will show up.
anyNA(fires$INCIDENT_DATETIME)
[1] FALSE
anyNA(fires$INCIDENT_BOROUGH)
[1] FALSE
anyNA(fires$DISPATCH_RESPONSE_SECONDS_QY)
[1] TRUE
anyNA(fires$INCIDENT_RESPONSE_SECONDS_QY)
[1] TRUE
anyNA(fires$INCIDENT_CLASSIFICATION)
[1] FALSE
anyNA(fires$INCIDENT_TRAVEL_TM_SECONDS_QY)
[1] TRUE
anyNA(fires$CONGRESSIONALDISTRICT)
[1] TRUE
Removing NAs
I used filter(!is.na()) to remove the NAs from every variable I checked above that showed they had NAs. I made a new dataset called “fires1” with the adjusted variables. I filtered for dispatch response seconds less than 1,400 and more than 0 and incident response seconds less than 10,000 and more than 0.
I use the function ‘mutate()’ to separate the date, time, and year into separate columns. I did this so I can use the ‘year’ of the incidents in a future plot.
I will explore the following variables, INCIDENT_BOROUGH, INCIDENT_RESPONSE_SECONDS_QY, INCIDENT_TRAVEL_TM_SECONDS_QY, INCIDENT_CLASSIFICATION, DISPATCH_RESPONSE_SECONDS_QY and CONGRESSIONALDISTRICT, with the functions ‘unique()’ for categorical and ‘summary()’ for numerical. ‘unique()’ will give me all of the duplicated observations in the variable I choose. ‘summary()’ will give me the Min., 1st Qu., Median, Mean 3rd Qu., and Max. of the variable I choose.
unique(fires1$INCIDENT_CLASSIFICATION)
[1] "Multiple Dwelling 'A' - Compactor fire"
[2] "Multiple Dwelling 'A' - Other fire"
[3] "Multiple Dwelling 'A' - Food on the stove fire"
[4] "Private Dwelling Fire"
[5] "Hospital Fire"
[6] "Other Public Building Fire"
[7] "Other Commercial Building Fire"
[8] "Multiple Dwelling 'B' Fire"
[9] "Store Fire"
[10] "Factory Fire"
[11] "Construction or Demolition Building Fire"
[12] "Church Fire"
[13] "Transit System - Structural"
[14] "Untenanted Building Fire"
[15] "School Fire"
[16] "Theater or TV Studio Fire"
[17] "Under Contruction / Vacant Fire"
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.0 202.0 238.0 245.9 280.0 7986.0
summary(fires1$INCIDENT_TRAVEL_TM_SECONDS_QY)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 155 188 195 226 7947
summary(fires1$DISPATCH_RESPONSE_SECONDS_QY)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 34.00 46.00 50.77 62.00 1325.00
summary(fires1$CONGRESSIONALDISTRICT)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 8.00 11.00 10.67 13.00 16.00
Linear Regression
I used the function cor() to find the correlation coefficient of the linear regression. Then I used lm(y ~ x) to find the slope, intercept, and p-value.
fires_lm <-lm(INCIDENT_RESPONSE_SECONDS_QY ~ DISPATCH_RESPONSE_SECONDS_QY, data = fires2)summary(fires_lm)
Call:
lm(formula = INCIDENT_RESPONSE_SECONDS_QY ~ DISPATCH_RESPONSE_SECONDS_QY,
data = fires2)
Residuals:
Min 1Q Median 3Q Max
-337.9 -39.7 -7.1 31.5 7753.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.889e+02 4.042e-01 467.4 <2e-16 ***
DISPATCH_RESPONSE_SECONDS_QY 1.123e+00 7.167e-03 156.7 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 75.94 on 186376 degrees of freedom
Multiple R-squared: 0.1164, Adjusted R-squared: 0.1164
F-statistic: 2.456e+04 on 1 and 186376 DF, p-value: < 2.2e-16
The correlation coefficient is 0.34 which shows a weak correlation. The equation for this linear regression is: INCIDENT_RESPONSE_SECONDS_QY = 2e-16(DISPATCH_RESPONSE_SECONDS_QY) + 1.889e+02. The p-value has three asterisks which suggests that it is meaningful but we have to take into account the adjusted R-squared which shows that only 11.6% of observations may be explained by the model.
Model
plot_lm <- fires2 |>ggplot(aes(DISPATCH_RESPONSE_SECONDS_QY, INCIDENT_RESPONSE_SECONDS_QY)) +geom_point(color ="orange3", fill ="orange", shape =21, size =3, alpha = .6) +geom_smooth(method ='lm', formula = y~x, linewidth =1, color ="black") +theme_test()plot_lm
The regression line is somewhat concentrated and although the adjusted R-square is 11% so I would say that there is not a strong relationship between INCIDENT_RESPONSE_SECONDS_QY and DISPATCH_RESPONSE_SECONDS_QY that wouldn’t make much sense to explore.
Exploring 1
I start by filtering for a second time because for some reason it crashed when I tried to edit the first one. I filtered for incident travel seconds less than 6,000 and incident response seconds less than 6,000.
plot1 <- fires2 |>mutate(HIGHEST_ALARM_LEVEL =fct_relevel(HIGHEST_ALARM_LEVEL,"Simultaneous Call", "First Alarm","Second Alarm", "Third Alarm", "Fourth Alarm", "Fifth Alarm", "Fifth Alarm or Higher", "Sixth Alarm", "Seventh Alarm", "Eighth Alarm", "Ninth Alarm or Higher", "All Hands Working")) |>group_by(INCIDENT_BOROUGH)ggplot(plot1, aes(HIGHEST_ALARM_LEVEL)) +geom_bar(fill ="tomato", color ="tomato3") +theme_test() +#scale_fill_manual(values = c("#e76f51", "#f4a261", "#2a9d8f", "#264653" )) +facet_wrap(~INCIDENT_BOROUGH) +labs(title ="Number of Incidents and Highest Alarm Level for the NYC Boroughs",x ="Highest Alarm Level", y ="Count") +coord_flip()
You can see from the plot that the most incidents highest alarm was the first alarm. I don’t plan to use the ‘HIGHEST_ALARM_LEVEL’ variable since there’s not much variety.
Exploring 2
I first use ‘mutate()’ to change CONGRESSIONALDISTRICT from numerical to categorical.
The plot above shows the most incidents happen in district 15 and Brooklyn but the other districts still have a significant mount of incidents with district 3 having the least amount.
Final Plot 1
I first use ‘mutate()’ to change ‘year’ from numerical to categorical. I will group by year so it has to be categorical.
hc_plot <-hcboxplot(x = fires4$INCIDENT_RESPONSE_SECONDS_QY,var = fires4$year,name ="Response in Seconds", color ="#dc1600",fillColor ="#db5749",outliers = F ) |>hc_title(text ="Incident Response Time (in seconds) for Each Year") |>hc_yAxis(title =list(text ="Incident Response Time (in seconds)")) |>hc_xAxis(title =list(text ="Year")) |>hc_chart(backgroundColor ="#ebfefe") |>hc_caption(text ="Source: FDNY")
Warning in hcboxplot(x = fires4$INCIDENT_RESPONSE_SECONDS_QY, var = fires4$year, : 'hcboxplot' is deprecated.
Use 'data_to_boxplot' instead.
See help("Deprecated")
hc_plot
The plot shows that the response times has stayed very similar from 2005 to 2023. The median of each box being around the same, with 2011, 2012, and 2015 having the highest mins and maxs. It has slightly lowered in recent years.
Based on my treemap, you can see that “Multiple Dwelling ‘A’ - Food on the stove fire” is the type of fire with the highest travel time combined. This is could be due to Class A or “Multiple Dwelling ‘A’ - Food on the stove fire” fires being the most common type of fire.
I learned that here: (https://www.firetrace.com/fire-protection-blog/5-classes-of-fire#:~:text=Class%20A%20fires%20are%20the,the%20fire%20will%20burn%20out.)
plot5 <-ggplot(fires3, aes(INCIDENT_RESPONSE_SECONDS_QY, INCIDENT_TRAVEL_TM_SECONDS_QY, fill = INCIDENT_BOROUGH)) +geom_point(shape =21, size =5, alpha = .5) +transition_time(year) +theme_classic() +scale_fill_manual(values =c("#371e16", "#760f13", "#c81e0a", "#f87614", "#fab520")) +labs(title ="Incident Response Time by Incident Travel Time",x ="Incident Response Time (in seconds)", y ="Incident Travel Time (in seconds)", fill ="Incident Borough", caption ="Year: {frame_time}\nSource: FDNY")animate(plot5, renderer =gifski_renderer(), fps =1, duration =n_distinct(fires2$year) +2, end_pause =1, start_pause =1)
The plot shows that Incident Response Time and Incident Travel Time have a strong correlation. You cna also see that in recent years both Incident Response Time and Incident Travel Time have gotten lower and lower. You can tell from the graph that from 2013, minus 2014 with the outlier and 2022 which had a burst of incidents, to 2023 the dots are crowded near the bottom.
Conclusion
I learned from my plots that Brooklyn has the highest number of incidents and Staten Island having the least. I expected the incident response time and incident travel time were related, I thought there might be more deviations. I thought there would also be more deviation between incident response time between the years. I wanted to add interactivity in the third plot but I couldn’t figure out how.