Project 3

Author

Charlie Roth

Introduction

The dataset I am exploring analyzes fire incidents in NYC collected by the Fire Department of New York City (FDNY). This dataset shows 194251 different incidents with 29 variables. Many variables are included such as the borough it happened (INCIDENT_BOROUGH), how fast the fire department unit responded in seconds (INCIDENT_RESPONSE_SECONDS_QY), the time between when the unit was assigned and when the unit got to the scene of the incident (INCIDENT_TRAVEL_TM_SECONDS_QY), two variables with the type of fire it was (INCIDENT_CLASSIFICATION) & (INCIDENT_CLASSIFICATION_GROUP), the time between the incident and when the first unit got assigned (DISPATCH_RESPONSE_SECONDS_QY), the day, month, year and time the incidents occurred (INCIDENT_DATETIME), and NYC congressional districts (CONGRESSIONALDISTRICT). I’m curious about the relationship between “INCIDENT_RESPONSE_SECONDS_QY” and “INCIDENT_TRAVEL_TM_SECONDS_QY”, “INCIDENT_RESPONSE_SECONDS_QY” and “DISPATCH_RESPONSE_SECONDS_QY”, and the relationships between “INCIDENT_BOROUGH” and every variable I listed above. I want to know which borough has the highest response, travel, and dispatch response time. I chose this dataset because when I was a kid I wanted to be a firefighter, my dreams have changed since then but I’m still very interested in Fire Departments and the work firefighters do. My source for this dataset is from the Fire Department of New York City (FDNY). The original dataset had 10 million observations so I filtered it out by structural fires and incidents reported by phones. I think that the data was collected by reports that fire units that responded would submit. It makes sense to me for a government agency to be very thorough with their records.

Load Packages and Data

I loaded the packages ‘tidyverse’, ‘gganimate’, ‘gifski’, and ‘highcharter’. I then set my working directory with setwd() and loaded my dataset using read_csv and named it “fires”. I then used the head() function to show the first 6 observations in my dataset.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(gganimate)
library(gifski)
setwd("/Users/charlieroth/Desktop/Spring 2024/DATA 110/Week 14 - Project 3")
fires <- read_csv("2Fire_Incident_Dispatch_Data_20240430.csv")

Rows: 194251 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): STARFIRE_INCIDENT_ID, INCIDENT_DATETIME, ALARM_BOX_BOROUGH, ALARM_...
dbl (13): ALARM_BOX_NUMBER, ZIPCODE, POLICEPRECINCT, CITYCOUNCILDISTRICT, CO...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(fires)

# A tibble: 6 × 29
  STARFIRE_INCIDENT_ID INCIDENT_DATETIME      ALARM_BOX_BOROUGH ALARM_BOX_NUMBER
  <chr>                <chr>                  <chr>                        <dbl>
1 2100426590120013     01/04/2021 12:20:00 AM BRONX                         2659
2 2100423200140043     01/04/2021 01:03:00 AM BROOKLYN                      2320
3 2100411940110156     01/04/2021 04:34:00 AM MANHATTAN                     1194
4 2100443340130018     01/04/2021 05:18:00 AM RICHMOND / STATE…             4334
5 2100433980120127     01/04/2021 05:52:00 AM BRONX                         3398
6 2100440750140183     01/04/2021 07:59:00 AM BROOKLYN                      4075
# ℹ 25 more variables: ALARM_BOX_LOCATION <chr>, INCIDENT_BOROUGH <chr>,
#   ZIPCODE <dbl>, POLICEPRECINCT <dbl>, CITYCOUNCILDISTRICT <dbl>,
#   COMMUNITYDISTRICT <dbl>, COMMUNITYSCHOOLDISTRICT <dbl>,
#   CONGRESSIONALDISTRICT <dbl>, ALARM_SOURCE_DESCRIPTION_TX <chr>,
#   ALARM_LEVEL_INDEX_DESCRIPTION <chr>, HIGHEST_ALARM_LEVEL <chr>,
#   INCIDENT_CLASSIFICATION <chr>, INCIDENT_CLASSIFICATION_GROUP <chr>,
#   DISPATCH_RESPONSE_SECONDS_QY <dbl>, FIRST_ASSIGNMENT_DATETIME <chr>, …

Checking for NAs

I used the function anyNA() to check if there are any NAs in the variables I’m going to use. If there are NAs, when I run the chunk TRUE will show up.

anyNA(fires$INCIDENT_DATETIME)

[1] FALSE

anyNA(fires$INCIDENT_BOROUGH)

[1] FALSE

anyNA(fires$DISPATCH_RESPONSE_SECONDS_QY)

[1] TRUE

anyNA(fires$INCIDENT_RESPONSE_SECONDS_QY)

[1] TRUE

anyNA(fires$INCIDENT_CLASSIFICATION)

[1] FALSE

anyNA(fires$INCIDENT_TRAVEL_TM_SECONDS_QY)

[1] TRUE

anyNA(fires$CONGRESSIONALDISTRICT)

[1] TRUE

Removing NAs

I used filter(!is.na()) to remove the NAs from every variable I checked above that showed they had NAs. I made a new dataset called “fires1” with the adjusted variables. I filtered for dispatch response seconds less than 1,400 and more than 0 and incident response seconds less than 10,000 and more than 0.

fires1 <- fires |>
  filter(!is.na(DISPATCH_RESPONSE_SECONDS_QY), !is.na(INCIDENT_RESPONSE_SECONDS_QY), !is.na(INCIDENT_CLASSIFICATION), !is.na(INCIDENT_TRAVEL_TM_SECONDS_QY), !is.na(CONGRESSIONALDISTRICT), DISPATCH_RESPONSE_SECONDS_QY < 1400, INCIDENT_RESPONSE_SECONDS_QY < 10000, DISPATCH_RESPONSE_SECONDS_QY > 0, INCIDENT_RESPONSE_SECONDS_QY > 0)

Mutating

I use the function ‘mutate()’ to separate the date, time, and year into separate columns. I did this so I can use the ‘year’ of the incidents in a future plot.

fires2 <- fires1 |>
  mutate(date = as.Date(mdy_hms(INCIDENT_DATETIME)),
         time = format(mdy_hms(INCIDENT_DATETIME), "%I:%M:%S %p"),
         year = year(date))
head(fires2)

# A tibble: 6 × 32
  STARFIRE_INCIDENT_ID INCIDENT_DATETIME      ALARM_BOX_BOROUGH ALARM_BOX_NUMBER
  <chr>                <chr>                  <chr>                        <dbl>
1 2100426590120013     01/04/2021 12:20:00 AM BRONX                         2659
2 2100423200140043     01/04/2021 01:03:00 AM BROOKLYN                      2320
3 2100411940110156     01/04/2021 04:34:00 AM MANHATTAN                     1194
4 2100443340130018     01/04/2021 05:18:00 AM RICHMOND / STATE…             4334
5 2100433980120127     01/04/2021 05:52:00 AM BRONX                         3398
6 2100440750140183     01/04/2021 07:59:00 AM BROOKLYN                      4075
# ℹ 28 more variables: ALARM_BOX_LOCATION <chr>, INCIDENT_BOROUGH <chr>,
#   ZIPCODE <dbl>, POLICEPRECINCT <dbl>, CITYCOUNCILDISTRICT <dbl>,
#   COMMUNITYDISTRICT <dbl>, COMMUNITYSCHOOLDISTRICT <dbl>,
#   CONGRESSIONALDISTRICT <dbl>, ALARM_SOURCE_DESCRIPTION_TX <chr>,
#   ALARM_LEVEL_INDEX_DESCRIPTION <chr>, HIGHEST_ALARM_LEVEL <chr>,
#   INCIDENT_CLASSIFICATION <chr>, INCIDENT_CLASSIFICATION_GROUP <chr>,
#   DISPATCH_RESPONSE_SECONDS_QY <dbl>, FIRST_ASSIGNMENT_DATETIME <chr>, …

Exploring

I will explore the following variables, INCIDENT_BOROUGH, INCIDENT_RESPONSE_SECONDS_QY, INCIDENT_TRAVEL_TM_SECONDS_QY, INCIDENT_CLASSIFICATION, DISPATCH_RESPONSE_SECONDS_QY and CONGRESSIONALDISTRICT, with the functions ‘unique()’ for categorical and ‘summary()’ for numerical. ‘unique()’ will give me all of the duplicated observations in the variable I choose. ‘summary()’ will give me the Min., 1st Qu., Median, Mean 3rd Qu., and Max. of the variable I choose.

unique(fires1$INCIDENT_CLASSIFICATION)

 [1] "Multiple Dwelling 'A' - Compactor fire"        
 [2] "Multiple Dwelling 'A' - Other fire"            
 [3] "Multiple Dwelling 'A' - Food on the stove fire"
 [4] "Private Dwelling Fire"                         
 [5] "Hospital Fire"                                 
 [6] "Other Public Building Fire"                    
 [7] "Other Commercial Building Fire"                
 [8] "Multiple Dwelling 'B' Fire"                    
 [9] "Store Fire"                                    
[10] "Factory Fire"                                  
[11] "Construction or Demolition Building Fire"      
[12] "Church Fire"                                   
[13] "Transit System - Structural"                   
[14] "Untenanted Building Fire"                      
[15] "School Fire"                                   
[16] "Theater or TV Studio Fire"                     
[17] "Under Contruction / Vacant Fire"

unique(fires1$INCIDENT_BOROUGH)

[1] "BRONX"                    "BROOKLYN"                
[3] "MANHATTAN"                "RICHMOND / STATEN ISLAND"
[5] "QUEENS"

unique(fires1$HIGHEST_ALARM_LEVEL)

 [1] "First Alarm"           "All Hands Working"     "Second Alarm"         
 [4] "Fourth Alarm"          "Seventh Alarm"         "Fifth Alarm or Higher"
 [7] "Fifth Alarm"           "Third Alarm"           "Eighth Alarm"         
[10] "Simultaneous Call"     "Sixth Alarm"           "Ninth Alarm or Higher"

summary(fires1$INCIDENT_RESPONSE_SECONDS_QY)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   21.0   202.0   238.0   245.9   280.0  7986.0

summary(fires1$INCIDENT_TRAVEL_TM_SECONDS_QY)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0     155     188     195     226    7947

summary(fires1$DISPATCH_RESPONSE_SECONDS_QY)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00   34.00   46.00   50.77   62.00 1325.00

summary(fires1$CONGRESSIONALDISTRICT)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00    8.00   11.00   10.67   13.00   16.00

Linear Regression

I used the function cor() to find the correlation coefficient of the linear regression. Then I used lm(y ~ x) to find the slope, intercept, and p-value.

cor(fires2$INCIDENT_RESPONSE_SECONDS_QY, fires2$DISPATCH_RESPONSE_SECONDS_QY)

[1] 0.3411989

fires_lm <- lm(INCIDENT_RESPONSE_SECONDS_QY ~ DISPATCH_RESPONSE_SECONDS_QY, data = fires2)
summary(fires_lm)


Call:
lm(formula = INCIDENT_RESPONSE_SECONDS_QY ~ DISPATCH_RESPONSE_SECONDS_QY, 
    data = fires2)

Residuals:
   Min     1Q Median     3Q    Max 
-337.9  -39.7   -7.1   31.5 7753.3 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  1.889e+02  4.042e-01   467.4   <2e-16 ***
DISPATCH_RESPONSE_SECONDS_QY 1.123e+00  7.167e-03   156.7   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 75.94 on 186376 degrees of freedom
Multiple R-squared:  0.1164,    Adjusted R-squared:  0.1164 
F-statistic: 2.456e+04 on 1 and 186376 DF,  p-value: < 2.2e-16

The correlation coefficient is 0.34 which shows a weak correlation. The equation for this linear regression is: INCIDENT_RESPONSE_SECONDS_QY = 2e-16(DISPATCH_RESPONSE_SECONDS_QY) + 1.889e+02. The p-value has three asterisks which suggests that it is meaningful but we have to take into account the adjusted R-squared which shows that only 11.6% of observations may be explained by the model.

Model

plot_lm <- fires2 |>
  ggplot(aes(DISPATCH_RESPONSE_SECONDS_QY, INCIDENT_RESPONSE_SECONDS_QY)) +
  geom_point(color = "orange3", fill = "orange", shape = 21, size = 3, alpha = .6) +
  geom_smooth(method = 'lm', formula = y~x, linewidth = 1, color = "black") +
  theme_test()
plot_lm

The regression line is somewhat concentrated and although the adjusted R-square is 11% so I would say that there is not a strong relationship between INCIDENT_RESPONSE_SECONDS_QY and DISPATCH_RESPONSE_SECONDS_QY that wouldn’t make much sense to explore.

Exploring 1

I start by filtering for a second time because for some reason it crashed when I tried to edit the first one. I filtered for incident travel seconds less than 6,000 and incident response seconds less than 6,000.

fires3 <- fires2 |>
  filter(INCIDENT_TRAVEL_TM_SECONDS_QY < 6000, INCIDENT_RESPONSE_SECONDS_QY < 6000)

Plot 1

plot1 <- fires2 |>
  mutate(HIGHEST_ALARM_LEVEL = fct_relevel(HIGHEST_ALARM_LEVEL,
                     "Simultaneous Call", 
                     "First Alarm",
                     "Second Alarm", 
                     "Third Alarm", 
                     "Fourth Alarm",  
                     "Fifth Alarm", 
                     "Fifth Alarm or Higher", 
                     "Sixth Alarm", 
                     "Seventh Alarm", 
                     "Eighth Alarm", 
                     "Ninth Alarm or Higher", 
                     "All Hands Working")) |>
  group_by(INCIDENT_BOROUGH)
ggplot(plot1, aes(HIGHEST_ALARM_LEVEL)) +
  geom_bar(fill = "tomato", color = "tomato3") +
  theme_test() +
  #scale_fill_manual(values = c("#e76f51", "#f4a261", "#2a9d8f", "#264653" )) +
  facet_wrap(~INCIDENT_BOROUGH) +
  labs(title = "Number of Incidents and Highest Alarm Level for the NYC Boroughs",
       x = "Highest Alarm Level", 
       y = "Count") +
  coord_flip()

You can see from the plot that the most incidents highest alarm was the first alarm. I don’t plan to use the ‘HIGHEST_ALARM_LEVEL’ variable since there’s not much variety.

Exploring 2

I first use ‘mutate()’ to change CONGRESSIONALDISTRICT from numerical to categorical.

fires2 <- fires2 |> 
  mutate(CONGRESSIONALDISTRICT = as.character(CONGRESSIONALDISTRICT))

unique(fires2$CONGRESSIONALDISTRICT)

 [1] "15" "9"  "10" "11" "13" "8"  "3"  "14" "6"  "12" "7"  "5"  "16"

I use ‘fct_relevel()’ to order the congressional districts 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. There was not 1, 2, or 4.

plot2 <- fires2 |>
  mutate(CONGRESSIONALDISTRICT = fct_relevel(CONGRESSIONALDISTRICT,
                     "3", 
                     "5", 
                     "6", 
                     "7",  
                     "8", 
                     "9", 
                     "10", 
                     "11", 
                     "12", 
                     "13", 
                     "14",
                     "15",
                     "16")) |>
  group_by(CONGRESSIONALDISTRICT)
ggplot(plot2, aes(CONGRESSIONALDISTRICT, fill = INCIDENT_BOROUGH)) +
  geom_bar() +
  scale_fill_manual(values = c("khaki1", "yellow", "gold1", "goldenrod2", "darkgoldenrod3")) +
  theme_classic() +
  labs(title = "Count of Incidents in Each NYC Congressional District",
       x = "NYC Congressional District",
       y = "Count",
       fill = "Borough")

The plot above shows the most incidents happen in district 15 and Brooklyn but the other districts still have a significant mount of incidents with district 3 having the least amount.

Final Plot 1

I first use ‘mutate()’ to change ‘year’ from numerical to categorical. I will group by year so it has to be categorical.

fires4 <- fires3 |>
  mutate(year = as.character(year))

hc_plot <- hcboxplot(
  x = fires4$INCIDENT_RESPONSE_SECONDS_QY,
  var = fires4$year,
  name = "Response in Seconds", 
  color = "#dc1600",
  fillColor = "#db5749",
  outliers = F
  ) |>
  hc_title(text = "Incident Response Time (in seconds) for Each Year") |>
  hc_yAxis(title = list(text = "Incident Response Time (in seconds)")) |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_chart(backgroundColor = "#ebfefe") |>
  hc_caption(text = "Source: FDNY")

Warning in hcboxplot(x = fires4$INCIDENT_RESPONSE_SECONDS_QY, var = fires4$year, : 'hcboxplot' is deprecated.
Use 'data_to_boxplot' instead.
See help("Deprecated")

hc_plot

The plot shows that the response times has stayed very similar from 2005 to 2023. The median of each box being around the same, with 2011, 2012, and 2015 having the highest mins and maxs. It has slightly lowered in recent years.

Final Plot 2

Link to my Tableau treemap:

https://public.tableau.com/views/NYCFireIncidentsTreemap/NYCFireIncidentsTreemap?:language=en-US&:sid=&:display_count=n&:origin=viz_share_link

Based on my treemap, you can see that “Multiple Dwelling ‘A’ - Food on the stove fire” is the type of fire with the highest travel time combined. This is could be due to Class A or “Multiple Dwelling ‘A’ - Food on the stove fire” fires being the most common type of fire.
I learned that here: (https://www.firetrace.com/fire-protection-blog/5-classes-of-fire#:~:text=Class%20A%20fires%20are%20the,the%20fire%20will%20burn%20out.)

Final Plot 3

https://stackoverflow.com/questions/55291157/how-to-make-animated-ggplot-for-different-years-using-gganimate

https://stackoverflow.com/questions/60259080/fail-to-render-an-animation

plot5 <- ggplot(fires3, aes(INCIDENT_RESPONSE_SECONDS_QY, INCIDENT_TRAVEL_TM_SECONDS_QY, fill = INCIDENT_BOROUGH)) +
  geom_point(shape = 21, size = 5, alpha = .5) +
  transition_time(year) +
  theme_classic() +
  scale_fill_manual(values = c("#371e16", "#760f13", "#c81e0a", "#f87614", "#fab520")) +
  labs(title = "Incident Response Time by Incident Travel Time",
       x = "Incident Response Time (in seconds)", 
       y = "Incident Travel Time (in seconds)", 
       fill = "Incident Borough", 
       caption = "Year: {frame_time}\nSource: FDNY")
animate(plot5, renderer = gifski_renderer(), fps = 1, duration = n_distinct(fires2$year) + 2, end_pause = 1, start_pause = 1)

The plot shows that Incident Response Time and Incident Travel Time have a strong correlation. You cna also see that in recent years both Incident Response Time and Incident Travel Time have gotten lower and lower. You can tell from the graph that from 2013, minus 2014 with the outlier and 2022 which had a burst of incidents, to 2023 the dots are crowded near the bottom.

Conclusion

I learned from my plots that Brooklyn has the highest number of incidents and Staten Island having the least. I expected the incident response time and incident travel time were related, I thought there might be more deviations. I thought there would also be more deviation between incident response time between the years. I wanted to add interactivity in the third plot but I couldn’t figure out how.