This report analyzes a Cal-Fire CSV file containing information about historic fire perimeters. Through various data wrangling and data visualization techniques I show that:
Before loading and viewing our data set I load in the necessary packages for my R script.
#load necessary packages -----
library(tidyverse)
library(here)
library(lubridate)
library(skimr)
library(janitor)
library(RColorBrewer)
#establish a theme for our graphs
theme_set(theme_classic())
The initial data set is a CSV file from CalFire containing records of fire perimeter data from 1878-2005. There are over 20,000 records and unnecessary fields for the purposes of this analysis. I begin by loading our initial data set and taking a glimpse at it to understand its structure.
#read the CalFire CSV master data file ----
calFires <- read_csv(here("Data", "Fire_Perimeters.csv"))
#glimpse into what the data looks like ----
glimpse(calFires)
## Rows: 22,810
## Columns: 21
## $ OBJECTID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ Year <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 202…
## $ State <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA…
## $ Agency <chr> "CDF", "CDF", "CDF", "CCO", "CDF", "CDF…
## $ `Unit ID` <chr> "LDF", "LAC", "ANF", "VNC", "LDF", "LAC…
## $ `Fire Name` <chr> "PALISADES", "EATON", "HUGHES", "KENNET…
## $ `Local Incident Number` <chr> "738", "9087", "250270", "3155", "3294"…
## $ `Alarm Date` <chr> "1/7/2025 8:00", "1/8/2025 8:00", "1/22…
## $ `Containment Date` <chr> "1/31/2025 8:00", "1/31/2025 8:00", "1/…
## $ Cause <dbl> 14, 14, 14, 14, 14, 14, 7, 14, 14, 14, …
## $ `Collection Method` <dbl> 7, 7, 7, 2, 7, 7, 7, 3, 7, 3, 3, 3, 3, …
## $ `Management Objective` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ `GIS Calculated Acres` <dbl> 23448.8800, 14056.2600, 10396.8000, 998…
## $ Comments <chr> NA, NA, NA, "from OES Intel 24", NA, NA…
## $ `Complex Name` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `IRWIN ID` <chr> "{A7EA5D21-F882-44B8-BF64-44AB11059DC1}…
## $ `Fire Number (historical use)` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `Complex ID` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ DECADES <chr> "2020-January 2025", "2020-January 2025…
## $ Shape__Area <dbl> 138651835, 83363929, 62160639, 5919678,…
## $ Shape__Length <dbl> 140231.61, 104933.21, 96698.60, 15602.0…
In the code below I create a new object to store the prepared CalFire data for analysis. In the code block I pipe together several operations to achieve a data set that will be more useful and efficient for the analysis.
#prepare the data to be analyzed and graphed ----
analysisFires <- calFires %>%
clean_names() %>%
mutate(
alarm_date = mdy_hm(alarm_date),
containment_date = mdy_hm(containment_date),
firelength = as.numeric(difftime(containment_date, alarm_date, units = "days")),
decade = (year %/% 10) * 10) %>%
filter(decade >= 1900) %>%
rename(size_in_acres = gis_calculated_acres) %>%
select(irwin_id, agency, unit_id, fire_name, decade, year, alarm_date,containment_date, firelength, size_in_acres)
clean_names() uniformly converts all of the column names to lowercase with spaces separated by “_“.
mutate() allows me to alter or create values in a specified column:
filter() allows me to filter the data to a subset of values. I chose to filter for fires after the 1900 decade because there is more consistent and reliable data.
rename() allows me to rename an existing field, in this case I changed gis_calculated_acre to size_in_acres.
select() allows me to select a subset of columns from the data set which I will be saving to the analysisFires object.
#list the new column names for the prepared data set ----
names(analysisFires)
## [1] "irwin_id" "agency" "unit_id" "fire_name"
## [5] "decade" "year" "alarm_date" "containment_date"
## [9] "firelength" "size_in_acres"
glimpse(analysisFires)
## Rows: 22,725
## Columns: 10
## $ irwin_id <chr> "{A7EA5D21-F882-44B8-BF64-44AB11059DC1}", "{72660ADC-…
## $ agency <chr> "CDF", "CDF", "CDF", "CCO", "CDF", "CDF", "CDF", "USF…
## $ unit_id <chr> "LDF", "LAC", "ANF", "VNC", "LDF", "LAC", "BTU", "SQF…
## $ fire_name <chr> "PALISADES", "EATON", "HUGHES", "KENNETH", "HURST", "…
## $ decade <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,…
## $ year <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2024, 2024, 2024,…
## $ alarm_date <dttm> 2025-01-07 08:00:00, 2025-01-08 08:00:00, 2025-01-22…
## $ containment_date <dttm> 2025-01-31 08:00:00, 2025-01-31 08:00:00, 2025-01-28…
## $ firelength <dbl> 24.0000, 23.0000, 6.0000, 26.0000, 2.0000, 3.0000, 45…
## $ size_in_acres <dbl> 23448.8800, 14056.2600, 10396.8000, 998.7378, 831.385…
To perform this analysis I created a more specific data subset from the analysisFires. I found that this methodology in certain circumstances kept the code easier to read and debug.
#retrieve the number of fires per year
firesPerDecade <- analysisFires %>%
group_by(decade) %>%
summarize(num_fires = n()) %>%
na.omit()
glimpse(firesPerDecade)
## Rows: 13
## Columns: 2
## $ decade <dbl> 1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, …
## $ num_fires <int> 70, 1324, 1479, 1157, 1155, 1819, 1276, 1879, 2208, 1992, 29…
This code block creates a new object from the analysisFires data set. This new object groups the data by decade and counts the number of fires that occurred in each decade stored as a new column called num_fires. The last line of code omits NA values from the data table.
avgSizePerYear <- analysisFires %>%
group_by(year, decade) %>%
summarize(avg_fire_size = mean(size_in_acres, na.rm = TRUE))
glimpse(avgSizePerYear)
## Rows: 124
## Columns: 3
## Groups: year [124]
## $ year <dbl> 1900, 1902, 1903, 1905, 1906, 1907, 1908, 1909, 1910, 19…
## $ decade <dbl> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1910, 19…
## $ avg_fire_size <dbl> 2044.0418, 731.4816, 1824.6718, 214.2885, 991.2227, 132.…
This code block creates another new object from the analysisFires data set. This object groups the data by year and decade so I can analyze the data for each year within each decade. I also summarize the data in a new field called avg_fire_size that calculates the average size in acres of each fire per year in each decade.
For plotting this first graph I used geom_col through the ggplot package. The x-axis shows the decades of the data and the y-axis shows the number of fires. geom_smooth adds a line of linear regression that is specified by the method “lm”. Lastly, the labs portion of code specifies labels for a title, axis, and caption.
firesPerDecade %>%
ggplot(aes(x = decade, y = num_fires)) +
geom_col(fill = "red") +
geom_smooth(method = "lm") +
labs(title = "Number of Fires per Decade",
x = "Decade",
y = "Number of Fires",
caption = "Graph 1 shows the number of fires per decade depcited as a geom_col through ggplot.
We can see a clear increase in the number of fires that have occured in Calfiornia every decade
through the trend line."
) +
theme(
plot.caption = element_text(
size = 10,
face = "italic",
hjust = .5 ))
For this second graph I used geom_area in the ggplot package. The area graph depicts more naturally the undulations in fire size across the years in a decade. To create a graph per each decade I used facet_wrap() and again set the labels using lab. Lastly, I removed the x axis text labels using theme() because writing in every year in each decade was hard to read in the output.
avgSizePerYear %>%
ggplot(aes(x = year, y = avg_fire_size)) +
geom_area() +
facet_wrap(~decade, scales = "free_x") +
labs(
title = "Average Fire Size per Year in California",
x = "Year",
y = "Average Fire Size (Acres)",
caption = "Graph 2 shows the average fire size per year grouped into decades.
There is a lot of natural fluctuation but overall we see an increase in fire size across the
recorded decades."
) +
theme(
axis.text.x = element_blank(),
plot.caption = element_text(
size = 10,
face = "italic",
hjust = .5 ))
For the final graph I used geom_point in the ggplot package. I weigh two variables against each other with the x axis being the number of days each fire burned and the y axis being the size of each fire. The points are colored by decade. I also scale_x_y_log10() to scale the x and y variables by log10 for a better visual distribution. I add another linear regression line across the graph with geom_smooth and import a color scale for the package RColorBrewer.
analysisFires %>%
ggplot(aes(x = firelength, y = size_in_acres, color = decade)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
geom_smooth(method = "lm", color = "black") +
labs(
title = "Relationship Between Fire Length and Fire Size (Log-Scale)",
x = "Fire Length (days, log10)",
y = "Fire Size (acres, log10)",
caption = "Graph 3 shows in an increase in fire intensity over time based off of fire duration and
fire size."
) +
scale_color_distiller(palette = "Spectral") +
theme(
plot.caption = element_text(
size = 10,
face = "italic",
hjust = .5 ))
ggsave(here("Data", "FavoriteGraph.jpg")) #saves my favorite graph :)
This CalFire data story shows that over time fires in California occur in higher frequency and have grown in intensity. This trend has seen an exceptionally sharp increase beginning in the 1900s onward. While other variables need to be taken into account, this upward trend in fire intensity and frequency is correlated to higher concentrations of greenhouse gasses in our atmosphere from human causes. This could suggest that fires are becoming worse because of human induced climate change on the planet.