Image from: New York Times https://www.nytimes.com/2025/01/08/us/california-fires-landmarks-pasadena-palisades-malibu.html
Intro:
I used a Dataset gathered from Kaggle the dataset was compiled from www.fire.ca.gov a government website that tracks wildfires in California. THe dataset is a compilation of fires from this site from 2013 to 2020. The dataset was fairly clean but it has a lot of NAs in certian variables. some variables also have information that isn’t always logical for instance there is a fire from way before 2013 when the dataset is supposed to start I filter this out. there where also repeated entries which I removed.
The variables I will use include AcresBurned the size of the fire which is the main metric I will use to determine the severity of a fire. Startes the date which the fire is started which I will reformat. ArchiveYear the year the fire happened. longitude and Latitude self explanatory. some other variables I used include Injuries, Fatalities, StructuresThreatened, and PersonnelInvolved all of which are self explanatory.
The Reason I chose this topic is because of the large number of wildfires that occur in California. I personally don’t know anyone who lives in California but I used to live in Oregon before moving to Maryland. Oregon (especially Southern Oregon) has a vary similiar problem to California in fact fires that start in California often effect Oregon.
Setup and Basic Cleaning:
library(tidyverse)
library(leaflet)
setwd("~/Data 110/Final")
fires <- read_csv("California_Fire_Incidents.csv")
fires$date <- as.Date(format(ymd_hms(fires$Started),format = '%Y-%m-%d'))
fires2<- fires[!duplicated(fires$Name), ]
fires3<- fires2 |>
filter(date > 2010)
Look at the Largest Fires:
fires3 |>
select(AcresBurned)|>
slice_max(order_by = AcresBurned, n =10)
## # A tibble: 10 × 1
## AcresBurned
## <dbl>
## 1 410203
## 2 281893
## 3 257314
## 4 229651
## 5 151623
## 6 132127
## 7 97717
## 8 96949
## 9 96901
## 10 90288
Notably the Largest fire is many times larger than any of the others.
Basic Visualization of Size of Fires Over Time
ggplot(fires3, aes(date,AcresBurned))+
geom_point(aes())+
labs(title = "Size of Fires Over Time", x = "Time (Years)", y = "Size (Acres)",caption= 'Data From: www.fire.ca.gov')+
theme_bw()
Interestingly their is a gap roughly every year where there are no forest fires. I believe this is due to the winter/wet season. From experiance most Fires occur during the spring and summer. Ultimatly this graph shows there is no real corelation with size of fire and which year the fire took place in. the vast majority of fires don’t get that big which is very good.
Multiple Linear Regression to Predict Size of Fires
model <- lm(AcresBurned ~ date + StructuresThreatened + Injuries + PersonnelInvolved, data = fires3)
summary(model)
##
## Call:
## lm(formula = AcresBurned ~ date + StructuresThreatened + Injuries +
## PersonnelInvolved, data = fires3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28632 -7139 -709 4684 50660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.014e+05 1.631e+05 -1.235 0.263
## date 1.196e+01 9.743e+00 1.227 0.266
## StructuresThreatened 4.142e+00 2.723e+01 0.152 0.884
## Injuries 2.687e+03 4.874e+03 0.551 0.601
## PersonnelInvolved 4.428e+00 1.914e+01 0.231 0.825
##
## Residual standard error: 26130 on 6 degrees of freedom
## (1145 observations deleted due to missingness)
## Multiple R-squared: 0.3058, Adjusted R-squared: -0.1571
## F-statistic: 0.6606 on 4 and 6 DF, p-value: 0.6415
The model is a really bad predictor non of the p-values are small enough to be signifagant and the Adjusted R-squared used for multiple linear regressions is literally negative so none of the variation is explained with the predictors.
Graph to Visualize the Randomness Shown in the Model
ggplot(fires3, aes(y= AcresBurned, x = Injuries))+
geom_point(alpha=0.3)+
geom_smooth(method='lm',formula=y~x, se = FALSE, color = "Black", linetype= 2)+
labs(title = "Size Over Number of Injuries", y = "Size (Acres)", caption= 'Data From: www.fire.ca.gov')+
theme_bw()
Injuries and most the other predictors have a lot of NAs which is why so few points appear. Darker points have more observations. A negative slope for this line doesn’t logically make sense a more injuries and a bigger fire should be positively correlated. The most likely issues with my model are firstly the number of NAs in some of my predictors and Secondly in Size not really being a good indicator of the sevarity og a fire.
Check for NAs in variable used for interactive graph
sum(is.na(fires3$AcresBurned))
sum(is.na(fires3$ArchiveYear))
sum(is.na(fires3$Latitude))
sum(is.na(fires3$Longitude))
Create Interactive Graph
popup <- paste0(
"<b>Year: </b>", fires3$ArchiveYear, "<br>",
"<b>Injuries (NA is no data for this fire): </b>", fires3$Injuries, "<br>",
"<b>Deaths (NA is no data for this fire): </b>", fires3$Fatalities, "<br>",
"<b>size (Acres): </b>", fires3$AcresBurned, "<br>"
)
leaflet() |>
setView(lng = -119.417931, lat = 36.778259, zoom =5)|>
addProviderTiles("CartoDB.Positron") |>
addCircles(
data = fires3,
radius = fires3$AcresBurned/2,
color = "#c02a23",
fillOpacity = 0.15,
popup =popup,
)
This graph shows individual fires with the size of the circles being proportional to the size of the fire. this graph is most intresting when you line it up with a map of California which shows where forests are. notably eastern California has way less fires than northern and western California this is due to Eastern California being desert like due to the effect of the coastal range on rain patterns.