Forest fires pose a significant threat to the environment, resulting in substantial ecological and economic harm. This project aims to analyze the relationship between weather conditions and fire occurrence and identify spatial patterns in fire events.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
forest_fires <- read_csv("forestfires.csv")
## Rows: 517 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): month, day
## dbl (11): X, Y, FFMC, DMC, DC, ISI, temp, RH, wind, rain, area
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# What columns are in the dataset?
colnames(forest_fires)
## [1] "X" "Y" "month" "day" "FFMC" "DMC" "DC" "ISI" "temp"
## [10] "RH" "wind" "rain" "area"
We know that the columns correspond to the following information:
A single row corresponds to the location of a fire and some
characteristics about the fire itself. Higher water presence is
typically asssociated with less fire spread, so we might expect the
water-related variables (DMC and rain) to be
related with area.
An overview of the distribution of numerical variables:
summary(forest_fires)
## X Y month day
## Min. :1.000 Min. :2.0 Length:517 Length:517
## 1st Qu.:3.000 1st Qu.:4.0 Class :character Class :character
## Median :4.000 Median :4.0 Mode :character Mode :character
## Mean :4.669 Mean :4.3
## 3rd Qu.:7.000 3rd Qu.:5.0
## Max. :9.000 Max. :9.0
## FFMC DMC DC ISI
## Min. :18.70 Min. : 1.1 Min. : 7.9 Min. : 0.000
## 1st Qu.:90.20 1st Qu.: 68.6 1st Qu.:437.7 1st Qu.: 6.500
## Median :91.60 Median :108.3 Median :664.2 Median : 8.400
## Mean :90.64 Mean :110.9 Mean :547.9 Mean : 9.022
## 3rd Qu.:92.90 3rd Qu.:142.4 3rd Qu.:713.9 3rd Qu.:10.800
## Max. :96.20 Max. :291.3 Max. :860.6 Max. :56.100
## temp RH wind rain
## Min. : 2.20 Min. : 15.00 Min. :0.400 Min. :0.00000
## 1st Qu.:15.50 1st Qu.: 33.00 1st Qu.:2.700 1st Qu.:0.00000
## Median :19.30 Median : 42.00 Median :4.000 Median :0.00000
## Mean :18.89 Mean : 44.29 Mean :4.018 Mean :0.02166
## 3rd Qu.:22.80 3rd Qu.: 53.00 3rd Qu.:4.900 3rd Qu.:0.00000
## Max. :33.30 Max. :100.00 Max. :9.400 Max. :6.40000
## area
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.52
## Mean : 12.85
## 3rd Qu.: 6.57
## Max. :1090.84
Check for missing values in each column
colSums(is.na(forest_fires))
## X Y month day FFMC DMC DC ISI temp RH wind rain area
## 0 0 0 0 0 0 0 0 0 0 0 0 0
There is no missing data.
month and day are character variables.
We’ll convert these variables into factors so that they’ll be sorted
into the correct order when we plot them.
forest_fires %>% pull(month) %>% unique
## [1] "mar" "oct" "aug" "sep" "apr" "jun" "jul" "feb" "jan" "dec" "may" "nov"
forest_fires %>% pull(day) %>% unique
## [1] "fri" "tue" "sat" "sun" "mon" "wed" "thu"
Let’s assume that Monday is the first day of the week.
month_order <- c("jan", "feb", "mar",
"apr", "may", "jun",
"jul", "aug", "sep",
"oct", "nov", "dec")
dow_order <- c("mon", "tue", "wed", "thu", "fri", "sat", "sun")
forest_fires <- forest_fires %>%
mutate(month = factor(month, levels = month_order),
day = factor(day, levels = dow_order))
When Do Most Forest Fires Occur? ### By Month:
fires_by_month <- forest_fires %>%
group_by(month) %>%
summarize(total_fires = n())
fires_by_month %>%
ggplot(aes(x = month, y = total_fires)) +
geom_col(fill ="lightcoral") +
labs(
title = "Number of forest fires by month",
y = "Number of fires",
x = "Month"
)
fires_by_dow <- forest_fires %>%
group_by(day) %>%
summarize(total_fires = n())
fires_by_dow %>%
ggplot(aes(x = day, y = total_fires)) +
geom_col(fill = "coral") +
labs(
title = "Number of forest fires by day of the week",
y = "Number of fires",
x = "Day of the week"
)
We see a massive spike in fires in August and September, as well as a smaller spike in March. Fires seem to be more frequent on the weekend.
correlation_matrix <- cor(forest_fires[, sapply(forest_fires, is.numeric)])
corrplot(correlation_matrix, method = "color")
forest_fires_long <- forest_fires %>%
pivot_longer(
cols = c("FFMC", "DMC", "DC",
"ISI", "temp", "RH",
"wind", "rain"),
names_to = "data_col",
values_to = "value"
)
forest_fires_long %>%
ggplot(aes(x = month, y = value)) +
geom_boxplot() +
facet_wrap(vars(data_col), scale = "free_y") +
labs(
title = "Variable changes over month",
x = "Month",
y = "Variable value"
)
We are trying to see how each of the variables in the dataset relate
to area. We can leverage the long format version of the
data we created to use with facet_wrap().
forest_fires_long %>%
ggplot(aes(x = value, y = area)) +
geom_point() +
facet_wrap(vars(data_col), scales = "free_x") +
labs(
title = "Relationships between other variables and area burned",
x = "Value of column",
y = "Area burned (hectare)"
)
It seems that there are two rows where area that still
hurt the scale of the visualization. Let’s make a similar visualization
that excludes these observations so that we can better see how each
variable relates to area.
forest_fires_long %>%
filter(area < 300) %>%
ggplot(aes(x = value, y = area)) +
geom_point() +
facet_wrap(vars(data_col), scales = "free_x") +
labs(
title = "Relationships between other variables and area burned (area < 300)",
x = "Value of column",
y = "Area burned (hectare)"
)
create_boxplots <- function(x, y) {
ggplot(data = forest_fires) +
aes_string(x = x, y = y) +
geom_boxplot() +
theme(panel.background = element_rect(fill = "gray"))
}
# Assign x and y names
x_var_month <- names(forest_fires)[3] # month
x_var_day <- names(forest_fires)[4] # day
y_var <- names(forest_fires)[5:12]
## use the map() function to apply the function to the variables of interest
month_box <- map2(x_var_month, y_var, create_boxplots) ## visualize variables by month
month_box
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
create_boxplots <- function(x, y) {
ggplot(data = forest_fires) +
aes_string(x = x, y = y) +
geom_boxplot() +
theme(panel.background = element_rect(fill = "gray"))
}
# Assign x and y names
x_var_month <- names(forest_fires)[3] # month
x_var_day <- names(forest_fires)[4] # day
y_var <- names(forest_fires)[5:12]
## use the map() function to apply the function to the variables of interest
day_box <- map2(x_var_day, y_var, create_boxplots) ## visualize variables by day
day_box
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
create_scatter <- function(x, y) {
ggplot(data = forest_fires) +
aes_string(x = x, y = y) +
geom_point() +
theme(panel.background = element_rect(fill = "white"))
}
# Assign x and y names
x_var <- names(forest_fires)[13] # area burned
y_var <- names(forest_fires)[5:12]
## use the map() function to apply the function to the variables of interest
scatter_plot <- map2(x_var, y_var, create_scatter)
scatter_plot
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]