I. Introduction

Forest fires pose a significant threat to the environment, resulting in substantial ecological and economic harm. This project aims to analyze the relationship between weather conditions and fire occurrence and identify spatial patterns in fire events.

Data Source:

https://archive.ics.uci.edu/dataset/162/forest+fires

II. Import Dataset

1. Load necessary libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.95 loaded

2. Load the dataset

forest_fires <- read_csv("forestfires.csv")

## Rows: 517 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): month, day
## dbl (11): X, Y, FFMC, DMC, DC, ISI, temp, RH, wind, rain, area
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3. Dataset Attributes

# What columns are in the dataset?
colnames(forest_fires)

##  [1] "X"     "Y"     "month" "day"   "FFMC"  "DMC"   "DC"    "ISI"   "temp" 
## [10] "RH"    "wind"  "rain"  "area"

We know that the columns correspond to the following information:

X: X-axis spatial coordinate within the Montesinho park map: 1 to 9
Y: Y-axis spatial coordinate within the Montesinho park map: 2 to 9
month: Month of the year: ‘jan’ to ‘dec’
day: Day of the week: ‘mon’ to ‘sun’
FFMC: Fine Fuel Moisture Code index from the FWI system: 18.7 to 96.20
DMC: Duff Moisture Code index from the FWI system: 1.1 to 291.3
DC: Drought Code index from the FWI system: 7.9 to 860.6
ISI: Initial Spread Index from the FWI system: 0.0 to 56.10
temp: Temperature in Celsius degrees: 2.2 to 33.30
RH: Relative humidity in percentage: 15.0 to 100
wind: Wind speed in km/h: 0.40 to 9.40
rain: Outside rain in mm/m2 : 0.0 to 6.4
area: The burned area of the forest (in ha): 0.00 to 1090.84

A single row corresponds to the location of a fire and some characteristics about the fire itself. Higher water presence is typically asssociated with less fire spread, so we might expect the water-related variables (DMC and rain) to be related with area.

III. Exploratory Data Analysis (EDA)

1. Data Overview

An overview of the distribution of numerical variables:

summary(forest_fires)

##        X               Y          month               day           
##  Min.   :1.000   Min.   :2.0   Length:517         Length:517        
##  1st Qu.:3.000   1st Qu.:4.0   Class :character   Class :character  
##  Median :4.000   Median :4.0   Mode  :character   Mode  :character  
##  Mean   :4.669   Mean   :4.3                                        
##  3rd Qu.:7.000   3rd Qu.:5.0                                        
##  Max.   :9.000   Max.   :9.0                                        
##       FFMC            DMC              DC             ISI        
##  Min.   :18.70   Min.   :  1.1   Min.   :  7.9   Min.   : 0.000  
##  1st Qu.:90.20   1st Qu.: 68.6   1st Qu.:437.7   1st Qu.: 6.500  
##  Median :91.60   Median :108.3   Median :664.2   Median : 8.400  
##  Mean   :90.64   Mean   :110.9   Mean   :547.9   Mean   : 9.022  
##  3rd Qu.:92.90   3rd Qu.:142.4   3rd Qu.:713.9   3rd Qu.:10.800  
##  Max.   :96.20   Max.   :291.3   Max.   :860.6   Max.   :56.100  
##       temp             RH              wind            rain        
##  Min.   : 2.20   Min.   : 15.00   Min.   :0.400   Min.   :0.00000  
##  1st Qu.:15.50   1st Qu.: 33.00   1st Qu.:2.700   1st Qu.:0.00000  
##  Median :19.30   Median : 42.00   Median :4.000   Median :0.00000  
##  Mean   :18.89   Mean   : 44.29   Mean   :4.018   Mean   :0.02166  
##  3rd Qu.:22.80   3rd Qu.: 53.00   3rd Qu.:4.900   3rd Qu.:0.00000  
##  Max.   :33.30   Max.   :100.00   Max.   :9.400   Max.   :6.40000  
##       area        
##  Min.   :   0.00  
##  1st Qu.:   0.00  
##  Median :   0.52  
##  Mean   :  12.85  
##  3rd Qu.:   6.57  
##  Max.   :1090.84

Check for missing values in each column

colSums(is.na(forest_fires))

##     X     Y month   day  FFMC   DMC    DC   ISI  temp    RH  wind  rain  area 
##     0     0     0     0     0     0     0     0     0     0     0     0     0

There is no missing data.

2. Data Processing

month and day are character variables. We’ll convert these variables into factors so that they’ll be sorted into the correct order when we plot them.

forest_fires %>% pull(month) %>% unique

##  [1] "mar" "oct" "aug" "sep" "apr" "jun" "jul" "feb" "jan" "dec" "may" "nov"

forest_fires %>% pull(day) %>% unique

## [1] "fri" "tue" "sat" "sun" "mon" "wed" "thu"

Let’s assume that Monday is the first day of the week.

month_order <- c("jan", "feb", "mar",
                 "apr", "may", "jun",
                 "jul", "aug", "sep",
                 "oct", "nov", "dec")

dow_order <- c("mon", "tue", "wed", "thu", "fri", "sat", "sun")

forest_fires <- forest_fires %>% 
  mutate(month = factor(month, levels = month_order),
         day = factor(day, levels = dow_order))

3. Analyzing Data

a. Forest Fires Frequency

When Do Most Forest Fires Occur? ### By Month:

fires_by_month <- forest_fires %>%
  group_by(month) %>%
  summarize(total_fires = n())

fires_by_month %>% 
  ggplot(aes(x = month, y = total_fires)) +
  geom_col(fill ="lightcoral") +
  labs(
    title = "Number of forest fires by month",
    y = "Number of fires",
    x = "Month"
  )

By Day of the Week:

fires_by_dow <- forest_fires %>%
  group_by(day) %>%
  summarize(total_fires = n())

fires_by_dow %>% 
  ggplot(aes(x = day, y = total_fires)) +
  geom_col(fill = "coral") +
  labs(
    title = "Number of forest fires by day of the week",
    y = "Number of fires",
    x = "Day of the week"
  )

We see a massive spike in fires in August and September, as well as a smaller spike in March. Fires seem to be more frequent on the weekend.

b. Correlation between numerical variables

correlation_matrix <- cor(forest_fires[, sapply(forest_fires, is.numeric)])
corrplot(correlation_matrix, method = "color")

c. Plotting Other Variables Against Time

forest_fires_long <- forest_fires %>% 
  pivot_longer(
    cols = c("FFMC", "DMC", "DC", 
             "ISI", "temp", "RH", 
             "wind", "rain"),
    names_to = "data_col",
    values_to = "value"
  )

forest_fires_long %>% 
  ggplot(aes(x = month, y = value)) +
  geom_boxplot() +
  facet_wrap(vars(data_col), scale = "free_y") +
  labs(
    title = "Variable changes over month",
    x = "Month",
    y = "Variable value"
  )

c. Examining Forest Fire Severity

We are trying to see how each of the variables in the dataset relate to area. We can leverage the long format version of the data we created to use with facet_wrap().

forest_fires_long %>% 
  ggplot(aes(x = value, y = area)) +
  geom_point() +
  facet_wrap(vars(data_col), scales = "free_x") +
  labs(
    title = "Relationships between other variables and area burned",
    x = "Value of column",
    y = "Area burned (hectare)"
  )

d. Outlier Problems

It seems that there are two rows where area that still hurt the scale of the visualization. Let’s make a similar visualization that excludes these observations so that we can better see how each variable relates to area.

forest_fires_long %>% 
  filter(area < 300) %>% 
  ggplot(aes(x = value, y = area)) +
  geom_point() +
  facet_wrap(vars(data_col), scales = "free_x") +
  labs(
    title = "Relationships between other variables and area burned (area < 300)",
    x = "Value of column",
    y = "Area burned (hectare)"
  )

Appendix

Variable changes over time

Create box plots to visualize the distribution of the following variables by month

create_boxplots <- function(x, y) {
  ggplot(data = forest_fires) + 
    aes_string(x = x, y = y) +
    geom_boxplot() +
    theme(panel.background = element_rect(fill = "gray"))
}
# Assign x and y names
x_var_month <- names(forest_fires)[3] # month
x_var_day <- names(forest_fires)[4] # day
y_var <- names(forest_fires)[5:12]

## use the map() function to apply the function to the variables of interest
month_box <- map2(x_var_month, y_var, create_boxplots) ## visualize variables by month
month_box

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

Create box plots to visualize the distribution of the following variables by day of the week

create_boxplots <- function(x, y) {
  ggplot(data = forest_fires) + 
    aes_string(x = x, y = y) +
    geom_boxplot() +
    theme(panel.background = element_rect(fill = "gray"))
}
# Assign x and y names
x_var_month <- names(forest_fires)[3] # month
x_var_day <- names(forest_fires)[4] # day
y_var <- names(forest_fires)[5:12]

## use the map() function to apply the function to the variables of interest
day_box <- map2(x_var_day, y_var, create_boxplots) ## visualize variables by day
day_box

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

Relationships between other variables and area burned

Create box plots to visualize the relationships between other variables and area burned

create_scatter <- function(x, y) {
  ggplot(data = forest_fires) + 
    aes_string(x = x, y = y) +
    geom_point() +
    theme(panel.background = element_rect(fill = "white"))
}
# Assign x and y names
x_var <- names(forest_fires)[13] # area burned
y_var <- names(forest_fires)[5:12]

## use the map() function to apply the function to the variables of interest
scatter_plot <- map2(x_var, y_var, create_scatter) 
scatter_plot

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

Exploratory Visualization of Forest Fire Data

Hoang Nguyen