Introduction

Forest fires are a critical environmental issue that cause significant ecological and economic damage. This project aims to predict the occurrence of forest fires based on fire weather indices using machine learning techniques.


Dataset

Dataset Description

The dataset contains 244 observations from two Canadian regions, “Cordillera” and “Hudson Bay.” It includes the following features:

  1. Date: Day, month (‘June’ to ‘September’), year (2012).
  2. Temp: Noon temperature (max) in Celsius degrees (22 to 42).
  3. RH: Relative Humidity in % (21 to 90).
  4. Ws: Wind speed in km/h (6 to 29).
  5. Rain: Total rain in mm for the day (0 to 16.8).
  6. FWI Components:
    • Fine Fuel Moisture Code (FFMC): 28.6 to 92.5.
    • Duff Moisture Code (DMC): 1.1 to 65.9.
    • Drought Code (DC): 7 to 220.4.
    • Initial Spread Index (ISI): 0 to 18.5.
    • Build-up Index (BUI): 1.1 to 68.
    • Fire Weather Index (FWI): 0 to 31.1.
  7. Classes: Two classes, “fire” and “not fire.”

Data Reading and Preprocessing

In this section, several functions are used to read and preprocess the forest fire dataset. The read.csv() function is used to load the data from a CSV file into R. By setting header = FALSE, the first row is not treated as column names, allowing us to manually set column names in the following steps. We use as.character(unlist(forest_fires[2, ])) to extract the second row and assign it as column names.

The na.omit() function is used to remove any rows with missing values, ensuring that only complete cases are retained. rownames(forest_fires_cleaned) <- NULL resets the row indices, which may have been affected after the data cleaning steps. The trimws() function is then used to remove any leading or trailing whitespace from the column names, while make.names() ensures that column names are unique and valid.

Lastly, the mutate() function from the dplyr package is used to convert character columns (e.g., Temperature, RH, Rain, etc.) into their appropriate numeric or factor types. This conversion is crucial for the subsequent analysis and modeling steps, as machine learning algorithms require numeric input for calculations and predictions.

# Load necessary libraries
library(readr)    # For reading CSV files
library(ggplot2)  # For data visualization
library(dplyr)    # For data manipulation

# Load the dataset
forest_fires <- read.csv("forest_fires_dataset.csv", header = FALSE, stringsAsFactors = FALSE)

# Use the second row as column names
colnames(forest_fires) <- as.character(unlist(forest_fires[2, ]))

# Remove unnecessary rows (1, 2, 125, 126, 127)
forest_fires_cleaned <- forest_fires[-c(1, 2, 125, 126, 127), ]
forest_fires_cleaned <- na.omit(forest_fires_cleaned)
# Reset row indices
rownames(forest_fires_cleaned) <- NULL
colnames(forest_fires_cleaned)
##  [1] "day"         "month"       "year"        "Temperature" " RH"        
##  [6] " Ws"         "Rain "       "FFMC"        "DMC"         "DC"         
## [11] "ISI"         "BUI"         "FWI"         "Classes  "
# Remove leading/trailing whitespace from column names
colnames(forest_fires_cleaned) <- trimws(colnames(forest_fires_cleaned))

# Ensure unique and valid column names (in case there are duplicates after trimming)
colnames(forest_fires_cleaned) <- make.names(colnames(forest_fires_cleaned), unique = TRUE)
# Convert character columns to numeric where appropriate
forest_fires_cleaned <- forest_fires_cleaned %>%
  mutate(
    Temperature = as.numeric(Temperature),
    RH = as.numeric(RH), 
    Ws = as.numeric(Ws),  
    Rain = as.numeric(Rain),
    FFMC = as.numeric(FFMC),
    DMC = as.numeric(DMC),
    DC = as.numeric(DC),
    ISI = as.numeric(ISI),
    BUI = as.numeric(BUI),
    FWI = as.numeric(FWI),
    Classes = factor(Classes)  # Convert Classes to factor (target variable)
  )

Exploratory Data Analysis (EDA)

Data Overview

We start by examining the structure and first few rows of the dataset to get an overview of its contents:

str(forest_fires_cleaned)
## 'data.frame':    244 obs. of  14 variables:
##  $ day        : chr  "1" "2" "3" "4" ...
##  $ month      : chr  "6" "6" "6" "6" ...
##  $ year       : chr  "2012" "2012" "2012" "2012" ...
##  $ Temperature: num  29 29 26 25 27 31 33 30 25 28 ...
##  $ RH         : num  57 61 82 89 77 67 54 73 88 79 ...
##  $ Ws         : num  18 13 22 13 16 14 13 15 13 12 ...
##  $ Rain       : num  0 1.3 13.1 2.5 0 0 0 0 0.2 0 ...
##  $ FFMC       : num  65.7 64.4 47.1 28.6 64.8 82.6 88.2 86.6 52.9 73.2 ...
##  $ DMC        : num  3.4 4.1 2.5 1.3 3 5.8 9.9 12.1 7.9 9.5 ...
##  $ DC         : num  7.6 7.6 7.1 6.9 14.2 22.2 30.5 38.3 38.8 46.3 ...
##  $ ISI        : num  1.3 1 0.3 0 1.2 3.1 6.4 5.6 0.4 1.3 ...
##  $ BUI        : num  3.4 3.9 2.7 1.7 3.9 7 10.9 13.5 10.5 12.6 ...
##  $ FWI        : num  0.5 0.4 0.1 0 0.5 2.5 7.2 7.1 0.3 0.9 ...
##  $ Classes    : Factor w/ 9 levels "","fire","fire ",..: 7 7 7 7 7 4 4 4 7 7 ...
head(forest_fires_cleaned)
##   day month year Temperature RH Ws Rain FFMC DMC   DC ISI BUI FWI     Classes
## 1   1     6 2012          29 57 18  0.0 65.7 3.4  7.6 1.3 3.4 0.5 not fire   
## 2   2     6 2012          29 61 13  1.3 64.4 4.1  7.6 1.0 3.9 0.4 not fire   
## 3   3     6 2012          26 82 22 13.1 47.1 2.5  7.1 0.3 2.7 0.1 not fire   
## 4   4     6 2012          25 89 13  2.5 28.6 1.3  6.9 0.0 1.7 0.0 not fire   
## 5   5     6 2012          27 77 16  0.0 64.8 3.0 14.2 1.2 3.9 0.5 not fire   
## 6   6     6 2012          31 67 14  0.0 82.6 5.8 22.2 3.1 7.0 2.5     fire

Next, we calculate the summary statistics for the dataset, which provides an overview of the distribution of numerical variables:

summary(forest_fires_cleaned)
##      day               month               year            Temperature   
##  Length:244         Length:244         Length:244         Min.   :22.00  
##  Class :character   Class :character   Class :character   1st Qu.:30.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :32.00  
##                                                           Mean   :32.17  
##                                                           3rd Qu.:35.00  
##                                                           Max.   :42.00  
##                                                                          
##        RH              Ws            Rain              FFMC      
##  Min.   :21.00   Min.   : 6.0   Min.   : 0.0000   Min.   :28.60  
##  1st Qu.:52.00   1st Qu.:14.0   1st Qu.: 0.0000   1st Qu.:72.08  
##  Median :63.00   Median :15.0   Median : 0.0000   Median :83.50  
##  Mean   :61.94   Mean   :15.5   Mean   : 0.7607   Mean   :77.89  
##  3rd Qu.:73.25   3rd Qu.:17.0   3rd Qu.: 0.5000   3rd Qu.:88.30  
##  Max.   :90.00   Max.   :29.0   Max.   :16.8000   Max.   :96.00  
##                                                                  
##       DMC              DC              ISI              BUI       
##  Min.   : 0.70   Min.   :  6.90   Min.   : 0.000   Min.   : 1.10  
##  1st Qu.: 5.80   1st Qu.: 12.35   1st Qu.: 1.400   1st Qu.: 6.00  
##  Median :11.30   Median : 33.10   Median : 3.500   Median :12.25  
##  Mean   :14.67   Mean   : 49.43   Mean   : 4.774   Mean   :16.66  
##  3rd Qu.:20.75   3rd Qu.: 69.10   3rd Qu.: 7.300   3rd Qu.:22.52  
##  Max.   :65.90   Max.   :220.40   Max.   :19.000   Max.   :68.00  
##                  NA's   :1                                        
##       FWI                Classes   
##  Min.   : 0.000   fire       :131  
##  1st Qu.: 0.700   not fire   :101  
##  Median : 4.200   fire       :  4  
##  Mean   : 7.035   fire       :  2  
##  3rd Qu.:11.450   not fire   :  2  
##  Max.   :31.100              :  1  
##  NA's   :1        (Other)    :  3

We check for missing values in each column to ensure that we handle them appropriately:

colSums(is.na(forest_fires_cleaned))
##         day       month        year Temperature          RH          Ws 
##           0           0           0           0           0           0 
##        Rain        FFMC         DMC          DC         ISI         BUI 
##           0           0           0           1           0           0 
##         FWI     Classes 
##           1           0

Data Visualization

A histogram is used to visualize the distribution of temperature values:

ggplot(forest_fires_cleaned, aes(x = Temperature)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  ggtitle("Temperature Distribution")

We calculate and visualize the correlation between numerical variables to identify any strong relationships:

correlation_matrix <- cor(forest_fires_cleaned[, sapply(forest_fires_cleaned, is.numeric)])
library(corrplot)
corrplot(correlation_matrix, method = "circle")