Forest fires are a critical environmental issue that cause significant ecological and economic damage. This project aims to predict the occurrence of forest fires based on fire weather indices using machine learning techniques.
The dataset contains 244 observations from two Canadian regions, “Cordillera” and “Hudson Bay.” It includes the following features:
In this section, several functions are used to read and preprocess
the forest fire dataset. The read.csv() function is used to
load the data from a CSV file into R. By setting
header = FALSE, the first row is not treated as column
names, allowing us to manually set column names in the following steps.
We use as.character(unlist(forest_fires[2, ])) to extract
the second row and assign it as column names.
The na.omit() function is used to remove any rows with
missing values, ensuring that only complete cases are retained.
rownames(forest_fires_cleaned) <- NULL resets the row
indices, which may have been affected after the data cleaning steps. The
trimws() function is then used to remove any leading or
trailing whitespace from the column names, while
make.names() ensures that column names are unique and
valid.
Lastly, the mutate() function from the dplyr package is used to convert character columns (e.g., Temperature, RH, Rain, etc.) into their appropriate numeric or factor types. This conversion is crucial for the subsequent analysis and modeling steps, as machine learning algorithms require numeric input for calculations and predictions.
# Load necessary libraries
library(readr) # For reading CSV files
library(ggplot2) # For data visualization
library(dplyr) # For data manipulation
# Load the dataset
forest_fires <- read.csv("forest_fires_dataset.csv", header = FALSE, stringsAsFactors = FALSE)
# Use the second row as column names
colnames(forest_fires) <- as.character(unlist(forest_fires[2, ]))
# Remove unnecessary rows (1, 2, 125, 126, 127)
forest_fires_cleaned <- forest_fires[-c(1, 2, 125, 126, 127), ]
forest_fires_cleaned <- na.omit(forest_fires_cleaned)
# Reset row indices
rownames(forest_fires_cleaned) <- NULL
colnames(forest_fires_cleaned)## [1] "day" "month" "year" "Temperature" " RH"
## [6] " Ws" "Rain " "FFMC" "DMC" "DC"
## [11] "ISI" "BUI" "FWI" "Classes "
# Remove leading/trailing whitespace from column names
colnames(forest_fires_cleaned) <- trimws(colnames(forest_fires_cleaned))
# Ensure unique and valid column names (in case there are duplicates after trimming)
colnames(forest_fires_cleaned) <- make.names(colnames(forest_fires_cleaned), unique = TRUE)
# Convert character columns to numeric where appropriate
forest_fires_cleaned <- forest_fires_cleaned %>%
mutate(
Temperature = as.numeric(Temperature),
RH = as.numeric(RH),
Ws = as.numeric(Ws),
Rain = as.numeric(Rain),
FFMC = as.numeric(FFMC),
DMC = as.numeric(DMC),
DC = as.numeric(DC),
ISI = as.numeric(ISI),
BUI = as.numeric(BUI),
FWI = as.numeric(FWI),
Classes = factor(Classes) # Convert Classes to factor (target variable)
)We start by examining the structure and first few rows of the dataset to get an overview of its contents:
## 'data.frame': 244 obs. of 14 variables:
## $ day : chr "1" "2" "3" "4" ...
## $ month : chr "6" "6" "6" "6" ...
## $ year : chr "2012" "2012" "2012" "2012" ...
## $ Temperature: num 29 29 26 25 27 31 33 30 25 28 ...
## $ RH : num 57 61 82 89 77 67 54 73 88 79 ...
## $ Ws : num 18 13 22 13 16 14 13 15 13 12 ...
## $ Rain : num 0 1.3 13.1 2.5 0 0 0 0 0.2 0 ...
## $ FFMC : num 65.7 64.4 47.1 28.6 64.8 82.6 88.2 86.6 52.9 73.2 ...
## $ DMC : num 3.4 4.1 2.5 1.3 3 5.8 9.9 12.1 7.9 9.5 ...
## $ DC : num 7.6 7.6 7.1 6.9 14.2 22.2 30.5 38.3 38.8 46.3 ...
## $ ISI : num 1.3 1 0.3 0 1.2 3.1 6.4 5.6 0.4 1.3 ...
## $ BUI : num 3.4 3.9 2.7 1.7 3.9 7 10.9 13.5 10.5 12.6 ...
## $ FWI : num 0.5 0.4 0.1 0 0.5 2.5 7.2 7.1 0.3 0.9 ...
## $ Classes : Factor w/ 9 levels "","fire","fire ",..: 7 7 7 7 7 4 4 4 7 7 ...
## day month year Temperature RH Ws Rain FFMC DMC DC ISI BUI FWI Classes
## 1 1 6 2012 29 57 18 0.0 65.7 3.4 7.6 1.3 3.4 0.5 not fire
## 2 2 6 2012 29 61 13 1.3 64.4 4.1 7.6 1.0 3.9 0.4 not fire
## 3 3 6 2012 26 82 22 13.1 47.1 2.5 7.1 0.3 2.7 0.1 not fire
## 4 4 6 2012 25 89 13 2.5 28.6 1.3 6.9 0.0 1.7 0.0 not fire
## 5 5 6 2012 27 77 16 0.0 64.8 3.0 14.2 1.2 3.9 0.5 not fire
## 6 6 6 2012 31 67 14 0.0 82.6 5.8 22.2 3.1 7.0 2.5 fire
Next, we calculate the summary statistics for the dataset, which provides an overview of the distribution of numerical variables:
## day month year Temperature
## Length:244 Length:244 Length:244 Min. :22.00
## Class :character Class :character Class :character 1st Qu.:30.00
## Mode :character Mode :character Mode :character Median :32.00
## Mean :32.17
## 3rd Qu.:35.00
## Max. :42.00
##
## RH Ws Rain FFMC
## Min. :21.00 Min. : 6.0 Min. : 0.0000 Min. :28.60
## 1st Qu.:52.00 1st Qu.:14.0 1st Qu.: 0.0000 1st Qu.:72.08
## Median :63.00 Median :15.0 Median : 0.0000 Median :83.50
## Mean :61.94 Mean :15.5 Mean : 0.7607 Mean :77.89
## 3rd Qu.:73.25 3rd Qu.:17.0 3rd Qu.: 0.5000 3rd Qu.:88.30
## Max. :90.00 Max. :29.0 Max. :16.8000 Max. :96.00
##
## DMC DC ISI BUI
## Min. : 0.70 Min. : 6.90 Min. : 0.000 Min. : 1.10
## 1st Qu.: 5.80 1st Qu.: 12.35 1st Qu.: 1.400 1st Qu.: 6.00
## Median :11.30 Median : 33.10 Median : 3.500 Median :12.25
## Mean :14.67 Mean : 49.43 Mean : 4.774 Mean :16.66
## 3rd Qu.:20.75 3rd Qu.: 69.10 3rd Qu.: 7.300 3rd Qu.:22.52
## Max. :65.90 Max. :220.40 Max. :19.000 Max. :68.00
## NA's :1
## FWI Classes
## Min. : 0.000 fire :131
## 1st Qu.: 0.700 not fire :101
## Median : 4.200 fire : 4
## Mean : 7.035 fire : 2
## 3rd Qu.:11.450 not fire : 2
## Max. :31.100 : 1
## NA's :1 (Other) : 3
We check for missing values in each column to ensure that we handle them appropriately:
## day month year Temperature RH Ws
## 0 0 0 0 0 0
## Rain FFMC DMC DC ISI BUI
## 0 0 0 1 0 0
## FWI Classes
## 1 0
A histogram is used to visualize the distribution of temperature values:
ggplot(forest_fires_cleaned, aes(x = Temperature)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
ggtitle("Temperature Distribution")
We calculate and visualize the correlation between numerical variables
to identify any strong relationships:
correlation_matrix <- cor(forest_fires_cleaned[, sapply(forest_fires_cleaned, is.numeric)])
library(corrplot)
corrplot(correlation_matrix, method = "circle")