Introduction

Forest fires pose a significant global challenge, resulting in the loss of millions of hectares each year. The detection and forecasting of forest fires emerge as crucial considerations in mitigating the impact of these disasters.

Brief Overview of The Project

The dataset includes 244 instances that regroup a data of two regions of Algeria,namely the Bejaia region located in the northeast of Algeria and the Sidi Bel-abbes region located in the northwest of Algeria.

The features of the dataset are as follows:
Temperature: temperature noon (temperature max) in Celsius degrees
RH: Relative Humidity in %  Ws: Wind speed in km/h
Rain: total day in mm
FFMC: Fine Fuel Moisture Code (FFMC) index from the FWI system
DMC: Duff Moisture Code (DMC) index from the FWI system
DC: Drought Code (DC) index from the FWI system
ISI: Initial Spread Index (ISI) index from the FWI system
BUI: Buildup Index (BUI) index from the FWI system
FWI: Fire Weather Index (FWI)
122 instances for each region. The period from June 2012 to September 2012.
The dataset comes from:https://archive.ics.uci.edu/dataset/547/algerian+forest+fires+dataset

Objectives of The Predictive Analysis

Objective 1: predict whether a fire will occur (using classification methods)
Objective 2: predict potential fire danger situations (using regression methods)

Analysis

Objective 1

Objective 1: predict whether a fire will occur (using classification methods)

Data Cleaning and Preprocessing

# Load the data
data <- read.csv("C:/Users/Zee/Desktop/FF/Data.csv")
# Examine the structure of the dataset
str(data)
## 'data.frame':    247 obs. of  14 variables:
##  $ day        : chr  "1" "2" "3" "4" ...
##  $ month      : chr  "6" "6" "6" "6" ...
##  $ year       : chr  "2012" "2012" "2012" "2012" ...
##  $ Temperature: chr  "29" "29" "26" "25" ...
##  $ RH         : chr  "57" "61" "82" "89" ...
##  $ Ws         : chr  "18" "13" "22" "13" ...
##  $ Rain       : chr  "0" "1.3" "13.1" "2.5" ...
##  $ FFMC       : chr  "65.7" "64.4" "47.1" "28.6" ...
##  $ DMC        : chr  "3.4" "4.1" "2.5" "1.3" ...
##  $ DC         : chr  "7.6" "7.6" "7.1" "6.9" ...
##  $ ISI        : chr  "1.3" "1" "0.3" "0" ...
##  $ BUI        : chr  "3.4" "3.9" "2.7" "1.7" ...
##  $ FWI        : chr  "0.5" "0.4" "0.1" "0" ...
##  $ Classes    : chr  "not fire   " "not fire   " "not fire   " "not fire   " ...
# Identify the row where Location B is labeled
location_B_row <- which(data$day == "Sidi-Bel Abbes Region Dataset")
# Print rows around the identified label for inspection
print(data[(location_B_row - 2):(location_B_row + 2), ])
##                               day month year Temperature  RH  Ws  Rain FFMC DMC
## 122                            30     9 2012          25  78  14   1.4   45 1.9
## 123                                                                            
## 124 Sidi-Bel Abbes Region Dataset                                              
## 125                           day month year Temperature  RH  Ws Rain  FFMC DMC
## 126                             1     6 2012          32  71  12   0.7 57.1 2.5
##      DC ISI BUI FWI     Classes
## 122 7.5 0.2 2.4 0.1 not fire   
## 123                            
## 124                            
## 125  DC ISI BUI FWI   Classes  
## 126 8.2 0.6 2.8 0.2 not fire
# Check consistency of columns
print(colnames(data))
##  [1] "day"         "month"       "year"        "Temperature" "RH"         
##  [6] "Ws"          "Rain"        "FFMC"        "DMC"         "DC"         
## [11] "ISI"         "BUI"         "FWI"         "Classes"
# Check for missing values
print(sapply(data, function(x) sum(is.na(x))))
##         day       month        year Temperature          RH          Ws 
##           0           0           0           0           0           0 
##        Rain        FFMC         DMC          DC         ISI         BUI 
##           0           0           0           0           0           0 
##         FWI     Classes 
##           0           0
# Find the row where B location is mentioned
location_B_row <- which(data$day == "Sidi-Bel Abbes Region Dataset")
# Combine data from A and B
combined_data <- rbind(data[1:(location_B_row - 2), ], data[(location_B_row + 1):nrow(data), ])
# Check the structure of the combined data
str(combined_data)
## 'data.frame':    245 obs. of  14 variables:
##  $ day        : chr  "1" "2" "3" "4" ...
##  $ month      : chr  "6" "6" "6" "6" ...
##  $ year       : chr  "2012" "2012" "2012" "2012" ...
##  $ Temperature: chr  "29" "29" "26" "25" ...
##  $ RH         : chr  "57" "61" "82" "89" ...
##  $ Ws         : chr  "18" "13" "22" "13" ...
##  $ Rain       : chr  "0" "1.3" "13.1" "2.5" ...
##  $ FFMC       : chr  "65.7" "64.4" "47.1" "28.6" ...
##  $ DMC        : chr  "3.4" "4.1" "2.5" "1.3" ...
##  $ DC         : chr  "7.6" "7.6" "7.1" "6.9" ...
##  $ ISI        : chr  "1.3" "1" "0.3" "0" ...
##  $ BUI        : chr  "3.4" "3.9" "2.7" "1.7" ...
##  $ FWI        : chr  "0.5" "0.4" "0.1" "0" ...
##  $ Classes    : chr  "not fire   " "not fire   " "not fire   " "not fire   " ...
# Export the combined data to a CSV file
write.csv(combined_data, "combined_data.csv", row.names = FALSE)
# Load the data
combined_data <- read.csv("combined_data.csv")
str(combined_data)
## 'data.frame':    245 obs. of  14 variables:
##  $ day        : chr  "1" "2" "3" "4" ...
##  $ month      : chr  "6" "6" "6" "6" ...
##  $ year       : chr  "2012" "2012" "2012" "2012" ...
##  $ Temperature: chr  "29" "29" "26" "25" ...
##  $ RH         : chr  "57" "61" "82" "89" ...
##  $ Ws         : chr  "18" "13" "22" "13" ...
##  $ Rain       : chr  "0" "1.3" "13.1" "2.5" ...
##  $ FFMC       : chr  "65.7" "64.4" "47.1" "28.6" ...
##  $ DMC        : chr  "3.4" "4.1" "2.5" "1.3" ...
##  $ DC         : chr  "7.6" "7.6" "7.1" "6.9" ...
##  $ ISI        : chr  "1.3" "1" "0.3" "0" ...
##  $ BUI        : chr  "3.4" "3.9" "2.7" "1.7" ...
##  $ FWI        : chr  "0.5" "0.4" "0.1" "0" ...
##  $ Classes    : chr  "not fire   " "not fire   " "not fire   " "not fire   " ...
summary(combined_data)
##      day               month               year           Temperature       
##  Length:245         Length:245         Length:245         Length:245        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##       RH                 Ws                Rain               FFMC          
##  Length:245         Length:245         Length:245         Length:245        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      DMC                 DC                ISI                BUI           
##  Length:245         Length:245         Length:245         Length:245        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      FWI              Classes         
##  Length:245         Length:245        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
# Check for missing values
missing_values <- sapply(combined_data, function(x) sum(is.na(x)))
print(missing_values)
##         day       month        year Temperature          RH          Ws 
##           0           0           0           0           0           0 
##        Rain        FFMC         DMC          DC         ISI         BUI 
##           0           0           0           0           0           0 
##         FWI     Classes 
##           0           0
# Convert numeric variables to appropriate types
numeric_vars <- c("Temperature", "RH", "Ws", "Rain", "FFMC", "DMC", "DC", "ISI", "BUI", "FWI")
combined_data[, numeric_vars] <- lapply(combined_data[, numeric_vars], as.numeric)
## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion

## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion

## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion

## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion

## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion

## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion

## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion

## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion

## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion

## Warning in lapply(combined_data[, numeric_vars], as.numeric): NAs introduced by
## coercion
# Clean up classes column
combined_data$Classes <- trimws(combined_data$Classes)  
# Remove leading and trailing whitespaces
# Check for outliers or inconsistencies
# Boxplot for numeric variables
numeric_vars <- c("Temperature", "RH", "Ws", "Rain", "FFMC", "DMC", "DC", "ISI", "BUI", "FWI")
par(mfrow = c(2, 5))  
for (var in numeric_vars) {
  boxplot(combined_data[[var]], main = var, col = "lightblue", border = "black")
}

# Histogram for numeric variables
par(mfrow = c(2, 5))  # Reset the plot layout
for (var in numeric_vars) {
  hist(combined_data[[var]], main = var, col = "lightgreen", border = "black")
}

# Save the cleaned data
write.csv(combined_data, "cleaned_combined_data.csv", row.names = FALSE)
# Check for missing values
missing_values <- sapply(combined_data, function(x) sum(is.na(x)))
print(missing_values)
##         day       month        year Temperature          RH          Ws 
##           0           0           0           1           1           1 
##        Rain        FFMC         DMC          DC         ISI         BUI 
##           1           1           1           2           1           1 
##         FWI     Classes 
##           2           0
# Handle missing values if present
# Impute missing values for numeric variables with mean
numeric_vars <- c("Temperature", "RH", "Ws", "Rain", "FFMC", "DMC", "DC", "ISI", "BUI", "FWI")
for (var in numeric_vars) {
  combined_data[, var][is.na(combined_data[, var])] <- mean(combined_data[, var], na.rm = TRUE)
}
# Impute missing values for categorical variables with mode
categorical_vars <- c("day", "month", "year", "Classes")
for (var in categorical_vars) {
  mode_val <- names(sort(table(combined_data[, var]), decreasing = TRUE))[1]
  combined_data[, var][is.na(combined_data[, var])] <- mode_val
}
# Check for missing values again
missing_values <- sapply(combined_data, function(x) sum(is.na(x)))
print(missing_values)
##         day       month        year Temperature          RH          Ws 
##           0           0           0           0           0           0 
##        Rain        FFMC         DMC          DC         ISI         BUI 
##           0           0           0           0           0           0 
##         FWI     Classes 
##           0           0
# Exclude "day," "month," and "year"
combined_data <- combined_data[, !names(combined_data) %in% c("day", "month", "year")]
# Confirm the updated structure
str(combined_data)
## 'data.frame':    245 obs. of  11 variables:
##  $ Temperature: num  29 29 26 25 27 31 33 30 25 28 ...
##  $ RH         : num  57 61 82 89 77 67 54 73 88 79 ...
##  $ Ws         : num  18 13 22 13 16 14 13 15 13 12 ...
##  $ Rain       : num  0 1.3 13.1 2.5 0 0 0 0 0.2 0 ...
##  $ FFMC       : num  65.7 64.4 47.1 28.6 64.8 82.6 88.2 86.6 52.9 73.2 ...
##  $ DMC        : num  3.4 4.1 2.5 1.3 3 5.8 9.9 12.1 7.9 9.5 ...
##  $ DC         : num  7.6 7.6 7.1 6.9 14.2 22.2 30.5 38.3 38.8 46.3 ...
##  $ ISI        : num  1.3 1 0.3 0 1.2 3.1 6.4 5.6 0.4 1.3 ...
##  $ BUI        : num  3.4 3.9 2.7 1.7 3.9 7 10.9 13.5 10.5 12.6 ...
##  $ FWI        : num  0.5 0.4 0.1 0 0.5 2.5 7.2 7.1 0.3 0.9 ...
##  $ Classes    : chr  "not fire" "not fire" "not fire" "not fire" ...

Exploratory Data Analysis

# Load necessary libraries for plotting
library(ggplot2)
# EDA for numeric variables
numeric_vars <- c("Temperature", "RH", "Ws", "Rain", "FFMC", "DMC", "DC", "ISI", "BUI", "FWI")
numeric_data <- combined_data[, numeric_vars]
# Pairwise scatterplots
pairs(numeric_data)

# Boxplots for each numeric variable by Classes
for (var in numeric_vars) {
  ggplot(combined_data, aes(x = Classes, y = get(var))) +
    geom_boxplot(fill = "lightblue", color = "black") +
    labs(title = var, x = "Classes", y = var)
}
# Bar plot for the target variable
ggplot(combined_data, aes(x = Classes)) +
  geom_bar(fill = "lightcoral") +
  labs(title = "Distribution of Classes", x = "Classes", y = "Count")

Modeling and Evaluation

# Install and load the ROSE package
# install.packages("ROSE", repos = "https://cloud.r-project.org")
library(ROSE)
## Loaded ROSE 0.0-4
# Clean up classes column (remove leading and trailing whitespaces)
combined_data$Classes <- trimws(combined_data$Classes)
# Remove empty and "Classes" levels
combined_data$Classes <- factor(combined_data$Classes, levels = c("fire", "not fire"))
# Check the levels of the target variable
levels(combined_data$Classes)
## [1] "fire"     "not fire"
# Perform oversampling
oversampled_data <- ovun.sample(Classes ~ ., data = combined_data, method = "over", 
                                N = 2 * max(table(combined_data$Classes)), seed = 42)$data
# Check the distribution of the target variable after oversampling
table(oversampled_data$Classes)
## 
##     fire not fire 
##      137      137
# Load the caret package
# install.packages("caret", repos = "https://cloud.r-project.org")
library(caret)
## Loading required package: lattice
# Set the seed for reproducibility
set.seed(42)
# Create an index for splitting the data (70% training, 30% testing)
index <- createDataPartition(oversampled_data$Classes, p = 0.7, list = FALSE)
# Split the data
train_data <- oversampled_data[index, ]
test_data <- oversampled_data[-index, ]
# Check the dimensions of the training and testing sets
dim(train_data)
## [1] 192  11
dim(test_data)
## [1] 82 11
# Load the randomForest package
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Specify the formula for the model
formula <- Classes ~ .
# Train the Random Forest model
rf_model <- randomForest(formula, data = train_data, ntree = 100, importance = TRUE)
# Print the model details
print(rf_model)
## 
## Call:
##  randomForest(formula = formula, data = train_data, ntree = 100,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 3.12%
## Confusion matrix:
##          fire not fire class.error
## fire       93        3     0.03125
## not fire    3       93     0.03125
# Make predictions on the test set
predictions <- predict(rf_model, newdata = test_data)
# Evaluate the model performance
confusion_matrix <- confusionMatrix(predictions, test_data$Classes)
print(confusion_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fire not fire
##   fire       41        1
##   not fire    0       40
##                                           
##                Accuracy : 0.9878          
##                  95% CI : (0.9339, 0.9997)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9756          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9756          
##          Pos Pred Value : 0.9762          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5000          
##          Detection Rate : 0.5000          
##    Detection Prevalence : 0.5122          
##       Balanced Accuracy : 0.9878          
##                                           
##        'Positive' Class : fire            
## 

Objective 2

Objective 2: predict potential fire danger situations (using regression methods)

Data Cleaning and Preprocessing

# Load the data
df <- read.csv("C:/Users/Zee/Documents/FF_o2/cleaned_combined_data.csv")
str(df)
## 'data.frame':    244 obs. of  14 variables:
##  $ day        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ month      : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ year       : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Temperature: int  29 29 26 25 27 31 33 30 25 28 ...
##  $ RH         : int  57 61 82 89 77 67 54 73 88 79 ...
##  $ Ws         : int  18 13 22 13 16 14 13 15 13 12 ...
##  $ Rain       : num  0 1.3 13.1 2.5 0 0 0 0 0.2 0 ...
##  $ FFMC       : num  65.7 64.4 47.1 28.6 64.8 82.6 88.2 86.6 52.9 73.2 ...
##  $ DMC        : num  3.4 4.1 2.5 1.3 3 5.8 9.9 12.1 7.9 9.5 ...
##  $ DC         : num  7.6 7.6 7.1 6.9 14.2 22.2 30.5 38.3 38.8 46.3 ...
##  $ ISI        : num  1.3 1 0.3 0 1.2 3.1 6.4 5.6 0.4 1.3 ...
##  $ BUI        : num  3.4 3.9 2.7 1.7 3.9 7 10.9 13.5 10.5 12.6 ...
##  $ FWI        : num  0.5 0.4 0.1 0 0.5 2.5 7.2 7.1 0.3 0.9 ...
##  $ Classes    : chr  "not fire" "not fire" "not fire" "not fire" ...
summary(df)
##       day            month          year       Temperature          RH       
##  Min.   : 1.00   Min.   :6.0   Min.   :2012   Min.   :22.00   Min.   :21.00  
##  1st Qu.: 8.00   1st Qu.:7.0   1st Qu.:2012   1st Qu.:30.00   1st Qu.:52.00  
##  Median :16.00   Median :7.5   Median :2012   Median :32.00   Median :63.00  
##  Mean   :15.75   Mean   :7.5   Mean   :2012   Mean   :32.17   Mean   :61.94  
##  3rd Qu.:23.00   3rd Qu.:8.0   3rd Qu.:2012   3rd Qu.:35.00   3rd Qu.:73.25  
##  Max.   :31.00   Max.   :9.0   Max.   :2012   Max.   :42.00   Max.   :90.00  
##        Ws            Rain              FFMC            DMC       
##  Min.   : 6.0   Min.   : 0.0000   Min.   :28.60   Min.   : 0.70  
##  1st Qu.:14.0   1st Qu.: 0.0000   1st Qu.:72.08   1st Qu.: 5.80  
##  Median :15.0   Median : 0.0000   Median :83.50   Median :11.30  
##  Mean   :15.5   Mean   : 0.7607   Mean   :77.89   Mean   :14.67  
##  3rd Qu.:17.0   3rd Qu.: 0.5000   3rd Qu.:88.30   3rd Qu.:20.75  
##  Max.   :29.0   Max.   :16.8000   Max.   :96.00   Max.   :65.90  
##        DC              ISI              BUI             FWI       
##  Min.   :  6.90   Min.   : 0.000   Min.   : 1.10   Min.   : 0.00  
##  1st Qu.: 11.55   1st Qu.: 1.400   1st Qu.: 6.00   1st Qu.: 0.70  
##  Median : 33.10   Median : 3.500   Median :12.25   Median : 4.45  
##  Mean   : 49.28   Mean   : 4.774   Mean   :16.66   Mean   : 7.03  
##  3rd Qu.: 68.15   3rd Qu.: 7.300   3rd Qu.:22.52   3rd Qu.:11.38  
##  Max.   :220.40   Max.   :19.000   Max.   :68.00   Max.   :31.10  
##    Classes         
##  Length:244        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
library(stringr)
df$Classes <- ifelse(str_detect(df$Classes, "not fire"), 0, 1)
col_names <- colnames(df)
drop_cols <- match(c('month','day','year'), col_names)
df <- df[,-drop_cols]

Exploratory Data Analysis

# Data manipulation
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Indepent and dependent features
X <- df %>% dplyr::select(-FWI)
y <- df$FWI
# Model selection and graphics drawing packages
library(caret)
library(ggplot2)
# Divide the training set and test set
set.seed(42)
split <- createDataPartition(y, p = 0.25, list = FALSE)
X_train <- X[split, ]
X_test <- X[-split, ]
y_train <- y[split]
y_test <- y[-split]
dim(X_train)
## [1] 63 10
dim(X_test)
## [1] 181  10
# install.packages(c("ggplot2", "reshape2"))
library(ggplot2)
library(reshape2)
corr <- cor(X_train)
# Heat maps for plotting correlation coefficient matrices
ggplot(melt(corr), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab", 
                       name = "Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1),
        axis.text.y = element_text(size = 12)) +
  coord_fixed()

# Feature scaling or standardization
X_train_scaled <- scale(X_train)
train_mean <- attr(X_train_scaled, "scaled:center")
train_sd <- attr(X_train_scaled, "scaled:scale")
X_test_scaled <- scale(X_test, center = train_mean, scale = train_sd)
#str(X_test_scaled)
#summary(X_test_scaled)
library(ggplot2)
# Standard scaling
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test)
# Box plots
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
boxplot(X_train, main = "X_train before scaling", col = "lightblue")
boxplot(X_train_scaled, main = "X_train after scaling", col = "lightgreen")

par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1)

Modeling and Evaluation

# install.packages("lattice")
# Load the lattice package
library(lattice)
# Linear regression
library(caret)
train_data <- cbind.data.frame(y_train, X_train_scaled)
linreg <- lm(y_train ~ ., data = train_data)
test_data <- cbind.data.frame(y_test, X_test_scaled)
y_pred <- predict(linreg, newdata = test_data)
mae <- mean(abs(y_test - y_pred))
score <- cor(y_test, y_pred)^2
cat("Linear Regression :Mean absolute error:", mae, "\n")
## Linear Regression :Mean absolute error: 1.157709
cat("Linear Regression :R-squared score:", score, "\n")
## Linear Regression :R-squared score: 0.9561771
plot(y_test, y_pred, main = "Linear Regression : Scatter plot of y_test vs. y_pred",
     xlab = "y_test", ylab = "y_pred", pch = 16)
abline(0, 1, col = "red", lty = 2)

# Lasso regression
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
lasso <- cv.glmnet(as.matrix(X_train_scaled), as.vector(y_train), alpha = 1)
y_pred <- predict(lasso, newx = as.matrix(X_test_scaled), s = "lambda.min")
mae <- mean(abs(y_test - y_pred))
score <- cor(y_test, y_pred)^2
cat("Lasso Regression : Mean absolute error:", mae, "\n")
## Lasso Regression : Mean absolute error: 1.130073
cat("Lasso Regression : R-squared score:", score, "\n")
## Lasso Regression : R-squared score: 0.9592434
plot(y_test, y_pred, main = "Lasso Regression : Scatter plot of y_test vs. y_pred",
     xlab = "y_test", ylab = "y_pred", pch = 16)
abline(0, 1, col = "red", lty = 2)

# Lasso regression model with cross-validated alpha selection
# install.packages("Metrics")
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
lassocv <- cv.glmnet(as.matrix(X_train_scaled), as.vector(y_train), alpha = 1)
mse_path <- as.matrix(lassocv$cvm)
matplot(log(lassocv$lambda), mse_path, type = "l", xlab = "log(lambda)", ylab = "Mean Squared Error",main = "Lasso Cross-Validated Mean Squared Error Path")

# Identify the optimal lambda (minimize cross-validated MSE)
#best_lambda <- lassocv$lambda.min
#cat("Optimal Lambda:", best_lambda, "\n")
y_pred <- predict(lassocv, newx = as.matrix(X_test_scaled), s = "lambda.min", type = "response")
mae <- mae(y_test, y_pred)
score <- cor(y_test, y_pred)^2
cat("Lasso Regression(Alpha) : Mean absolute error:", mae, "\n")
## Lasso Regression(Alpha) : Mean absolute error: 1.131542
cat("Lasso Regression(Alpha) : R-squared score:", score, "\n")
## Lasso Regression(Alpha) : R-squared score: 0.9591337
plot(y_pred, y_test, main = "Lasso Regression(Alpha) : Scatter plot of y_pred vs. y_test",
     xlab = "y_pred", ylab = "y_test", pch = 16)
abline(0, 1, col = "red", lty = 2)  

#Ridge regression
library(glmnet)
ridge <- cv.glmnet(as.matrix(X_train_scaled), as.vector(y_train), alpha = 0)
y_pred <- predict(ridge, newx = as.matrix(X_test_scaled), s = "lambda.min")
mae <- mean(abs(y_test - y_pred))
score <- cor(y_test, y_pred)^2
cat("Ridge Regression : Mean absolute error:", mae, "\n")
## Ridge Regression : Mean absolute error: 1.306729
cat("Ridge Regression : R-squared score:", score, "\n")
## Ridge Regression : R-squared score: 0.9500863
plot(y_test, y_pred, main = "Ridge Regression : Scatter plot of y_test vs. y_pred",
     xlab = "y_test", ylab = "y_pred", pch = 16)
abline(0, 1, col = "red", lty = 2)

#Elastic Net regression
library(glmnet)
elastic <- cv.glmnet(as.matrix(X_train_scaled), as.vector(y_train), alpha = 0.5)
y_pred <- predict(elastic, newx = as.matrix(X_test_scaled), s = "lambda.min")
mae <- mean(abs(y_test - y_pred))
score <- cor(y_test, y_pred)^2
cat("Elastic Net Regression : Mean absolute error:", mae, "\n")
## Elastic Net Regression : Mean absolute error: 1.136838
cat("Elastic Net Regression : R-squared score:", score, "\n")
## Elastic Net Regression : R-squared score: 0.9588857
plot(y_test, y_pred, main = "Elastic Net Regression : Scatter plot of y_test vs. y_pred",
     xlab = "y_test", ylab = "y_pred", pch = 16)
abline(0, 1, col = "red", lty = 2) 

Conclusion

By combining the results of classification and regression models, forest management authorities can gain a comprehensive understanding of fire risk. By predicting whether and how fires will occur, we can inform proactive measures to help prevent fires and manage forest areas more effectively, thereby contributing to overall forest safety and conservation.

The results of our project can be further applied to: risk assessment for forest management, planning controlled burns and fire resource allocation, and providing early warning to communities about potential fire hazards.

Of course, there is also room for subsequent project improvements: incorporating contextualised data on human activities and land-use change, exploring the use of advanced spatial analysis methods for localised risk assessment, and regularly updating models with new data to adapt to changing environmental conditions.