Analysis of Forest Fires Dataset

Introduction

The Forest fires dataset contains information on forest fires in Portugal from 2000 to 2003, including meteorological and other factors that may influence the spread of fires. Using multiple linear regression, I will analyze how these factors relate to the size of the burned area in hectares with the target variable of “Area”.

Link: https://archive.ics.uci.edu/ml/datasets/Forest+Fires

Data Exploration

Variable Description

Below is a list of variables in the Dataset with their descriptions:

Variable	Description
X	x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y	y-axis spatial coordinate within the Montesinho park map: 2 to 9
month	month of the year: ‘jan’ to ‘dec’
day	day of the week: ‘mon’ to ‘sun’
FFMC	Fine Fuel Moisture Code (FFMC) rating from the FWI system: 18.7 to 96.20
DMC	Duff Moisture Code (DMC) rating from the FWI system: 1.1 to 291.3
DC	Drought Code (DC) rating from the FWI system: 7.9 to 860.6
ISI	Initial Spread Index (ISI) rating from the FWI system: 0.0 to 56.10
temp	temperature in Celsius degrees: 2.2 to 33.30
RH	relative humidity in %: 15.0 to 100
wind	wind speed in km/h: 0.40 to 9.40
rain	outside rain in mm/m2 : 0.0 to 6.4
area	the burned area of the forest (in ha): 0.00 to 1090.84r

Step by step Code Analysis

Loaded all the required libraries for Mulitple Linear Regression model
Loaded the Forest Fires Dataset and explored all the variables

List of the first 10

##    X Y month day FFMC   DMC    DC  ISI temp RH wind rain area
## 1  7 5   mar fri 86.2  26.2  94.3  5.1  8.2 51  6.7  0.0    0
## 2  7 4   oct tue 90.6  35.4 669.1  6.7 18.0 33  0.9  0.0    0
## 3  7 4   oct sat 90.6  43.7 686.9  6.7 14.6 33  1.3  0.0    0
## 4  8 6   mar fri 91.7  33.3  77.5  9.0  8.3 97  4.0  0.2    0
## 5  8 6   mar sun 89.3  51.3 102.2  9.6 11.4 99  1.8  0.0    0
## 6  8 6   aug sun 92.3  85.3 488.0 14.7 22.2 29  5.4  0.0    0
## 7  8 6   aug mon 92.3  88.9 495.6  8.5 24.1 27  3.1  0.0    0
## 8  8 6   aug mon 91.5 145.4 608.2 10.7  8.0 86  2.2  0.0    0
## 9  8 6   sep tue 91.0 129.5 692.6  7.0 13.1 63  5.4  0.0    0
## 10 7 5   sep sat 92.5  88.0 698.6  7.1 22.8 40  4.0  0.0    0

Column Names

##  [1] "X"     "Y"     "month" "day"   "FFMC"  "DMC"   "DC"    "ISI"   "temp" 
## [10] "RH"    "wind"  "rain"  "area"

In Summary

Loaded several R packages using the library() function. Then, loaded the “forestfires” dataset by reading in a CSV file using the read.csv() function. Lastly, displayed the first few rows of the dataset using the head() function.

Data Pre Processing

Pre Processing Data

Distinguished none numeric columns from numeric columns just for the sake of our regression model

##    X Y FFMC   DMC    DC  ISI temp RH wind rain area
## 1  7 5 86.2  26.2  94.3  5.1  8.2 51  6.7  0.0    0
## 2  7 4 90.6  35.4 669.1  6.7 18.0 33  0.9  0.0    0
## 3  7 4 90.6  43.7 686.9  6.7 14.6 33  1.3  0.0    0
## 4  8 6 91.7  33.3  77.5  9.0  8.3 97  4.0  0.2    0
## 5  8 6 89.3  51.3 102.2  9.6 11.4 99  1.8  0.0    0
## 6  8 6 92.3  85.3 488.0 14.7 22.2 29  5.4  0.0    0
## 7  8 6 92.3  88.9 495.6  8.5 24.1 27  3.1  0.0    0
## 8  8 6 91.5 145.4 608.2 10.7  8.0 86  2.2  0.0    0
## 9  8 6 91.0 129.5 692.6  7.0 13.1 63  5.4  0.0    0
## 10 7 5 92.5  88.0 698.6  7.1 22.8 40  4.0  0.0    0

## [1] "month" "day"

The month and day column have now been removed

Checked for missing values in advance as part of the pre processing

##        X               Y          month               day           
##  Min.   :1.000   Min.   :2.0   Length:517         Length:517        
##  1st Qu.:3.000   1st Qu.:4.0   Class :character   Class :character  
##  Median :4.000   Median :4.0   Mode  :character   Mode  :character  
##  Mean   :4.669   Mean   :4.3                                        
##  3rd Qu.:7.000   3rd Qu.:5.0                                        
##  Max.   :9.000   Max.   :9.0                                        
##       FFMC            DMC              DC             ISI        
##  Min.   :18.70   Min.   :  1.1   Min.   :  7.9   Min.   : 0.000  
##  1st Qu.:90.20   1st Qu.: 68.6   1st Qu.:437.7   1st Qu.: 6.500  
##  Median :91.60   Median :108.3   Median :664.2   Median : 8.400  
##  Mean   :90.64   Mean   :110.9   Mean   :547.9   Mean   : 9.022  
##  3rd Qu.:92.90   3rd Qu.:142.4   3rd Qu.:713.9   3rd Qu.:10.800  
##  Max.   :96.20   Max.   :291.3   Max.   :860.6   Max.   :56.100  
##       temp             RH              wind            rain        
##  Min.   : 2.20   Min.   : 15.00   Min.   :0.400   Min.   :0.00000  
##  1st Qu.:15.50   1st Qu.: 33.00   1st Qu.:2.700   1st Qu.:0.00000  
##  Median :19.30   Median : 42.00   Median :4.000   Median :0.00000  
##  Mean   :18.89   Mean   : 44.29   Mean   :4.018   Mean   :0.02166  
##  3rd Qu.:22.80   3rd Qu.: 53.00   3rd Qu.:4.900   3rd Qu.:0.00000  
##  Max.   :33.30   Max.   :100.00   Max.   :9.400   Max.   :6.40000  
##       area        
##  Min.   :   0.00  
##  1st Qu.:   0.00  
##  Median :   0.52  
##  Mean   :  12.85  
##  3rd Qu.:   6.57  
##  Max.   :1090.84

No missing NA values found

Checked for outliers in the dataset and ploted these outliers before and after removing them. Variables have been plotted against the area variable

Before removal of outliers

After Removal of outliers

In the plots, the x-axis represents one of the variables (e.g., X, Y, FFMC, etc.), and the y-axis represents the area variable. Each dot represents one observation in the dataset.

Before removing the outliers, the plots show that there are several points that are far from the rest of the observations, which are likely to be outliers. After removing the outliers, the range of the y-axis has been reduced, and the remaining observations have a more consistent relationship with the variables shown on the x-axis.

The removal of the outliers is important because after I run it before removing the outliners it significantly affected the results of my analysis. The presence of outliers can cause the regression coefficients to be biased or result in a poor fit of the model to the data. By removing the outliers, I can obtain more accurate estimates of the regression coefficients and a better fit of the model to the data.

Split the data into training and test set. The code for this operation sets a seed for reproducibility and then randomly splits the dataset into training and testing sets using the sample() function. My training set contains 70% of the data and the testing set contains the remaining 30%.

Created a correlation Matrix which shows the degree to which two or more variables are related or vary together It ranges from -1 to +1, with -1 indicating a perfect negative correlation (i.e., when one variable increases, the other variable decreases), +1 indicating a perfect positive correlation (i.e., when one variable increases, the other variable increases), and 0 indicating no correlation between the variables.

The correlation matrix is useful in identifying which variables are positively or negatively correlated with each other

Based on the matrix above, you can see that there are several squares with different colors, representing the correlation between pairs of variables. The brighter the color, the stronger the correlation between the two variables.

Based on the output I tried to omit FFMC/DMC/DC against Temperature however the results of the accuracy of the model remained the same.Correlation does not imply causation.

Fitted the models for Multiple Linear Regression using the lm() function, where the response variable is “area” and all other variables in the dataset are predictors. The summary() function is used to print a summary of the model to the console, including the coefficient estimates and statistical significance of each predictor. Finally, the model$coefficients command displays the coefficient estimates.

## # A tibble: 28 × 5
##    term     estimate std.error statistic p.value
##    <chr>       <dbl>     <dbl>     <dbl>   <dbl>
##  1 monthdec    6.73      2.40      2.81  0.00525
##  2 monthmay   -3.47      4.08     -0.851 0.395  
##  3 monthaug   -3.03      2.37     -1.28  0.203  
##  4 monthjul   -2.78      2.08     -1.34  0.182  
##  5 monthsep   -2.67      2.67     -1.00  0.317  
##  6 monthjan   -2.67      3.33     -0.801 0.423  
##  7 monthnov   -2.61      4.04     -0.648 0.518  
##  8 monthoct   -2.42      2.83     -0.854 0.393  
##  9 monthjun   -2.16      1.88     -1.15  0.250  
## 10 daytue      0.994     0.687     1.45  0.149  
## # … with 18 more rows

The output shows the coefficient sorted estimates for each variable and the significance of each coefficient. A positive coefficient indicates that the variable has a positive effect on the area burned, while a negative coefficient indicates the opposite. The p-value indicates the level of statistical significance of each coefficient.

The output also shows the overall performance of the model, including the residuals (difference between predicted and actual values) and the multiple R-squared, which is the proportion of variance in the dependent variable (area) that is explained by the independent variables. In this case, the R-squared value is low (0.07995), indicating that the model does not explain much of the variance in the data.

Run an evaluation on the perfomance of the model

## [1] 0.07995134

Got an R-squared value of 0.07995134 meaning that only about 7.99% of the variance in the dependent variable is explained by the independent variables included in your multiple linear regression model.

Predicting the area of the forest fires based on multiple Variables on the test data and evaluated the perfomance

##        1        3        6        8        9       12 
## 1.080448 1.907757 1.756031 1.209989 2.817131 2.595459

## [1] 2.299128

## [1] 2.951788

Evaluated the performance of my model by calculating RMSE, MAE, and R-squared using the actual area values in the test dataset and the predicted values obtained from the model.

The RMSE and MAE indicate the average difference between the actual and predicted area values, with lower values indicating better performance. The R-squared value indicates the proportion of variance in area that is explained by the independent variables, with higher values indicating better fit of the model.

Analysis of Forest Fires Dataset

John Paul Chirwa

23/04/2023