Introduction

The Forest fires dataset contains information on forest fires in Portugal from 2000 to 2003, including meteorological and other factors that may influence the spread of fires. Using multiple linear regression, I will analyze how these factors relate to the size of the burned area in hectares with the target variable of “Area”.

Link: https://archive.ics.uci.edu/ml/datasets/Forest+Fires

Data Exploration

Variable Description

Below is a list of variables in the Dataset with their descriptions:

Variable Description
X x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y y-axis spatial coordinate within the Montesinho park map: 2 to 9
month month of the year: ‘jan’ to ‘dec’
day day of the week: ‘mon’ to ‘sun’
FFMC Fine Fuel Moisture Code (FFMC) rating from the FWI system: 18.7 to 96.20
DMC Duff Moisture Code (DMC) rating from the FWI system: 1.1 to 291.3
DC Drought Code (DC) rating from the FWI system: 7.9 to 860.6
ISI Initial Spread Index (ISI) rating from the FWI system: 0.0 to 56.10
temp temperature in Celsius degrees: 2.2 to 33.30
RH relative humidity in %: 15.0 to 100
wind wind speed in km/h: 0.40 to 9.40
rain outside rain in mm/m2 : 0.0 to 6.4
area the burned area of the forest (in ha): 0.00 to 1090.84r

Step by step Code Analysis

  1. Loaded all the required libraries for Mulitple Linear Regression model

  2. Loaded the Forest Fires Dataset and explored all the variables

List of the first 10

##    X Y month day FFMC   DMC    DC  ISI temp RH wind rain area
## 1  7 5   mar fri 86.2  26.2  94.3  5.1  8.2 51  6.7  0.0    0
## 2  7 4   oct tue 90.6  35.4 669.1  6.7 18.0 33  0.9  0.0    0
## 3  7 4   oct sat 90.6  43.7 686.9  6.7 14.6 33  1.3  0.0    0
## 4  8 6   mar fri 91.7  33.3  77.5  9.0  8.3 97  4.0  0.2    0
## 5  8 6   mar sun 89.3  51.3 102.2  9.6 11.4 99  1.8  0.0    0
## 6  8 6   aug sun 92.3  85.3 488.0 14.7 22.2 29  5.4  0.0    0
## 7  8 6   aug mon 92.3  88.9 495.6  8.5 24.1 27  3.1  0.0    0
## 8  8 6   aug mon 91.5 145.4 608.2 10.7  8.0 86  2.2  0.0    0
## 9  8 6   sep tue 91.0 129.5 692.6  7.0 13.1 63  5.4  0.0    0
## 10 7 5   sep sat 92.5  88.0 698.6  7.1 22.8 40  4.0  0.0    0
Column Names
##  [1] "X"     "Y"     "month" "day"   "FFMC"  "DMC"   "DC"    "ISI"   "temp" 
## [10] "RH"    "wind"  "rain"  "area"

In Summary

Loaded several R packages using the library() function. Then, loaded the “forestfires” dataset by reading in a CSV file using the read.csv() function. Lastly, displayed the first few rows of the dataset using the head() function.

  1. Data Pre Processing

Pre Processing Data

Distinguished none numeric columns from numeric columns just for the sake of our regression model

##    X Y FFMC   DMC    DC  ISI temp RH wind rain area
## 1  7 5 86.2  26.2  94.3  5.1  8.2 51  6.7  0.0    0
## 2  7 4 90.6  35.4 669.1  6.7 18.0 33  0.9  0.0    0
## 3  7 4 90.6  43.7 686.9  6.7 14.6 33  1.3  0.0    0
## 4  8 6 91.7  33.3  77.5  9.0  8.3 97  4.0  0.2    0
## 5  8 6 89.3  51.3 102.2  9.6 11.4 99  1.8  0.0    0
## 6  8 6 92.3  85.3 488.0 14.7 22.2 29  5.4  0.0    0
## 7  8 6 92.3  88.9 495.6  8.5 24.1 27  3.1  0.0    0
## 8  8 6 91.5 145.4 608.2 10.7  8.0 86  2.2  0.0    0
## 9  8 6 91.0 129.5 692.6  7.0 13.1 63  5.4  0.0    0
## 10 7 5 92.5  88.0 698.6  7.1 22.8 40  4.0  0.0    0
## [1] "month" "day"

The month and day column have now been removed

Checked for missing values in advance as part of the pre processing

##        X               Y          month               day           
##  Min.   :1.000   Min.   :2.0   Length:517         Length:517        
##  1st Qu.:3.000   1st Qu.:4.0   Class :character   Class :character  
##  Median :4.000   Median :4.0   Mode  :character   Mode  :character  
##  Mean   :4.669   Mean   :4.3                                        
##  3rd Qu.:7.000   3rd Qu.:5.0                                        
##  Max.   :9.000   Max.   :9.0                                        
##       FFMC            DMC              DC             ISI        
##  Min.   :18.70   Min.   :  1.1   Min.   :  7.9   Min.   : 0.000  
##  1st Qu.:90.20   1st Qu.: 68.6   1st Qu.:437.7   1st Qu.: 6.500  
##  Median :91.60   Median :108.3   Median :664.2   Median : 8.400  
##  Mean   :90.64   Mean   :110.9   Mean   :547.9   Mean   : 9.022  
##  3rd Qu.:92.90   3rd Qu.:142.4   3rd Qu.:713.9   3rd Qu.:10.800  
##  Max.   :96.20   Max.   :291.3   Max.   :860.6   Max.   :56.100  
##       temp             RH              wind            rain        
##  Min.   : 2.20   Min.   : 15.00   Min.   :0.400   Min.   :0.00000  
##  1st Qu.:15.50   1st Qu.: 33.00   1st Qu.:2.700   1st Qu.:0.00000  
##  Median :19.30   Median : 42.00   Median :4.000   Median :0.00000  
##  Mean   :18.89   Mean   : 44.29   Mean   :4.018   Mean   :0.02166  
##  3rd Qu.:22.80   3rd Qu.: 53.00   3rd Qu.:4.900   3rd Qu.:0.00000  
##  Max.   :33.30   Max.   :100.00   Max.   :9.400   Max.   :6.40000  
##       area        
##  Min.   :   0.00  
##  1st Qu.:   0.00  
##  Median :   0.52  
##  Mean   :  12.85  
##  3rd Qu.:   6.57  
##  Max.   :1090.84

No missing NA values found

Checked for outliers in the dataset and ploted these outliers before and after removing them. Variables have been plotted against the area variable

Before removal of outliers

After Removal of outliers

In the plots, the x-axis represents one of the variables (e.g., X, Y, FFMC, etc.), and the y-axis represents the area variable. Each dot represents one observation in the dataset.

Before removing the outliers, the plots show that there are several points that are far from the rest of the observations, which are likely to be outliers. After removing the outliers, the range of the y-axis has been reduced, and the remaining observations have a more consistent relationship with the variables shown on the x-axis.

The removal of the outliers is important because after I run it before removing the outliners it significantly affected the results of my analysis. The presence of outliers can cause the regression coefficients to be biased or result in a poor fit of the model to the data. By removing the outliers, I can obtain more accurate estimates of the regression coefficients and a better fit of the model to the data.

Run an evaluation on the perfomance of the model

## [1] 0.07995134

Got an R-squared value of 0.07995134 meaning that only about 7.99% of the variance in the dependent variable is explained by the independent variables included in your multiple linear regression model.

  1. Predicting the area of the forest fires based on multiple Variables on the test data and evaluated the perfomance
##        1        3        6        8        9       12 
## 1.080448 1.907757 1.756031 1.209989 2.817131 2.595459
## [1] 2.299128
## [1] 2.951788

Evaluated the performance of my model by calculating RMSE, MAE, and R-squared using the actual area values in the test dataset and the predicted values obtained from the model.

The RMSE and MAE indicate the average difference between the actual and predicted area values, with lower values indicating better performance. The R-squared value indicates the proportion of variance in area that is explained by the independent variables, with higher values indicating better fit of the model.

Conclusion

Multiple Linear Regression on Forest Fire dataset

The output shows the coefficients and significance of each variable in the multiple linear regression model for the forest fire dataset. A positive coefficient indicates a positive effect on the area burned, while a negative coefficient indicates the opposite. The R-squared value of 0.07995 indicates that the model explains only a small portion of the variance in the data.

The low R-squared value suggests that the model may not be an effective predictor of the area burned. Additional independent variables may be needed to improve the model’s performance. The evaluation score confirms the weak performance of the model in predicting the area burned.

Overall, the multiple linear regression model has limited effectiveness in predicting the area burned in the forest fire dataset.