The Forest fires dataset contains information on forest fires in Portugal from 2000 to 2003, including meteorological and other factors that may influence the spread of fires. Using multiple linear regression, I will analyze how these factors relate to the size of the burned area in hectares with the target variable of “Area”.
Link: https://archive.ics.uci.edu/ml/datasets/Forest+Fires
Below is a list of variables in the Dataset with their descriptions:
| Variable | Description |
|---|---|
| X | x-axis spatial coordinate within the Montesinho park map: 1 to 9 |
| Y | y-axis spatial coordinate within the Montesinho park map: 2 to 9 |
| month | month of the year: ‘jan’ to ‘dec’ |
| day | day of the week: ‘mon’ to ‘sun’ |
| FFMC | Fine Fuel Moisture Code (FFMC) rating from the FWI system: 18.7 to 96.20 |
| DMC | Duff Moisture Code (DMC) rating from the FWI system: 1.1 to 291.3 |
| DC | Drought Code (DC) rating from the FWI system: 7.9 to 860.6 |
| ISI | Initial Spread Index (ISI) rating from the FWI system: 0.0 to 56.10 |
| temp | temperature in Celsius degrees: 2.2 to 33.30 |
| RH | relative humidity in %: 15.0 to 100 |
| wind | wind speed in km/h: 0.40 to 9.40 |
| rain | outside rain in mm/m2 : 0.0 to 6.4 |
| area | the burned area of the forest (in ha): 0.00 to 1090.84r |
Loaded all the required libraries for Mulitple Linear Regression model
Loaded the Forest Fires Dataset and explored all the variables
List of the first 10
## X Y month day FFMC DMC DC ISI temp RH wind rain area
## 1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0
## 2 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0
## 3 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0
## 4 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0
## 5 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0
## 6 8 6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 0
## 7 8 6 aug mon 92.3 88.9 495.6 8.5 24.1 27 3.1 0.0 0
## 8 8 6 aug mon 91.5 145.4 608.2 10.7 8.0 86 2.2 0.0 0
## 9 8 6 sep tue 91.0 129.5 692.6 7.0 13.1 63 5.4 0.0 0
## 10 7 5 sep sat 92.5 88.0 698.6 7.1 22.8 40 4.0 0.0 0
## [1] "X" "Y" "month" "day" "FFMC" "DMC" "DC" "ISI" "temp"
## [10] "RH" "wind" "rain" "area"
Loaded several R packages using the library() function. Then, loaded the “forestfires” dataset by reading in a CSV file using the read.csv() function. Lastly, displayed the first few rows of the dataset using the head() function.
Distinguished none numeric columns from numeric columns just for the sake of our regression model
## X Y FFMC DMC DC ISI temp RH wind rain area
## 1 7 5 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0
## 2 7 4 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0
## 3 7 4 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0
## 4 8 6 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0
## 5 8 6 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0
## 6 8 6 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 0
## 7 8 6 92.3 88.9 495.6 8.5 24.1 27 3.1 0.0 0
## 8 8 6 91.5 145.4 608.2 10.7 8.0 86 2.2 0.0 0
## 9 8 6 91.0 129.5 692.6 7.0 13.1 63 5.4 0.0 0
## 10 7 5 92.5 88.0 698.6 7.1 22.8 40 4.0 0.0 0
## [1] "month" "day"
The month and day column have now been removed
Checked for missing values in advance as part of the pre processing
## X Y month day
## Min. :1.000 Min. :2.0 Length:517 Length:517
## 1st Qu.:3.000 1st Qu.:4.0 Class :character Class :character
## Median :4.000 Median :4.0 Mode :character Mode :character
## Mean :4.669 Mean :4.3
## 3rd Qu.:7.000 3rd Qu.:5.0
## Max. :9.000 Max. :9.0
## FFMC DMC DC ISI
## Min. :18.70 Min. : 1.1 Min. : 7.9 Min. : 0.000
## 1st Qu.:90.20 1st Qu.: 68.6 1st Qu.:437.7 1st Qu.: 6.500
## Median :91.60 Median :108.3 Median :664.2 Median : 8.400
## Mean :90.64 Mean :110.9 Mean :547.9 Mean : 9.022
## 3rd Qu.:92.90 3rd Qu.:142.4 3rd Qu.:713.9 3rd Qu.:10.800
## Max. :96.20 Max. :291.3 Max. :860.6 Max. :56.100
## temp RH wind rain
## Min. : 2.20 Min. : 15.00 Min. :0.400 Min. :0.00000
## 1st Qu.:15.50 1st Qu.: 33.00 1st Qu.:2.700 1st Qu.:0.00000
## Median :19.30 Median : 42.00 Median :4.000 Median :0.00000
## Mean :18.89 Mean : 44.29 Mean :4.018 Mean :0.02166
## 3rd Qu.:22.80 3rd Qu.: 53.00 3rd Qu.:4.900 3rd Qu.:0.00000
## Max. :33.30 Max. :100.00 Max. :9.400 Max. :6.40000
## area
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.52
## Mean : 12.85
## 3rd Qu.: 6.57
## Max. :1090.84
No missing NA values found
Checked for outliers in the dataset and ploted these outliers before and after removing them. Variables have been plotted against the area variable
Before removal of outliers
After Removal of outliers
In the plots, the x-axis represents one of the variables (e.g., X, Y, FFMC, etc.), and the y-axis represents the area variable. Each dot represents one observation in the dataset.
Before removing the outliers, the plots show that there are several points that are far from the rest of the observations, which are likely to be outliers. After removing the outliers, the range of the y-axis has been reduced, and the remaining observations have a more consistent relationship with the variables shown on the x-axis.
The removal of the outliers is important because after I run it before removing the outliners it significantly affected the results of my analysis. The presence of outliers can cause the regression coefficients to be biased or result in a poor fit of the model to the data. By removing the outliers, I can obtain more accurate estimates of the regression coefficients and a better fit of the model to the data.
Split the data into training and test set. The code for this operation sets a seed for reproducibility and then randomly splits the dataset into training and testing sets using the sample() function. My training set contains 70% of the data and the testing set contains the remaining 30%.
Created a correlation Matrix which shows the degree to which two or more variables are related or vary together It ranges from -1 to +1, with -1 indicating a perfect negative correlation (i.e., when one variable increases, the other variable decreases), +1 indicating a perfect positive correlation (i.e., when one variable increases, the other variable increases), and 0 indicating no correlation between the variables.
The correlation matrix is useful in identifying which variables are positively or negatively correlated with each other
Based on the matrix above, you can see that there are several squares with different colors, representing the correlation between pairs of variables. The brighter the color, the stronger the correlation between the two variables.
Based on the output I tried to omit FFMC/DMC/DC against Temperature however the results of the accuracy of the model remained the same.Correlation does not imply causation.
Fitted the models for Multiple Linear Regression using the lm() function, where the response variable is “area” and all other variables in the dataset are predictors. The summary() function is used to print a summary of the model to the console, including the coefficient estimates and statistical significance of each predictor. Finally, the model$coefficients command displays the coefficient estimates.
## # A tibble: 28 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 monthdec 6.73 2.40 2.81 0.00525
## 2 monthmay -3.47 4.08 -0.851 0.395
## 3 monthaug -3.03 2.37 -1.28 0.203
## 4 monthjul -2.78 2.08 -1.34 0.182
## 5 monthsep -2.67 2.67 -1.00 0.317
## 6 monthjan -2.67 3.33 -0.801 0.423
## 7 monthnov -2.61 4.04 -0.648 0.518
## 8 monthoct -2.42 2.83 -0.854 0.393
## 9 monthjun -2.16 1.88 -1.15 0.250
## 10 daytue 0.994 0.687 1.45 0.149
## # … with 18 more rows
The output shows the coefficient sorted estimates for each variable and the significance of each coefficient. A positive coefficient indicates that the variable has a positive effect on the area burned, while a negative coefficient indicates the opposite. The p-value indicates the level of statistical significance of each coefficient.
The output also shows the overall performance of the model, including the residuals (difference between predicted and actual values) and the multiple R-squared, which is the proportion of variance in the dependent variable (area) that is explained by the independent variables. In this case, the R-squared value is low (0.07995), indicating that the model does not explain much of the variance in the data.
## [1] 0.07995134
Got an R-squared value of 0.07995134 meaning that only about 7.99% of the variance in the dependent variable is explained by the independent variables included in your multiple linear regression model.
## 1 3 6 8 9 12
## 1.080448 1.907757 1.756031 1.209989 2.817131 2.595459
## [1] 2.299128
## [1] 2.951788
Evaluated the performance of my model by calculating RMSE, MAE, and R-squared using the actual area values in the test dataset and the predicted values obtained from the model.
The RMSE and MAE indicate the average difference between the actual and predicted area values, with lower values indicating better performance. The R-squared value indicates the proportion of variance in area that is explained by the independent variables, with higher values indicating better fit of the model.
The output shows the coefficients and significance of each variable in the multiple linear regression model for the forest fire dataset. A positive coefficient indicates a positive effect on the area burned, while a negative coefficient indicates the opposite. The R-squared value of 0.07995 indicates that the model explains only a small portion of the variance in the data.
The low R-squared value suggests that the model may not be an effective predictor of the area burned. Additional independent variables may be needed to improve the model’s performance. The evaluation score confirms the weak performance of the model in predicting the area burned.
Overall, the multiple linear regression model has limited effectiveness in predicting the area burned in the forest fire dataset.