Introduction

With recent increase of California wildfires, I have decided to utilize the technology of machine learning, in an attempt to predict when, specifically, what month a forest fire can occur. The algorithm I will be using to make these predicitons, is the “RPart” algorithm– a type of Classification and Regression Tree algorithm, which utilizes linear regresssion as a method of automatically selecting the best splits.

Once I have finished utilizing the “RPart” algorithm, I will then attempt to boost the model’s performance, by use of the “cp,” or, complexity parameter in order to prune the decision tree.

Step 1: Collecting the Data

To begin, I will import the “forestfires” datset into the working environment. This dataset was taken from UC Irvine’s Machine Learning Data Repository, and was donated by Cortez and Morais, of the University of Minho, Portugal.

fires <- read.csv("forestfires.csv")
fires$day <- NULL

To specify, there are certain abreviations of the feature names present in the dataset.

FFMC- The FFMC index in the FWI system.  These values are known as "Fine Fuel Moisture Codes,"  which range from (18.7,96.2).  According to the Canadian Forest Fire Weather Index System, "The FFMC is a numerical rating of the moisture contentof litter and other cured fine fuels (needles, mosses,twigs less than I cm in diameter)."

DMC - The DMC index from the FWI system.  These values are known as the the "Duff Moisture Codes," and range from (1.1,291.3).  Accoring to the Canadian Forest Fire Weather Index System, "The DMC indicates the moisture content ofloosely-compacted organic layers of moderate depth."
DC - The DC index in the FWI system. The values are known as the "Drought Code," and range from (7.9,860.6).  According to the Canadian Weather Index System, "The third moisture code is the DC, and it is anindicator of moisture, content in deep, compact organiclayers."
ISI - The ISI index from the FWI system.  These values are part of the "Initial Spread Index," and range from (0,56.10).  Acoording to the Canadian Forest Fire Weather Index, "The ISI combines the FFMC and wind speed to indicatethe expected rate of fire spread (Fig.  5)."
RH- The Relative Humidity, which, ranges from (15.0,100)

Step 2: Exploring and Preparing Data

To begin, I will analyze the structure of the “forestfires” dataset. In order to do so, I created a histogram of my target variable, month, in order to gain perspective of the attribute’s distribution.

hist(fires$month)

As evident from the histogram, the data seems to be slightly skewed left, with the majority of the fires occuring in August.

Now, in order to run the RPart algorithm, I created the testing and training datasets. These datasets were created using the “sample” method, in order to ensure random training and testing data.

set.seed(123)
train_sample <- sample(517, 416)

fires_train <- fires[train_sample, ]
fires_test  <- fires[-train_sample, ]

## Attatched below, is my original attempt of the holdout method.  
# fires_train <- fires[1:416, ]
# fires_test <- fires[417:517, ]

Step 3: Training the Model on the Data

With the test and training datasets successfuly created, I began the process of creating the regression tree, using the “rpart” package.

library(rpart)

m.rpart <- rpart(month ~., data = fires_train)

With the regression successfully created, I then went onto create a visual of the tree using the “rpart.plot” function.

library(rpart.plot)
rpart.plot(m.rpart, digits = 3)

rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

As evident from the Tree, the primary splits were based off of the DC levels. In fact, most of the splits were based off of the DC levels, with only two splits being based off of the DMC levels. This information shall come into play, in my attempt to improve the algorithm’s performance in step 5.

Step 4: Evaluating Model Performance

With the regression tree created, I generated predictions for the testing dataset, using the “predict” function.

p.rpart <- predict(m.rpart, fires_test)

Using the generated predictions, I then compared the predictions with the actual dataset, using the summary function.

summary(p.rpart)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.973   7.896   8.133   7.719   9.048  10.571
summary(fires_test$month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   8.000   8.000   7.832   9.000  12.000

As evident from the numerical summaries, the predictions from the RPart algorithm are very similar to the testing dataset. From this information, I believe that the algorithm has developed a fairly strong accuracy for predicting which month a forest fire has occured. In order to develop a better sense of the algorithm’s accuracy, I will use the correlation function on the predictions datset, and the testing dataset.

cor(p.rpart, fires_test$month)
## [1] 0.9095257

As evident from the correlation coefficient, the algorithm’s predictions and the testing datasets are strongly, positively, and linearly, correlated. This implies that the algorithm has developed a strong accuracy for predicting the month in which a forest fire has occured.

Next, I created a custom function to calculate the mean absolute error of the algorithm accuracy.

MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}

Using the custom function created, I will calculate the Mean Absolute Error of the predictions, and the testing dataset.

MAE(p.rpart, fires_test$month)
## [1] 0.3094427

As evident from the functions output, there there is a Mean Absolute Error of .309 present in the model. I personally, believe that this is a pretty good rate of error.

Next, using the mean value of the months feature, I calculated the mean absolute error for the average month value, in relation to the actual testing dataset.

mean(fires_test$month)
## [1] 7.831683
MAE(7.831683, fires_test$month)
## [1] 1.388884

Again, the Mean Absolute Error does not appear to be extraordinarily high between the average testing month, and the regular testing months.

Step 5: Improving Model Performance

In order to improve the accuracy of this algorithm, I replaced the leaves of the regression tree with their own seperate regression models, using the M5P algorithm. For the new and improved regression tree, the extra regression models will be centered on the attributes of DC, and DMC.

library(RWeka)
m.m5p <- M5P(month~DC+DMC, data = fires_train)

summary(m.m5p)
## 
## === Summary ===
## 
## Correlation coefficient                  0.8943
## Mean absolute error                      0.842 
## Root mean squared error                  1.2506
## Relative absolute error                 47.8065 %
## Root relative squared error             54.4956 %
## Total Number of Instances              416

Based on the summary of the new regression tree, it appears that the algorithm has unfortunately raised the Mean Absolute Error from 0.30944, to 0.842. In addition, the algorithms correlation coefficient has decreased from .9095, to 0.8943.

Next I used the “predict” function in order to generate the predictions of the newly improved model.

p.m5p <- predict(m.m5p, fires_test)
summary(p.m5p)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.451   7.134   7.412   6.946   8.460   9.055
summary(fires_test$month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   8.000   8.000   7.832   9.000  12.000

As evident from the prediction summaries, the newly improved model has a summary that is not as close to the original dataset, than the former model. To further compare the accuracy, I will be using the correlation function.

cor(p.m5p, fires_test$month)
## [1] 0.802855

Unfortunately, while the Mean Absolute Error has decreased, our new regression tree’s correlation with the testing data is lower than that of the old regression tree’s.

MAE(fires_test$month, p.m5p)
## [1] 0.9871519

As evident from the code output, the Mean Absolute Error of the correlation between the testing data and the new model has increased. While I may have attempted to boost the RPart model’s performance, I may have increased its errors, with a decreased accuracy.

Next, I attempted to boost the rpart algorithm, using a tuning parameter known as “cp,” or, the complexity parameter. Using the “Caret” package in R, I can run the rpart algorithm on the data, and will automatically use the best complexity parameter as a tuning parameter for the model.
In addition, I used the RMSE, or, Root Mean square error, as the metric for the error of my model.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'caret'
## The following object is masked _by_ '.GlobalEnv':
## 
##     MAE
set.seed(300)
m_rpart <- train(month ~ ., data = fires, method = "rpart",
                metric = "RMSE")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
m_rpart
## CART 
## 
## 517 samples
##  11 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 517, 517, 517, 517, 517, 517, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE      
##   0.02071644  1.026758  0.7929236  0.5540386
##   0.04644137  1.103480  0.7630930  0.6859543
##   0.78276245  1.643340  0.7239524  1.1515641
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02071644.
## View for further information of this method:  https://datascience.stackexchange.com/questions/31346/caret-and-rpart-does-caret-automatically-prune-rpart-trees
fires_pred <- predict(m_rpart, fires)
cor(fires_pred, fires$month)
## [1] 0.9357592

As evident from the code output, the accuracy of the Rpart algorithm has increased slightly, based on the correlation between the algorithm’s predictions, and the actual values. While there was initially a correlation value of .9095, the correlation for the tuned model has decreased to .9358, resulting in a slightly more accurate model.

Overall, I have found this model to be fairly useful for predicting the month in which a forest fire has occured, based off of certain conditions. While one attempts to boost the model’s performance had failed, it appears that the “caret” packge’s use of the complexity paramet had better resutls than the original Rpart algorithm, resulting in a highly accurate model to calculate what month a fire will occur.

Appendix

The attribute, “months” was converted from a factor to a numeric translation of the months. E.g., Jan = 1, Feb = 2, etc.

Step 2:

To further explore the dataset, I will use the summary function in order to generate numerical summaries of the data attributes.

summary(fires)
##        X               Y           month             FFMC      
##  Min.   :1.000   Min.   :2.0   Min.   : 1.000   Min.   :18.70  
##  1st Qu.:3.000   1st Qu.:4.0   1st Qu.: 7.000   1st Qu.:90.20  
##  Median :4.000   Median :4.0   Median : 8.000   Median :91.60  
##  Mean   :4.669   Mean   :4.3   Mean   : 7.476   Mean   :90.64  
##  3rd Qu.:7.000   3rd Qu.:5.0   3rd Qu.: 9.000   3rd Qu.:92.90  
##  Max.   :9.000   Max.   :9.0   Max.   :12.000   Max.   :96.20  
##       DMC              DC             ISI              temp      
##  Min.   :  1.1   Min.   :  7.9   Min.   : 0.000   Min.   : 2.20  
##  1st Qu.: 68.6   1st Qu.:437.7   1st Qu.: 6.500   1st Qu.:15.50  
##  Median :108.3   Median :664.2   Median : 8.400   Median :19.30  
##  Mean   :110.9   Mean   :547.9   Mean   : 9.022   Mean   :18.89  
##  3rd Qu.:142.4   3rd Qu.:713.9   3rd Qu.:10.800   3rd Qu.:22.80  
##  Max.   :291.3   Max.   :860.6   Max.   :56.100   Max.   :33.30  
##        RH              wind            rain              area        
##  Min.   : 15.00   Min.   :0.400   Min.   :0.00000   Min.   :   0.00  
##  1st Qu.: 33.00   1st Qu.:2.700   1st Qu.:0.00000   1st Qu.:   0.00  
##  Median : 42.00   Median :4.000   Median :0.00000   Median :   0.52  
##  Mean   : 44.29   Mean   :4.018   Mean   :0.02166   Mean   :  12.85  
##  3rd Qu.: 53.00   3rd Qu.:4.900   3rd Qu.:0.00000   3rd Qu.:   6.57  
##  Max.   :100.00   Max.   :9.400   Max.   :6.40000   Max.   :1090.84

Next, I will create numerical summaries of the most important features in the dataset. (Determined in Step 3.)

summary(fires$DMC)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.1    68.6   108.3   110.9   142.4   291.3
summary(fires$DC)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.9   437.7   664.2   547.9   713.9   860.6