With recent increase of California wildfires, I have decided to utilize the technology of machine learning, in an attempt to predict when, specifically, what month a forest fire can occur. The algorithm I will be using to make these predicitons, is the “RPart” algorithm– a type of Classification and Regression Tree algorithm, which utilizes linear regresssion as a method of automatically selecting the best splits.
Once I have finished utilizing the “RPart” algorithm, I will then attempt to boost the model’s performance, by use of the “cp,” or, complexity parameter in order to prune the decision tree.
To begin, I will import the “forestfires” datset into the working environment. This dataset was taken from UC Irvine’s Machine Learning Data Repository, and was donated by Cortez and Morais, of the University of Minho, Portugal.
fires <- read.csv("forestfires.csv")
fires$day <- NULL
To specify, there are certain abreviations of the feature names present in the dataset.
FFMC- The FFMC index in the FWI system. These values are known as "Fine Fuel Moisture Codes," which range from (18.7,96.2). According to the Canadian Forest Fire Weather Index System, "The FFMC is a numerical rating of the moisture contentof litter and other cured fine fuels (needles, mosses,twigs less than I cm in diameter)."
DMC - The DMC index from the FWI system. These values are known as the the "Duff Moisture Codes," and range from (1.1,291.3). Accoring to the Canadian Forest Fire Weather Index System, "The DMC indicates the moisture content ofloosely-compacted organic layers of moderate depth."
DC - The DC index in the FWI system. The values are known as the "Drought Code," and range from (7.9,860.6). According to the Canadian Weather Index System, "The third moisture code is the DC, and it is anindicator of moisture, content in deep, compact organiclayers."
ISI - The ISI index from the FWI system. These values are part of the "Initial Spread Index," and range from (0,56.10). Acoording to the Canadian Forest Fire Weather Index, "The ISI combines the FFMC and wind speed to indicatethe expected rate of fire spread (Fig. 5)."
RH- The Relative Humidity, which, ranges from (15.0,100)
To begin, I will analyze the structure of the “forestfires” dataset. In order to do so, I created a histogram of my target variable, month, in order to gain perspective of the attribute’s distribution.
hist(fires$month)
As evident from the histogram, the data seems to be slightly skewed left, with the majority of the fires occuring in August.
Now, in order to run the RPart algorithm, I created the testing and training datasets. These datasets were created using the “sample” method, in order to ensure random training and testing data.
set.seed(123)
train_sample <- sample(517, 416)
fires_train <- fires[train_sample, ]
fires_test <- fires[-train_sample, ]
## Attatched below, is my original attempt of the holdout method.
# fires_train <- fires[1:416, ]
# fires_test <- fires[417:517, ]
With the test and training datasets successfuly created, I began the process of creating the regression tree, using the “rpart” package.
library(rpart)
m.rpart <- rpart(month ~., data = fires_train)
With the regression successfully created, I then went onto create a visual of the tree using the “rpart.plot” function.
library(rpart.plot)
rpart.plot(m.rpart, digits = 3)
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)
As evident from the Tree, the primary splits were based off of the DC levels. In fact, most of the splits were based off of the DC levels, with only two splits being based off of the DMC levels. This information shall come into play, in my attempt to improve the algorithm’s performance in step 5.
With the regression tree created, I generated predictions for the testing dataset, using the “predict” function.
p.rpart <- predict(m.rpart, fires_test)
Using the generated predictions, I then compared the predictions with the actual dataset, using the summary function.
summary(p.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.973 7.896 8.133 7.719 9.048 10.571
summary(fires_test$month)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 8.000 8.000 7.832 9.000 12.000
As evident from the numerical summaries, the predictions from the RPart algorithm are very similar to the testing dataset. From this information, I believe that the algorithm has developed a fairly strong accuracy for predicting which month a forest fire has occured. In order to develop a better sense of the algorithm’s accuracy, I will use the correlation function on the predictions datset, and the testing dataset.
cor(p.rpart, fires_test$month)
## [1] 0.9095257
As evident from the correlation coefficient, the algorithm’s predictions and the testing datasets are strongly, positively, and linearly, correlated. This implies that the algorithm has developed a strong accuracy for predicting the month in which a forest fire has occured.
Next, I created a custom function to calculate the mean absolute error of the algorithm accuracy.
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
Using the custom function created, I will calculate the Mean Absolute Error of the predictions, and the testing dataset.
MAE(p.rpart, fires_test$month)
## [1] 0.3094427
As evident from the functions output, there there is a Mean Absolute Error of .309 present in the model. I personally, believe that this is a pretty good rate of error.
Next, using the mean value of the months feature, I calculated the mean absolute error for the average month value, in relation to the actual testing dataset.
mean(fires_test$month)
## [1] 7.831683
MAE(7.831683, fires_test$month)
## [1] 1.388884
Again, the Mean Absolute Error does not appear to be extraordinarily high between the average testing month, and the regular testing months.
In order to improve the accuracy of this algorithm, I replaced the leaves of the regression tree with their own seperate regression models, using the M5P algorithm. For the new and improved regression tree, the extra regression models will be centered on the attributes of DC, and DMC.
library(RWeka)
m.m5p <- M5P(month~DC+DMC, data = fires_train)
summary(m.m5p)
##
## === Summary ===
##
## Correlation coefficient 0.8943
## Mean absolute error 0.842
## Root mean squared error 1.2506
## Relative absolute error 47.8065 %
## Root relative squared error 54.4956 %
## Total Number of Instances 416
Based on the summary of the new regression tree, it appears that the algorithm has unfortunately raised the Mean Absolute Error from 0.30944, to 0.842. In addition, the algorithms correlation coefficient has decreased from .9095, to 0.8943.
Next I used the “predict” function in order to generate the predictions of the newly improved model.
p.m5p <- predict(m.m5p, fires_test)
summary(p.m5p)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.451 7.134 7.412 6.946 8.460 9.055
summary(fires_test$month)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 8.000 8.000 7.832 9.000 12.000
As evident from the prediction summaries, the newly improved model has a summary that is not as close to the original dataset, than the former model. To further compare the accuracy, I will be using the correlation function.
cor(p.m5p, fires_test$month)
## [1] 0.802855
Unfortunately, while the Mean Absolute Error has decreased, our new regression tree’s correlation with the testing data is lower than that of the old regression tree’s.
MAE(fires_test$month, p.m5p)
## [1] 0.9871519
As evident from the code output, the Mean Absolute Error of the correlation between the testing data and the new model has increased. While I may have attempted to boost the RPart model’s performance, I may have increased its errors, with a decreased accuracy.
Next, I attempted to boost the rpart algorithm, using a tuning parameter known as “cp,” or, the complexity parameter. Using the “Caret” package in R, I can run the rpart algorithm on the data, and will automatically use the best complexity parameter as a tuning parameter for the model.
In addition, I used the RMSE, or, Root Mean square error, as the metric for the error of my model.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'caret'
## The following object is masked _by_ '.GlobalEnv':
##
## MAE
set.seed(300)
m_rpart <- train(month ~ ., data = fires, method = "rpart",
metric = "RMSE")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
m_rpart
## CART
##
## 517 samples
## 11 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 517, 517, 517, 517, 517, 517, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.02071644 1.026758 0.7929236 0.5540386
## 0.04644137 1.103480 0.7630930 0.6859543
## 0.78276245 1.643340 0.7239524 1.1515641
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02071644.
## View for further information of this method: https://datascience.stackexchange.com/questions/31346/caret-and-rpart-does-caret-automatically-prune-rpart-trees
fires_pred <- predict(m_rpart, fires)
cor(fires_pred, fires$month)
## [1] 0.9357592
As evident from the code output, the accuracy of the Rpart algorithm has increased slightly, based on the correlation between the algorithm’s predictions, and the actual values. While there was initially a correlation value of .9095, the correlation for the tuned model has decreased to .9358, resulting in a slightly more accurate model.
Overall, I have found this model to be fairly useful for predicting the month in which a forest fire has occured, based off of certain conditions. While one attempts to boost the model’s performance had failed, it appears that the “caret” packge’s use of the complexity paramet had better resutls than the original Rpart algorithm, resulting in a highly accurate model to calculate what month a fire will occur.
The attribute, “months” was converted from a factor to a numeric translation of the months. E.g., Jan = 1, Feb = 2, etc.
To further explore the dataset, I will use the summary function in order to generate numerical summaries of the data attributes.
summary(fires)
## X Y month FFMC
## Min. :1.000 Min. :2.0 Min. : 1.000 Min. :18.70
## 1st Qu.:3.000 1st Qu.:4.0 1st Qu.: 7.000 1st Qu.:90.20
## Median :4.000 Median :4.0 Median : 8.000 Median :91.60
## Mean :4.669 Mean :4.3 Mean : 7.476 Mean :90.64
## 3rd Qu.:7.000 3rd Qu.:5.0 3rd Qu.: 9.000 3rd Qu.:92.90
## Max. :9.000 Max. :9.0 Max. :12.000 Max. :96.20
## DMC DC ISI temp
## Min. : 1.1 Min. : 7.9 Min. : 0.000 Min. : 2.20
## 1st Qu.: 68.6 1st Qu.:437.7 1st Qu.: 6.500 1st Qu.:15.50
## Median :108.3 Median :664.2 Median : 8.400 Median :19.30
## Mean :110.9 Mean :547.9 Mean : 9.022 Mean :18.89
## 3rd Qu.:142.4 3rd Qu.:713.9 3rd Qu.:10.800 3rd Qu.:22.80
## Max. :291.3 Max. :860.6 Max. :56.100 Max. :33.30
## RH wind rain area
## Min. : 15.00 Min. :0.400 Min. :0.00000 Min. : 0.00
## 1st Qu.: 33.00 1st Qu.:2.700 1st Qu.:0.00000 1st Qu.: 0.00
## Median : 42.00 Median :4.000 Median :0.00000 Median : 0.52
## Mean : 44.29 Mean :4.018 Mean :0.02166 Mean : 12.85
## 3rd Qu.: 53.00 3rd Qu.:4.900 3rd Qu.:0.00000 3rd Qu.: 6.57
## Max. :100.00 Max. :9.400 Max. :6.40000 Max. :1090.84
Next, I will create numerical summaries of the most important features in the dataset. (Determined in Step 3.)
summary(fires$DMC)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.1 68.6 108.3 110.9 142.4 291.3
summary(fires$DC)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.9 437.7 664.2 547.9 713.9 860.6