Summary

This program is designed to use geographical (Fire location), temporal (Month and Day), Fire Weather varibles () and weather variables (RH, Temp, Rain, Wind) to predict the area burned by forest fires. Data were obtained from the UCI Machine Learning database (http://archive.ics.uci.edu/ml/datasets/Forest+Fires) and contain details for 517 fires found in the Montesinho Natural Park in Portugal.

Here we used a linear model to predict area burned where, after multiple models, the final model contained Month, DMC, DC, ISI, wind and temp predictors with addioonal interactive terms. Despite this however, the final model only achieved a 9% accuracy meaning there is likely to be a large error rate for any predictions made. This was largely due to the limited number of data cases and releativly low level correlations between the predictor variables and area outcome measure.

Details on how to just run the program and use the predictive function are found directly below. More information on the exact models used and building process can be found after that.

Prerequisite

Install the following packages: caret,grDevices,leaps,corrplot,car,DAAG and relaimpo.
Ensure you have set your working directory to the location you want to to perform the analysis.

Load in relevant packages

library(caret)
library(grDevices)
library(leaps)
library(relaimpo)
library(corrplot)
library(car)
library(DAAG)
# Alters the way numbers are outputted
options(scipen = 999,digits = 3)

Making Predictions

Here we can input current spatial, temporal, weather and fire index conditions in order to predict the area burned.

Source the forestfirecode.R into a session of R
Run the predictfire() function
The console will then ask you to input values for each of the predictor values.
Input these 1 at a time (with day and month information as numeric values).
The model will then predict the area burned based on these values.

All associated graphs to used to visualize the data and model (presented below) will also be saved as pdf files to the same project folder file for later inspection.

Building the Model

Load Data into R

Here we’ll firstly check whether a independent file has been created for the analysis and whether the datafile has been downloaded. If neither the folder nor the file exist, the program will create a folder and download the data.

It will then read in the data to a forestdata variable

# Location you want the project to work in
projectfolder <- "Forest_Fire"

# Creates a project folder is none exist and changes directory into it
mainDir <- getwd()
dir.create(file.path(mainDir,projectfolder), showWarnings = FALSE)
setwd(paste(mainDir,projectfolder, sep="/"))

# Check if the data exists and download if not.
destfile <- paste(getwd(),"forestfires.csv", sep="/")
if(!file.exists("forestfires.csv")){
  download.file(url="https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv",destfile)}
# Read in the datafile
forestdata <- read.csv(destfile, header = TRUE, sep = ",",stringsAsFactors=FALSE)
origdata <- forestdata

Assessing and Visualizing the data

Here we can see we have the full 517 data cases with 13 variables which match up to the data documentation for the dataset (12 predictors plus an area outcome). We can also see there were 0 cases of missing cases in the dataset. Nevertheless, although we have a complete dataset, there are only a very limited number of cases which might pose problems for the models accuracy. This is particualy concerning considering we have 247 or 47.776% of cases with 0 hectors burned.

# Number of data cases and variables
dim(forestdata)

## [1] 517  13

# Check names of variables - match database infomation.
colnames(forestdata)

##  [1] "X"     "Y"     "month" "day"   "FFMC"  "DMC"   "DC"    "ISI"  
##  [9] "temp"  "RH"    "wind"  "rain"  "area"

# Check to see if any data missing - we're OK here so can proceed.
sum(is.na(forestdata))

## [1] 0

# Check to see how many cases have an area of 0
length(which(forestdata$area==0))

## [1] 247

Looking at the data, we may also want to perform some initial transformations on both the predictor and outcome variables before we start modelling.

Predictors

Both the month and day variables are currently strings so we want to convert them into numeric values to enable easier modelling and visualization.

# Convert month and day string variables into numeric values
forestdata$month <- as.numeric(as.factor(forestdata$month))
forestdata$day <- as.numeric(as.factor(forestdata$day))

We can now also start to visualize the data. Here we can see density plots for each of the 12 predictor variables. This plot will be saved into the analysis directory when running the program. Here we can see Rain has a heavy right-skew and FFMC has a large left-skew.

Looking at the rain variable specifically, there is very little variance with only 1.56% of the data containing a non-0 rain level. Even if this variable were transformed, it is unlikley to facilitate model prediction so will be removed.

Finally, the FFMC variable will be transformed using a cube in order to reduce its left skew. The figure below show this this transformation was indeed successful.

# Density plot - Predictors
# Shows us rain ais right skewed and FFMC is left skewed. 
# We can see normal looking distributions for temp, wind, RH, X, DMC and day.
#pdf(file="predictordesnisty.pdf")
par(mfrow=c(2,6),mar=c(3.90, 4.25, 2.5, 0.5))
for (variables in 1:(dim(forestdata)[2]-1)){
  thisvar = forestdata[,variables]
  d <- density(thisvar)
  plot(d, main = names(forestdata[variables]),xlab="")
  polygon(d, col="cyan", border="blue")
  title("Density plots for all 11 Model Variables", line = -19, outer = TRUE)}

# Rain variable has a heavy 0 distribution with only 1.56% of the data
# This will therefore be removed from the model as there is not enough variance
print(paste("Percentage non-zero rain: ",round(length(which(forestdata$rain>0)) /dim(forestdata)[1]*100,2)))
forestdata <- forestdata[,-which(colnames(forestdata)== "rain")]

## [1] "Percentage non-zero rain:  1.55"

# Since the FFMC is left-skew, we'll cube it to normalize it
par(mfrow=c(1,2),mar=c(5, 4.25, 5.5, 2))
d <- density(forestdata$FFMC)
plot(d,main="FFMC Density (original)",xlab="FFMC index", col='tomato', lwd=3)
forestdata$FFMC<- (forestdata$FFMC^3)
d <- density(forestdata$FFMC)
plot(d,main="FFMC Density (x^3)",xlab="FFMC index", col='tomato', lwd=3)

Output

We can also visualize the density of the outcome variable of Area burned. Here we can again see a heavy right-skew in the data set with many 0 values as previously mentioned. This varible will therefore be transformed using a log transformation where 1 will first be added to the area (to account for the 0 values). The figure below shows the success of this transformation in reducing the skew.

# Density plot - Outcome
# Shows us a extensive right skew in the data
par(mfrow=c(1,2),mar=c(5, 4.25, 5.5, 2))
  d <- density(forestdata$area)
  plot(d,main="Area Burned Density (original)",xlab="Area Burned (Hec)", col='tomato', lwd=3)
  d <- density(log(forestdata$area+1))
  plot(d,main="Area Burned Density (log(x+1))",xlab="Area Burned (Hec)", col='tomato', lwd=3)

# Heavy skew indicates log transformation
# Since there are also many 0 counts for area, we'll first add 1 before transforming
forestdata$area <- log(forestdata$area+1)

Visualizing Variable relationship

Here we now want to examine the correlations between all of our 13 variables where we can see a number of important correlations which we may want to account for in our model.
1. Positive correlations between ISI, temp, DCM and DC
2. Positive correlations between X and Y
3. Negative correlaitons between RH and Temp

We can also see that the Area outcome variable isn’t strongly correlatied with any of the 12 predictors.

# Examine correlations between all 12 predictors and the area outcome
par(mfrow=c(1,1))
M <- cor(forestdata)
corrplot(M, method="color", outline = TRUE,type="lower",order = "hclust",
         tl.col="black", tl.srt=45, diag=FALSE,tl.cex = 1,mar=c(0,0,3,0),
         title="Correlation Matrix between Predictor and Outcome variables")

Data modelling

Assumptions

Here we’ll be plotting linear regressions to try and predict area burned from our 12 predictors. This means we’ll first need to check the data meet the assumpmptions of this test

Homoscedasticity Here we can see that the data meet the assumption of Homoscadacity where the variances in the residuals of the model are randomly distributed. Using the Breush-Pagan test, we can see a non-signifincant results suggesting that the data can be considered homoscedastic.

assumptionsmodel <- lm(area ~ ., data=forestdata)
lmtest::bptest(assumptionsmodel)
par(mfrow=c(2,2))
plot(assumptionsmodel)


    studentized Breusch-Pagan test

data:  assumptionsmodel
BP = 20, df = 10, p-value = 0.1

Normality

Here, due to the large sample size, even small deviations from normality will be picked up as significant so normality will be assessed with plots. There also seems to be a heavy skew to the residuals which are not normally distributed. This appears to be due to the large number of 0’s in the database. When these 0’s are removed, we can see the residuals become more normally distributed. Although this means we lose a large chunk of the data cases, this is needed in order to correctly use the lm model.

assumptionsmodel_all <- lm(area ~ ., data=forestdata)
assumptionsmodel_0 <- lm(area ~ .,data=forestdata[which(forestdata$area>0),])
# Remove all cases with an area burned of 0
forestdata <- forestdata[which(forestdata$area>0),]
# Plots both with and without 0 residuals
par(mfrow=c(1,2))
hist(assumptionsmodel_all$residuals, main = "Data with 0 area burned", xlab = 'Residuals')
abline(v=mean(assumptionsmodel_all$residuals), col='red', lwd=2)
hist(assumptionsmodel_0$residuals,main = "Data without 0 area burned", xlab = 'Residuals')
abline(v=mean(assumptionsmodel_0$residuals), col='red', lwd=2)

Basic Model

Our basic model just uses a linear model with all 12 predictor variables to account for area burned.

model1 <- lm(area ~ ., data=forestdata)
outcome1 <- summary(model1)
(round(outcome1$coefficients[ , c(2,4)],3))
# Find which variables are significant
sig.pred <- row.names(outcome1$coefficients)[which(outcome1$coefficients[ ,4] <= 0.05)]
rsquare <- round(summary(model1)$r.squared,3)
print(paste("The R2 value is ",rsquare))

            Std. Error Pr(>|t|)
(Intercept)      1.260    0.077
X                0.038    0.442
Y                0.077    0.499
month            0.022    0.039
day              0.041    0.618
FFMC             0.000    0.982
DMC              0.002    0.021
DC               0.001    0.088
ISI              0.031    0.317
temp             0.021    0.994
RH               0.007    0.214
wind             0.046    0.182
[1] "The R2 value is  0.05"

We can see that there is a very low level of variance explained by the model with a r squared value of 0.05. However we can see that month, DMC are a significant predictor.

Second Model: Add interactive

We saw in the correlation matrix that there was a large correlation between the forest fire index variables (FFMC,DMC,DC and ISI). It may therefore be beneficial to look at interactive terms between these variables. The same may also be said for the weather variable which also show some correlations.

# Create interactive terms for Fire index
forestdata$FFMC.DMC <- forestdata$FFMC*forestdata$DMC
forestdata$FFMC.DC <-forestdata$FFMC*forestdata$DC
forestdata$FFMC.ISI <-forestdata$FFMC*forestdata$ISI
forestdata$DMC.DC<-forestdata$DMC*forestdata$DC
forestdata$DMC.ISI<-forestdata$DMC*forestdata$ISI
forestdata$DC.ISI<-forestdata$DC*forestdata$ISI

# Create interactive terms for Weather
forestdata$wind.temp<-(forestdata$wind)*(forestdata$temp)
forestdata$temp.RH<-(forestdata$temp)*(forestdata$RH)
forestdata$wind.RH<-(forestdata$wind)*(forestdata$RH)

# Model with original and interactive terms
model2 <- lm(area~.,data=forestdata)
outcome2 <- summary(model2)
# Find which variables are significant
sig.pred <- row.names(outcome2$coefficients)[which(outcome2$coefficients[ ,4] <= 0.05)]
rsquare <- round(summary(model2)$r.squared,3)
print(paste("The R2 value is ",rsquare))

## [1] "The R2 value is  0.095"

Again, we can see our R2 value is 0.095 which has improved by 0.044 compared to the previous model. We can also see the following predictors are now significant: month, DMC, FFMC.DMC

Third Model: Remove Outliers

We now want to explore whether there are any data points that are significanty influence the model were trying to plot. These points can have a significant impact on the accuracy of the models predictions. Here we can see some data cases that are both outliers and influencial. These points should therefore be removed from the analysis.

# Visualise the outliers and influence points in the model
influencePlot(model2,id.n=5)

##     StudRes    Hat    CookD
## 200  -0.820 0.5851 0.045224
## 239   3.750 0.0241 0.015727
## 407   0.192 0.3326 0.000876
## 470   1.746 0.2517 0.048435
## 480   3.097 0.0618 0.029097

# Remove the 2 outliers from the data
forestdata2 <- forestdata[-which(row.names(forestdata) %in% c(200,470)),]

# Run full model again with outliers removed
model3 <- lm(area~.,data=forestdata2)
outcome3 <- summary(model3)
# Find which variables are significant
sig.pred <- row.names(outcome3$coefficients)[which(outcome3$coefficients[ ,4] <= 0.05)]
rsquare <- round(summary(model3)$r.squared,3)
print(paste("The R2 value is ",rsquare))

## [1] "The R2 value is  0.1"

Again, our model now has an R2 value of 0.1 showing a 0.006 improvement now the outliers have been removed. We can also see the following predictors are now significant: month, FFMC.DMC

Cross Validation and Variable selection

Now that we’ve finalised the variables, we need to select which variables to use in the model as there are currently too many predictors. Although reducing the number of variables will likley reduce the r2 value, this is needed to avoid overfitting.

We first want to run a cross-validation to assess the accuray of the model with all the variables in using k-means techniques. This will split the dataset into train and test chunks in 89 different ways and evaluate the model error for each.

#Perform initial cross validaiton
par(mfrow=c(1,1))
cv_all<-cv.lm(data=forestdata2, model3, m=3)

RMSD_origin<-sqrt(mean((cv_all$cvpred-cv_all$area)^2))
print(paste("All variable model RMSD: ",RMSD_origin))

## [1] "All variable model RMSD:  1.34849411888118"

As we can see above, the model has a Root MSD of 1.348. We now want to find which subset of variables gives us the best prediction accuracy. The pot below shows how models with different variable subsets compare in terms of adjusted R2 where we can see the variables for the best subset at the top.

# Loop though all subsets to calucate a model which each as the predictors
subset.out<-regsubsets(area~ ., data=forestdata2,nbest=1,nvmax=NULL,method="exhaustive")
summary.out<-summary(subset.out)
# Returns the variables used in the best model
bestmodelvariables<- summary.out$which[which.max(summary.out$adjr2),]  
par(mfrow=c(1,1))
# PLot all the subsets against the adjusted r2 value
plot(subset.out,scale="adjr2",main="All Subset Regression")

variables <- names(bestmodelvariables[which(bestmodelvariables==TRUE)])[-1]

Here we can see the best model uses the following variables: month, DMC, DC, ISI, wind, FFMC.DMC, FFMC.DC, DMC.DC, temp.RH.
We can finally run the same procedure again with these specific varaibles and perform cross-validation on the final model.

# Extract the best variables and inset into model
formula <- noquote(paste("area~",paste(variables, collapse = '+'),sep=""))
modelfinal<-lm(formula,data=forestdata2)  
# Perform cross-validation on the new model
cv_final<-cv.lm(data=forestdata2, modelfinal, m=3)

RMSD_final<-sqrt(mean((cv_final$cvpred-cv_final$area)^2))
print(paste("Subset variable model RMSD: ",round(RMSD_final,2)))

## [1] "Subset variable model RMSD:  1.26"

Here we can see using a subset of the variables has improved our root MSD by 0.09 and gives a model that can explain 9.055% of the data.

New Predictions

The model can now be used to predict forest fires based on new preidctor values. Below we ask the user to input these values to the console command line before returning the predicted value. However, as the r2 value of the model is very low, this likely means that predictiors will have a large error and may not be reliable.

  modelinput <- data.frame(matrix(nrow=1, ncol=length(origdata)-1))
  colnames(modelinput) <- colnames(origdata)[-length(origdata)]
  # Loop through each variable and ask ser to input the value
  for (i in 1:(length(origdata)-1)){
    thisvar <- colnames(origdata)[i]
    varinput <- readline(paste("Input the ",thisvar," value:"))
    varinput <- as.numeric(unlist(strsplit(varinput, ",")))
    modelinput[1,i] <- varinput
  }
  # Create interactive terms for the Fire index
  modelinput$FFMC.DMC <- modelinput$FFMC*modelinput$DMC
  modelinput$FFMC.DC <-modelinput$FFMC*modelinput$DC
  modelinput$FFMC.ISI <-modelinput$FFMC*modelinput$ISI
  modelinput$DMC.DC<-modelinput$DMC*modelinput$DC
  modelinput$DMC.ISI<-modelinput$DMC*modelinput$ISI
  modelinput$DC.ISI<-modelinput$DC*modelinput$ISI
  
  # Create interactive terms for Weather
  modelinput$wind.temp<-(modelinput$wind)*(modelinput$temp)
  modelinput$temp.RH<-(modelinput$temp)*(modelinput$RH)
  modelinput$wind.RH<-(modelinput$wind)*(modelinput$RH)
# Use the model to predict new value
  modelinput <- modelinput
  newpred <- predict(modelfinal, modelinput)
  # Backwards transformation to original hec. value.
  newpred <- exp(newpred)-1
  # Print out new value
  print(paste("We expect fires in these conditions to burn ",round(newpred,2)," hec"))

Predicting Forest Fires: Linear Modelling

Rebecca Elizabeth Kitching

12/09/2018