Introduction
Data was collected to observe which factors have the greatest impact
on the area of land burned in a forest fire.
In the previous assignment, several regression models were created in
order to determine which one provides the best quality of prediction and
estimation for the area affected by a forest fire. For this project, we
will look at the method of bootstrapping the multiple regression model.
We will use the bootstrap method to find bootstrap confidence intervals
of the variables in the regression model.
Data Description
I found this data set on kaggle.com on the following webpage: https://www.kaggle.com/datasets/elikplim/forest-fires-data-set.
The data set is named as “forestfires.csv”.
This data set uses meteorological data to predict the approximated
area which will be burned in forest fires based upon these various
factors. This allows us to investigate the factors which have an impact
on the area of land which is burned by a forest fire. Such data is
incredibly useful in order to help know which factors have the greatest
significance on the area affected by a forest fire, because this could
help provide neccessary data to help minimize the area damaged by forest
fires.
The data in this data set was collected around the Montesinho Natural
Park which is located in Portugal. The data was collected from 517
observations of forest fires which occurred within this national park.
To further enhance the data collection, researchers divided up the
national park into a set of spatial coordinates to pinpoint the precise
location of the forest fires within Montesinho Natural Park. The
national park was divided into x-axis and y-axis spatial coordinates,
with the x-axis coordinates ranging from 1 to 9, and the y-axis
coordinates ranging from 2 to 9. Each observation was pinpointed to the
coordinate location of where the forest fire began.
Variables
There are 13 variables in the forestfires data set.
X: The x-axis spatial coordinate of the forest fire’s location
within Montesinho Natural Park. A numeric value ranging from 1 to
9.
Y: The y-axis spatial coordinate of the forest fire’s location
within Montesinho Natural Park. A numeric value ranging from 2 to
9.
month: The month of the year in which the forest fire occurred. A
categorical, character variable with the lowercase abbreviation of the
first three letters of the month. (For example: feb, oct, dec)
day: The day of the week on which the forest fire occurred. A
categorical, character variable with the lowercase abbreviation of the
first three letters of the month. (For example: mon, tue, sat)
FFMC: FFMC index from the FWI System. A numeric, quantitative
variable. The FWI System stands for the Fire Weather Index System and it
is a system that provides a numeric rating on a scale for the intensity
of a fire. FFMC stands for Fine Fuel Moisture Code and it is a measure
of the moisture content of litter and other fuels for a fire that are
present.
DMC: DMC index from the FWI System. A numeric, quantitative
variable. DMC stands for Deep Moist Convection and it is a measure of
decomposed organic material underneath the litter which fuels the
fire.
DC: DC index from the FWI System. A numeric, quantitative
variable. DC stands for Drought Code and is a measure of the moisture
content of compact organic layers in the area affected by the
fire.
ISI: ISI index from the FWI System. A numeric, quantitative
variable. ISI stands for Initial Spread Index and is a measure of the
expected rate of fire spread.
temp: The temperature given in degrees Celsius. A numeric,
quantitative variable.
RH: The relative humidity given as a percentage. A numeric,
quantitative variable.
wind: The speed of the wind given in kilometers per hour (km/h).
A numeric, quantitative variable.
rain: The rain occurring outside given in a measurement of
mm/m^2. A numeric, quantitative variable.
area: The area of the forest which was burned by the fire.
Measured in units of hectares (ha). A numeric, quantitative variable.
The response variable of this experimental study.
Research
Questions
For this project, our main goal is to create a model which uses the
various factors involved to provide significance for the prediction and
estimation of the area burned by a forest fire.
Some further research questions which I came up with over the past
few weeks of analyzing this data include:
Which factors have the greatest significance in predicting the
area burned by a forest fire?
Which factors have the highest correlation with each other? And,
what are some possible explanations for these factors have the greatest
correlation?
Are there any major outliers in the data set which would prompt
further examination to see what is going on in that occurrence?
Is this data set from Montesinho National Park applicable to the
overall scope of forest fires around the world? Is our data set
representative of forest fires of all locations?
How useful is the model we will create in predicting the area
affected by a forest fire? And, how can this data help with predicting
patterns of forest fires in order to minimize the damage they
cause?
We will use these research questions as a starting point for our
analysis of this experiment. We will come back to these questions in the
conclusion to see which research questions we have concluded an answer
to.
Data Preparation and
Exploratory Analysis
To begin, let’s read in the data set from Github.
'data.frame': 517 obs. of 13 variables:
$ X : int 7 7 7 8 8 8 8 8 8 7 ...
$ Y : int 5 4 4 6 6 6 6 6 6 5 ...
$ month: chr "mar" "oct" "oct" "mar" ...
$ day : chr "fri" "tue" "sat" "fri" ...
$ FFMC : num 86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
$ DMC : num 26.2 35.4 43.7 33.3 51.3 ...
$ DC : num 94.3 669.1 686.9 77.5 102.2 ...
$ ISI : num 5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
$ temp : num 8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
$ RH : int 51 33 33 97 99 29 27 86 63 40 ...
$ wind : num 6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
$ rain : num 0 0 0 0.2 0 0 0 0 0 0 ...
$ area : num 0 0 0 0 0 0 0 0 0 0 ...
The data was collected by pinpointing the forest fire’s location of
origin from a system of spatial coordinates around Montesinho Natural
Park. So, we will create a graph to help visualize the locations of the
occurrences.

As we can see, the forest fire occurrences are spread out throughout
the park. It appears that more forest fires were observed in the lower
half of the Y-coordinates, around 2-5, however the data still appears to
be spread out enough to continue with our investigation of the
model.
Refining the Data
set
While looking at the data set, something I noticed is that many
observations have an area of 0, meaning that none of the forest was
burned by a forest fire for those observations. This could provide for
some comparisons between what conditions caused a fire and what did not,
or lead to questions of why a fire occurred on a certain day but not
another. However, the purpose of this experiment is to see what factors
have a significant impact on the area burned in a forest fire. If we
were to include these observations with an area of 0 affected by the
fires, then this may lead to our observations being skewed or separated
into two groups: area of zero and area of nonzero, aka no fire damage
and fire damage. This could have a negative affect on our final
regression model, by clustering our data into two groups, zero and
nonzero area, which would cause for notable inaccuracy. After some
consideration, I believe it would be in the best interest of this
experiment to create a new data frame, one which only includes the
observations where there was in fact a forest fire which caused
damage.
Let’s create a new data set called “forestfires1” that includes only
the observations where the area affected is greater than 0.
X Y month day FFMC DMC DC ISI temp RH wind rain area
1 9 9 jul tue 85.8 48.3 313.4 3.9 18.0 42 2.7 0 0.36
2 1 4 sep tue 91.0 129.5 692.6 7.0 21.7 38 2.2 0 0.43
3 2 5 sep mon 90.9 126.5 686.5 7.0 21.9 39 1.8 0 0.47
4 1 2 aug wed 95.5 99.9 513.3 13.2 23.3 31 4.5 0 0.55
5 8 6 aug fri 90.1 108.0 529.8 12.5 21.2 51 8.9 0 0.61
6 1 2 jul sat 90.0 51.3 296.3 8.7 16.6 53 5.4 0 0.71
'data.frame': 270 obs. of 13 variables:
$ X : int 9 1 2 1 8 1 2 6 5 8 ...
$ Y : int 9 4 5 2 6 2 5 5 4 3 ...
$ month: chr "jul" "sep" "sep" "aug" ...
$ day : chr "tue" "tue" "mon" "wed" ...
$ FFMC : num 85.8 91 90.9 95.5 90.1 90 95.5 95.2 90.1 84.4 ...
$ DMC : num 48.3 129.5 126.5 99.9 108 ...
$ DC : num 313 693 686 513 530 ...
$ ISI : num 3.9 7 7 13.2 12.5 8.7 13.2 10.4 6.2 3.2 ...
$ temp : num 18 21.7 21.9 23.3 21.2 16.6 23.8 27.4 13.2 24.2 ...
$ RH : int 42 38 39 31 51 53 32 22 40 28 ...
$ wind : num 2.7 2.2 1.8 4.5 8.9 5.4 5.4 4 5.4 3.6 ...
$ rain : num 0 0 0 0 0 0 0 0 0 0 ...
$ area : num 0.36 0.43 0.47 0.55 0.61 0.71 0.77 0.9 0.95 0.96 ...
Let’s create another map for our refined data set to ensure
everything still looks alright.

The map appears very similar to the one created for the forestfires
data set, with the points spread out across the coordinates, so this is
good. Once again, it appears that more forest fires were observed in the
lower half of the Y-coordinates, around coordinates 2-5, but apart from
this, the points are spread out across the coordinate system.
Having a refined data set which focuses on the observations in which
forest fires did affect and cause damage to an area of the forest will
provide better insight and power for our purpose of this experiment,
which is to see which factors have significance in the amount of area
damaged by forest fires. By focusing only on the observations which were
affected by forest fires, this will be more useful for our goal and
allowing us to create a model for prediction and estimation without the
inclusion of redundant observations which do not help our model.
Let’s check for any imbalances in the categorical variable of
“month”.
apr aug dec feb jul jun mar may oct sep
4 99 9 10 18 8 19 1 5 97
It turns out that there is a very significant imbalance in the
variable month. The number of observations in August and September, 99
and 97 respectively drastically outweigh all of the other months, with
no other month having greater than 19 observations. This is a major
cause for concern as this categorical variable is very severely
imbalanced.
We will keep the month variable for now, as it seems like what month
it is does have significance in predicting area affected by forest
fires, but we will keep a close eye on it for the final model to
determine if it should be removed due to its imbalance. We will address
this variable later on to see if its removal will help improve our
model. This month variable is one we will keep a close watch on due to
this concern of imbalance.
Let’s check the other categorical variable “day” for potential
imbalances.
fri mon sat sun thu tue wed
43 39 42 47 31 36 32
The “day” variable does not show major imbalance, as all the days
have a significant number of observations with the difference between
the highest and lowest days being a total of 15 observations, Sunday
with 47 and Wednesday with 32. However, the day of the week does not
seem like it would have as major significance in predicting forest fire
damage then some of the other factors in question. This can be seen as
by how the days of the week are not very imbalanced in regards to their
total observations. This is unlike the month variable where it was shown
that the vast majority of the observations fall into either August or
September, evidencing that the majority of forest fires occur within
these warmer, end of summer months.
To avoid having too many or too redundant variables within our full
model, the categorical variable for “day” will not be included.
9 |
9 |
jul |
85.8 |
48.3 |
313.4 |
3.9 |
18.0 |
42 |
2.7 |
0 |
0.36 |
1 |
4 |
sep |
91.0 |
129.5 |
692.6 |
7.0 |
21.7 |
38 |
2.2 |
0 |
0.43 |
2 |
5 |
sep |
90.9 |
126.5 |
686.5 |
7.0 |
21.9 |
39 |
1.8 |
0 |
0.47 |
1 |
2 |
aug |
95.5 |
99.9 |
513.3 |
13.2 |
23.3 |
31 |
4.5 |
0 |
0.55 |
8 |
6 |
aug |
90.1 |
108.0 |
529.8 |
12.5 |
21.2 |
51 |
8.9 |
0 |
0.61 |
1 |
2 |
jul |
90.0 |
51.3 |
296.3 |
8.7 |
16.6 |
53 |
5.4 |
0 |
0.71 |
Model Building
For this project, we want to create a multiple linear regression
model that allows us to use the several factors in order to create a
model for the prediction and estimation of the area affected by forest
fires.
Full Model
We will begin with a full multiple linear regression model. We will
observe how this full model turns out to help decide whether we need to
apply any transformations to any of the variables.
Statistics of Regression Coefficients
(Intercept) |
-74.4004408 |
224.5630059 |
-0.3313121 |
0.7406863 |
X |
4.7819085 |
2.7221618 |
1.7566584 |
0.0802003 |
Y |
-2.2318912 |
5.4754358 |
-0.4076189 |
0.6839023 |
monthaug |
119.6584526 |
78.0476359 |
1.5331464 |
0.1265043 |
monthdec |
93.2179223 |
66.0743422 |
1.4108036 |
0.1595456 |
monthfeb |
2.9241777 |
51.9041602 |
0.0563380 |
0.9551175 |
monthjul |
66.9401095 |
66.5964018 |
1.0051611 |
0.3157912 |
monthjun |
21.2886426 |
63.0498198 |
0.3376480 |
0.7359118 |
monthmar |
-12.1109924 |
48.6547436 |
-0.2489170 |
0.8036294 |
monthmay |
7.9712044 |
98.5515997 |
0.0808836 |
0.9355993 |
monthoct |
168.8733920 |
94.7483285 |
1.7823364 |
0.0759079 |
monthsep |
174.9584210 |
87.6286192 |
1.9965900 |
0.0469541 |
FFMC |
0.7443619 |
2.4705591 |
0.3012929 |
0.7634416 |
DMC |
0.3982145 |
0.1627861 |
2.4462435 |
0.0151253 |
DC |
-0.3152171 |
0.1190645 |
-2.6474489 |
0.0086256 |
ISI |
-1.9149661 |
2.0946308 |
-0.9142261 |
0.3614790 |
temp |
2.5843307 |
1.9331153 |
1.3368736 |
0.1824792 |
RH |
-0.2528721 |
0.5528720 |
-0.4573792 |
0.6477957 |
wind |
3.3528858 |
3.3304720 |
1.0067299 |
0.3150380 |
rain |
-4.9594839 |
13.7861611 |
-0.3597436 |
0.7193425 |
Check for
Violations
Now, we will check the residual plots to look for potential
violations of our full model.

It appears that we do have some violations within our model. For
instance, variance is not constant as shown by how there are a few very
notable outliers to the right of the residual plot. Additionally, the
Q-Q Plot shows some violations as well as the right side of the data
appears to have some sort of curvature with these points not following a
normal distribution. Furthermore, the Cooke’s Distance plot shows that
we indeed do have some high leverage points in our model.
Let’s check the VIF for any potential multicollinearity issues in our
model.
GVIF Df GVIF^(1/(2*Df))
X 1.508460 1 1.228194
Y 1.472642 1 1.213525
month 117.881365 9 1.303408
FFMC 3.009177 1 1.734698
DMC 3.625445 1 1.904060
DC 26.871270 1 5.183751
ISI 2.704552 1 1.644552
temp 5.113912 1 2.261396
RH 2.491118 1 1.578327
wind 1.411811 1 1.188196
rain 1.081053 1 1.039737

It appears that we do have a few high VIF values present in our model
which leads to the likelihood of a multicollinearity issue.
Let’s use a correlation matrix to check which variables are causing
issues. We can only check the correlation of numeric variables so we
will create a temporary data set “cor.data” which does not include the
categorical variable of month.
X Y FFMC DMC DC ISI temp RH wind rain area
X 1.00 0.50 -0.07 -0.10 -0.16 -0.05 -0.08 0.06 0.04 0.06 0.07
Y 0.50 1.00 -0.02 0.04 -0.03 -0.07 0.03 -0.04 -0.04 0.03 0.05
FFMC -0.07 -0.02 1.00 0.48 0.41 0.70 0.56 -0.29 -0.16 0.08 0.05
DMC -0.10 0.04 0.48 1.00 0.67 0.33 0.50 0.03 -0.14 0.08 0.09
DC -0.16 -0.03 0.41 0.67 1.00 0.26 0.50 -0.08 -0.24 0.04 0.05
ISI -0.05 -0.07 0.70 0.33 0.26 1.00 0.47 -0.15 0.07 0.07 0.00
temp -0.08 0.03 0.56 0.50 0.50 0.47 1.00 -0.50 -0.32 0.08 0.11
RH 0.06 -0.04 -0.29 0.03 -0.08 -0.15 -0.50 1.00 0.14 0.10 -0.10
wind 0.04 -0.04 -0.16 -0.14 -0.24 0.07 -0.32 0.14 1.00 0.05 0.00
rain 0.06 0.03 0.08 0.08 0.04 0.07 0.08 0.10 0.05 1.00 -0.01
area 0.07 0.05 0.05 0.09 0.05 0.00 0.11 -0.10 0.00 -0.01 1.00
It appears that FFMC and ISI are highly correlated with a correlation
of 0.70. We will remove one of these variables to help avoid this
multicollinearity issue. We will keep FFMC, as it is the first variables
alphabetically of these two. Additionally, it appears that DMC and DC
are significantly correlated, with a correlation of 0.67. We will keep
DC as it is the first variable alphabetically of these two.
Now, let’s revise our final.data model with the variables adjusted to
address the multicollinearity issue that we encountered.
9 |
9 |
jul |
85.8 |
313.4 |
18.0 |
42 |
2.7 |
0 |
0.36 |
1 |
4 |
sep |
91.0 |
692.6 |
21.7 |
38 |
2.2 |
0 |
0.43 |
2 |
5 |
sep |
90.9 |
686.5 |
21.9 |
39 |
1.8 |
0 |
0.47 |
1 |
2 |
aug |
95.5 |
513.3 |
23.3 |
31 |
4.5 |
0 |
0.55 |
8 |
6 |
aug |
90.1 |
529.8 |
21.2 |
51 |
8.9 |
0 |
0.61 |
1 |
2 |
jul |
90.0 |
296.3 |
16.6 |
53 |
5.4 |
0 |
0.71 |
The adjusted final model is given as follows.
Statistics of Regression Coefficients
(Intercept) |
-75.1321207 |
182.9397173 |
-0.4106933 |
0.6816464 |
X |
3.8460358 |
2.7296652 |
1.4089772 |
0.1600746 |
Y |
0.3730475 |
5.4471380 |
0.0684851 |
0.9454538 |
monthaug |
43.5843177 |
73.6968537 |
0.5914000 |
0.5547826 |
monthdec |
42.3450183 |
63.7861600 |
0.6638590 |
0.5073874 |
monthfeb |
3.8319210 |
52.4571871 |
0.0730485 |
0.9418254 |
monthjul |
17.3621272 |
64.7226016 |
0.2682545 |
0.7887229 |
monthjun |
-10.9919974 |
62.5858539 |
-0.1756307 |
0.8607251 |
monthmar |
-12.1725199 |
49.1748282 |
-0.2475356 |
0.8046953 |
monthmay |
8.2755520 |
99.4698215 |
0.0831966 |
0.9337613 |
monthoct |
50.2980046 |
84.5434955 |
0.5949364 |
0.5524198 |
monthsep |
74.6973998 |
80.2226257 |
0.9311263 |
0.3526794 |
FFMC |
0.4424613 |
1.9435330 |
0.2276582 |
0.8200965 |
DC |
-0.1180168 |
0.0944542 |
-1.2494614 |
0.2126556 |
temp |
2.7700666 |
1.9310653 |
1.4344759 |
0.1526763 |
RH |
-0.0717121 |
0.5499241 |
-0.1304036 |
0.8963512 |
wind |
2.7234690 |
3.2044010 |
0.8499152 |
0.3961790 |
rain |
-5.5042055 |
13.9136960 |
-0.3955962 |
0.6927376 |
It is important to note that our full multiple linear regression
model has a p-value of p = 0.6816, which is greater than our alpha value
of 0.05. This indicates that this full model will likely not be our best
choice for prediction and estimation as this model was not statistically
significant in predicting the area burned by a forest fire.
Box Cox Models
Due to the violations in the residual plots of the model, we will use
the Box Cox transformations to adjust our data.
Comparison of Q-Q
Plots
Something which was made note of when performing the various
transformations was that there appeared to be some violations when
checking our residual conditions. One particular violation which
appeared to be very severe in certain cases was the failure of the Q-Q
Plot to indicate that the data appeared to follow a normal distribution.
Both the original linear model and the square root transformation model
especially had severe violations regarding their Q-Q Plots. However, the
logarithmic transformation model appeared to have a Q-Q Plot which was
much closer to indicting a normal distribution than the other two. While
there was still some slight deviation on the right hand side of the Q-Q
Plot for the logarithmic transformation, it was way less than either of
the other two models. Additionally, the logarithmic transformation was
the only Q-Q Plot which did not show blatant curvature in the right hand
side of it.

By comparing the Q-Q Plots of these three potential models, we can
see that the logarithmic model has the linear model and the square root
transformation model have very severe violations of normality in their
Q-Q Plot. While the logarthimic transformation model still does have
some deviation from normality, it is by far the least severe violation
of the three. This gives a strong point towards the logarithmic
transformation model being the best one to use for the prediction and
estimation of area.
Goodness of
Fit
We will now look at other goodness of fit methods for the final three
models of consideration. These methods for goodness of fit will include
useful tools for statistical analysis such as R-squared value, AIC, and
Mallow’s Cp.
We will create an output table to display the goodness of fit values
for each of the three models.
Goodness-of-fit Measures for the Three Candidate
Models
Linear Full Model |
1932593.5973 |
0.0398488 |
-0.0249233 |
18 |
2432.5069 |
Sqrt Area Log Temp |
3373.9333 |
0.0167906 |
-0.0094783 |
8 |
697.8614 |
Log Area |
620.3287 |
0.0106382 |
-0.0157952 |
8 |
240.5934 |
Overall, it appears that unfortunately all three models have very low
R squared values indicating low explained variance within their
respective models. The Mallow’s Cp value for the last two models, the
square root and logarithmic transformation models, is better than that
of the full linear model. The logarithmic transformation model has the
lowest AIC out of the three, which indicates that it is a better fit
model for the data.
All in all, the best model of the three appears to be the logarithmic
transformation. This has been shown by the goodness of fit measures that
were performed as well as by that it was the only model with a
statistically significant p-value, p = 0.0014, while the other two
models both had p-values that were greater than our alpha value of 0.05
indicating that the other two models were not statistically significant
in predicting area while the logarithmic model was.
So, we will select the logarithmic transformation model as our final
model to use.
Final Model
After our analysis of the three potential choices for a final model,
we decided that the best fit model to use is the logarithmic
transformation model. This model had the best goodness of fit measures
as well as it was the only one with a statistically significant
p-value.
Inferential Statistics of Final Model- Log Area
(Intercept) |
2.4834516 |
0.7666130 |
3.2395115 |
0.0013520 |
X |
0.0211869 |
0.0462344 |
0.4582490 |
0.6471534 |
Y |
-0.0367580 |
0.0928944 |
-0.3956964 |
0.6926510 |
DC |
0.0000890 |
0.0004891 |
0.1820437 |
0.8556892 |
temp |
-0.0181803 |
0.0213391 |
-0.8519727 |
0.3950073 |
RH |
-0.0098113 |
0.0074753 |
-1.3125061 |
0.1904980 |
wind |
0.0356783 |
0.0530221 |
0.6728946 |
0.5016074 |
rain |
0.0910809 |
0.2407878 |
0.3782623 |
0.7055421 |
Our overall model appears to be statistically significant for the
prediction and estimation of log(area). It appears that choosing the
logarithmic transformation as our final model was the best choice after
all.
The combined logarithmic model of X, Y, DC, temp, RH, wind, and rain,
statistically significantly predicted the log of area, p = 0.0014.
The regression equation for our final choice of the model is given as
follows:
log(area) = 2.4835 + 0.0212 * X - 0.0368 * Y + 0.0001 * DC - 0.0182 *
temp - 0.0098 * RH + 0.0357 * wind + 0.0911 * rain.
Included below are the inferential statistics of the final model.
Inferential Statistics of Final Model
(Intercept) |
2.4834516 |
0.7666130 |
3.2395115 |
0.0013520 |
X |
0.0211869 |
0.0462344 |
0.4582490 |
0.6471534 |
Y |
-0.0367580 |
0.0928944 |
-0.3956964 |
0.6926510 |
DC |
0.0000890 |
0.0004891 |
0.1820437 |
0.8556892 |
temp |
-0.0181803 |
0.0213391 |
-0.8519727 |
0.3950073 |
RH |
-0.0098113 |
0.0074753 |
-1.3125061 |
0.1904980 |
wind |
0.0356783 |
0.0530221 |
0.6728946 |
0.5016074 |
rain |
0.0910809 |
0.2407878 |
0.3782623 |
0.7055421 |
Final Model
Variable Interpretation
We will analyze the relationship of the independent variables in our
final regression model with their affect on area. As states previously
in the variable description section of this project, the measurements
for area are given in hectares. Additionally, since our final model
selected is a logarithmic transformation, these interpretations will be
looking at the effects of these independent variables on the log of
area.
X: For every additional 1 coordinate movement to the right, the
logarithmic transformation of area burned by a forest fire increases by
0.0212 hectares.
Y: For every additional 1 coordinate movement upwards,
logarithmic transformation of area burned by a forest fire decreases by
0.0368 hectare.
DC: For every additional 1 unit increase of DC, the measure of
the moisture content of compact organic layers in the area affected by
the fire, the logarithmic transformation of area burned by a forest fire
increases by 0.0001 hectare.
temp: For every 1 degree Celsius increase in temperature, the
logarithmic transformation of area burned by a forest fire decreases by
0.0182 hectare.
RH: For every 1 percentage increase in the relative humidity, the
logarithmic transformation of area burned by a forest fire decreases by
0.0098 hectare.
wind: For every 1 kilometer per hour increase in the speed of the
wind, the logarithmic transformation of area burned by a forest fire
increases by 0.0357.
rain: For every 1 mm/m^2 increase in the rain occurring, the
logarithmic transformation of area burned by a forest fire increases by
0.0911.
We can see that the variables that are positively associated with the
logarithmic transformation of area burned by a forest fire are X, DC,
wind, and rain. This indicates that as these variables increase, so does
the area burned by a forest fire. On the other hand, the variables that
are negatively associated with the logarithmic transformation of area
burned by a forest fire are Y, temp, and RH. This indicates that as
these variables increase, the area burned by a forest fire
decreases.
Bootstrapping the Final
Model
We chose the logarithmic transformation as our final model in the
previous section due to it being the model which both shows the greatest
statistical significance in the prediction of area, and has the fewest
violations of a multiple regression model of the models we
considered.
The equation of our final model is: log(area) = 2.4835 + 0.0212 * X -
0.0368 * Y + 0.0001 * DC - 0.0182 * temp - 0.0098 * RH + 0.0357 * wind +
0.0911 * rain.
Now, we will look at the process of bootstrapping this final model in
order to find the bootstrap confidence intervals of the coefficients in
our final model.
Bootstrap Cases
We will look at the bootstrap cases for our final model and use these
to help build the bootstrap confidence intervals of the regression
coefficients.
We will now use these bootstrap cases to create histograms of the
bootstrap regression coefficients.
The histograms of the bootstrap coefficients for the variables in the
model were created.
There are histograms for each of the independent variables in the
model, along with the intercept, except for the variable “rain”. The
variable “rain” contains several points with the value of zero. Since
the final model we chose to use is a logarithmic model, this results in
an error because log(0) is undefined. So, this variable was not included
in the histograms of the bootstrap coefficients due to it resulting in
this error. All of the other variables can be seen along with their
histograms.
Within the histograms, there are two different normal density curves
placed on them. The blue curve is a non-parametric density estimate of
the bootstrap sampling distribution. The bootstrap confidence intervals
we will calculate shortly are based upon this non-parametric bootstrap
sampling distribution. The purple curve is a measure of the estimate of
the regression coefficients and the corresponding value of their
standard error. The p-values given in the regression output are taken
from this curve.


Looking at the histograms of the regression coefficients, both the
blue and purple curves appear to be close and consistent with each
other. Additionally, the histograms of all of the regression
coefficients appear to follow distributions that are approximately
normal without any severe skew or outliers immediately noticeable.
Bootstrap
Confidence Intervals Using Bootstrapping Cases
Next, we will create the 95% Bootstrap Confidence Intervals for the
regression coefficients of the selected model.
Something to note is that as mentioned in the previous section, the
variable “rain” includes several entries with a value of zero. Since our
final model is a logarithmic transformation, we can not have a log of
zero because log(0) is an undefined value, so this leads to the
bootstrap method giving the variable rain a value of NA. However, the
other independent variables do not have this issue of having several
zero values in their entries, so we can observe their bootstrap
confidence intervals below.
Regression Coefficient Matrix
(Intercept) |
2.4835 |
0.7666 |
3.2395 |
0.0014 |
[ 0.8029 , 4.0016 ] |
X |
0.0212 |
0.0462 |
0.4582 |
0.6472 |
[ -0.0619 , 0.1082 ] |
Y |
-0.0368 |
0.0929 |
-0.3957 |
0.6927 |
[ -0.2263 , 0.1693 ] |
DC |
0.0001 |
0.0005 |
0.1820 |
0.8557 |
[ -7e-04 , 9e-04 ] |
temp |
-0.0182 |
0.0213 |
-0.8520 |
0.3950 |
[ -0.0577 , 0.0195 ] |
RH |
-0.0098 |
0.0075 |
-1.3125 |
0.1905 |
[ -0.0219 , 0.003 ] |
wind |
0.0357 |
0.0530 |
0.6729 |
0.5016 |
[ -0.0672 , 0.1367 ] |
rain |
0.0911 |
0.2408 |
0.3783 |
0.7055 |
[ -0.9978 , 0.2001 ] |
We can see the bootstrap confidence intervals of the variables along
with the intercept of the final logarithmic model of the data. These
confidence intervals use bootstrapping cases in order to find the
confidence intervals for the regression coefficients. In the above
table, we can see the test statistics and the p-values for each of the
regression coefficients. We can also see the 95% bootstrap confidence
intervals for each of the variables along with the intercept from the
use of bootstrapping cases.
Something interesting to note is that the variables X, Y, DC, temp,
RH, wind, and rain all contain zero in their 95% bootstrap confidence
intervals, meaning that on their own they are not statistically
significant. However, the 95% bootstrap confidence interval for the
intercept does contain zero, meaning that the combined logarithmic model
of all these variables in statistically significant which matches up
with what was shown earlier when constructing this final model. This
statistical significance of the overall model can be seen by the p-value
of p = 0.0014.
Bootstrap
Residuals
In this section, we will use the bootstrap residual method to
construct the bootstrap confidence intervals. We will first take a look
at the histogram of the residuals that are obtained from the bootstrap
method, and then we will construct the 95% bootstrap confidence
intervals for the data.
Residual
Histogram
Now let’s create a histogram to visualize the residuals obtained from
the bootstrap method.

The above histogram shows that the residual histogram appears like it
is close to being normally distributed, however it appears like there is
an ever-so-slightly skew to the left, with more of the residuals seeming
to occur on the left-side of the histogram. Additionally, there are some
observations which are slightly off to the end, specifically the two
observations to the right of x = 4 on the right-hand side and the one
observation to the very left of x = -4 on the left-hand side which could
be very slight potential outliers. However, these observations do not
look too severe so it is not a major cause for concern. Overall, the
residuals seems to fall close to that of a normal distribution,
In the following section, we will use these residuals to conduct the
process of taking residual bootstrap samples. We will use these samples
to estimate the bootstrap confidence intervals for the regression
coefficients in our final model.
Residual Bootstrap
Regression and Finding Bootstrap Confidence Intervals
Now, let’s create the 95% bootstrap confidence intervals of the
bootstrap regression coefficients.
We will create histograms for each of our regression coefficients to
help us visualize our bootstrap residuals for our model.
We will now create the residual bootstrap histograms for the
regression coefficients of our model.


We can see that the histograms mostly follow a normal distribution
for the regression coefficients of our model. Just like before when
creating histograms for the regression coefficients in the bootstrap
cases, there are two curves on these histograms, a blue and a purple
curve. The blue curve is a non-parametric density estimate of the
bootstrap sampling distribution. The bootstrap confidence intervals we
will calculate shortly are based upon this non-parametric bootstrap
sampling distribution. The purple curve is a measure of the estimate of
the regression coefficients and the corresponding value of their
standard error. The p-values given in the regression output are taken
from this curve.
Bootstrap
Confidence Intervals Using Bootstrap Residuals
Next, we will calculate the 95% bootstrap confidence intervals by the
use of the bootstrap residuals.
Regression Coefficient Matrix of 95% Residual Bootstrap
Confidence Intervals
(Intercept) |
2.4835 |
0.7666 |
3.2395 |
0.0014 |
[ 1.0459 , 4.0219 ] |
X |
0.0212 |
0.0462 |
0.4582 |
0.6472 |
[ -0.0702 , 0.1077 ] |
Y |
-0.0368 |
0.0929 |
-0.3957 |
0.6927 |
[ -0.2172 , 0.1382 ] |
DC |
0.0001 |
0.0005 |
0.1820 |
0.8557 |
[ -8e-04 , 0.001 ] |
temp |
-0.0182 |
0.0213 |
-0.8520 |
0.3950 |
[ -0.0628 , 0.0224 ] |
RH |
-0.0098 |
0.0075 |
-1.3125 |
0.1905 |
[ -0.0246 , 0.004 ] |
wind |
0.0357 |
0.0530 |
0.6729 |
0.5016 |
[ -0.0687 , 0.1355 ] |
rain |
0.0911 |
0.2408 |
0.3783 |
0.7055 |
[ -0.339 , 0.5911 ] |
Above we can see the output table for the 95% bootstrap confidence
intervals by the use of the bootstrap residuals method. This table
includes the 95% bootstrap confidence intervals for each of the
regression coefficients in our final model. Once again we can see that
all of the confidence intervals for the regression coefficients contain
zero except for that of the intercept. This matches up with our previous
analysis of the model that although the individual variables do not show
statistical significant, their combination in the overall logarithmic
model is statistically significant, which can be seen by the p-value of
p = 0.0014.
Combining the Results
of the Regular Model and the Bootstrap Results
Now that we have created both bootstrap confidence intervals using
bootstrap cases and using bootstrap residuals, let’s combine these
results into one table to show the combined inferential statistics.
Combined Inferential Statistics- Bootstrap Confidence Intervals
(Bootstrap Cases and Residuals) and p-values
(Intercept) |
2.4835 |
0.7666 |
0.0014 |
[ 0.8029 , 4.0016 ] |
[ 1.0459 , 4.0219 ] |
X |
0.0212 |
0.0462 |
0.6472 |
[ -0.0619 , 0.1082 ] |
[ -0.0702 , 0.1077 ] |
Y |
-0.0368 |
0.0929 |
0.6927 |
[ -0.2263 , 0.1693 ] |
[ -0.2172 , 0.1382 ] |
DC |
0.0001 |
0.0005 |
0.8557 |
[ -7e-04 , 9e-04 ] |
[ -8e-04 , 0.001 ] |
temp |
-0.0182 |
0.0213 |
0.3950 |
[ -0.0577 , 0.0195 ] |
[ -0.0628 , 0.0224 ] |
RH |
-0.0098 |
0.0075 |
0.1905 |
[ -0.0219 , 0.003 ] |
[ -0.0246 , 0.004 ] |
wind |
0.0357 |
0.0530 |
0.5016 |
[ -0.0672 , 0.1367 ] |
[ -0.0687 , 0.1355 ] |
rain |
0.0911 |
0.2408 |
0.7055 |
[ -0.9978 , 0.2001 ] |
[ -0.339 , 0.5911 ] |
The table above combines the results for the inferential statistics
of the bootstrap method. The table includes the p-values of the
regression coefficients and their bootstrap confidence intervals. This
includes both the bootstrap confidence intervals using bootstrap cases
(bootc.ci.95) and the bootstrap confidence intervals using bootstrap
residuals (bootr.ci.95).
We can see in this output table that both methods of bootstrap
confidence intervals yield the same results in regards to the
statistical significance of the overall model. As was stated previously
in the analysis of the log(area) model, although it seems that the
individual variables do not show notable statistical significance, as
all of their p-values are greater than our alpha of 0.05, their combined
effect in the overall model does show statistical significance with a
p-value of p = 0.0014. The bootstrap confidence intervals further
strengthen this point as all of the regression coefficients for the
variables include zero in their confidence intervals indicating that
they are not statistically significant on their own. However, the
confidence interval for the intercept does not include zero indicating
that it is statistically significant which can be seen by its p-value of
p = 0.0014. This validates our analysis of the final model and its
statistical significance.
We can create a table to show the widths of the 95% bootstrap
confidence intervals for both the bootstrap cases method and the
bootstrap residuals method.
95% Bootstrap Confidence Intervals Width- Bootstrap Cases and
Bootstrap Residuals
3.1987 |
2.9760 |
0.1701 |
0.1779 |
0.3955 |
0.3555 |
0.0016 |
0.0019 |
0.0772 |
0.0852 |
0.0249 |
0.0286 |
0.2039 |
0.2043 |
1.1979 |
0.9301 |
The table above shows the width for both of the bootstrap confidence
interval methods for the regression coefficients of the final model. For
both of these two methods for bootstrap confidence intervals, we can see
that the widths are similar for the two methods showing that they are
consistent with one another.
Final Model
Report
Now that we have looked at various methods of analyzing our final
regression model, let’s create a final table of the inferential
statistics.
Inferential Statistics of the Final Regression Model
(Intercept) |
2.4834516 |
0.7666130 |
3.2395115 |
0.0013520 |
X |
0.0211869 |
0.0462344 |
0.4582490 |
0.6471534 |
Y |
-0.0367580 |
0.0928944 |
-0.3956964 |
0.6926510 |
DC |
0.0000890 |
0.0004891 |
0.1820437 |
0.8556892 |
temp |
-0.0181803 |
0.0213391 |
-0.8519727 |
0.3950073 |
RH |
-0.0098113 |
0.0074753 |
-1.3125061 |
0.1904980 |
wind |
0.0356783 |
0.0530221 |
0.6728946 |
0.5016074 |
rain |
0.0910809 |
0.2407878 |
0.3782623 |
0.7055421 |
The table above shows the inferential statistics of our final model
which allows us to verify the statistical significance of it one more
time. Our model for the logarithmic transformation of area is
statistically significant.
The equation of our final model is: log(area) = 2.4835 + 0.0212 * X -
0.0368 * Y + 0.0001 * DC - 0.0182 * temp - 0.0098 * RH + 0.0357 * wind +
0.0911 * rain.
The combination of X, Y, DC, temp, temp, RH, wind, and rain
statistically significantly predicted the logarithmic transformation of
area, p = 0.0014. This model has good utility for predicition and
estimation.
Interpretations of
the Regression Coefficients
Using the final inferential statistics, we can analyze the
relationship of the independent variables in the final model with their
affect on area. As was stated previously in the variable description
section of this project, the measurements for area are given in
hectares. Additionally, since our final model selected is a logarithmic
transformation, these interpretations will be looking at the effects of
these independent variables on the log of area.
X: For every additional 1 coordinate movement to the right, the
logarithmic transformation of area burned by a forest fire increases by
0.0212 hectares.
Y: For every additional 1 coordinate movement upwards,
logarithmic transformation of area burned by a forest fire decreases by
0.0368 hectare.
DC: For every additional 1 unit increase of DC, the measure of
the moisture content of compact organic layers in the area affected by
the fire, the logarithmic transformation of area burned by a forest fire
increases by 0.0001 hectare.
temp: For every 1 degree Celsius increase in temperature, the
logarithmic transformation of area burned by a forest fire decreases by
0.0182 hectare.
RH: For every 1 percentage increase in the relative humidity, the
logarithmic transformation of area burned by a forest fire decreases by
0.0098 hectare.
wind: For every 1 kilometer per hour increase in the speed of the
wind, the logarithmic transformation of area burned by a forest fire
increases by 0.0357.
rain: For every 1 mm/m^2 increase in the rain occurring, the
logarithmic transformation of area burned by a forest fire increases by
0.0911.
We can see that the variables that are positively associated with the
logarithmic transformation of area burned by a forest fire are X, DC,
wind, and rain. This indicates that as these variables increase, so does
the area burned by a forest fire. On the other hand, the variables that
are negatively associated with the logarithmic transformation of area
burned by a forest fire are Y, temp, and RH. This indicates that as
these variables increase, the area burned by a forest fire
decreases.
Summary and
Discussion
Overall, through the creation of our final model, and through the use
of bootstrap confidence intervals, we were able to conclude the
statistical significance of our overall final model. The log(area) model
that was chosen showed itself to be statistically significant with a
p-value of p = 0.0014. This was verified through the creation of
bootstrap confidence intervals using both the methods of bootstrapping
cases and bootstrap residuals.
The equation of our final model is given by: log(area) = 2.4835 +
0.0212 * X - 0.0368 * Y + 0.0001 * DC - 0.0182 * temp - 0.0098 * RH +
0.0357 * wind + 0.0911 * rain.
This model showed itself to be statistically significant for the
prediction and estimation of the logarithmic transformation of the area
affected by forest fires with a statistically significant p-valye of p =
0.0014.
Answering the
Research Questions
Earlier in this report, some research questions were discussed and
these were considered while analyzing the data and its results. Now that
our model has been thoroughly analyzed, let’s go back and see if we can
now answer these research questions for this project.
- Which factors have the greatest significance in predicting the area
burned by a forest fire?
As was found through the analysis of our final model, the regression
coefficients themselves did not show statistical significance, but their
combined effect did. Out of the independent variables, RH was the most
significant with a p-value of 0.1905. However, this value is still above
our alpha value of 0.05. So, although the individual variables did not
show statistical significance in our final model, their combined effect
did show statistical significance with a p-value of p = 0.0014.
- Which factors have the highest correlation with each other? And,
what are some possible explanations for these factors have the greatest
correlation?
We did have some factors with notably high correlation with one
another. This led to us having to address and fix multicollinearity
issues within the original, full model. The variables FFMC and ISI were
highly correlated with each other and so were the variables DMC and DC.
These multicollinearity issues were addressed when creating our final
model.
- Are there any major outliers in the data set which would prompt
further examination to see what is going on in that occurrence?
Yes, this data does have some outliers within it. Additionally,
looking at the residual plots for our final model showed the presence of
some high leverage points as well. This is something which would be
worth looking into in further experimentation to see why there are these
extreme observations within the data set.
- Is this data set from Montesinho National Park applicable to the
overall scope of forest fires around the world? Is our data set
representative of forest fires of all locations?
This is a question which we do not have enough information to answer
with perfect accuracy because this project only looked at the data for
Montesinho National Park, not any other places affected by forest fires.
However, the data set was sufficiently large suggesting that it serves
significance in representing its population. We can not say with perfect
accuracy whether this data set does or does not represent every single
location affected by forest fires, but this data set provided great use
in creating a model which considers various factors which can affect the
area burned by a forest fire.
- How useful is the model we will create in predicting the area
affected by a forest fire? And, how can this data help with predicting
patterns of forest fires in order to minimize the damage they
cause?
Through the analysis and creation of the final model, we did find
that our model showed statistical significance in the prediction and
estimation of the area affected by forest fires with a p-value of p =
0.0014 in our final model, the logarithmic transformation of area.
Further
Discussion
Some of the major findings of this project was that the final model
that was created was in fact statistically significant. This indicates
that this final model can be useful in predicting the area that will be
affected by forest fires based upon the combined effect of the various
variables in this model. This could be helpful for calculating in
advance how much damage a forest fire could potentially cause based upon
the conditions of the environment in which it will occur. This could be
used to help minimize the damage by having knowledge in advance about
the damages which could occur based upon the known conditions.
Conclusions and
Recommendations
Overall, the final model of this project showed itself to be
statistically significant in its prediction and estimation of the area
of land affected by forest fires based upon a combination of factors.
This could provide use in the prevention of the damages caused by forest
fires if the conditions were known in advance. By having the knowledge
of the area of land which could be impacted by a forest fire based upon
the conditions at hand, this could provide rescuers with some knowledge
in advance to help fight the forest fires and hopefully minimize the
damage caused by having some knowledge in advance regarding the damage
potential of the forest fire.
Some recommendations I would suggest for further experimentation
are:
Since the data in this experiment was all collected within
Montesinho Natural Park, consider expanding the data collection to
include other places and others natural parks to determine whether the
conclusions drawn from this data truly are applicable to all forest
fires regardless of location.
Consider other factors which may have an impact on the area
affected by forest fires. Some ideas which come to mind include time it
takes to reach the nearest fire station, other weather conditions like
snow or fog, or volume, density, and size of the trees or forest in the
given area that is affected.
Due to the unpredictable nature of forest fires and other natural
disasters, it can be hard to have perfect accuracy in the data
collected, specifically the precise and exact area of forest that was
affected, so further experiments could consider expanding the sample
size to ensure the accuracy of the results and recordings in this
experiment.
Altogether, the final model for the logarithmic transformation of
area provides statistical significance in the prediction of the area
affected by forest fires which could provide use by having knowledge in
advance of which factors at hand could lead to more damage from the
forest fires.
