Data was collected to observe which factors have the greatest impact on the area of land burned in a forest fire.
I found this data set on kaggle.com on the following webpage: https://www.kaggle.com/datasets/elikplim/forest-fires-data-set. The data set is named as “forestfires.csv”.
This data set uses meteorological data to predict the approximated area which will be burned in forest fires based upon these various factors. This allows us to investigate the factors which have an impact on the area of land which is burned by a forest fire. Such data is incredibly useful in order to help know which factors have the greatest significance on the area affected by a forest fire, because this could help provide neccessary data to help minimize the area damaged by forest fires.
The data in this data set was collected around the Montesinho Natural Park which is located in Portugal. The data was collected from 517 observations of forest fires which occurred within this national park. To further enhance the data collection, researchers divided up the national park into a set of spatial coordinates to pinpoint the precise location of the forest fires within Montesinho Natural Park. The national park was divided into x-axis and y-axis spatial coordinates, with the x-axis coordinates ranging from 1 to 9, and the y-axis coordinates ranging from 2 to 9. Each observation was pinpointed to the coordinate location of where the forest fire began.
There are 13 variables in the forestfires data set.
X: The x-axis spatial coordinate of the forest fire’s location within Montesinho Natural Park. A numeric value ranging from 1 to 9.
Y: The y-axis spatial coordinate of the forest fire’s location within Montesinho Natural Park. A numeric value ranging from 2 to 9.
month: The month of the year in which the forest fire occurred. A categorical, character variable with the lowercase abbreviation of the first three letters of the month. (For example: feb, oct, dec)
day: The day of the week on which the forest fire occurred. A categorical, character variable with the lowercase abbreviation of the first three letters of the month. (For example: mon, tue, sat)
FFMC: FFMC index from the FWI System. A numeric, quantitative variable. The FWI System stands for the Fire Weather Index System and it is a system that provides a numeric rating on a scale for the intensity of a fire. FFMC stands for Fine Fuel Moisture Code and it is a measure of the moisture content of litter and other fuels for a fire that are present.
DMC: DMC index from the FWI System. A numeric, quantitative variable. DMC stands for Deep Moist Convection and it is a measure of decomposed organic material underneath the litter which fuels the fire.
DC: DC index from the FWI System. A numeric, quantitative variable. DC stands for Drought Code and is a measure of the moisture content of compact organic layers in the area affected by the fire.
ISI: ISI index from the FWI System. A numeric, quantitative variable. ISI stands for Initial Spread Index and is a measure of the expected rate of fire spread.
temp: The temperature given in degrees Celsius. A numeric, quantitative variable.
RH: The relative humidity given as a percentage. A numeric, quantitative variable.
wind: The speed of the wind given in kilometers per hour (km/h). A numeric, quantitative variable.
rain: The rain occurring outside given in a measurement of mm/m^2. A numeric, quantitative variable.
area: The area of the forest which was burned by the fire. Measured in units of hectares (ha). A numeric, quantitative variable. The response variable of this experimental study.
The main question of this project is to investigate which factors have the greatest correlation and significance in predicting forest fires. Additionally, we would like to see if we can create a multiple linear regression model that significantly predicts the area burned within a forest fire based on the several factors that will be included within this model.
Some further, more specific questions which I thought of in regards to this data are given as follows:
What factors have the greatest significance in predicting the area burned by a forest fire?
Which factors have the highest correlation with each other? And, what are some possible explanations for these factors have the greatest correlation?
Are there any major outliers in the data set which would prompt further examination to see what is going on in that occurrence?
Is this data set from Montesinho National Park applicable to the overall scope of forest fires around the world? Is our data set representative of forest fires of all locations?
How useful is the model we will create in predicting the area affected by a forest fire? And, how can this data help with predicting patterns of forest fires in order to minimize the damage they cause?
To begin, let’s read in the data set from Github.
'data.frame': 517 obs. of 13 variables:
$ X : int 7 7 7 8 8 8 8 8 8 7 ...
$ Y : int 5 4 4 6 6 6 6 6 6 5 ...
$ month: chr "mar" "oct" "oct" "mar" ...
$ day : chr "fri" "tue" "sat" "fri" ...
$ FFMC : num 86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
$ DMC : num 26.2 35.4 43.7 33.3 51.3 ...
$ DC : num 94.3 669.1 686.9 77.5 102.2 ...
$ ISI : num 5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
$ temp : num 8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
$ RH : int 51 33 33 97 99 29 27 86 63 40 ...
$ wind : num 6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
$ rain : num 0 0 0 0.2 0 0 0 0 0 0 ...
$ area : num 0 0 0 0 0 0 0 0 0 0 ...
The data was collected by pinpointing the forest fire’s location of origin from a system of spatial coordinates around Montesinho Natural Park. So, we will create a graph to help visualize the locations of the occurrences.
As we can see, the forest fire occurrences are spread out throughout the park. It appears that more forest fires were observed in the lower half of the Y-coordinates, around 2-5, however the data still appears to be spread out enough to continue with our investigation of the model.
While looking at the data set, something I noticed is that many observations have an area of 0, meaning that none of the forest was burned by a forest fire for those observations. This could provide for some comparisons between what conditions caused a fire and what did not, or lead to questions of why a fire occurred on a certain day but not another. However, the purpose of this experiment is to see what factors have a significant impact on the area burned in a forest fire. If we were to include these observations with an area of 0 affected by the fires, then this may lead to our observations being skewed or separated into two groups: area of zero and area of nonzero, aka no fire damage and fire damage. This could have a negative affect on our final regression model, by clustering our data into two groups, zero and nonzero area, which would cause for notable inaccuracy. After some consideration, I believe it would be in the best interest of this experiment to create a new data frame, one which only includes the observations where there was in fact a forest fire which caused damage.
Let’s create a new data set called “forestfires1” that includes only the observations where the area affected is greater than 0.
X Y month day FFMC DMC DC ISI temp RH wind rain area
1 9 9 jul tue 85.8 48.3 313.4 3.9 18.0 42 2.7 0 0.36
2 1 4 sep tue 91.0 129.5 692.6 7.0 21.7 38 2.2 0 0.43
3 2 5 sep mon 90.9 126.5 686.5 7.0 21.9 39 1.8 0 0.47
4 1 2 aug wed 95.5 99.9 513.3 13.2 23.3 31 4.5 0 0.55
5 8 6 aug fri 90.1 108.0 529.8 12.5 21.2 51 8.9 0 0.61
6 1 2 jul sat 90.0 51.3 296.3 8.7 16.6 53 5.4 0 0.71
'data.frame': 270 obs. of 13 variables:
$ X : int 9 1 2 1 8 1 2 6 5 8 ...
$ Y : int 9 4 5 2 6 2 5 5 4 3 ...
$ month: chr "jul" "sep" "sep" "aug" ...
$ day : chr "tue" "tue" "mon" "wed" ...
$ FFMC : num 85.8 91 90.9 95.5 90.1 90 95.5 95.2 90.1 84.4 ...
$ DMC : num 48.3 129.5 126.5 99.9 108 ...
$ DC : num 313 693 686 513 530 ...
$ ISI : num 3.9 7 7 13.2 12.5 8.7 13.2 10.4 6.2 3.2 ...
$ temp : num 18 21.7 21.9 23.3 21.2 16.6 23.8 27.4 13.2 24.2 ...
$ RH : int 42 38 39 31 51 53 32 22 40 28 ...
$ wind : num 2.7 2.2 1.8 4.5 8.9 5.4 5.4 4 5.4 3.6 ...
$ rain : num 0 0 0 0 0 0 0 0 0 0 ...
$ area : num 0.36 0.43 0.47 0.55 0.61 0.71 0.77 0.9 0.95 0.96 ...
Let’s create another map for our refined data set to ensure everything still looks alright.
The map appears very similar to the one created for the forestfires data set with the zero area observations left in, so this is good.
Having a refined data set which focuses on the observations in which forest fires did affect and cause damage to an area of the forest will provide better insight and power for our purpose of this experiment, which is to see which factors have significance in the amount of area damaged by forest fires. By focusing only on the observations which were affected by forest fires, this will be more useful for our goal and allowing us to create a model for prediction and estimation without the inclusion of redundant observations which do not help our model.
Let’s check for any imbalances in the categorical variable of “month”.
apr aug dec feb jul jun mar may oct sep
4 99 9 10 18 8 19 1 5 97
It turns out that there is a very significant imbalance in the variable month. The number of observations in August and September, 99 and 97 respectively drastically outweigh all of the other months, with no other month having greater than 19 observations. This is a major cause for concern as this categorical variable is very severely imbalanced.
We will keep the month variable for now, as it seems like what month it is does have significance in predicting area affected by forest fires, but we will keep a close eye on it for the final model to determine if it should be removed due to its imbalance. We will address this variable later on to see if its removal will help improve our model. This month variable is one we will keep a close watch on due to this concern of imbalance.
Let’s check the other categorical variable “day” for potential imbalances.
fri mon sat sun thu tue wed
43 39 42 47 31 36 32
The “day” variable does not show major imbalance, as all the days have a significant number of observations with the difference between the highest and lowest days being a total of 15 observations, Sunday with 47 and Wednesday with 32. However, the day of the week does not seem like it would have as major significance in predicting forest fire damage then some of the other factors in question. This can be seen as by how the days of the week are not very imbalanced in regards to their total observations. This is unlike the month variable where it was shown that the vast majority of the observations fall into either August or September, evidencing that the majority of forest fires occur within these warmer, end of summer months.
To avoid having too many or too redudent variables within our full model, the categorical variable for “day” will not be included.
X | Y | month | FFMC | DMC | DC | ISI | temp | RH | wind | rain | area |
---|---|---|---|---|---|---|---|---|---|---|---|
9 | 9 | jul | 85.8 | 48.3 | 313.4 | 3.9 | 18.0 | 42 | 2.7 | 0 | 0.36 |
1 | 4 | sep | 91.0 | 129.5 | 692.6 | 7.0 | 21.7 | 38 | 2.2 | 0 | 0.43 |
2 | 5 | sep | 90.9 | 126.5 | 686.5 | 7.0 | 21.9 | 39 | 1.8 | 0 | 0.47 |
1 | 2 | aug | 95.5 | 99.9 | 513.3 | 13.2 | 23.3 | 31 | 4.5 | 0 | 0.55 |
8 | 6 | aug | 90.1 | 108.0 | 529.8 | 12.5 | 21.2 | 51 | 8.9 | 0 | 0.61 |
1 | 2 | jul | 90.0 | 51.3 | 296.3 | 8.7 | 16.6 | 53 | 5.4 | 0 | 0.71 |
The goal of this experiment is to create a multiple linear regression model that allows us to use the several factors in order to create a model for the prediction and estimation of the area affected by forest fires.
We will begin with a full multiple linear regression model. We will observe how this full model turns out to help decide whether we need to apply any transformations to any of the variables.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -74.4004408 | 224.5630059 | -0.3313121 | 0.7406863 |
X | 4.7819085 | 2.7221618 | 1.7566584 | 0.0802003 |
Y | -2.2318912 | 5.4754358 | -0.4076189 | 0.6839023 |
monthaug | 119.6584526 | 78.0476359 | 1.5331464 | 0.1265043 |
monthdec | 93.2179223 | 66.0743422 | 1.4108036 | 0.1595456 |
monthfeb | 2.9241777 | 51.9041602 | 0.0563380 | 0.9551175 |
monthjul | 66.9401095 | 66.5964018 | 1.0051611 | 0.3157912 |
monthjun | 21.2886426 | 63.0498198 | 0.3376480 | 0.7359118 |
monthmar | -12.1109924 | 48.6547436 | -0.2489170 | 0.8036294 |
monthmay | 7.9712044 | 98.5515997 | 0.0808836 | 0.9355993 |
monthoct | 168.8733920 | 94.7483285 | 1.7823364 | 0.0759079 |
monthsep | 174.9584210 | 87.6286192 | 1.9965900 | 0.0469541 |
FFMC | 0.7443619 | 2.4705591 | 0.3012929 | 0.7634416 |
DMC | 0.3982145 | 0.1627861 | 2.4462435 | 0.0151253 |
DC | -0.3152171 | 0.1190645 | -2.6474489 | 0.0086256 |
ISI | -1.9149661 | 2.0946308 | -0.9142261 | 0.3614790 |
temp | 2.5843307 | 1.9331153 | 1.3368736 | 0.1824792 |
RH | -0.2528721 | 0.5528720 | -0.4573792 | 0.6477957 |
wind | 3.3528858 | 3.3304720 | 1.0067299 | 0.3150380 |
rain | -4.9594839 | 13.7861611 | -0.3597436 | 0.7193425 |
Now, we will check the residual plots to look for potential violations of our model.
It appears that we do have some violations within our model. For instance, variance is not constant as shown by how there are a few very notable outliers to the right of the residual plot. Additionally, the Q-Q Plot shows some violations as well as the right side of the data appears to have some sort of curvature with these points not following a normal distribution. Furthermore, the Cooke’s Distance plot shows that we indeed do have some high leverage points in our model.
Let’s check the VIF for any potential multicollinearity issues in our model.
GVIF Df GVIF^(1/(2*Df))
X 1.508460 1 1.228194
Y 1.472642 1 1.213525
month 117.881365 9 1.303408
FFMC 3.009177 1 1.734698
DMC 3.625445 1 1.904060
DC 26.871270 1 5.183751
ISI 2.704552 1 1.644552
temp 5.113912 1 2.261396
RH 2.491118 1 1.578327
wind 1.411811 1 1.188196
rain 1.081053 1 1.039737
It appears that we do have a few high VIF values present in our model which leads to the likelihood of a multicollinearity issue.
Let’s use a correlation matrix to check which variables are causing issues. We can only check the correlation of numeric variables so we will create a temporary data set “cor.data” which does not include the categorical variable of month.
X Y FFMC DMC DC ISI temp RH wind rain area
X 1.00 0.50 -0.07 -0.10 -0.16 -0.05 -0.08 0.06 0.04 0.06 0.07
Y 0.50 1.00 -0.02 0.04 -0.03 -0.07 0.03 -0.04 -0.04 0.03 0.05
FFMC -0.07 -0.02 1.00 0.48 0.41 0.70 0.56 -0.29 -0.16 0.08 0.05
DMC -0.10 0.04 0.48 1.00 0.67 0.33 0.50 0.03 -0.14 0.08 0.09
DC -0.16 -0.03 0.41 0.67 1.00 0.26 0.50 -0.08 -0.24 0.04 0.05
ISI -0.05 -0.07 0.70 0.33 0.26 1.00 0.47 -0.15 0.07 0.07 0.00
temp -0.08 0.03 0.56 0.50 0.50 0.47 1.00 -0.50 -0.32 0.08 0.11
RH 0.06 -0.04 -0.29 0.03 -0.08 -0.15 -0.50 1.00 0.14 0.10 -0.10
wind 0.04 -0.04 -0.16 -0.14 -0.24 0.07 -0.32 0.14 1.00 0.05 0.00
rain 0.06 0.03 0.08 0.08 0.04 0.07 0.08 0.10 0.05 1.00 -0.01
area 0.07 0.05 0.05 0.09 0.05 0.00 0.11 -0.10 0.00 -0.01 1.00
It appears that FFMC and ISI are highly correlated with a correlation of 0.70. We will remove one of these variables to help avoid this multicollinearity issue. We will keep FFMC, as it is the first variables alphabetically of these two. Additionally, it appears that DMC and DC are significantly correlated, with a correlation of 0.67. We will keep DC as it is the first variable alphabetically of these two.
Now, let’s revise our final.data model with the variables adjusted to address the multicollinearity issue that we encountered.
X | Y | month | FFMC | DC | temp | RH | wind | rain | area |
---|---|---|---|---|---|---|---|---|---|
9 | 9 | jul | 85.8 | 313.4 | 18.0 | 42 | 2.7 | 0 | 0.36 |
1 | 4 | sep | 91.0 | 692.6 | 21.7 | 38 | 2.2 | 0 | 0.43 |
2 | 5 | sep | 90.9 | 686.5 | 21.9 | 39 | 1.8 | 0 | 0.47 |
1 | 2 | aug | 95.5 | 513.3 | 23.3 | 31 | 4.5 | 0 | 0.55 |
8 | 6 | aug | 90.1 | 529.8 | 21.2 | 51 | 8.9 | 0 | 0.61 |
1 | 2 | jul | 90.0 | 296.3 | 16.6 | 53 | 5.4 | 0 | 0.71 |
The adjusted final model is given as follows.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -75.1321207 | 182.9397173 | -0.4106933 | 0.6816464 |
X | 3.8460358 | 2.7296652 | 1.4089772 | 0.1600746 |
Y | 0.3730475 | 5.4471380 | 0.0684851 | 0.9454538 |
monthaug | 43.5843177 | 73.6968537 | 0.5914000 | 0.5547826 |
monthdec | 42.3450183 | 63.7861600 | 0.6638590 | 0.5073874 |
monthfeb | 3.8319210 | 52.4571871 | 0.0730485 | 0.9418254 |
monthjul | 17.3621272 | 64.7226016 | 0.2682545 | 0.7887229 |
monthjun | -10.9919974 | 62.5858539 | -0.1756307 | 0.8607251 |
monthmar | -12.1725199 | 49.1748282 | -0.2475356 | 0.8046953 |
monthmay | 8.2755520 | 99.4698215 | 0.0831966 | 0.9337613 |
monthoct | 50.2980046 | 84.5434955 | 0.5949364 | 0.5524198 |
monthsep | 74.6973998 | 80.2226257 | 0.9311263 | 0.3526794 |
FFMC | 0.4424613 | 1.9435330 | 0.2276582 | 0.8200965 |
DC | -0.1180168 | 0.0944542 | -1.2494614 | 0.2126556 |
temp | 2.7700666 | 1.9310653 | 1.4344759 | 0.1526763 |
RH | -0.0717121 | 0.5499241 | -0.1304036 | 0.8963512 |
wind | 2.7234690 | 3.2044010 | 0.8499152 | 0.3961790 |
rain | -5.5042055 | 13.9136960 | -0.3955962 | 0.6927376 |
It is important to note that our full multiple linear regression model has a p-value of p = 0.6816, which is greater than our alpha value of 0.05. This indicates that this full model will likely not be our best choice for prediction and estimation as this model was not statistically significant in predicting the area burned by a forest fire.
Due to the violations in the residual plots of the model, we will use the Box Cox transformations to adjust our data.
We will try out several transformations in order to choose the best one for our model.
The transformations which I tried used specific variables. The Box Cox transformations I performed are listed as follows:
Box Cox- Logarithmic transformation of wind.
Box Cox- wind.
Box Cox- Logarithmic transformation of temperature (temp).
Box Cox- Logarithmic transformations of wind and temperature (temp).
We can see the plots for the Box Cox transformations that were performed and these transformation plots indicate the optimal value of lamba for the different transformed predictor variables. Specifically, for this transformation we looked at transformations of the predictor variables wind and temp, as these two quantiative variables can help us see if we can improve our model for predicting the area based upon the predictors we have.
We will create a square-root transformation of our model with the log transformed temp variable.
To help improve the significance of the model, I decided to focus in the variables which provided the most statistical significance to our model. Variables such as FFMC and month greatly increased the p-value of the model, so I decided to improve the statistical significance of the multiple regression model by removing these variables from this square root transformation. However, while doing so did decrease the p-value, it still did not lower it enough to achieve a level of statistical significance for this square root transformation model.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2.9136841 | 2.5350136 | 1.1493761 | 0.2514491 |
X | 0.1193862 | 0.1080509 | 1.1049065 | 0.2702137 |
Y | -0.0681424 | 0.2169081 | -0.3141534 | 0.7536545 |
DC | 0.0003843 | 0.0011597 | 0.3314038 | 0.7406045 |
log(temp) | 0.2345492 | 0.6913639 | 0.3392558 | 0.7346888 |
RH | -0.0225197 | 0.0165198 | -1.3631935 | 0.1739918 |
wind | 0.0848477 | 0.1249625 | 0.6789852 | 0.4977466 |
rain | -0.0815678 | 0.5583234 | -0.1460942 | 0.8839594 |
The p-value of this model is p = 0.2514, which is greater than our alpha value of 0.05. This indicates that this square root transformation may not be our best option as it is not statistically significant for prediction and estimation of the area burned by a forest fire.
Let’s take a look at the residual plots for this transformation.
The residual plot appears to be mostly random except for a few outliers which appear like they are significant. Just like in the linear model, the Q-Q Plot appears to show a violation in that the right hand side of the plot shows the points diverging from a normal distribution. This is evidence of a violation of this transformed model. Furthermore, the Cooke’s Distance plot shows that there is a very high leverage point in the model along with several other outliers, indicating further violations still present within this square-root transformed model.
We will create a logarithmic multiple regression model using the log transformation of the response variable area.
To help improve the significance of the model, I decided to focus in the variables which provided the most statistical significance to our model. Variables such as FFMC and month greatly increased the p-value of the model, so I decided to improve the statistical significance of the multiple regression model by removing these variables from this logarithmic transformation. Doing so greatly improved the p-value, with the much lower p-value indicating that this model now does have statistical significance for prediction and estimation of area.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2.4834516 | 0.7666130 | 3.2395115 | 0.0013520 |
X | 0.0211869 | 0.0462344 | 0.4582490 | 0.6471534 |
Y | -0.0367580 | 0.0928944 | -0.3956964 | 0.6926510 |
DC | 0.0000890 | 0.0004891 | 0.1820437 | 0.8556892 |
temp | -0.0181803 | 0.0213391 | -0.8519727 | 0.3950073 |
RH | -0.0098113 | 0.0074753 | -1.3125061 | 0.1904980 |
wind | 0.0356783 | 0.0530221 | 0.6728946 | 0.5016074 |
rain | 0.0910809 | 0.2407878 | 0.3782623 | 0.7055421 |
Let’s take a look at the residual plots for this transformation.
With the logarithmic model, the residuals appear much better than they did for the linear model. The residual plot appears much more random. While the Q-Q Plot does appear to go slightly off of the line for a normal distribution on the right hand side, it is much less severe than how it was in the residual plot of the linear model. Additionally, we still do have some high leverage points as seen in the Cooke’s Distance plot which would definelty be worth investigating further in additional experiments to determine why we have these high leverage points and outliers. However, overall the logarithmic model has fewer violations and much less severe problems than the linear model had regarding its residuals.
Overall, this logarithmic transformed model appears to have the fewest violations and also is the only one with a statistically significant p-value, p = 0.0014.
Something which was made note of when performing the various transformations was that there appeared to be some violations when checking our residual conditions. One particular violation which appeared to be very severe in certain cases was the failure of the Q-Q Plot to indicate that the data appeared to follow a normal distribution. Both the original linear model and the square root transformation model especially had severe violations regarding their Q-Q Plots. However, the logarithmic transformation model appeared to have a Q-Q Plot which was much closer to indicting a normal distribution than the other two. While there was still some slight deviation on the right hand side of the Q-Q Plot for the logarithmic transformation, it was way less than either of the other two models. Additionally, the logarithmic transformation was the only Q-Q Plot which did not show blatant curvature in the right hand side of it.
By comparing the Q-Q Plots of these three potential models, we can see that the logarithmic model has the linear model and the square root transformation model have very severe violations of normality in their Q-Q Plot. While the logarthimic transformation model still does have some deviation from normality, it is by far the least severe violation of the three. This gives a strong point towards the logarithmic transformation model being the best one to use for the prediction and estimation of area.
We will now look at other goodness of fit methods for the final three models of consideration. These methods for goodness of fit will include useful tools for statistical analysis such as R-squared value, AIC, and Mallow’s Cp.
We will create an output table to display the goodness of fit values for each of the three models.
SSE | R.sq | R.adj | Cp | AIC | |
---|---|---|---|---|---|
linear full model | 1932593.5973 | 0.0398488 | -0.0249233 | 18 | 2432.5069 |
sqrt.area.log.temp | 3373.9333 | 0.0167906 | -0.0094783 | 8 | 697.8614 |
log.area | 620.3287 | 0.0106382 | -0.0157952 | 8 | 240.5934 |
Overall, it appears that unfortuantley all three models have very low R squared values indicating low explained variance within their respective models. The Mallow’s Cp value for the last two models, the square root and logarithmic transformation models, is better than that of the full linear model. The logarithmic transformation model has the lowest AIC out of the three, which indicates that it is a better fit model for the data.
All in all, the best model of the three appears to be the logarithmic transformation. This has been shown by the goodness of fit measures that were performed as well as by that it was the only model with a statistically significant p-value, p = 0.0014, while the other two models both had p-values that were greater than our alpha value of 0.05 indicating that the other two models were not statistcially significant in predicting area while the logarithmic model was.
So, we will select the logarithmic transformation model as our final model to use.
After our analysis of the three potential choices for a final model, we decided that the best fit model to use is the logarithmic transformation model. This model had the best goodness of fit measures as well as it was the only one with a statistically significant p-value.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2.4834516 | 0.7666130 | 3.2395115 | 0.0013520 |
X | 0.0211869 | 0.0462344 | 0.4582490 | 0.6471534 |
Y | -0.0367580 | 0.0928944 | -0.3956964 | 0.6926510 |
DC | 0.0000890 | 0.0004891 | 0.1820437 | 0.8556892 |
temp | -0.0181803 | 0.0213391 | -0.8519727 | 0.3950073 |
RH | -0.0098113 | 0.0074753 | -1.3125061 | 0.1904980 |
wind | 0.0356783 | 0.0530221 | 0.6728946 | 0.5016074 |
rain | 0.0910809 | 0.2407878 | 0.3782623 | 0.7055421 |
Our overall model appears to be statistically significant for the prediction and estimation of log(area). It appears that choosing the logarithmic transformation as our final model was the best choice after all.
The combined logarithmic model of X, Y, DC, temp, RH, wind, and rain, statistically significantly predicted the log of area, p = 0.0014.
The regression equation for our final choice of a regression model is given as follows:
log(area) = 2.4835 + 0.0212 * X - 0.0368 * Y + 0.0001 * DC - 0.0182 * temp - 0.0098 * RH + 0.0357 * wind + 0.0911 * rain.
Let’s interpret the coefficients of some of the variables in our final model. Also, as a reminder of the variable section of this report, the response variable area is given in units of hectares.
DC: For every additional 1 unit increase of DC, the measure of the moisture content of compact organic layers in the area affected by the fire, the logarithmic transformation of area burned by a forest fire increases by 0.0001 hectare.
temp: For every 1 degree Celsius increase in temperature, the logarithmic transformation of area burned by a forest fire decreases by 0.0182 hectare.
RH: For every 1 percentage increase in the relative humidity, the logarithmic transformation of area burned by a forest fire decreases by 0.0098 hectare.
wind: For every 1 kilometer per hour increase in the speed of the wind, the logarithmic transformation of area burned by a forest fire increases by 0.0357.
rain: For every 1 mm/m^2 increase in the rain occurring, the logarithmic transformation of area burned by a forest fire increases by 0.0911.
We can see that the variables that are positively associated with the logarithmic transformation of area burned by a forest fire are X, DC, wind, and rain. This indicates that as these variables increase, so does the area burned by a forest fire.
On the other hand, the variables that are negatively associated with the logarithmic transformation of area burned by a forest fire are Y, temp, and RH. This indicates that as these variables increase, the area burned by a forest fire decreases.
Overall, it was the logarithmic transformation model which turned out to provide the most statistical significance for the prediction and estimation of the area burned by a forest fire. This was the final model that was chosen after the analysis of each of the potential final models.
This logarithmic model stood out for being the only model out of the final three options with a statistically significant p-value. Additionally, while this model still had some violations such as a slight deviation of normality in its Q-Q Plot and some high leverage points along with some potential outliers, it had much fewer violations than either of the other two models. These observations made it clear that this model was the best choice amongst the options that we had.
Out of all of the variables in our final model, the logarithmic transformation model, RH has the highest level of statistical significance. However, even this variable with the highest statistical significance had a p-value of 0.1905, which is greater than our value of alpha, 0.05. This indicates that while our model as a whole provides significant statistical significance with an overall p-value of 0.0014, the individual variables appear to have less notable significance than the overall model as a whole.
Overall, the logarithmic model was statistically significant in its prediction of the area burned by a forest fire, and this is the model which was chosen for prediction and estimation.
This data set was found on the website kaggle.com under the collection of data sets. Included below is the citation of the webpage on which the data set can be found as well as the webpage which I looked at to help me find a data set which caught my interest.
Darlington, A. (2017, September 4). Forest fires data set. Kaggle. https://www.kaggle.com/datasets/elikplim/forest-fires-data-set
Kumar A. (2023, November 10). Linear Regression Datasets: CSV, Excel. https://vitalflux.com/linear-regression-datasets-csv-excel/#google_vignette