Australian bushfires: statistical analysis & exploratory forecasting

Author

Tom Maccor

Published

July 12, 2024

This paper will aim at answering the question:

Is there a statistically significant increase in the incidence (Nr. fires/year) and size of fires, in New South Wales (Australia), since bushfire data started being recorded?

1. Data Collection

An internet search for data sources was performed. The datasets that showed most promise were the ones kept by the NSW Department of Climate Change, Energy, the Environment and Water (click for the URL).

The datasets are in Shapefile format (.shp) -which is a simple, nontopological format for storing the geometric location and attribute information, of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas) -which is our case for the size of the area burnt each year.

Now, the dataset obtained had to be converted into a format that can be read by the R programming language. This was done via QGIS, an open source Geospatial software (click to go to the QGIS website)

Once a .csv file was obtained, it was read into R:

2. Data analysis & data wrangling

#download.file("https://raw.githubusercontent.com/TMaccor/OZ_bushfires/main/NSW_fire_history_2024.csv",  destfile = "/tmp/NSWfirehistory_2024.csv", method = "curl")

NSW_fire_history <- read_csv("https://raw.githubusercontent.com/TMaccor/OZ_bushfires/main/NSW_fire_history_2024.csv")


glimpse(NSW_fire_history)

Rows: 35,776
Columns: 8
$ FireName   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ FireNo     <chr> NA, "70/161", NA, "0", "0", "0", "0", "0", "0", "0", NA, NA…
$ Label      <chr> "1970-71 Prescribed Burn", "1970-71 Prescribed Burn", "1970…
$ StartDate  <date> NA, NA, 1970-07-01, 1971-07-01, 1971-07-01, 1971-07-01, 19…
$ EndDate    <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ AreaHa     <dbl> 299.98546, 309.02564, 13.60000, 58.99437, 383.15515, 27.241…
$ PerimeterM <dbl> 8615.637, 20657.634, 1999.488, 5312.210, 20564.603, 3844.24…
$ VerDate    <chr> "06/06/2024", "06/06/2024", "06/06/2024", "06/06/2024", "06…

We can see that there is data on bushfires for the period 1899-2024:

datetest <- NSW_fire_history$StartDate %>% na.omit ()
range(datetest)

[1] "1899-12-30" "2024-05-30"

but…there are a lot of missing records:

NSW_fire_history %>% summarise(across(everything(), ~ sum(is.na(.x))))

# A tibble: 1 × 8
  FireName FireNo Label StartDate EndDate AreaHa PerimeterM VerDate
     <int>  <int> <int>     <int>   <int>  <int>      <int>   <int>
1    14633  17523     0     15850   20660      0          0       0

15,850 missing Start Dates out of a total of 35,776 records. That is too much.

A histogram shows that, not surpringingly, there is more are more data recorded in recent years versus the 1st half of the 20th century:

And so:

in order to use as much data as possible, we are going to obtain the fire dates from the ‘LABEL’ variable. This will also allow us to remove all of the ‘prescribed’ burns, since those are not to be taken into account to answer our question.
it makes sense only to use data from 1945 onwards -since pretty much there is no data available prior to this year
BUT….by using the fire dates from the ‘LABEL’ variable, we won’t be able to factor in the yearly seasonality of the bushfire data -as this variable only specifies the YEAR of the fire. Then again, if we had wanted to use the original Start Date data, we wouldn’t have been able to factor in yearly seasonality anyway (precisely due to missing data in this variable).
```
NSW_fire_history2 <- NSW_fire_history %>% filter(grepl('[Wd]ild', Label)) 

NSW_fire_history2 <- NSW_fire_history2 %>% 
                     mutate(derived_year = (as.numeric(sub(".*(\\d{4}).*", "\\1", NSW_fire_history2$Label)) +1) )  %>%
                     filter (derived_year > 1944) 
```

We then proceed to (#1) summarise the total area burnt each year and (#2) count the # of wildfires that ocurred on each year:

Total_Area_burned_year <- NSW_fire_history2%>%
                     group_by(derived_year) %>%
                     summarize(TotalAreaHa = sum(AreaHa, na.rm = TRUE))


Nr_fires_year <- NSW_fire_history2%>%
                 group_by(derived_year) %>%
                 count(derived_year)

We plot those 2 new variables that we have created (‘Total_Area_burned_year’ and ’Nr_fires_year”) to find out more about them (over over the lineplot, to visualize the actual exact values):

And so we can appreciate that the ‘Total Area burnt’ timeseries data has no trend. The seasonal component cannot be determined (as we are only using yearly summarised data) -but we can visualize what could be cycles occuring every 5-10 years (a cyclic pattern for fires in Australia has been extensively documented)^{1, 2, 3, 4, 5} and we can also see the cycles’ frequency increasing, approximately from 1990 onwards.

The ‘Nr. fires per year’ data shows a slight upward trend (again, seasonality cannot be determined).

3. Which are the statistical tests, that can be used to determine if the increase in the data trend observed visually, is statistically significant?

We will calculate this on the ‘Nr. of fires’ timeseries.

Augmented Dickey-Fuller (ADF) Test
The ADF test checks for the presence of a unit root in the time series, which can indicate if the series is non-stationary and has a trend.

Mann-Kendall Trend Test
The Mann-Kendall test is a non-parametric test used to identify trends in a time series without specifying whether the trend is linear or non-linear. It can be particularly useful for environmental and climate data.

Nr_fires_year_ts <- ts(Nr_fires_year$n, start = min(Nr_fires_year$derived_year), frequency = 1)

# ADF Test
adf_test <- adf.test(Nr_fires_year_ts)
adf_test


    Augmented Dickey-Fuller Test

data:  Nr_fires_year_ts
Dickey-Fuller = -1.612, Lag order = 4, p-value = 0.7345
alternative hypothesis: stationary

p > 0.05 , hence the ‘Nr. of fires’ timeseries data is non-stationary (there is a trend)

# Mann-Kendall Trend Test
mk_test <- Kendall(Nr_fires_year$derived_year, Nr_fires_year$n)
mk_test

tau = 0.49, 2-sided pvalue =< 2.22e-16

p < 0.05 -meaning we reject the null, so there is sufficient evidence to say there is a trend.

If we model a linear regression, we obtain a statistically significant equation as well:

lm_model <- lm(n ~ derived_year, data = Nr_fires_year)
summary(lm_model)


Call:
lm(formula = n ~ derived_year, data = Nr_fires_year)

Residuals:
    Min      1Q  Median      3Q     Max 
-460.18 -103.18  -27.14   95.84  918.50 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -12000.810   2012.611  -5.963 6.82e-08 ***
derived_year      6.193      1.014   6.107 3.71e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 209.5 on 78 degrees of freedom
Multiple R-squared:  0.3235,    Adjusted R-squared:  0.3148 
F-statistic:  37.3 on 1 and 78 DF,  p-value: 3.71e-08

# Plot the trend
ggplot(Nr_fires_year, aes(x = derived_year, y = n)) +
  geom_line() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Number of Fires per Year with Linear Trend", x = "Year", y = "Number of Fires") +
  theme(
    panel.background = element_rect(fill = "lightsteelblue1",
                                    colour = "lightsteelblue1",
                                    size = 0.5, linetype = "solid"),
                           )

We have to acknowledge though, that this is a very simple linear model, and that it only explains roughly a quarter of the data. But still, we have yet another method that is telling us that there is statistical significance in the trend.

Having established statistical significance in the trend, we can go further and attempt to answer a 2nd question: determine if the number and size of fires in the period 1945-1957, are statistically significantly different from the period 1980-2023

I could have chosen to compare 2 periods of the same length (same # of years), but I want to use as much data as possible. And by using data from the 1980 and 1990’s , if statistical significance is found it will only further strength the claim (as supposedly, global warming is increasing over time).

To answer this question, I will use the following approach.

Data Preparation: split the data into two subsets: 1950-1955 and 1980-2023.
For each period, calculate the number of fires per year and the total area affected by fires each year.
Calculate the mean, median, standard deviation, and variance for the number of fires per year and the total area affected by fires per year for both periods.
Check if the data for each period follows a normal distribution using the Shapiro-Wilk test or the Kolmogorov-Smirnov test.

Variance Test: Compare the variances of the two periods using Levene’s Test or F-test to check for homogeneity of variances.

Statistical Tests for Mean Comparison: If the data is normally distributed and variances are equal, use the Student’s t-test (independent t-test) to compare the means. If the data is normally distributed but variances are not equal, use Welch’s t-test. If the data is not normally distributed, use the Wilcoxon Rank Sum test (a non-parametric test).

We run the R code for this now.

Code

# Filter data for the two periods
period1 <- NSW_fire_history2 %>% filter(derived_year >= 1945 & derived_year <= 1957)
period2 <- NSW_fire_history2 %>% filter(derived_year > 1980 & derived_year <= 2023)

# Calculate number of fires and total area affected per year for both periods
fires_per_year_p1 <- period1 %>% group_by(derived_year) %>% summarize(count = n(), total_area = sum(AreaHa, na.rm = TRUE))
fires_per_year_p2 <- period2 %>% group_by(derived_year) %>% summarize(count = n(), total_area = sum(AreaHa, na.rm = TRUE))


# Descriptive statistics
desc_p1_fires <- summary(fires_per_year_p1$count)
desc_p2_fires <- summary(fires_per_year_p2$count)
desc_p1_area <- summary(fires_per_year_p1$total_area)
desc_p2_area <- summary(fires_per_year_p2$total_area)

print("Period 1: 1945 to 1957")

[1] "Period 1: 1945 to 1957"

Code

desc_p1_fires

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.00   13.00   25.00   39.08   59.00   97.00

Code

print("Period 2: 1980 to 2023")

[1] "Period 2: 1980 to 2023"

Code

desc_p2_fires

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   62.0   290.0   389.0   434.7   621.5  1323.0

Above are displayed the summary stats for the ‘Nr. of fires’ of the 2 periods that we selected for our question.

Now, let’s see if the data in the 2 selected periods is normally distributed or not:

Code

# Normality test
normality_p1_fires <- shapiro.test(fires_per_year_p1$count)
normality_p2_fires <- shapiro.test(fires_per_year_p2$count)
normality_p1_area <- shapiro.test(fires_per_year_p1$total_area)
normality_p2_area <- shapiro.test(fires_per_year_p2$total_area)


normality_p1_area


    Shapiro-Wilk normality test

data:  fires_per_year_p1$total_area
W = 0.34612, p-value = 8.447e-07

Code

normality_p1_fires


    Shapiro-Wilk normality test

data:  fires_per_year_p1$count
W = 0.86119, p-value = 0.03994

Code

print("Normality test for Period 1 above")

[1] "Normality test for Period 1 above"

Code

normality_p2_area


    Shapiro-Wilk normality test

data:  fires_per_year_p2$total_area
W = 0.4441, p-value = 1.477e-11

Code

normality_p2_fires


    Shapiro-Wilk normality test

data:  fires_per_year_p2$count
W = 0.93046, p-value = 0.01206

Code

print("Normality test for Period 2 above")

[1] "Normality test for Period 2 above"

All 4 Shapiro-Wilk normality tests have a p value <0.05. Hence, the data in these 2 periods is not normally distributed and therefore, we run a Wilcoxon rank sum test on the data from the 2 periods:

stest_Nr_fires <- wilcox.test(fires_per_year_p1$count, fires_per_year_p2$count)

stest_area <- wilcox.test(fires_per_year_p1$total_area, fires_per_year_p2$total_area)


stest_Nr_fires


    Wilcoxon rank sum test with continuity correction

data:  fires_per_year_p1$count and fires_per_year_p2$count
W = 6, p-value = 1.17e-07
alternative hypothesis: true location shift is not equal to 0

stest_area


    Wilcoxon rank sum exact test

data:  fires_per_year_p1$total_area and fires_per_year_p2$total_area
W = 70, p-value = 1.232e-05
alternative hypothesis: true location shift is not equal to 0

In the Wilcoxon rank sum test, p-values are not strictly defined as a measure of the effect size. If the p-value is lower the significance level (usually 0.05) then we can say that we have statistically significant evidences to reject the null hypothesis -and thus to accept that the data are from different populations (and so, there is a statistically significant difference between the 1945-1957 data and the 1980-2023 data!).

4. FORECASTING

There are numerous methods used for forecasting. And lately, the advent of machine learning techniques has widened the forecasting tools and possibilities exponentially. A bibliographic research was performed, and after consideration, the ARIMA/SARIMA method was chosen. Machine learning methods were discarded in this instance, as ARIMA/SARIMA has recently been found to perform better in environmental data⁶ .

The basis for forecasting via ARIMA/SARIMA using R was taken from the excellent book “Forecasting: principles and practice” (3rd edition) by Rob Hyndman and George Athanosopoulos (click for the URL)

ARIMA/SARIMA models

An autoregressive integrated moving average (ARIMA) model is a form of regression analysis that gauges the strength of one dependent variable relative to other changing variables. The model’s goal is to predict future values by examining the differences between values in the series instead of through actual values.

An ARIMA model can be understood by outlining each of its components as follows:

Autoregression (AR): refers to a model that shows a changing variable that regresses on its own lagged, or prior, values. In other words, it predicts future values based on past values.
Integrated (I): represents the differencing of raw observations to allow the time series to become stationary (i.e., data values are replaced by the difference between the data values and the previous values).
Moving average (MA): incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.

Since we’ve had to dispense seasonal variation data in our 2 timeseries (remember we could not use the ‘Start Date’ variable information due to the high number of missing datapoints, hence we could only use aggregated yearly data), the method we will use is ARIMA.

For this, we use the ‘fabletools’ and ‘feasts’ R packages.

And in this case, we will let R choose the best ARIMA model, using the most stringent method (which instructs R to iterate but work harder, using a larger model space).

Code

Nr_fires_year_disag <- Nr_fires_year %>% ungroup()
Nr_fires_year_ts <- as_tsibble(Nr_fires_year_disag, index = derived_year)

fit <- Nr_fires_year_ts %>%
        model(auto = ARIMA(n, stepwise = FALSE, approx = FALSE)   )

fit |> pivot_longer(everything(), names_to = "Model name",
                    values_to = "Orders")

# A mable: 1 x 2
# Key:     Model name [1]
  `Model name`         Orders
  <chr>               <model>
1 auto         <ARIMA(3,1,3)>

The model chosen is ARIMA 3,1,3 (autoregression on the 3rd lag, 1 order of differencing and a moving average window of order 3)

We plot the residuals to check if they are random (equal to “white noise”):

Code

fit |>
  select(auto) |>
  gg_tsresiduals()

They are -so we can then proceed to forecast 5 years ahead:

fitplot <- fit |>
  forecast(h=5) |>
  mutate(n = distributional::dist_truncated(n, 0))  %>%
  filter(.model=='auto') |>
  autoplot(Nr_fires_year_ts) + labs(title="Number of fires per year - NSW", x="Year")


fitplot <- fitplot + theme(
    panel.background = element_rect(fill = "navajowhite2",
                                    colour = "navajowhite2",
                                    size = 0.5, linetype = "solid"),
                           )


fitplot

The forecast obtained is decent (for lack of another descriptor). The confidence intervals are far wider than we would like them to be, but as a first attempt, the forecast is acceptable.

The interactive plot below will let you hover over the actual precise predicted values (though the R command for it currently does not allow the confidence intervals to be displayed).

Code

ggplotly(fitplot)

Forecasting the “Total area burnt/year” doesn’t provide any meaningful result.

And so, here ends this paper.

We could further advance and explore more complex forecasting methods that could be more precise, but I am going to stop here.

The entire code ran in this paper and the datasets are completely reproducible and are available upon request at tmaccor2021@gmail.com

REFERENCES

CSIRO Submission 09/355. Bushfires in Australia. July 2009.
Doerr SH, Santın C. 2016. Global trends in wildfire and its impacts: perceptions versus realities in a changing world. Phil. Trans. R. Soc. B 371: 20150345. http://dx.doi.org/10.1098/rstb.2015.0345
Payne, J.S. Burning Bush: A Fire History of Australia. Heny Holt & Co., Inc. New York, USA, 1991.
Cheney, N.P. Bushfire Disasters in Australia 1945-1975. Aust. For. 1976, Vol. 39, 245:268.
Canadell G. et. al. Multi-decadal increase of forest burnt area in Australia is linked to Climate Change. Nature Communications 2021. https://doi.org/10.1038/s41467-021-27225-4
Dutta R, Das A, Aryal J. 2016 Big data integration shows Australian bush-fire frequency is increasing significantly. R. Soc. Open Sci. 3: 150241. http://dx.doi.org/10.1098/rsos.150241