This project aims to use data modelling to explore the immediate impacts of cyclones on reef condition, specifically coral cover. In turn, we aim to examine which indicators best predict the impacts of cyclone conditions on immediate physical reef damage. Additionally, we use our model to develop an interactive platform through which to communicate scientific modelling results to general audiences, thereby increasing the accessibility of our results. Our research questions were:
a)How do wave conditions during tropical cyclones impact coral cover in Capricorn Bunker, Great Barrier Reef; and
b)How can these changes be communicated as an accessible educational tool?
1.2 Background
2 Methods
ff
2.1 Data Collection
To model and examine the impacts of cyclones on the CBG, we extract variables from multiple datasets, using wave climates and cyclone windows as predictors and coral cover as response variables.
2.2Study Site
The Capricorn-Bunker Group (CBG) is a well-studied region of the GBR that contains a total of 17 islands, approximately 10 km west of the edge of the continental shelf in the southernmost sector of the GBR (Jell and Flood 1978). The region experiences relatively fewer, but more temporally clustered cyclones relative to the central GBR (Wolff et al. 2018). Notably, Southern reefs have been less impacted by marine heatwaves and bleaching events compared to other regions of the GBR (Hughes et al., 2017), and the CBG has had no severe outbreaks of crown-of-thorns starfish since monitoring started (AIMS 2025). Thus, the CBG provides a robust case study for isolating cyclone impacts from other confounding environmental stressors.
The study consisted of 10 reefs in the Capricorn-Bunker Group (CBG) (see Figure 1). Coral cover has been monitored annually-biannually at these reefs since 2006 (for 4 reefs, monitoring started in 1992) under the Australian Institute of Marine Science Long-Term Monitoring Program (AIMS 2025). The LTMP uses manta tow surveys, where a diver towed at a constant speed behind a boat records percentage estimations of coral cover around the entire slope perimeter. As cyclone damage is generally concentrated in the orientation of incoming waves (Vila-Concejo and Kench 2017), manta tow data provides a robust indicator of reef-wide cyclone impacts that may be heterogeneously distributed.
Figure 1: Study site: 10 reefs in the Capricorn-Bunker Group, Southern Great Barrier Reef. a) Study context, the white circle represents the 1000km cyclone radius used to constrain cyclone tracks. b) Higher resolution view of the 10 reefs in this study: North Reef, Broomfield reef, Wreck Island Reef, One Tree Island Reef, Erskine Reef, Mast Head Reef, Boult Reef, Hoskyn Islands Reef, Fairfax Islands Reef, Lady Musgrave Reef. Orange markers indicate the island measured under the Australian Institute of Marine Science Long-Term Monitoring Program. Blue markers indicate the gridded locations used to collect wave hindcast data from the Centre for Australian Weather and Climate Wave Hindcast Model. Images from: AIMS (2025), Google Earth (2026).
2.3 Datasets
2.3.1 Wave Data: CAWCR Wave Hindcast Model
The Centre for Australian Weather and Climate Research (CAWCR) Wave Hindcast Model is a global model based on forcing conditions derived from surface winds and sea ice concentrations in the Climate Forecast System Reanalysis. It has a 0.4 x 0.4 ° global resolution (including a series of smaller, nested grids in Australia), and produces hourly wave hindcast outputs. Wave data was collected from unique grid points adjacent to the 10 reefs monitored in the AIMS LTMP in NetCDF format (see figure 1). Hourly data from 1992 to present was extracted for all 10 reefs, and merged into a single dataset. Each grid point was linked to its corresponding reef name, to generate a combined dataset accounting for hourly wave conditions at all 10 study sites since 1992.
Variables retained for analysis were significant wave height, Hs (in meters), peak wave frequency, Fp ( Hz, discretised from 0.038 to 0.5 Hz), and direction, Dir (° from North, discretised into 24 intervals of 15°). From these variables, wave energy, \(\textbf{E}\) (J/m2) and wave power, \(\textbf{P}\) (kW/m) were calculated to account for the cumulative impacts of height and speed. Deep-water Linear Wave Theory equations were used, as depths at each wave data point ranged from 8-60 m (Google Earth 2026). The following calculations were made using Microsoft Excel:
\(\rho\) (seawater density) was 1025 kg/m3 in line with mean seawater density (Karnauskas 2020);
\(g\) (gravitational acceleration) was 9.81 m/s2 ; and
the coefficient was reduced to \(\frac{1}{16}\) (Salles, personal communication), as significant wave height values are derived from a spectrum of many irregular sea waves, rather than a single idealised wave height.
Wave Power,\(\textbf{P}\)
\[
\textbf{P} = \textbf{E}\,C\,n
\]
Where:
\(n=\frac{1}{2}\) to reflect deep water equation parameters;
celerity: \(C=1.56\,T_p\) ; and
peak period \(T_p\) is the inverse of peak wave frequency \(F_p\) by: \(T_p =\frac{1}{F_p}\)
2.3.2 Cyclone Tracks
Cyclone tracks were used to identify where cyclones entered a 1000km radius of each study reef. While the spatial extent of TC gales is 300-500 km (Grossmann-Matheson et al. 2024), propagated waves can travel up to 20,000 km from their source (Hoeke et al. 2013). Recently, waves of 4.4 m height propagated during TC Justin were recorded 630 km away (Smith, Vila-Concejo, and Salles 2023). Thus, we chose the radius of 1000 km as a conservative estimate, to reduce confounding effects of regular sea-states conditions whilst still accounting for potential spatial and temporal lags in wave propagation.
Cyclone track data was accessed from International Best Track Archive for Climate Stewardship (IBTrACS, v04r01). For each TC, the dataset provides the storm location (latitude and longitude coordinates), and characteristics such as storm category and wind speed, recorded at 3 hour intervals. Cyclone data was merged with wave data by matching the same date, hour, and minute between datasets, filtering observations from 1992 onwards. The spatial distance between gridded reef wave points and cyclone locations was calculated using the Haversine formula (Robusto 1957), and only cyclone observations within 1000km of the reefs were kept. Thus, the combined wave and cyclone data contained reef-specific, three-hourly wave climates for the duration of time cyclones were within 1000km of the study sites.
2.3.3 Coral Cover
To quantify the impacts of wave climates on coral cover, the combined wave and cyclone dataset was merged with AIMS LTMP coral cover observations according to individual cyclone conditions. Datasets were merged using reef name as the common key, matching the surveys in the manta tow dataset for every cyclone reef row in the combined wave cyclone dataset. New columns were created to calculate descriptive wave climate characteristics across the duration of each observed cyclone for each specific reef. The following variables were retained in the final dataset:
2.3.4 Final Merged Dataset
Finally, live and dead coral cover data were extracted according to the nearest monitoring dates identified before and after each observed cyclone. Thus, the merged dataset is cyclone-frequency dependent, and includes both reef-specific wave conditions, and temporally related coral cover observations. The final merged dataset includes 152 observations across 10 reefs and 17 cyclones, and a combination of numeric, categorial and temporal variables related to wave conditions, cyclone exposure, and coral cover. (For detailed methods see Appendix Section 7.1).
2.4 Modelling
Data cleaning and preprocessing were performed during the dataset merging process by removing any observations with missing longitude or latitude values, filtering those from 1992 onwards, and keeping only cyclones within 1000km of reef locations. After completing all merging and filtering steps, the final dataset contained 152 observations across 10 reefs and 17 cyclones, with a combination of numeric, categorical, and temporal variables related to wave conditions, cyclone exposure, and coral cover. The final dataset contained no missing values. The primary response variable of interest was the change in live coral cover, calculated using the difference between pre- and post-cyclone coral measurements.
Code (Reading in Data)
data <-read_csv("merging datasets/merged_cyclone_coral_v4.csv", show_col_types =FALSE)all_vars <-c("Mean_hs","Max_hs","Intervals_hs_gt4","Intervals_hs_gt3","Intervals_hs_gt25","Mean_fp","Max_fp","Mean_dir","Std_dir","Min_distance_km","Duration_hours","Mean_E","Max_E","Mean_P","Max_P")all_data <- data %>% dplyr::select(all_of(all_vars)) %>%mutate(across(everything(), as.numeric))selected_vars <-c("Intervals_hs_gt3","Mean_hs","Mean_P","Mean_fp","Mean_dir","Min_distance_km","Duration_hours")cyclone_exposure <- data %>% dplyr::select(all_of(selected_vars))
Multicollinearity between predictor variables was also checked using correlation analysis, where several wave-related variables showed strong correlations, such as mean and maximum significant wave height as seen in Figure 2. Final variable selection, while considering multicollinearity, included mean significant wave height, hours with significant wave height exceeding 3m, mean power, mean peak wave frequency, mean wave direction, minimum distance from cyclone to reef, and cyclone duration hours. In particular, mean significant wave height and mean wave power were highly correlated since wave power is partially derived from wave height. However, both variables were retained due to their environmental importance and potential contribution to predicting coral cover change (Ferrario et al. 2014).
Figure 2: Pearson correlation heatmap of cyclone exposure variables used in the modelling process. Red colours indicate positive correlations, while blue colours indicate negative correlations. Darker red cells indicate stronger positive correlations and highlight potential multicollinearity among several wave height and energy-related predictors.
2.5 Model Assumptions
Several modelling assumptions were considered throughout the analysis. One key assumption for all models was data independence. However, this assumption is inherently violated to some extent because geographically proximate reefs may share similar environmental conditions and cyclone exposure. Despite this spatial dependence being unavoidable in ecological datasets of this kind, each reef–cyclone observation was treated as independent for modelling purposes.
Also, the training and testing datasets were assumed to come from similar distributions, with the training data being representative of broader reef and cyclone exposure conditions. In particular, for linear regression, additional assumptions included linearity between predictor variables and coral cover change, homoscedasticity (constant variance), and normality of residuals, along with multicollinearity between variables checked before modelling.
Repeated 5-fold cross-validation (CV) with 50 repeats was applied to evaluate model performance more reliably, with performance metrics including Root Mean Square Error (RMSE) and R2 values calculated for each fold before averaging the results for comparison, summarised in Table 1 and Figure 3. Based on the overall model performance, Random Forest emerged as the strongest candidate model, achieving the lowest mean RMSE of 9.517 in Figure 3 (a) and the highest mean R2 value of 0.371 in Figure 3 (b). This indicates that the model was able to explain approximately 37% of the variance in coral cover change while also producing the lowest prediction error among all evaluated models.
Extra Trees also demonstrated relatively strong performance with a mean RMSE of 9.686 and an R2 value of 0.332, while XGBoost and KNN Regression showed moderate predictive ability. In contrast, Linear Regression produced the highest RMSE and the lowest R2 value, suggesting that simpler linear relationships were less capable of capturing the complexity of environmental influences on coral cover change.
The repeated cross-validation boxplots further showed that Random Forest maintained a comparatively stable spread in both RMSE and R2 values across folds compared with several other models, particularly Linear Regression and KNN Regression. This suggests that the Random Forest model produced more consistent predictive performance and generalised more reliably across different validation subsets. Therefore, Random Forest was selected as the final model for this project.
(b) Coefficient of determination R-squared. Higher R-squared values indicate better model fit and explanatory performance.
Figure 3: Boxplot comparisons of our chosen candidate classification models using repeated 5-fold cross-validation.
To further interpret the Random Forest model, a SHAP (Shapley) beeswarm plot was generated to explore the contribution of each predictor variable to predicting coral cover change (Figure 4). The sign of SHAP values refers to the direction of model contribution, indicating whether the predictor variable increases or decreases predicted coral cover change relative to the baseline prediction, while the SHAP magnitude indicates the strength of contribution (Lundberg et al. 2020). The colour gradient represents the feature values of the predictor variables. From Figure 4, higher values of Mean_hs generally contributed towards more positive predicted coral cover changes, while higher values of Duration_hours contributed towards greater coral loss. The SHAP values also illustrated that the influence of these variables varied across observations, suggesting complex non-linear relationships captured by the model.
Figure 4: SHAP (Shapley) beeswarm plot for the Random Forest model showing the contribution and direction of predictor variables on coral cover change predictions. Features with larger absolute SHAP values have greater influence on model output. Positive SHAP values indicate an increase in predicted coral cover change, while negative SHAP values indicate a decrease.
3.2 Product Deployment
(a) Explore Past Cyclones tab.,Design Your Own Cyclone tab.
(b) Explore Past Cyclones tab.,Design Your Own Cyclone tab.
Figure 5: Shiny
4 Discussion
ffff
5 Conclusion
conclusions
6 References
AIMS. 2025. “AIMS Long-Term Monitoring Program: Crown-of-Thorns Starfish and Benthos Manta Tow Data (Great Barrier Reef).”https://doi.org/10.25845/5c09b0abf315a.
Ferrario, Filippo, Michael W. Beck, Curt D. Storlazzi, Fiorenza Micheli, Christine C. Shepard, and Laura Airoldi. 2014. “The Effectiveness of Coral Reefs for Coastal Hazard Risk Reduction and Adaptation.”Nature Communications 5 (1). https://doi.org/10.1038/ncomms4794.
Google Earth. 2026. “Gladstone and the Capricorn Bunker Reef Group, Great Barrier Reef: 23.5 s, 152.07 e.” Satellite map. https://earth.google.com/web/.
Grossmann-Matheson, Georg, Ian R. Young, Andrea Meucci, and Jose-Henrique Alves. 2024. “Global Tropical Cyclone Extreme Wave Height Climatology.”Scientific Reports 14 (1): 4167. https://doi.org/10.1038/s41598-024-54691-9.
Hoeke, Ron K., Kathleen L. McInnes, Jonathan C. Kruger, Robin J. McNaught, John R. Hunter, and Scott G. Smithers. 2013. “Widespread Inundation of Pacific Islands Triggered by Distant-Source Wind-Waves.”Global and Planetary Change 108: 128–38. https://doi.org/10.1016/j.gloplacha.2013.06.006.
Jell, J. S., and P. G. Flood. 1978. “Guide to the Geology of Reefs of the Capricorn and Bunker Groups, Great Barrier Reef Province, with Special Reference to Heron Reef.”Australasian Sedimentologists Group 8 (3): 1–85.
Karnauskas, Kristopher B. 2020. “Physical Diagnosis of the 2016 Great Barrier Reef Bleaching Event.”Geophysical Research Letters 47 (11). https://doi.org/10.1029/2019gl086177.
Lundberg, Scott M., Gabriel Erion, H. Chen, A. DeGrave, Jordan M. Prutkin, B. Nair, R. Katz, Jonathan Himmelfarb, N. Bansal, and Su-In Lee. 2020. “From Local Explanations to Global Understanding with Explainable AI for Trees.”Nature Machine Intelligence 2: 56–67. https://doi.org/10.1038/s42256-019-0138-9.
Robusto, Carl C. 1957. “The Cosine-Haversine Formula.”The American Mathematical Monthly 64 (1): 38–40. https://doi.org/10.2307/2309088.
Vila-Concejo, Alexandra, and Paul Kench. 2017. “Storms in Coral Reefs: Processes and Impacts.” In Coastal Storms: Processes and Impacts, 127–49. https://doi.org/10.1002/9781118937099.ch7.
Wolff, Nicholas H., Peter J. Mumby, Marji Devlin, and Ken R. N. Anthony. 2018. “Vulnerability of the Great Barrier Reef to Climate Change and Local Pressures.”Global Change Biology 24 (5): 1978–91. https://doi.org/10.1111/gcb.14043.
7 Appendix
7.1 Merging Datasets
All netCDF wave files were combined into a single dataset, with each wave point linked to its corresponding reef. This produced the final wave dataset, “combined_wave_data.csv”, which included significant wave height, peak wave frequency, wave direction, and related variables.
The wave dataset (“combined_wave_data.csv”) was then merged with the IBTrACS cyclone dataset (“ibtracs.since1980.list.v04r01.csv”) by matching observations on date, hour, and minute. The spatial distance between reef wave points and cyclone locations was calculated using the Haversine formula and stored as a new variable. Only cyclone observations within 1000 km of reef locations were retained, resulting in the final combined dataset, “combined_wave_cyclone.csv”.
The dataset (“combined_wave_cyclone.csv”) was then merged with the AIMS manta tow coral cover dataset (“capricornbunkermantatowdata.csv”) for use in the final modelling analysis, producing “merged_coral_cover.csv”. The datasets were joined using Reef_ID as the common key, matching manta tow survey records to each cyclone–reef observation in the wave–cyclone dataset.
Additional variables were then derived for analysis, including mean and maximum wave height, the number of hours during which wave height exceeded thresholds of 2.5 m, 3 m, and 4 m, wave direction, wave period, and duration of cyclone impact. Further calculations included minimum distance between cyclone centre and reef, as well as mean and maximum wave energy and wave power.
7.2 Final Variables and their Descriptions
Final ccewdfeio
Variable Name
Description
Mean_hs
Mean significant wave height (m)
Intervals_hs_gt3
Hours of waves > 3 metres significant wave height (hours)
Mean_fp
Mean peak frequency (Hz)
Mean_dir
Mean wave direction from North (°)
Min_distance_km
Minimum distance between cyclone and reef (km)
Duration_hours
Duration of time cyclone was in 1000 km radius of reef (hours)
Mean_P
Mean wave power (kW/m)
Source Code
---title: "Modelling the Impacts of Cyclone-Induced Waves on Coral Cover in the Capricorn Bunker, Great Barrier Reef"date: "`r Sys.Date()`"author: "Reef 1"format: html: embed-resources: true code-fold: true code-tools: true fig_caption: true theme: unitedexecute: cache: truetable-of-contents: truenumber-sections: truebibliography: references.bibreference-location: section---```{r}#| message: false#| warning: false#| code-summary: "Code (Loading Packages)"library(tidyverse)library(lubridate)library(geosphere)library(caret)library(MASS)library(randomForest)library(FNN)library(broom)library(knitr)library(tidymodels)library(xgboost)library(ranger)library(fastshap)library(shapviz)library(patchwork)```# Introduction## Project AimsThis project aims to use data modelling to explore the immediate impacts of cyclones on reef condition, specifically coral cover. In turn, we aim to examine which indicators best predict the impacts of cyclone conditions on immediate physical reef damage. Additionally, we use our model to develop an interactive platform through which to communicate scientific modelling results to general audiences, thereby increasing the accessibility of our results. Our research questions were:**a)** *How do wave conditions during tropical cyclones impact coral cover in Capricorn Bunker, Great Barrier Reef;* and**b)** *How can these changes be communicated as an accessible educational tool?*## Background# Methodsff## Data CollectionTo model and examine the impacts of cyclones on the CBG, we extract variables from multiple datasets, using wave climates and cyclone windows as predictors and coral cover as response variables.## **Study Site**The Capricorn-Bunker Group (CBG) is a well-studied region of the GBR that contains a total of 17 islands, approximately 10 km west of the edge of the continental shelf in the southernmost sector of the GBR [@Jell1978]. The region experiences relatively fewer, but more temporally clustered cyclones relative to the central GBR [@Wolff2018]. Notably, Southern reefs have been less impacted by marine heatwaves and bleaching events compared to other regions of the GBR (Hughes et al., 2017), and the CBG has had no severe outbreaks of crown-of-thorns starfish since monitoring started [@AIMS2025]. Thus, the CBG provides a robust case study for isolating cyclone impacts from other confounding environmental stressors.The study consisted of 10 reefs in the Capricorn-Bunker Group (CBG) (see Figure 1). Coral cover has been monitored annually-biannually at these reefs since 2006 (for 4 reefs, monitoring started in 1992) under the Australian Institute of Marine Science Long-Term Monitoring Program [@AIMS2025]. The LTMP uses manta tow surveys, where a diver towed at a constant speed behind a boat records percentage estimations of coral cover around the entire slope perimeter. As cyclone damage is generally concentrated in the orientation of incoming waves [@VilaConcejo2017], manta tow data provides a robust indicator of reef-wide cyclone impacts that may be heterogeneously distributed. ```{r}#| label: fig-study-site#| echo: false#| message: false#| warning: false#| fig-align: center#| fig-cap: "Study site: 10 reefs in the Capricorn-Bunker Group, Southern Great Barrier Reef. a) Study context, the white circle represents the 1000km cyclone radius used to constrain cyclone tracks. b) Higher resolution view of the 10 reefs in this study: North Reef, Broomfield reef, Wreck Island Reef, One Tree Island Reef, Erskine Reef, Mast Head Reef, Boult Reef, Hoskyn Islands Reef, Fairfax Islands Reef, Lady Musgrave Reef. Orange markers indicate the island measured under the Australian Institute of Marine Science Long-Term Monitoring Program. Blue markers indicate the gridded locations used to collect wave hindcast data from the Centre for Australian Weather and Climate Wave Hindcast Model. Images from: AIMS (2025), Google Earth (2026)."#| out-width: 100%knitr::include_graphics("www/study-site.png")```## Datasets### Wave Data: CAWCR Wave Hindcast ModelThe Centre for Australian Weather and Climate Research (CAWCR) Wave Hindcast Model is a global model based on forcing conditions derived from surface winds and sea ice concentrations in the Climate Forecast System Reanalysis. It has a 0.4 x 0.4 ° global resolution (including a series of smaller, nested grids in Australia), and produces hourly wave hindcast outputs. Wave data was collected from unique grid points adjacent to the 10 reefs monitored in the AIMS LTMP in NetCDF format (see figure 1). Hourly data from 1992 to present was extracted for all 10 reefs, and merged into a single dataset. Each grid point was linked to its corresponding reef name, to generate a combined dataset accounting for hourly wave conditions at all 10 study sites since 1992. Variables retained for analysis were significant wave height, **H~s~** (in meters), peak wave frequency, **Fp** ( Hz, discretised from 0.038 to 0.5 Hz), and direction, **Dir** (° from North, discretised into 24 intervals of 15°). From these variables, wave energy, $\textbf{E}$ (J/m^2^) and wave power, $\textbf{P}$ (kW/m) were calculated to account for the cumulative impacts of height and speed. Deep-water Linear Wave Theory equations were used, as depths at each wave data point ranged from 8-60 m [@GoogleEarth2026]. The following calculations were made using Microsoft Excel:- **Wave Energy,** $\textbf{E}$\ $$ \mathbf{E}= \frac{1}{16} \, \rho \, g \, \mathbf{H}_{\mathrm{s}}^{2}$$ Where: - $\rho$ (seawater density) was 1025 kg/m^3^ in line with mean seawater density [@Karnauskas2020]; - $g$ (gravitational acceleration) was 9.81 m/s^2^ ; and - the coefficient was reduced to $\frac{1}{16}$ (Salles, personal communication), as significant wave height values are derived from a spectrum of many irregular sea waves, rather than a single idealised wave height.- **Wave Power,** $\textbf{P}$ $$ \textbf{P} = \textbf{E}\,C\,n $$ Where: - $n=\frac{1}{2}$ to reflect deep water equation parameters; - celerity: $C=1.56\,T_p$ ; and - peak period $T_p$ is the inverse of peak wave frequency $F_p$ by: $T_p =\frac{1}{F_p}$### Cyclone TracksCyclone tracks were used to identify where cyclones entered a 1000km radius of each study reef. While the spatial extent of TC gales is 300-500 km [@GrossmannMatheson2024], propagated waves can travel up to 20,000 km from their source [@Hoeke2013]. Recently, waves of 4.4 m height propagated during TC Justin were recorded 630 km away [@Smith2023]. Thus, we chose the radius of 1000 km as a conservative estimate, to reduce confounding effects of regular sea-states conditions whilst still accounting for potential spatial and temporal lags in wave propagation. Cyclone track data was accessed from International Best Track Archive for Climate Stewardship (IBTrACS, v04r01). For each TC, the dataset provides the storm location (latitude and longitude coordinates), and characteristics such as storm category and wind speed, recorded at 3 hour intervals. Cyclone data was merged with wave data by matching the same date, hour, and minute between datasets, filtering observations from 1992 onwards. The spatial distance between gridded reef wave points and cyclone locations was calculated using the Haversine formula [@Robusto1957], and only cyclone observations within 1000km of the reefs were kept. Thus, the combined wave and cyclone data contained reef-specific, three-hourly wave climates for the duration of time cyclones were within 1000km of the study sites.### Coral CoverTo quantify the impacts of wave climates on coral cover, the combined wave and cyclone dataset was merged with AIMS LTMP coral cover observations according to individual cyclone conditions. Datasets were merged using reef name as the common key, matching the surveys in the manta tow dataset for every cyclone reef row in the combined wave cyclone dataset. New columns were created to calculate descriptive wave climate characteristics across the duration of each observed cyclone for each specific reef. The following variables were retained in the final dataset:```{r EDA_cleaning}#| message: false#| warning: false```### Final Merged DatasetFinally, live and dead coral cover data were extracted according to the nearest monitoring dates identified before and after each observed cyclone. Thus, the merged dataset is cyclone-frequency dependent, and includes both reef-specific wave conditions, and temporally related coral cover observations. The final merged dataset includes 152 observations across 10 reefs and 17 cyclones, and a combination of numeric, categorial and temporal variables related to wave conditions, cyclone exposure, and coral cover. (For detailed methods see Appendix @sec-merging-datasets).## ModellingData cleaning and preprocessing were performed during the dataset merging process by removing any observations with missing longitude or latitude values, filtering those from 1992 onwards, and keeping only cyclones within 1000km of reef locations. After completing all merging and filtering steps, the final dataset contained 152 observations across 10 reefs and 17 cyclones, with a combination of numeric, categorical, and temporal variables related to wave conditions, cyclone exposure, and coral cover. The final dataset contained no missing values. The primary response variable of interest was the change in live coral cover, calculated using the difference between pre- and post-cyclone coral measurements. ```{r}#| message: false#| warning: false#| code-summary: "Code (Reading in Data)"data <-read_csv("merging datasets/merged_cyclone_coral_v4.csv", show_col_types =FALSE)all_vars <-c("Mean_hs","Max_hs","Intervals_hs_gt4","Intervals_hs_gt3","Intervals_hs_gt25","Mean_fp","Max_fp","Mean_dir","Std_dir","Min_distance_km","Duration_hours","Mean_E","Max_E","Mean_P","Max_P")all_data <- data %>% dplyr::select(all_of(all_vars)) %>%mutate(across(everything(), as.numeric))selected_vars <-c("Intervals_hs_gt3","Mean_hs","Mean_P","Mean_fp","Mean_dir","Min_distance_km","Duration_hours")cyclone_exposure <- data %>% dplyr::select(all_of(selected_vars))```Multicollinearity between predictor variables was also checked using correlation analysis, where several wave-related variables showed strong correlations, such as mean and maximum significant wave height as seen in @fig-correlation-heatmap. Final variable selection, while considering multicollinearity, included mean significant wave height, hours with significant wave height exceeding 3m, mean power, mean peak wave frequency, mean wave direction, minimum distance from cyclone to reef, and cyclone duration hours. In particular, mean significant wave height and mean wave power were highly correlated since wave power is partially derived from wave height. However, both variables were retained due to their environmental importance and potential contribution to predicting coral cover change [@Ferrario2014].```{r fig.height=8, fig.width=10}#| label: fig-correlation-heatmap#| message: false#| warning: false#| code-summary: Code (Correlation Heatmap)#| fig-cap: Pearson correlation heatmap of cyclone exposure variables used in the modelling process. Red colours indicate positive correlations, while blue colours indicate negative correlations. Darker red cells indicate stronger positive correlations and highlight potential multicollinearity among several wave height and energy-related predictors.correlation_matrix <- cor( all_data, use = "pairwise.complete.obs", method = "pearson")# Keep only lower trianglecorrelation_matrix[upper.tri(correlation_matrix)] <- NAcorrelation_long <- as.data.frame(as.table(correlation_matrix)) %>% as_tibble() %>% rename( variable_y = Var1, variable_x = Var2, correlation = Freq ) %>% filter(!is.na(correlation)) %>% mutate( variable_x = factor(variable_x, levels = all_vars), variable_y = factor(variable_y, levels = rev(all_vars)), correlation_label = sprintf("%.2f", correlation) )ggplot(correlation_long, aes(x = variable_x, y = variable_y, fill = correlation)) + geom_tile(color = "white", linewidth = 0.5) + geom_text(aes(label = correlation_label), size = 3) + scale_fill_gradient2( low = "#2166AC", mid = "white", high = "#B2182B", midpoint = 0, limits = c(-1, 1), name = "Pearson\ncorrelation" ) + coord_fixed() + labs( title = "Correlation Heatmap of Cyclone Exposure Variables", x = NULL, y = NULL ) + theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold", size = 15), axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), axis.text.y = element_text(hjust = 1), panel.grid = element_blank() )```## Model AssumptionsSeveral modelling assumptions were considered throughout the analysis. One key assumption for all models was data independence. However, this assumption is inherently violated to some extent because geographically proximate reefs may share similar environmental conditions and cyclone exposure. Despite this spatial dependence being unavoidable in ecological datasets of this kind, each reef–cyclone observation was treated as independent for modelling purposes.Also, the training and testing datasets were assumed to come from similar distributions, with the training data being representative of broader reef and cyclone exposure conditions. In particular, for linear regression, additional assumptions included linearity between predictor variables and coral cover change, homoscedasticity (constant variance), and normality of residuals, along with multicollinearity between variables checked before modelling.# Results## Comparison of Models```{r}#| message: false#| warning: false#| label: setup-5-fold-CV#| code-summary: Code (Setup 5-fold CV)# Create target variablemodel_data <- data %>%mutate(live_coral_change = Post_MEAN_LIVE_CORAL - Pre_MEAN_LIVE_CORAL )selected_vars <- selected_vars[selected_vars %in%names(model_data)]# Prepare modelling datamodel_data <- model_data %>% dplyr::select(live_coral_change, all_of(selected_vars)) %>%drop_na()# Train-test splitset.seed(3888)train_index <-sample(1:nrow(model_data), size =0.75*nrow(model_data))train_data <- model_data[train_index, ]test_data <- model_data[-train_index, ]# Evaluation functionevaluate_model <-function(actual, predicted) {tibble(RMSE =sqrt(mean((actual - predicted)^2)),MSE =mean((actual - predicted)^2),MAE =mean(abs(actual - predicted)),R_squared =1-sum((actual - predicted)^2) /sum((actual -mean(actual))^2) )}# Use the repeated 5-fold CV setting for all modelsset.seed(3888)cv_control <-trainControl(method ="repeatedcv",number =5,repeats =50,savePredictions ="final")set.seed(3888)shared_folds <-vfold_cv(train_data, v =5)summarise_cv_results <-function(cv_result, model_name, interpretability, notes) { cv_result %>%summarise(RMSE =mean(RMSE),MSE =mean(MSE),MAE =mean(MAE),R_squared =mean(R_squared),.groups ="drop" ) %>%mutate(Model = model_name,Interpretability = interpretability,Notes = notes )}``````{r}#| message: false#| warning: false#| label: linear-regression#| code-summary: Code (Linear Regression)# Linear Regression with backward selection using repeated 5-fold CVset.seed(3888)lm_cv_model <- caret::train( live_coral_change ~ .,data = train_data,method ="lmStepAIC",trControl = cv_control,metric ="RMSE",trace =FALSE)lm_cv_results <- lm_cv_model$pred %>%group_by(Resample) %>%summarise(RMSE =sqrt(mean((obs - pred)^2)),MSE =mean((obs - pred)^2),MAE =mean(abs(obs - pred)),R_squared =1-sum((obs - pred)^2) /sum((obs -mean(obs))^2),.groups ="drop" )lm_result <-summarise_cv_results( lm_cv_results,"Linear Regression","High","Backward selection evaluated with 5-fold CV")``````{r}#| message: false#| warning: false#| label: knn-regression#| code-summary: Code (k-NN Regression)# KNN Regression using repeated 5-fold CVset.seed(3888)knn_grid <-expand.grid(k =c(3, 5, 7, 9, 11, 13, 15))knn_cv_model <- caret::train( live_coral_change ~ .,data = train_data,method ="knn",trControl = cv_control,tuneGrid = knn_grid,metric ="RMSE",preProcess =c("center", "scale"))best_k <- knn_cv_model$bestTune$kknn_cv_results <- knn_cv_model$pred %>%filter(k == best_k) %>%group_by(Resample) %>%summarise(RMSE =sqrt(mean((obs - pred)^2)),MSE =mean((obs - pred)^2),MAE =mean(abs(obs - pred)),R_squared =1-sum((obs - pred)^2) /sum((obs -mean(obs))^2),.groups ="drop" )knn_result <-summarise_cv_results( knn_cv_results,paste0("KNN Regression (k = ", best_k, ")"),"Low-Medium","Tuned k and evaluated with 5-fold CV")``````{r}#| message: false#| warning: false#| label: random-forest#| code-summary: Code (Random Forest)#| # Random Forest using repeated 5-fold CVset.seed(3888)rf_grid <-expand.grid(mtry =c(2, 3, 4, 5),splitrule ="variance",min.node.size =c(2, 5, 10))tree_values <-c(300, 500, 700)rf_models <-list()rf_results <-data.frame()for (ntree_val in tree_values) { rf_temp <- caret::train( live_coral_change ~ .,data = train_data,method ="ranger",trControl = cv_control,tuneGrid = rf_grid,metric ="RMSE",num.trees = ntree_val,importance ="impurity" ) temp_results <- rf_temp$results temp_results$num.trees <- ntree_val rf_results <-rbind(rf_results, temp_results) rf_models[[paste0("trees_", ntree_val)]] <- rf_temp}best_rf <- rf_results %>%arrange(RMSE) %>%slice(1)best_mtry <- best_rf$mtrybest_min_node <- best_rf$min.node.sizebest_num_trees <- best_rf$num.treesbest_rf_model <- rf_models[[paste0("trees_", best_num_trees)]]rf_cv_results <- best_rf_model$pred %>%filter(mtry == best_mtry) %>%group_by(Resample) %>%summarise(RMSE =sqrt(mean((obs - pred)^2)),MSE =mean((obs - pred)^2),MAE =mean(abs(obs - pred)),R_squared =1-sum((obs - pred)^2) /sum((obs -mean(obs))^2),.groups ="drop" )rf_result <-summarise_cv_results( rf_cv_results,"Random Forest","Medium","Tuned mtry and evaluated with 5-fold CV")# Final Random Forest fitted on the training data for SHAP interpretationrf_model <-randomForest( live_coral_change ~ .,data = train_data,ntree =500,mtry = best_mtry,importance =TRUE)``````{r}#| message: false#| warning: false#| label: XGBoost#| code-summary: Code (XGBoost)# XGBoost using repeated 5-fold CVset.seed(3888)xgb_train <- train_dataxgb_test <- test_dataxgb_folds <- shared_foldsxgb_recipe <-recipe(live_coral_change ~ ., data = xgb_train) %>%step_zv(all_predictors())xgb_model <-boost_tree(trees =300,tree_depth =tune(),learn_rate =tune(),loss_reduction =tune(),sample_size =tune(),mtry =tune(),min_n =tune()) %>%set_engine("xgboost") %>%set_mode("regression")xgb_grid <-expand_grid(tree_depth =c(1, 2, 3),learn_rate =c(0.01, 0.03, 0.05),loss_reduction =c(0, 0.01),sample_size =c(0.7, 0.9),mtry =c(3, 5, 7),min_n =c(5, 10))xgb_workflow <-workflow() %>%add_recipe(xgb_recipe) %>%add_model(xgb_model)xgb_tuned <-tune_grid( xgb_workflow,resamples = xgb_folds,grid = xgb_grid,metrics = yardstick::metric_set( yardstick::rmse, yardstick::mae, yardstick::rsq),control =control_grid(save_pred =TRUE))best_xgb <-select_best(xgb_tuned, metric ="rmse")final_xgb_workflow <-finalize_workflow( xgb_workflow, best_xgb)final_xgb_fit <-fit( final_xgb_workflow,data = xgb_train)xgb_pred <-predict(final_xgb_fit, xgb_test) %>%pull(.pred)xgb_cv_results <-collect_predictions(xgb_tuned) %>%inner_join(best_xgb, by =names(best_xgb)) %>%group_by(id) %>%summarise(RMSE =sqrt(mean((live_coral_change - .pred)^2)),MSE =mean((live_coral_change - .pred)^2),MAE =mean(abs(live_coral_change - .pred)),R_squared =1-sum((live_coral_change - .pred)^2) /sum((live_coral_change -mean(live_coral_change))^2),.groups ="drop" ) %>%rename(Resample = id)xgb_result <-summarise_cv_results( xgb_cv_results,"XGBoost","Medium","Tuned boosted trees evaluated with 5-fold CV")``````{r}#| message: false#| warning: false#| label: extra-trees#| code-summary: Code (Extra Trees)# Extra Trees using repeated 5-fold CVextra_trees_recipe <-recipe(live_coral_change ~ ., data = xgb_train) %>%step_zv(all_predictors())extra_trees_model <-rand_forest(trees =500,mtry =tune(),min_n =tune()) %>%set_engine("ranger", splitrule ="extratrees", importance ="impurity") %>%set_mode("regression")extra_trees_workflow <-workflow() %>%add_recipe(extra_trees_recipe) %>%add_model(extra_trees_model)extra_trees_grid <-expand_grid(mtry =c(2, 4, 7),min_n =c(2, 5, 10))extra_trees_tuned <-tune_grid( extra_trees_workflow,resamples = xgb_folds,grid = extra_trees_grid,metrics = yardstick::metric_set( yardstick::rmse, yardstick::mae, yardstick::rsq),control =control_grid(save_pred =TRUE))best_extra_trees <-select_best(extra_trees_tuned, metric ="rmse")final_extra_trees_workflow <-finalize_workflow( extra_trees_workflow, best_extra_trees)final_extra_trees_fit <-fit( final_extra_trees_workflow,data = xgb_train)extra_trees_pred <-predict(final_extra_trees_fit, xgb_test) %>%pull(.pred)extra_trees_cv_results <-collect_predictions(extra_trees_tuned) %>%inner_join(best_extra_trees, by =names(best_extra_trees)) %>%group_by(id) %>%summarise(RMSE =sqrt(mean((live_coral_change - .pred)^2)),MSE =mean((live_coral_change - .pred)^2),MAE =mean(abs(live_coral_change - .pred)),R_squared =1-sum((live_coral_change - .pred)^2) /sum((live_coral_change -mean(live_coral_change))^2),.groups ="drop" ) %>%rename(Resample = id)extra_trees_result <-summarise_cv_results( extra_trees_cv_results,"Extra Trees","Medium","Extremely randomized trees evaluated with 5-fold CV")```Repeated 5-fold cross-validation (CV) with 50 repeats was applied to evaluate model performance more reliably, with performance metrics including Root Mean Square Error (RMSE) and R^2^ values calculated for each fold before averaging the results for comparison, summarised in @tbl-model-comparison and @fig-boxplot-5-fold-CV. Based on the overall model performance, Random Forest emerged as the strongest candidate model, achieving the lowest mean RMSE of `{r} round(rf_result$RMSE, 3)` in @fig-boxplot-5-fold-CV-1 and the highest mean R^2^ value of `{r} round(rf_result$R_squared, 3)` in @fig-boxplot-5-fold-CV-2. This indicates that the model was able to explain approximately `{r} round(rf_result$R_squared*100, 0)`% of the variance in coral cover change while also producing the lowest prediction error among all evaluated models.Extra Trees also demonstrated relatively strong performance with a mean RMSE of `{r} round(extra_trees_result$RMSE, 3)` and an R^2^ value of `{r} round(extra_trees_result$R_squared, 3)`, while XGBoost and KNN Regression showed moderate predictive ability. In contrast, Linear Regression produced the highest RMSE and the lowest R^2^ value, suggesting that simpler linear relationships were less capable of capturing the complexity of environmental influences on coral cover change.The repeated cross-validation boxplots further showed that Random Forest maintained a comparatively stable spread in both RMSE and R^2^ values across folds compared with several other models, particularly Linear Regression and KNN Regression. This suggests that the Random Forest model produced more consistent predictive performance and generalised more reliably across different validation subsets. Therefore, Random Forest was selected as the final model for this project.```{r}#| message: false#| warning: false#| label: tbl-model-comparison#| tbl-cap: Model Performance Comparison Based on repeated 5-fold Cross-validation#| code-summary: Code (Combined Model Comparison)# Combine model comparisonmodel_comparison <-bind_rows( lm_result, knn_result, rf_result, xgb_result, extra_trees_result) %>% dplyr::select(Model, RMSE, MSE, MAE, R_squared, Interpretability, Notes) %>%arrange(RMSE)kable( model_comparison,digits =3,col.names =c("Model","RMSE","MSE","MAE","R-squared","Interpretability","Notes" ),align =c("l", "r", "r", "r", "r", "l", "l"),caption ="")# Prepare fold-level CV results for boxplotscv_plot_data <-bind_rows( lm_cv_results %>%mutate(Model ="Linear Regression"), knn_cv_results %>%mutate(Model =paste0("KNN Regression (k = ", best_k, ")")), rf_cv_results %>%mutate(Model ="Random Forest"), xgb_cv_results %>%mutate(Model ="XGBoost"), extra_trees_cv_results %>%mutate(Model ="Extra Trees"))``````{r}#| message: false#| warning: false#| label: fig-boxplot-5-fold-CV#| code-summary: Code (Boxplot 5-fold CV performance)#| fig-cap: "Boxplot comparisons of our chosen candidate classification models using repeated 5-fold cross-validation."#| fig-subcap:#| - "Root Mean Square Error (RMSE). Lower RMSE values indicate better predictive performance."#| - "Coefficient of determination R-squared. Higher R-squared values indicate better model fit and explanatory performance."#| layout-ncol: 2# Boxplot RMSE comparison using repeated 5-fold CV resultscv_plot_data %>%mutate(Model =reorder(Model, RMSE, median)) %>%ggplot(aes(x = Model, y = RMSE)) +geom_boxplot(width =0.6,fill ="grey85",color ="black",outlier.shape =NA ) +coord_flip() +labs(title ="Repeated 5-fold CV RMSE Comparison",x ="Model",y ="CV RMSE" ) +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold", size =15),axis.title =element_text(face ="bold"),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank() )# Boxplot R-squared comparison using repeated 5-fold CV resultscv_plot_data %>%mutate(Model =reorder(Model, R_squared, median)) %>%ggplot(aes(x = Model, y = R_squared)) +geom_boxplot(width =0.6,fill ="grey85",color ="black",outlier.shape =NA ) +coord_flip() +labs(title ="R-squared Comparison using Repeated 5-fold CV",x ="Model",y ="CV R-squared" ) +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold", size =15),axis.title =element_text(face ="bold"),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank() )```To further interpret the Random Forest model, a SHAP (Shapley) beeswarm plot was generated to explore the contribution of each predictor variable to predicting coral cover change (@fig-SHAP-plot). The sign of SHAP values refers to the direction of model contribution, indicating whether the predictor variable increases or decreases predicted coral cover change relative to the baseline prediction, while the SHAP magnitude indicates the strength of contribution [@Lundberg2020]. The colour gradient represents the feature values of the predictor variables. From @fig-SHAP-plot, higher values of Mean_hs generally contributed towards more positive predicted coral cover changes, while higher values of Duration_hours contributed towards greater coral loss. The SHAP values also illustrated that the influence of these variables varied across observations, suggesting complex non-linear relationships captured by the model.```{r}#| label: fig-SHAP-plot#| message: false#| warning: false#| code-summary: Code (SHAP Beeswarm Plot)#| fig-cap: "SHAP (Shapley) beeswarm plot for the Random Forest model showing the contribution and direction of predictor variables on coral cover change predictions. Features with larger absolute SHAP values have greater influence on model output. Positive SHAP values indicate an increase in predicted coral cover change, while negative SHAP values indicate a decrease."# Random Forest SHAP beeswarm plotX_train <- train_data %>% dplyr::select(-live_coral_change) %>%as.data.frame()X_test <- test_data %>% dplyr::select(-live_coral_change) %>%as.data.frame()pred_fun <-function(object, newdata) {predict(object, newdata =as.data.frame(newdata))}set.seed(3888)shap_values <- fastshap::explain(object = rf_model,X = X_train,pred_wrapper = pred_fun,newdata = X_test,nsim =100,adjust =TRUE)shap_object <- shapviz::shapviz(as.matrix(shap_values),X = X_test)shapviz::sv_importance( shap_object,kind ="bee") + ggplot2::scale_colour_viridis_c(option ="viridis",direction =1,name ="Feature value" ) + ggplot2::ggtitle("Random Forest SHAP Beeswarm Plot")```## Product Deployment```{r}#| label: fig-shiny#| echo: false#| message: false#| warning: false#| code-summary: Code (Screenshots from Shiny exhibit)#| fig-cap: Shiny#| fig-subcap: Explore Past Cyclones tab.,Design Your Own Cyclone tab.#| layout-nrow: 2knitr::include_graphics("www/shiny-1.png")knitr::include_graphics("www/shiny-2.png")```# Discussionffff# Conclusionconclusions# References::: {#refs}:::# Appendix## Merging Datasets {#sec-merging-datasets .appendix}All netCDF wave files were combined into a single dataset, with each wave point linked to its corresponding reef. This produced the final wave dataset, **“combined_wave_data.csv”**, which included significant wave height, peak wave frequency, wave direction, and related variables.The wave dataset (**“combined_wave_data.csv”**) was then merged with the IBTrACS cyclone dataset (**“ibtracs.since1980.list.v04r01.csv”**) by matching observations on date, hour, and minute. The spatial distance between reef wave points and cyclone locations was calculated using the Haversine formula and stored as a new variable. Only cyclone observations within 1000 km of reef locations were retained, resulting in the final combined dataset, **“combined_wave_cyclone.csv”**.The dataset (**“combined_wave_cyclone.csv”**) was then merged with the AIMS manta tow coral cover dataset (**“capricornbunkermantatowdata.csv”**) for use in the final modelling analysis, producing **“merged_coral_cover.csv”**. The datasets were joined using **Reef_ID** as the common key, matching manta tow survey records to each cyclone–reef observation in the wave–cyclone dataset.Additional variables were then derived for analysis, including mean and maximum wave height, the number of hours during which wave height exceeded thresholds of 2.5 m, 3 m, and 4 m, wave direction, wave period, and duration of cyclone impact. Further calculations included minimum distance between cyclone centre and reef, as well as mean and maximum wave energy and wave power.## Final Variables and their Descriptions {#sec-final-vars .appendix}| Variable Name | Description ||--------------------|----------------------------------------------------|| **Mean_hs** | Mean significant wave height (m) || **Intervals_hs_gt3** | Hours of waves \> 3 metres significant wave height (hours) || **Mean_fp** | Mean peak frequency (Hz) || **Mean_dir** | Mean wave direction from North (°) || **Min_distance_km** | Minimum distance between cyclone and reef (km) || **Duration_hours** | Duration of time cyclone was in 1000 km radius of reef (hours) || **Mean_P** | Mean wave power (kW/m) |: Final ccewdfeio