Modelling the Impacts of Cyclone-Induced Waves on Coral Cover in the Capricorn Bunker, Great Barrier Reef

Author

Reef 1

Published

May 30, 2026

Code (Loading Packages)

library(tidyverse)
library(lubridate)
library(geosphere)
library(caret)
library(MASS)
library(randomForest)
library(FNN)
library(broom)
library(knitr)
library(tidymodels)
library(xgboost)
library(ranger)
library(fastshap)
library(shapviz)
library(patchwork)

1 Introduction

1.1 Project Aims

This project aims to use data modelling to explore the immediate impacts of cyclones on reef condition, specifically coral cover. In turn, we aim to examine which indicators best predict the impacts of cyclone conditions on immediate physical reef damage. Additionally, we use our model to develop an interactive platform through which to communicate scientific modelling results to general audiences, thereby increasing the accessibility of our results. Our research questions were:

a) How do wave conditions during tropical cyclones impact coral cover in Capricorn Bunker, Great Barrier Reef; and

b) How can these changes be communicated as an accessible educational tool?

1.2 Background

2 Methods

2.1 Data Collection

To model and examine the impacts of cyclones on the CBG, we extract variables from multiple datasets, using wave climates and cyclone windows as predictors and coral cover as response variables.

2.2 Study Site

The Capricorn-Bunker Group (CBG) is a well-studied region of the GBR that contains a total of 17 islands, approximately 10 km west of the edge of the continental shelf in the southernmost sector of the GBR (Jell and Flood 1978). The region experiences relatively fewer, but more temporally clustered cyclones relative to the central GBR (Wolff et al. 2018). Notably, Southern reefs have been less impacted by marine heatwaves and bleaching events compared to other regions of the GBR (Hughes et al., 2017), and the CBG has had no severe outbreaks of crown-of-thorns starfish since monitoring started (AIMS 2025). Thus, the CBG provides a robust case study for isolating cyclone impacts from other confounding environmental stressors.

The study consisted of 10 reefs in the Capricorn-Bunker Group (CBG) (see Figure 1). Coral cover has been monitored annually-biannually at these reefs since 2006 (for 4 reefs, monitoring started in 1992) under the Australian Institute of Marine Science Long-Term Monitoring Program (AIMS 2025). The LTMP uses manta tow surveys, where a diver towed at a constant speed behind a boat records percentage estimations of coral cover around the entire slope perimeter. As cyclone damage is generally concentrated in the orientation of incoming waves (Vila-Concejo and Kench 2017), manta tow data provides a robust indicator of reef-wide cyclone impacts that may be heterogeneously distributed.

Figure 1: Study site: 10 reefs in the Capricorn-Bunker Group, Southern Great Barrier Reef. a) Study context, the white circle represents the 1000km cyclone radius used to constrain cyclone tracks. b) Higher resolution view of the 10 reefs in this study: North Reef, Broomfield reef, Wreck Island Reef, One Tree Island Reef, Erskine Reef, Mast Head Reef, Boult Reef, Hoskyn Islands Reef, Fairfax Islands Reef, Lady Musgrave Reef. Orange markers indicate the island measured under the Australian Institute of Marine Science Long-Term Monitoring Program. Blue markers indicate the gridded locations used to collect wave hindcast data from the Centre for Australian Weather and Climate Wave Hindcast Model. Images from: AIMS (2025), Google Earth (2026).

2.3 Datasets

2.3.1 Wave Data: CAWCR Wave Hindcast Model

The Centre for Australian Weather and Climate Research (CAWCR) Wave Hindcast Model is a global model based on forcing conditions derived from surface winds and sea ice concentrations in the Climate Forecast System Reanalysis. It has a 0.4 x 0.4 ° global resolution (including a series of smaller, nested grids in Australia), and produces hourly wave hindcast outputs. Wave data was collected from unique grid points adjacent to the 10 reefs monitored in the AIMS LTMP in NetCDF format (see figure 1). Hourly data from 1992 to present was extracted for all 10 reefs, and merged into a single dataset. Each grid point was linked to its corresponding reef name, to generate a combined dataset accounting for hourly wave conditions at all 10 study sites since 1992.

Variables retained for analysis were significant wave height, H_s (in meters), peak wave frequency, Fp ( Hz, discretised from 0.038 to 0.5 Hz), and direction, Dir (° from North, discretised into 24 intervals of 15°). From these variables, wave energy, \(\textbf{E}\) (J/m²) and wave power, \(\textbf{P}\) (kW/m) were calculated to account for the cumulative impacts of height and speed. Deep-water Linear Wave Theory equations were used, as depths at each wave data point ranged from 8-60 m (Google Earth 2026). The following calculations were made using Microsoft Excel:

Wave Energy, \(\textbf{E}\)
\[ \mathbf{E}= \frac{1}{16} \, \rho \, g \, \mathbf{H}_{\mathrm{s}}^{2}\]

Where:
- \(\rho\) (seawater density) was 1025 kg/m³ in line with mean seawater density (Karnauskas 2020);
- \(g\) (gravitational acceleration) was 9.81 m/s² ; and
- the coefficient was reduced to \(\frac{1}{16}\) (Salles, personal communication), as significant wave height values are derived from a spectrum of many irregular sea waves, rather than a single idealised wave height.
Wave Power, \(\textbf{P}\)

\[ \textbf{P} = \textbf{E}\,C\,n \]

Where:
- \(n=\frac{1}{2}\) to reflect deep water equation parameters;
- celerity: \(C=1.56\,T_p\) ; and
- peak period \(T_p\) is the inverse of peak wave frequency \(F_p\) by: \(T_p =\frac{1}{F_p}\)

2.3.2 Cyclone Tracks

Cyclone tracks were used to identify where cyclones entered a 1000km radius of each study reef. While the spatial extent of TC gales is 300-500 km (Grossmann-Matheson et al. 2024), propagated waves can travel up to 20,000 km from their source (Hoeke et al. 2013). Recently, waves of 4.4 m height propagated during TC Justin were recorded 630 km away (Smith, Vila-Concejo, and Salles 2023). Thus, we chose the radius of 1000 km as a conservative estimate, to reduce confounding effects of regular sea-states conditions whilst still accounting for potential spatial and temporal lags in wave propagation.

Cyclone track data was accessed from International Best Track Archive for Climate Stewardship (IBTrACS, v04r01). For each TC, the dataset provides the storm location (latitude and longitude coordinates), and characteristics such as storm category and wind speed, recorded at 3 hour intervals. Cyclone data was merged with wave data by matching the same date, hour, and minute between datasets, filtering observations from 1992 onwards. The spatial distance between gridded reef wave points and cyclone locations was calculated using the Haversine formula (Robusto 1957), and only cyclone observations within 1000km of the reefs were kept. Thus, the combined wave and cyclone data contained reef-specific, three-hourly wave climates for the duration of time cyclones were within 1000km of the study sites.

2.3.3 Coral Cover

To quantify the impacts of wave climates on coral cover, the combined wave and cyclone dataset was merged with AIMS LTMP coral cover observations according to individual cyclone conditions. Datasets were merged using reef name as the common key, matching the surveys in the manta tow dataset for every cyclone reef row in the combined wave cyclone dataset. New columns were created to calculate descriptive wave climate characteristics across the duration of each observed cyclone for each specific reef. The following variables were retained in the final dataset:

2.3.4 Final Merged Dataset

Finally, live and dead coral cover data were extracted according to the nearest monitoring dates identified before and after each observed cyclone. Thus, the merged dataset is cyclone-frequency dependent, and includes both reef-specific wave conditions, and temporally related coral cover observations. The final merged dataset includes 152 observations across 10 reefs and 17 cyclones, and a combination of numeric, categorial and temporal variables related to wave conditions, cyclone exposure, and coral cover. (For detailed methods see Appendix Section 7.1).

2.4 Modelling

Data cleaning and preprocessing were performed during the dataset merging process by removing any observations with missing longitude or latitude values, filtering those from 1992 onwards, and keeping only cyclones within 1000km of reef locations. After completing all merging and filtering steps, the final dataset contained 152 observations across 10 reefs and 17 cyclones, with a combination of numeric, categorical, and temporal variables related to wave conditions, cyclone exposure, and coral cover. The final dataset contained no missing values. The primary response variable of interest was the change in live coral cover, calculated using the difference between pre- and post-cyclone coral measurements.

Code (Reading in Data)

data <- read_csv("merging datasets/merged_cyclone_coral_v4.csv", show_col_types = FALSE)

all_vars <- c(
  "Mean_hs",
  "Max_hs",
  "Intervals_hs_gt4",
  "Intervals_hs_gt3",
  "Intervals_hs_gt25",
  "Mean_fp",
  "Max_fp",
  "Mean_dir",
  "Std_dir",
  "Min_distance_km",
  "Duration_hours",
  "Mean_E",
  "Max_E",
  "Mean_P",
  "Max_P"
)
all_data <- data %>%
  dplyr::select(all_of(all_vars)) %>%
  mutate(across(everything(), as.numeric))

selected_vars <- c(
  "Intervals_hs_gt3",
  "Mean_hs",
  "Mean_P",
  "Mean_fp",
  "Mean_dir",
  "Min_distance_km",
  "Duration_hours"
)

cyclone_exposure <- data %>%
  dplyr::select(all_of(selected_vars))

Multicollinearity between predictor variables was also checked using correlation analysis, where several wave-related variables showed strong correlations, such as mean and maximum significant wave height as seen in Figure 2. Final variable selection, while considering multicollinearity, included mean significant wave height, hours with significant wave height exceeding 3m, mean power, mean peak wave frequency, mean wave direction, minimum distance from cyclone to reef, and cyclone duration hours. In particular, mean significant wave height and mean wave power were highly correlated since wave power is partially derived from wave height. However, both variables were retained due to their environmental importance and potential contribution to predicting coral cover change (Ferrario et al. 2014).

Code (Correlation Heatmap)

correlation_matrix <- cor(
  all_data,
  use = "pairwise.complete.obs",
  method = "pearson"
)

# Keep only lower triangle
correlation_matrix[upper.tri(correlation_matrix)] <- NA

correlation_long <- as.data.frame(as.table(correlation_matrix)) %>%
  as_tibble() %>%
  rename(
    variable_y = Var1,
    variable_x = Var2,
    correlation = Freq
  ) %>%
  filter(!is.na(correlation)) %>%
  mutate(
    variable_x = factor(variable_x, levels = all_vars),
    variable_y = factor(variable_y, levels = rev(all_vars)),
    correlation_label = sprintf("%.2f", correlation)
  )

ggplot(correlation_long, aes(x = variable_x, y = variable_y, fill = correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = correlation_label), size = 3) +
  scale_fill_gradient2(
    low = "#2166AC",
    mid = "white",
    high = "#B2182B",
    midpoint = 0,
    limits = c(-1, 1),
    name = "Pearson\ncorrelation"
  ) +
  coord_fixed() +
  labs(
    title = "Correlation Heatmap of Cyclone Exposure Variables",
    x = NULL,
    y = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 15),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    axis.text.y = element_text(hjust = 1),
    panel.grid = element_blank()
  )

Figure 2: Pearson correlation heatmap of cyclone exposure variables used in the modelling process. Red colours indicate positive correlations, while blue colours indicate negative correlations. Darker red cells indicate stronger positive correlations and highlight potential multicollinearity among several wave height and energy-related predictors.

2.5 Model Assumptions

Several modelling assumptions were considered throughout the analysis. One key assumption for all models was data independence. However, this assumption is inherently violated to some extent because geographically proximate reefs may share similar environmental conditions and cyclone exposure. Despite this spatial dependence being unavoidable in ecological datasets of this kind, each reef–cyclone observation was treated as independent for modelling purposes.

Also, the training and testing datasets were assumed to come from similar distributions, with the training data being representative of broader reef and cyclone exposure conditions. In particular, for linear regression, additional assumptions included linearity between predictor variables and coral cover change, homoscedasticity (constant variance), and normality of residuals, along with multicollinearity between variables checked before modelling.

3 Results

3.1 Comparison of Models

Code (Setup 5-fold CV)

# Create target variable
model_data <- data %>%
  mutate(
    live_coral_change = Post_MEAN_LIVE_CORAL - Pre_MEAN_LIVE_CORAL
  )

selected_vars <- selected_vars[selected_vars %in% names(model_data)]

# Prepare modelling data
model_data <- model_data %>%
  dplyr::select(live_coral_change, all_of(selected_vars)) %>%
  drop_na()

# Train-test split
set.seed(3888)

train_index <- sample(1:nrow(model_data), size = 0.75 * nrow(model_data))

train_data <- model_data[train_index, ]
test_data  <- model_data[-train_index, ]

# Evaluation function
evaluate_model <- function(actual, predicted) {
  tibble(
    RMSE = sqrt(mean((actual - predicted)^2)),
    MSE = mean((actual - predicted)^2),
    MAE = mean(abs(actual - predicted)),
    R_squared = 1 - sum((actual - predicted)^2) /
      sum((actual - mean(actual))^2)
  )
}

# Use the repeated 5-fold CV setting for all models
set.seed(3888)
cv_control <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 50,
  savePredictions = "final"
)

set.seed(3888)
shared_folds <- vfold_cv(train_data, v = 5)

summarise_cv_results <- function(cv_result, model_name, interpretability, notes) {
  cv_result %>%
    summarise(
      RMSE = mean(RMSE),
      MSE = mean(MSE),
      MAE = mean(MAE),
      R_squared = mean(R_squared),
      .groups = "drop"
    ) %>%
    mutate(
      Model = model_name,
      Interpretability = interpretability,
      Notes = notes
    )
}

Code (Linear Regression)

# Linear Regression with backward selection using repeated 5-fold CV
set.seed(3888)

lm_cv_model <- caret::train(
  live_coral_change ~ .,
  data = train_data,
  method = "lmStepAIC",
  trControl = cv_control,
  metric = "RMSE",
  trace = FALSE
)

lm_cv_results <- lm_cv_model$pred %>%
  group_by(Resample) %>%
  summarise(
    RMSE = sqrt(mean((obs - pred)^2)),
    MSE = mean((obs - pred)^2),
    MAE = mean(abs(obs - pred)),
    R_squared = 1 - sum((obs - pred)^2) / sum((obs - mean(obs))^2),
    .groups = "drop"
  )

lm_result <- summarise_cv_results(
  lm_cv_results,
  "Linear Regression",
  "High",
  "Backward selection evaluated with 5-fold CV"
)

Code (k-NN Regression)

# KNN Regression using repeated 5-fold CV
set.seed(3888)

knn_grid <- expand.grid(k = c(3, 5, 7, 9, 11, 13, 15))

knn_cv_model <- caret::train(
  live_coral_change ~ .,
  data = train_data,
  method = "knn",
  trControl = cv_control,
  tuneGrid = knn_grid,
  metric = "RMSE",
  preProcess = c("center", "scale")
)

best_k <- knn_cv_model$bestTune$k

knn_cv_results <- knn_cv_model$pred %>%
  filter(k == best_k) %>%
  group_by(Resample) %>%
  summarise(
    RMSE = sqrt(mean((obs - pred)^2)),
    MSE = mean((obs - pred)^2),
    MAE = mean(abs(obs - pred)),
    R_squared = 1 - sum((obs - pred)^2) / sum((obs - mean(obs))^2),
    .groups = "drop"
  )

knn_result <- summarise_cv_results(
  knn_cv_results,
  paste0("KNN Regression (k = ", best_k, ")"),
  "Low-Medium",
  "Tuned k and evaluated with 5-fold CV"
)

Code (Random Forest)

set.seed(3888)

rf_grid <- expand.grid(
  mtry = c(2, 3, 4, 5),
  splitrule = "variance",
  min.node.size = c(2, 5, 10)
)

tree_values <- c(300, 500, 700)

rf_models <- list()
rf_results <- data.frame()

for (ntree_val in tree_values) {

  rf_temp <- caret::train(
    live_coral_change ~ .,
    data = train_data,
    method = "ranger",
    trControl = cv_control,
    tuneGrid = rf_grid,
    metric = "RMSE",
    num.trees = ntree_val,
    importance = "impurity"
  )

  temp_results <- rf_temp$results

  temp_results$num.trees <- ntree_val

  rf_results <- rbind(rf_results, temp_results)

  rf_models[[paste0("trees_", ntree_val)]] <- rf_temp
}

best_rf <- rf_results %>%
  arrange(RMSE) %>%
  slice(1)

best_mtry <- best_rf$mtry
best_min_node <- best_rf$min.node.size
best_num_trees <- best_rf$num.trees

best_rf_model <- rf_models[[paste0("trees_", best_num_trees)]]

rf_cv_results <- best_rf_model$pred %>%
  filter(mtry == best_mtry) %>%
  group_by(Resample) %>%
  summarise(
    RMSE = sqrt(mean((obs - pred)^2)),
    MSE = mean((obs - pred)^2),
    MAE = mean(abs(obs - pred)),
    R_squared = 1 - sum((obs - pred)^2) / sum((obs - mean(obs))^2),
    .groups = "drop"
  )

rf_result <- summarise_cv_results(
  rf_cv_results,
  "Random Forest",
  "Medium",
  "Tuned mtry and evaluated with 5-fold CV"
)

# Final Random Forest fitted on the training data for SHAP interpretation
rf_model <- randomForest(
  live_coral_change ~ .,
  data = train_data,
  ntree = 500,
  mtry = best_mtry,
  importance = TRUE
)

Code (XGBoost)

# XGBoost using repeated 5-fold CV
set.seed(3888)

xgb_train <- train_data
xgb_test <- test_data
xgb_folds <- shared_folds

xgb_recipe <- recipe(live_coral_change ~ ., data = xgb_train) %>%
  step_zv(all_predictors())

xgb_model <- boost_tree(
  trees = 300,
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

xgb_grid <- expand_grid(
  tree_depth = c(1, 2, 3),
  learn_rate = c(0.01, 0.03, 0.05),
  loss_reduction = c(0, 0.01),
  sample_size = c(0.7, 0.9),
  mtry = c(3, 5, 7),
  min_n = c(5, 10)
)

xgb_workflow <- workflow() %>%
  add_recipe(xgb_recipe) %>%
  add_model(xgb_model)

xgb_tuned <- tune_grid(
  xgb_workflow,
  resamples = xgb_folds,
  grid = xgb_grid,
  metrics = yardstick::metric_set(
  yardstick::rmse,
  yardstick::mae,
  yardstick::rsq
),
  control = control_grid(save_pred = TRUE)
)

best_xgb <- select_best(xgb_tuned, metric = "rmse")

final_xgb_workflow <- finalize_workflow(
  xgb_workflow,
  best_xgb
)

final_xgb_fit <- fit(
  final_xgb_workflow,
  data = xgb_train
)

xgb_pred <- predict(final_xgb_fit, xgb_test) %>%
  pull(.pred)

xgb_cv_results <- collect_predictions(xgb_tuned) %>%
  inner_join(best_xgb, by = names(best_xgb)) %>%
  group_by(id) %>%
  summarise(
    RMSE = sqrt(mean((live_coral_change - .pred)^2)),
    MSE = mean((live_coral_change - .pred)^2),
    MAE = mean(abs(live_coral_change - .pred)),
    R_squared = 1 - sum((live_coral_change - .pred)^2) /
      sum((live_coral_change - mean(live_coral_change))^2),
    .groups = "drop"
  ) %>%
  rename(Resample = id)

xgb_result <- summarise_cv_results(
  xgb_cv_results,
  "XGBoost",
  "Medium",
  "Tuned boosted trees evaluated with 5-fold CV"
)

Code (Extra Trees)

# Extra Trees using repeated 5-fold CV
extra_trees_recipe <- recipe(live_coral_change ~ ., data = xgb_train) %>%
  step_zv(all_predictors())

extra_trees_model <- rand_forest(
  trees = 500,
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", splitrule = "extratrees", importance = "impurity") %>%
  set_mode("regression")

extra_trees_workflow <- workflow() %>%
  add_recipe(extra_trees_recipe) %>%
  add_model(extra_trees_model)

extra_trees_grid <- expand_grid(
  mtry = c(2, 4, 7),
  min_n = c(2, 5, 10)
)

extra_trees_tuned <- tune_grid(
  extra_trees_workflow,
  resamples = xgb_folds,
  grid = extra_trees_grid,
  metrics = yardstick::metric_set(
  yardstick::rmse,
  yardstick::mae,
  yardstick::rsq
),
  control = control_grid(save_pred = TRUE)
)

best_extra_trees <- select_best(extra_trees_tuned, metric = "rmse")

final_extra_trees_workflow <- finalize_workflow(
  extra_trees_workflow,
  best_extra_trees
)

final_extra_trees_fit <- fit(
  final_extra_trees_workflow,
  data = xgb_train
)

extra_trees_pred <- predict(final_extra_trees_fit, xgb_test) %>%
  pull(.pred)

extra_trees_cv_results <- collect_predictions(extra_trees_tuned) %>%
  inner_join(best_extra_trees, by = names(best_extra_trees)) %>%
  group_by(id) %>%
  summarise(
    RMSE = sqrt(mean((live_coral_change - .pred)^2)),
    MSE = mean((live_coral_change - .pred)^2),
    MAE = mean(abs(live_coral_change - .pred)),
    R_squared = 1 - sum((live_coral_change - .pred)^2) /
      sum((live_coral_change - mean(live_coral_change))^2),
    .groups = "drop"
  ) %>%
  rename(Resample = id)

extra_trees_result <- summarise_cv_results(
  extra_trees_cv_results,
  "Extra Trees",
  "Medium",
  "Extremely randomized trees evaluated with 5-fold CV"
)

Repeated 5-fold cross-validation (CV) with 50 repeats was applied to evaluate model performance more reliably, with performance metrics including Root Mean Square Error (RMSE) and R² values calculated for each fold before averaging the results for comparison, summarised in Table 1 and Figure 3. Based on the overall model performance, Random Forest emerged as the strongest candidate model, achieving the lowest mean RMSE of 9.517 in Figure 3 (a) and the highest mean R² value of 0.371 in Figure 3 (b). This indicates that the model was able to explain approximately 37% of the variance in coral cover change while also producing the lowest prediction error among all evaluated models.

Extra Trees also demonstrated relatively strong performance with a mean RMSE of 9.686 and an R² value of 0.332, while XGBoost and KNN Regression showed moderate predictive ability. In contrast, Linear Regression produced the highest RMSE and the lowest R² value, suggesting that simpler linear relationships were less capable of capturing the complexity of environmental influences on coral cover change.

The repeated cross-validation boxplots further showed that Random Forest maintained a comparatively stable spread in both RMSE and R² values across folds compared with several other models, particularly Linear Regression and KNN Regression. This suggests that the Random Forest model produced more consistent predictive performance and generalised more reliably across different validation subsets. Therefore, Random Forest was selected as the final model for this project.

Code

#| code-summary: Code (Combined Model Comparison)
# Combine model comparison
model_comparison <- bind_rows(
  lm_result,
  knn_result,
  rf_result,
  xgb_result,
  extra_trees_result
) %>%
  dplyr::select(Model, RMSE, MSE, MAE, R_squared, Interpretability, Notes) %>%
  arrange(RMSE)

kable(
  model_comparison,
  digits = 3,
  col.names = c(
    "Model",
    "RMSE",
    "MSE",
    "MAE",
    "R-squared",
    "Interpretability",
    "Notes"
  ),
  align = c("l", "r", "r", "r", "r", "l", "l"),
  caption = ""
)
# Prepare fold-level CV results for boxplots
cv_plot_data <- bind_rows(
  lm_cv_results %>% mutate(Model = "Linear Regression"),
  knn_cv_results %>% mutate(Model = paste0("KNN Regression (k = ", best_k, ")")),
  rf_cv_results %>% mutate(Model = "Random Forest"),
  xgb_cv_results %>% mutate(Model = "XGBoost"),
  extra_trees_cv_results %>% mutate(Model = "Extra Trees")
)

Table 1: Model Performance Comparison Based on repeated 5-fold Cross-validation

Model	RMSE	MSE	MAE	R-squared	Interpretability	Notes
Random Forest	9.517	92.201	7.673	0.371	Medium	Tuned mtry and evaluated with 5-fold CV
Extra Trees	9.686	96.801	7.691	0.332	Medium	Extremely randomized trees evaluated with 5-fold CV
XGBoost	10.005	103.343	7.832	0.285	Medium	Tuned boosted trees evaluated with 5-fold CV
KNN Regression (k = 5)	10.189	105.920	8.321	0.270	Low-Medium	Tuned k and evaluated with 5-fold CV
Linear Regression	11.055	124.952	8.833	0.183	High	Backward selection evaluated with 5-fold CV

Code (Boxplot 5-fold CV performance)

# Boxplot RMSE comparison using repeated 5-fold CV results
cv_plot_data %>%
  mutate(Model = reorder(Model, RMSE, median)) %>%
  ggplot(aes(x = Model, y = RMSE)) +
  geom_boxplot(
    width = 0.6,
    fill = "grey85",
    color = "black",
    outlier.shape = NA
  ) +
  coord_flip() +
  labs(
    title = "Repeated 5-fold CV RMSE Comparison",
    x = "Model",
    y = "CV RMSE"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 15),
    axis.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank()
  )

# Boxplot R-squared comparison using repeated 5-fold CV results
cv_plot_data %>%
  mutate(Model = reorder(Model, R_squared, median)) %>%
  ggplot(aes(x = Model, y = R_squared)) +
  geom_boxplot(
    width = 0.6,
    fill = "grey85",
    color = "black",
    outlier.shape = NA
  ) +
  coord_flip() +
  labs(
    title = "R-squared Comparison using Repeated 5-fold CV",
    x = "Model",
    y = "CV R-squared"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 15),
    axis.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank()
  )

(a) Root Mean Square Error (RMSE). Lower RMSE values indicate better predictive performance.

To further interpret the Random Forest model, a SHAP (Shapley) beeswarm plot was generated to explore the contribution of each predictor variable to predicting coral cover change (Figure 4). The sign of SHAP values refers to the direction of model contribution, indicating whether the predictor variable increases or decreases predicted coral cover change relative to the baseline prediction, while the SHAP magnitude indicates the strength of contribution (Lundberg et al. 2020). The colour gradient represents the feature values of the predictor variables. From Figure 4, higher values of Mean_hs generally contributed towards more positive predicted coral cover changes, while higher values of Duration_hours contributed towards greater coral loss. The SHAP values also illustrated that the influence of these variables varied across observations, suggesting complex non-linear relationships captured by the model.

Code (SHAP Beeswarm Plot)

# Random Forest SHAP beeswarm plot
X_train <- train_data %>%
  dplyr::select(-live_coral_change) %>%
  as.data.frame()

X_test <- test_data %>%
  dplyr::select(-live_coral_change) %>%
  as.data.frame()

pred_fun <- function(object, newdata) {
  predict(object, newdata = as.data.frame(newdata))
}

set.seed(3888)

shap_values <- fastshap::explain(
  object = rf_model,
  X = X_train,
  pred_wrapper = pred_fun,
  newdata = X_test,
  nsim = 100,
  adjust = TRUE
)

shap_object <- shapviz::shapviz(
  as.matrix(shap_values),
  X = X_test
)

shapviz::sv_importance(
  shap_object,
  kind = "bee"
) +
  ggplot2::scale_colour_viridis_c(
    option = "viridis",
    direction = 1,
    name = "Feature value"
  ) +
  ggplot2::ggtitle("Random Forest SHAP Beeswarm Plot")

Figure 4: SHAP (Shapley) beeswarm plot for the Random Forest model showing the contribution and direction of predictor variables on coral cover change predictions. Features with larger absolute SHAP values have greater influence on model output. Positive SHAP values indicate an increase in predicted coral cover change, while negative SHAP values indicate a decrease.

3.2 Product Deployment

(a) Explore Past Cyclones tab.,Design Your Own Cyclone tab.

4 Discussion

ffff

5 Conclusion

conclusions

6 References

AIMS. 2025. “AIMS Long-Term Monitoring Program: Crown-of-Thorns Starfish and Benthos Manta Tow Data (Great Barrier Reef).” https://doi.org/10.25845/5c09b0abf315a.

Ferrario, Filippo, Michael W. Beck, Curt D. Storlazzi, Fiorenza Micheli, Christine C. Shepard, and Laura Airoldi. 2014. “The Effectiveness of Coral Reefs for Coastal Hazard Risk Reduction and Adaptation.” Nature Communications 5 (1). https://doi.org/10.1038/ncomms4794.

Google Earth. 2026. “Gladstone and the Capricorn Bunker Reef Group, Great Barrier Reef: 23.5 s, 152.07 e.” Satellite map. https://earth.google.com/web/.

Grossmann-Matheson, Georg, Ian R. Young, Andrea Meucci, and Jose-Henrique Alves. 2024. “Global Tropical Cyclone Extreme Wave Height Climatology.” Scientific Reports 14 (1): 4167. https://doi.org/10.1038/s41598-024-54691-9.

Hoeke, Ron K., Kathleen L. McInnes, Jonathan C. Kruger, Robin J. McNaught, John R. Hunter, and Scott G. Smithers. 2013. “Widespread Inundation of Pacific Islands Triggered by Distant-Source Wind-Waves.” Global and Planetary Change 108: 128–38. https://doi.org/10.1016/j.gloplacha.2013.06.006.

Jell, J. S., and P. G. Flood. 1978. “Guide to the Geology of Reefs of the Capricorn and Bunker Groups, Great Barrier Reef Province, with Special Reference to Heron Reef.” Australasian Sedimentologists Group 8 (3): 1–85.

Karnauskas, Kristopher B. 2020. “Physical Diagnosis of the 2016 Great Barrier Reef Bleaching Event.” Geophysical Research Letters 47 (11). https://doi.org/10.1029/2019gl086177.

Lundberg, Scott M., Gabriel Erion, H. Chen, A. DeGrave, Jordan M. Prutkin, B. Nair, R. Katz, Jonathan Himmelfarb, N. Bansal, and Su-In Lee. 2020. “From Local Explanations to Global Understanding with Explainable AI for Trees.” Nature Machine Intelligence 2: 56–67. https://doi.org/10.1038/s42256-019-0138-9.

Robusto, Carl C. 1957. “The Cosine-Haversine Formula.” The American Mathematical Monthly 64 (1): 38–40. https://doi.org/10.2307/2309088.

Smith, C., Alexandra Vila-Concejo, and Tom Salles. 2023. “Offshore Wave Climate of the Great Barrier Reef.” Coral Reefs. https://doi.org/10.1007/s00338-023-02377-5.

Vila-Concejo, Alexandra, and Paul Kench. 2017. “Storms in Coral Reefs: Processes and Impacts.” In Coastal Storms: Processes and Impacts, 127–49. https://doi.org/10.1002/9781118937099.ch7.

Wolff, Nicholas H., Peter J. Mumby, Marji Devlin, and Ken R. N. Anthony. 2018. “Vulnerability of the Great Barrier Reef to Climate Change and Local Pressures.” Global Change Biology 24 (5): 1978–91. https://doi.org/10.1111/gcb.14043.

7 Appendix

7.1 Merging Datasets

All netCDF wave files were combined into a single dataset, with each wave point linked to its corresponding reef. This produced the final wave dataset, “combined_wave_data.csv”, which included significant wave height, peak wave frequency, wave direction, and related variables.

The wave dataset (“combined_wave_data.csv”) was then merged with the IBTrACS cyclone dataset (“ibtracs.since1980.list.v04r01.csv”) by matching observations on date, hour, and minute. The spatial distance between reef wave points and cyclone locations was calculated using the Haversine formula and stored as a new variable. Only cyclone observations within 1000 km of reef locations were retained, resulting in the final combined dataset, “combined_wave_cyclone.csv”.

The dataset (“combined_wave_cyclone.csv”) was then merged with the AIMS manta tow coral cover dataset (“capricornbunkermantatowdata.csv”) for use in the final modelling analysis, producing “merged_coral_cover.csv”. The datasets were joined using Reef_ID as the common key, matching manta tow survey records to each cyclone–reef observation in the wave–cyclone dataset.

Additional variables were then derived for analysis, including mean and maximum wave height, the number of hours during which wave height exceeded thresholds of 2.5 m, 3 m, and 4 m, wave direction, wave period, and duration of cyclone impact. Further calculations included minimum distance between cyclone centre and reef, as well as mean and maximum wave energy and wave power.

7.2 Final Variables and their Descriptions

Final ccewdfeio
Variable Name	Description
Mean_hs	Mean significant wave height (m)
Intervals_hs_gt3	Hours of waves > 3 metres significant wave height (hours)
Mean_fp	Mean peak frequency (Hz)
Mean_dir	Mean wave direction from North (°)
Min_distance_km	Minimum distance between cyclone and reef (km)
Duration_hours	Duration of time cyclone was in 1000 km radius of reef (hours)
Mean_P	Mean wave power (kW/m)