1. Background and Motivation

The ocean plays a critical role in our global climate conditions. One of the main threats to our oceanic environment is the rise in sea surface temperatures. This analysis aims to analyze sea surface temperature data from the Catalina region to forecast future temperatures and assess whether oceans are expected to continue experiencing increasing temperature and, if so, whether this will lead them to experience irreversible harm. In addition, through this project, I would like to verify if claims made in the media about the health of our oceans are valid by comparing the results with research on the impact of temperature changes on our marine ecosystems.

2. Data

The dataset contains sea surface temperature data collected in the Catalina region from July 2002 to March 2023 by the NOAA. In total, the original dataset contains 40,000,000 observations. The data includes the date-time of collection, latitude, longitude, altitude, and temperature measurements in degrees Celsius. This dataset will be used to analyze the region’s spatial and temporal sea surface temperature trends.

3. Project Objectives

3.1. Primary Questions

How has the sea surface temperature changed over time in the Catalina region? What factors might be influencing the observed trends in sea surface temperature? Can I forecast future sea surface temperatures in the region, and if so, what are the expected temperature changes? Based on the forecasted temperature changes, when is the Catalina region expected to reach the critical coral bleaching thresholds?

3.2. Preliminary Research

As time passes, we have heard more and more about the threats global warming poses to our planet Earth. More recently, you may have heard about rising ocean temperatures in addition to issues like the ozone layer and greenhouse gases. News outlets and activists have reported that increasing ocean temperatures affect our oceans by melting ice caps, increasing sea levels, causing oceanic heat waves and coral reef bleaching, plastic and chemical composition from human waste dumping, changing carbon levels, and ocean acidification. According to Lindsey and Dahlman, the oceans have absorbed more than 90% of the excess heat generated by human-induced global warming, with ocean heat gain rates varying between 0.64 and 0.80 Watts per square meter across different depths from 1993 to 2021 (Lindsey and Dahlman). Additionally, Nick Bradford of the National Environmental Education Foundation reports that the average global sea surface temperature has increased about 1.5°F since 1901, with an average rate of 0.13°F per decade (Bradford). This is a claimed rise of about 0.013°F annually. These changes can carry significant consequences for Earth’s climate system and marine ecosystems in the future.

Through my preliminary research, I had difficulty finding a set scale of when all or most marine life would end at a given sea temperature. Given the vast diversity of creatures, life for some is possible at certain temperatures, whereas others may not be as fortunate. As some might have guessed, sea temperature alone is not the sole culprit to end all marine life should our planet be doomed to ever-increasing temperatures. Many things can affect marine ecosystems, such as carbon content, acidity, and biodiversity population levels. Without a magical temperature that would kill off our ocean, I had to get creative with a scaleable measure that could still measure catastrophic effects. Thus, I shifted my focus to coral reef bleaching, as coral reefs are among Earth’s most biologically diverse and valuable ecosystems. The EPA states that an estimated 25 percent of all marine life, including over 4,000 species of fish, depends on coral reefs at some point in their life cycle.

In this analysis, I will use data collected by NOAA from the Catalina region as a representative sample of the entire ocean. NOAA’s Coral Reef Watch (CRW) daily global 5km satellite coral bleaching Degree Heating Week (DHW) product reveals accumulated heat stress that can lead to coral bleaching and death. Initially, I planned to implement the DHW scale in my project, which indicates significant coral bleaching typically occurs when the DHW value reaches 4°C-weeks, while an 8°C increase risks mass bleaching. However, due to the data being monthly, forecasting weekly data in a specific sequence would be very complex.

To address this challenge, I will use NOAA’s “bleaching threshold” as an alternative scale. The critical temperature of 29.5°C (85.1°F) is considered the bleaching threshold, and a temperature 1°C above the threshold, 30.5°C (86.9°F), serves as an indicator of when corals are at risk of experiencing heat stress and mass bleaching (“Coral Reef Watch”). By analyzing the Catalina region as a representative sample, I will assume that if, after forecasting as many years as necessary, there are consistent months where temperatures are above 86.9°F, coral reefs are at risk of bleaching and dying out in the Catalina region.

To implement this approach, I will perform a time series analysis by fitting autoregression (AR) models to the data, incorporating the bleaching threshold as a critical reference point. I will then use these models to forecast future sea surface temperatures in the Catalina region, estimating when the area might experience temperatures above the bleaching threshold. Additionally, I will analyze spatial trends and relationships in the data, considering the bleaching threshold as a key parameter. This analysis will help me determine when the Catalina region, as a representative sample, is at risk of coral bleaching due to rising sea surface temperatures.

3.3. Methods

Time Series Analysis: I will fit auto regression (AR) models to the time series data and use them to forecast future sea surface temperatures. Spatial Analysis: I will use spatial point pattern techniques to analyze the spatial distribution of sea surface temperature and identify areas with significant temperature differences.

4. Data Processing and Data Analysis

4.1 Data preprocessing:

First, I will clean and prepare the dataset for analysis, including aggregating temperature data by date and location, converting time stamps to a monthly scale, converting temperature to a Fahrenheit scale, removing duplicates and null values, and adjusting longitude measurements. In addition I would like to mention that for the purposes of this analysis and to save on computational power, some data preparation was preformed outside of this R Markdown file render. The original data file was almost 2 GB. Once NA values being removed along with other prep work, the data file cata_sst.csv was only around 1.1 GB saving an immense amount of time and power in the R environment.

Below is the code used to import the original data, and prepare it for our analysis for future records.

if (!require('data.table')) install.packages('data.table'); library('data.table')
if (!require('lubridate')) install.packages('lubridate'); library('lubridate')

# Read in data
cata_sst <- fread("erdMWsstdmday_5e5a_ad79_87d0.csv", header = FALSE)
colnames(cata_sst) <- c("time", "altitude", "latitude", "longitude", "sst")

# Remove Unneeded Column
cata_sst[, altitude := NULL]

# Remove NA Rows
cata_sst <- cata_sst[complete.cases(cata_sst$sst), ]

# Convert the "time" column from character to POSIXct format
cata_sst[, time := as.POSIXct(time, format = "%Y-%m-%d %H:%M:%S")]

# Convert the "time" column from POSIXct to Date format
cata_sst[, date := as.Date(time)]

# Remove Unneeded Column
cata_sst[, time := NULL]

# Correct longitude
cata_sst[, longitude := longitude - 360]

# Rename columns
setnames(cata_sst, "latitude", "lats")
setnames(cata_sst, "longitude", "lon")

# Check for duplicates to be sure before writing to csv
any(duplicated(cata_sst[, .(lats, lon, date)])) # False

# No duplicates. Convert to monthly
cata_sst[, month_year := format(as.Date(date), "%Y-%m")]

# Remove Unneeded Column
cata_sst[, date := NULL]

4.2 Data Prep

# Read in data
cata_sst <- fread("cata_sst.csv", header = TRUE)
colnames(cata_sst) <- c("lats", "lon", "sst", "date")
cata_sst$date <- as.Date(paste0(cata_sst$date, "-01"), format = "%Y-%m-%d")
setnames(cata_sst, "month_year", "date", skip_absent = TRUE)

# Convert to Fahrenheit
cata_sst$sst <- cata_sst$sst * 9/5 + 32

4.3 EDA:

I will conduct exploratory data analysis to visualize and summarize the data, identify trends, and detect any outliers or inconsistencies. This will include a shiny app tool to cycle for a visual EDA between months. Code or a link will be provided to the professor along with data files if use is needed. I will also utilize statistical methods for EDA, as spatial data spanning many years will require efficient ways of analyzing such vast amounts of data.

4.3.1 Summary Statistics

# Summary Statistics
summary(cata_sst$sst)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32.18   59.52   62.10   62.44   65.39   81.41

The median (62.10 F) is slightly lower than the mean (62.44 F), indicating a slight right skew in the data. Max and Min SST values show a wide range of regional temperatures. This wide range could have implications for marine life, including corals. In addition, the range of SST and (SD) values vary across the years. Years with high variability (e.g., 2013 and 2014) might indicate more unstable environmental conditions affecting marine life.

4.3.2 Annual Summary

annual_summary <- cata_sst[, .(mean_sst = mean(sst, na.rm = TRUE),
                               median_sst = median(sst, na.rm = TRUE),
                               min_sst = min(sst, na.rm = TRUE),
                               max_sst = max(sst, na.rm = TRUE),
                               range_sst = max(sst, na.rm = TRUE) - min(sst, na.rm = TRUE),
                               sd_sst = sd(sst, na.rm = TRUE)),
                           by = year(date)]
head(annual_summary)

##    year mean_sst median_sst min_sst max_sst range_sst   sd_sst
## 1: 2002 63.57079   63.60238  54.482  71.609    17.127 2.407717
## 2: 2003 62.26467   61.55420  50.702  73.859    23.157 3.600943
## 3: 2004 62.34296   61.64300  51.638  75.011    23.373 3.837297
## 4: 2005 62.17049   61.70400  50.558  73.247    22.689 3.152014
## 5: 2006 62.68221   62.40119  50.144  76.820    26.676 4.042645
## 6: 2007 61.25971   60.36620  50.441  76.397    25.956 3.903932

Mean SST for each year ranges between 58.49 F (in 2023) and 65.92 F (in 2015). Minimum SST across all years is 32.18 F, (in 2013). Maximum SST across all years is 81.41 F, (in 2018). Range of SST, Highest 46.03 F (in 2014) Lowest at 12.37 F (in 2023).

Each year’s standard deviation (SD) shows consistent variability across years, with a noticeable increase in 2013 and 2014. The highest standard deviation is 7.99 F (in 2014), and the lowest standard deviation is 2.03 F (in 2023). Mean SST increased from 2002 to 2015, peaked (at 65.92 F), then decreased again, reaching its lowest value (58.49 F) in 2023. Indicates fluctuations of SST over the years can impact marine ecosystems and the risk of coral bleaching.

4.3.3 Temporal Trends Monthly

# Note Eval = FALSE here to save on printed space
monthly_avg_sst <- cata_sst[, .(mean_sst = mean(sst, na.rm = TRUE)),
                            by = .(year = year(date), month = month(date))]
for (i in 1:nrow(monthly_avg_sst)) {
  cat(paste(monthly_avg_sst[i,]$year, monthly_avg_sst[i,]$month, monthly_avg_sst[i,]$mean_sst, sep = "\t"), "\n")
}

Seasonal trends are shown with higher summer temperatures and lower winter temperatures. There also seems to be an upward trend in temperatures overall, ranging from 60.9 to 64.9 in 2022 to 61.8 to 65.7 in 2022. 2013 had a significant drop in December to an average of 37.2 and 2014 had a sharp increase in July and August with averages of 68.2 and 69.4, respectively. The sharp increase and decrease of these temperatures may be caused by El Nino, La Nina, or some other ocean temperature affecting global events. If it is El Nino or La Nina, I must note that these typically occur every 3-5 years. Unfortunately, during my research, I could not pinpoint what may have caused this event. It is also possible that this was a region specific event where only the Catalina region experienced such low temperatures. It is odd, as most research articles I could find pointed to higher sea temperatures for the given time period, as opposed extremely cool sea temperatures.

4.3.4 Histograms and density plots

# Histogram
hist(cata_sst$sst, main = "Histogram of Sea Surface Temperatures", xlab = "Temperature (F)", breaks = 100)

The distribution is not uniform. Some temperature ranges are more common than others. Sea surface temperature value frequency tends to increase from 32 F to around 63-64 F, after which it decreases. Most Sea surface temperatures values are approximately 63-64 F. Very few observations are in extremely low and high temperatures.

4.3.5 Density Plot

plot(density(cata_sst$sst, na.rm = TRUE), main = "Density Plot of Sea Surface Temperatures", xlab = "Temperature (F)")

Our density plot shows a similar pattern to our histogram, however it is displayed in a smoothed manner.

4.3.6 Boxplots

boxplot(sst ~ year(date), data = cata_sst, main = "Boxplot of Sea Surface Temperatures by Year", xlab = "Year", ylab = "Temperature (F)")

Year Outliers Percentage

2002 0.48% (23/4800)

2003 0.85% (41/4800)

2004 0.60% (29/4800)

2005 4.87% (234/4800)

2006 0.27% (13/4800)

2007 61.02% (2929/4800)

2008 0.00% (0/4800)

2009 0.50% (24/4800)

2010 16.02% (769/4800)

2011 11.81% (567/4800)

2012 8.88% (426/4800)

2013 2304.17% (110396/4800)

2014 2857.08% (137140/4800)

2015 1.48% (71/4800)

2016 238.23% (11435/4800)

2017 0.06% (3/4800)

2018 45.46% (2182/4800)

2019 41.81% (2007/4800)

2020 0.00% (0/4800)

2021 0.00% (0/4800)

2022 0.83% (40/4800)

2023 1.04% (50/4800)

The increase in IQR between 2012 and 2014 shows a wider data spread. On the other hand, IQR decreases after 2014 and stays about the same until 2023. This dip in 2023 is due to only being from January to March, likely to have colder temperatures

4.3.7 Seasonal Decomposition

ts_data <- ts(cata_sst[, mean(sst, na.rm = TRUE), by = date]$V1, start = c(2002, 7), frequency = 12)
# Decompose the data
decomposedRes <- decompose(ts_data, type = 'multiplicative')  # use type = "additive" for additive components

# Plot the decomposed result
plot(decomposedRes)

# Plot the trend component
plot(decomposedRes$trend)

# STL decomposition
stlRes <- stl(ts_data, s.window = "periodic")
stlcomps <- stlRes$time.series
# Extract the trend component
Trends <- stlcomps[, 2]
# Plot the trend component
plot(Trends)

# Decompose the time series using STL
ts_stl <- stl(ts_data, "periodic")
# De-seasonalize the time series
ts_sa <- seasadj(ts_stl)
# Plot the original series
plot(ts_data, type = "l")

# Plot the seasonally adjusted series
plot(ts_sa, type = "l")

# Seasonal plot
# Split the time series into four parts
ts_sa_part1 <- window(ts_sa, start = start(ts_sa), end = c(2007, 12))
ts_sa_part2 <- window(ts_sa, start = c(2008, 1), end = c(2012, 12))
ts_sa_part3 <- window(ts_sa, start = c(2013, 1), end = c(2017, 12))
ts_sa_part4 <- window(ts_sa, start = c(2018, 1), end = end(ts_sa))

# Part 1: 2002 - 2007
colors_part1 <- c("red", "blue", "green", "purple", "orange", "cyan")
seasonplot(ts_sa_part1, 12, col = colors_part1, year.labels = TRUE, main = "Seasonal Plot: 2002-2007")

# Part 2: 2008 - 2012
colors_part2 <- c("magenta", "yellow", "brown", "pink", "gray")
seasonplot(ts_sa_part2, 12, col = colors_part2, year.labels = TRUE, main = "Seasonal Plot: 2008-2012")

# Part 3: 2013 - 2017
colors_part3 <- c("darkred", "darkblue", "darkgreen", "darkorange", "darkgray")
seasonplot(ts_sa_part3, 12, col = colors_part3, year.labels = TRUE, main = "Seasonal Plot: 2013-2017")

# Part 4: 2018 - 2023
colors_part4 <- c("limegreen", "gold", "coral", "skyblue", "navy", "firebrick")
seasonplot(ts_sa_part4, 12, col = colors_part4, year.labels = TRUE, main = "Seasonal Plot: 2018-2023")

Test for stationarity

adf.test(ts_data)

## Warning in adf.test(ts_data): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data
## Dickey-Fuller = -7.5536, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

Augmented Dickey Fuller (ADF) Test has a p-value is 0.01,thus we reject null hypothesis, indicating it is stationary.

kpss.test(ts_data, null = "Trend")

## Warning in kpss.test(ts_data, null = "Trend"): p-value greater than printed
## p-value

## 
##  KPSS Test for Trend Stationarity
## 
## data:  ts_data
## KPSS Trend = 0.087426, Truncation lag parameter = 5, p-value = 0.1

Kwiatkowski–Phillips–Schmidt–Shin (KPSS) Test has a p-value is 0.1, thus we fail to reject null hypothesis, indicating trend stationarity.

kpss.test(ts_data, null = "Level")

## Warning in kpss.test(ts_data, null = "Level"): p-value greater than printed
## p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  ts_data
## KPSS Level = 0.28983, Truncation lag parameter = 5, p-value = 0.1

KPSS test has a p-value is 0.1, fail to reject null hypothesis, indicates level stationarity.

Now from deseasoning, Variance of the time series does not change over time, and there is no apparent trend or seasonality affecting the data.

Calculate the number of differences required to make the time series stationary

ndiffs(ts_data, test = "adf") # 0

## [1] 0

ndiffs(ts_data, test = "kpss") # 0

## [1] 0

ndiffs(ts_data) # 0

## [1] 0

Each test shows no differences are required to make the time series stationary.

Seasonal Data

The seasonal component shows a repeating pattern, enforcing seasonality of the data. The trend component shows an overall upward trend from the beginning to about T_52. It then shows a downward trend from T_53 to T_100. After that, the trend starts to increase again to end. The remainder component shows the residual fluctuations in data that can’t be explained by seasonal or trend components. It also shows positive and negative values through dataset which indicates variations not accounted for by other components.

Seasonal-Adjusted Data

There is an overall trend upward for temperatures as years move forward. Temperatures seem to fluctuate less than the seasonal data for most years, especially 2013-2017. Still there is a large drop in 2013 and some other potential events that took place. This is to be expected. Once again, I could not find information on an event at this time for 2013 so my assumption is that this was an anomaly in the Catalina region only.

4.3.8 Autocorrelation Check

# Plot
acf(ts_data)

# Plot
acf_result <- acf(ts_data, lag.max = 12, plot = FALSE)
plot(acf_result, main = "Autocorrelation (Monthly Lags)", xaxt = "n")
xticks <- seq(0, 1, length.out = 13)  # Create 13 equally spaced tick marks between 0 and 1
axis(side = 1, at = xticks, labels = 0:12)  # Add custom x-axis labels (lags 0 to 12)

The results from our autocorrelation plots indicate a strong positive correlation between sea temperatures in consecutive months. This means that month-to-month data follows seasonality and will be mostly similar to the previous month. As the time between lags is more significant, it shows a weak correlation, again matching with seasonality.

4.4 Time Series Analysis

It seems from our EDA that we have noticed some extreme outliers in the year of 2013. Overall though, it seems so far that claims of rising ocean temperatures are to be true. Now that we have seen trends in our data we can attempt to model our data for future forecasts. This will allow us to estimate if rising sea temperatures are supposed to continue, and if so at what rate? I have chosen to use ARIMA an model for our forecasts. ARIMA will capture the temporal structure and patterns within such as seasonality and autocorrelation. This should allow for accurate predictions of future values in the time series. Specifically, the auto.arima function is used in order to chose the best fit ARIMA model. Keep in mind, we are utilizing “ts_data”. This time series was used in our EDA and is composed of the mean value for all spatial points for each month of every year. Mean values were used for modeling due to the large size of the dataset. with over thirty million data points, it was not possible to compute models with all points on the equipment available to me. In addition, please not as well the “ts_data” has been de-seasonalized during our EDA and is stationary.

4.4.1 Fit the best ARIMA model

# Fit the best ARIMA model
# AR <- auto.arima(ts_data, stepwise=FALSE, approximation=FALSE, trace=TRUE)
# Best model: ARIMA(1,0,1)(0,1,1)[12]
AR <- arima(ts_data, order = c(1, 0, 1), seasonal = list(order = c(0, 1, 1), period = 12))
AR_fit <- ts_data - residuals(AR)

# Forecast 120 periods ahead
AR_forecast <- predict(AR, n.ahead = 120)

# Create a ts object for forecasts
forecast_start <- end(ts_data) + 1/12
forecast_ts <- ts(AR_forecast$pred, start = forecast_start, frequency = 12)

# Create a ts object for upper and lower bounds of the 95% prediction intervals
upper_bound_ts <- ts(AR_forecast$pred + 2 * AR_forecast$se, start = forecast_start, frequency = 12)
lower_bound_ts <- ts(AR_forecast$pred - 2 * AR_forecast$se, start = forecast_start, frequency = 12)

# Plot the original ts, forecasts, and 95% prediction intervals
ts.plot(ts_data, xlim = c(2002, 2033), main = "Original time series, AR(1) forecasts, and 95% prediction intervals")
lines(forecast_ts, col = 2, lwd = 3)
lines(upper_bound_ts, col = 2, lty = 2, lwd = 2)
lines(lower_bound_ts, col = 2, lty = 2, lwd = 2)

# Find peaks and valleys of forecasts
peak_months <- c()
valley_months <- c()
# Convert forecasts to a vector
forecast_values <- as.vector(AR_forecast$pred)
# Iterate through forecasts, excluding first and last values
for (i in 2:(length(forecast_values) - 1)) {
  if (forecast_values[i] > forecast_values[i - 1] && forecast_values[i] > forecast_values[i + 1]) {
    peak_months <- c(peak_months, i)
  } else if (forecast_values[i] < forecast_values[i - 1] && forecast_values[i] < forecast_values[i + 1]) {
    valley_months <- c(valley_months, i)
  }
}
# Convert month indices to month-year format
peak_dates <- as.Date("2002-01-01") + (length(ts_data) + peak_months - 1) * 30.44
valley_dates <- as.Date("2002-01-01") + (length(ts_data) + valley_months - 1) * 30.44
# Print peaks and valleys
cat("Peaks:\n")
for (i in 1:length(peak_months)) {
  cat(format(peak_dates[i], "%b %Y"), ": ", forecast_values[peak_months[i]], "\n")
}
cat("\nValleys:\n")
for (i in 1:length(valley_months)) {
  cat(format(valley_dates[i], "%b %Y"), ": ", forecast_values[valley_months[i]], "\n")
}

Peaks:

Nov 2022 : 68.49155

Nov 2023 : 68.49897

Nov 2024 : 68.49897

Valleys:

May 2023 : 58.87902

May 2024 : 58.87905

May 2025 : 58.87905

As shown in the plot above, and by our list of peaks and valleys, our 10 year predictive forecast assumes that temperatures will slightly increase over the first year by about 0.00742 °F. This is not extremely far off from our claimed estimated of 0.013°F annually. However, as you can see, this model assumes that within a years time, all subsequent years are subjected to be in the same repeating range. This is highly unlikely, and is probably caused by either overfitting or extreme outliers, such as 2013, affecting the overall model performance.

4.4.2 ARIMA Smoothed

To test this theory of overfitting and extreme outliers affecting my forecasts I will preform a new ARIMA model. This model will be built using smoothed data, in order to remove some of the more extreme outliers. Hopefully, this will give me a much more accurate presentation of the data as well as more accurate forecasts.

# Calculate the mean SST for each date
mean_sst_by_date <- cata_sst[, mean(sst, na.rm = TRUE), by = date]$V1

# Moving average of mean SST values at N window of N
N <- 7
moving_avg_mean_sst_by_date <- filter(mean_sst_by_date, rep(1/N, N), sides = 1)

# Add NA to beginning of ts to match original length
moving_avg_mean_sst_by_date <- c(rep(NA, N - 1), moving_avg_mean_sst_by_date)

# Create ts object with moving average mean values
ts_data_moving_avg <- ts(moving_avg_mean_sst_by_date, start = c(2002, 7), frequency = 12)

# auto.arima for best model
# AR_moving_avg <- auto.arima(ts_data_moving_avg, stationary=TRUE, stepwise=FALSE, approximation=FALSE)
# print(AR)
AR_moving_avg <- arima(ts_data_moving_avg, order = c(5, 0, 0)) # results show this is best model

# Alternative ARIMA model, From before, best ARIMA (keep for future reference)
# AR_moving_avg <- arima(ts_data_moving_avg, order = c(1, 0, 1), seasonal = list(order = c(0, 1, 1), period = 12))

# Forecast 120 periods ahead
AR_forecast_moving_avg <- forecast(AR_moving_avg, h = 120)

# Bounds settings
upper_bound_ts <- AR_forecast_moving_avg$upper[,2] # 95% prediction interval upper bound
lower_bound_ts <- AR_forecast_moving_avg$lower[,2] # 95% prediction interval lower bound

# Plot original ts, forecasts, and 95% prediction intervals
ts.plot(ts_data_moving_avg, xlim = c(2002, 2033), main = "Smoothed time series, ARIMA forecasts, and 95% prediction intervals")
lines(AR_forecast_moving_avg$mean, col = 2, lwd = 3)
lines(upper_bound_ts, col = 2, lty = 2, lwd = 2)
lines(lower_bound_ts, col = 2, lty = 2, lwd = 2)

# Note Eval = FALSE here to save on printed space
# Initialize peak and valley months
peak_months <- c()
valley_months <- c()

# Convert forecasts to a vector
forecast_values <- as.vector(AR_forecast_moving_avg$mean)

# Iterate through forecasts, excluding first and last values
for (i in 2:(length(forecast_values) - 1)) {
  if (forecast_values[i] > forecast_values[i - 1] && forecast_values[i] > forecast_values[i + 1]) {
    peak_months <- c(peak_months, i)
  } else if (forecast_values[i] < forecast_values[i - 1] && forecast_values[i] < forecast_values[i + 1]) {
    valley_months <- c(valley_months, i)
  }
}

# Convert month indices to month-year format
peak_dates <- as.Date("2002-01-01") + (length(ts_data_moving_avg) + peak_months - 1) * 30.44
valley_dates <- as.Date("2002-01-01") + (length(ts_data_moving_avg) + valley_months - 1) * 30.44

# Print peaks and valleys
cat("Peaks:\n")
for (i in 1:length(peak_months)) {
  cat(format(peak_dates[i], "%b %Y"), ": ", forecast_values[peak_months[i]], "\n")
}

cat("\nValleys:\n")
for (i in 1:length(valley_months)) {
  cat(format(valley_dates[i], "%b %Y"), ": ", forecast_values[valley_months[i]], "\n")
}

Peaks:

Aug 2023 : 63.6071

Aug 2024 : 62.88364

Jul 2025 : 62.59479

Jul 2026 : 62.48095

Jul 2027 : 62.43551

Jul 2028 : 62.4177

Jul 2029 : 62.41081

Jun 2030 : 62.4082

Jun 2031 : 62.40722

Jun 2032 : 62.40683

Valleys:

Feb 2023 : 60.05424

Feb 2024 : 61.56719

Jan 2025 : 62.0962

Jan 2026 : 62.28597

Jan 2027 : 62.35983

Jan 2028 : 62.38859

Jan 2029 : 62.39972

Jan 2030 : 62.40399

Dec 2030 : 62.40557

Dec 2031 : 62.40619

As you can see our from smoothed plot and list of peaks and valleys, our confidence interval seems more concise in this model, as if it were more accurate. There seems to be a much larger difference from year to year starting out, showing an decrease of about 1.2 degrees from peaks and an equivalent increase of the valleys. Thus, we seem to still be running into an issue with our forecast. Again, this may be due to overfitting or our extreme values.This could additionally be a drawback of ARIMA models that is causing issues with this type of data specifically. The model gets more sure of itself as time goes on, and subsequently lowers the range of peaks and valleys until it evens out at around 62.4 degrees. Logically we know that ocean temperatures are not going to even out this way, even if it made sense statistically based on one model. To combat this we may need to switch up our model again.

4.4.3 auto.arima() with lambda parameter

Based on the seemingly inaccuracy of our last two models, I will implement a different methodology in hopes to overcome any limitations from this model. I will split the data to utilize testing and training data, as well as implemented use of the lambda parameter in the auto.arima function. We will utilize the auto lambda parameter so that it can chose the appropriate lambda for our data. This will ensure variance is efficiently stabilized for our model. In addition, we will be testing for normality of our data to try to get additional insights to troubleshoot our forecasting issue. Note, for this first ARIMA plot we will not be using smoothing data, we will follow up with a smoothed data plot next and compare results. We will compare forecasting accuracy of models by calculating Mean Absolute Error (MAE).

# Initialize ts
ts_data <- ts(cata_sst[, mean(sst, na.rm = TRUE), by = date]$V1, start = c(2002, 7), frequency = 12)
# Split into training and test data
train_data <- window(ts_data, end = c(2020, 12))
test_data <- window(ts_data, start = c(2021, 1))


train_arima <- auto.arima(train_data, lambda = "auto")

# Forecast 10 years (120 periods) using best model
arima_forecast <- forecast(train_arima, h = 120)
# print(arima_forecast$mean)

# Plot original ts, forecasts, and test with 95% confidence intervals
plot(train_data, xlim = c(2002, 2033), ylim = range(c(train_data, test_data, arima_forecast$mean)), main = "Train, Test and 10-Year Forecast")
lines(arima_forecast$mean, col = "blue")
lines(test_data, col = "black")
lines(arima_forecast$lower[,2], col = "red", lty = 2)  # 95% lower confidence interval
lines(arima_forecast$upper[,2], col = "red", lty = 2)  # 95% upper confidence interval

# Note Eval = FALSE here to save on printed space
# Find peaks and valleys for first plot
peak_months <- integer(0)
valley_months <- integer(0)

forecast_values <- arima_forecast$mean

for (i in 2:(length(forecast_values) - 1)) {
  if (forecast_values[i] > forecast_values[i - 1] && forecast_values[i] > forecast_values[i + 1]) {
    peak_months <- c(peak_months, i)
  } else if (forecast_values[i] < forecast_values[i - 1] && forecast_values[i] < forecast_values[i + 1]) {
    valley_months <- c(valley_months, i)
  }
}

cat("Peaks for Train, Test and 10-Year Forecast plot:\n")
for (i in peak_months) {
  cat(format(as.Date("2002-01-01") + (length(train_data) + i - 1) * 30.44, "%b %Y"), ": ", forecast_values[i], "\n")
}

cat("\nValleys for Train, Test and 10-Year Forecast plot:\n")
for (i in valley_months) {
  cat(format(as.Date("2002-01-01") + (length(train_data) + i - 1) * 30.44, "%b %Y"), ": ", forecast_values[i], "\n")
}

# Calculate Mean Absolute Error (MAE) to measure forecast accuracy
mae <- mean(abs(test_data - arima_forecast$mean[1:length(test_data)]))
cat("Mean Absolute Error (MAE):", mae, "\n")

# Check residuals
residuals <- residuals(train_arima)
plot(residuals)

qqnorm(residuals)
qqline(residuals)

# Print residuals data points
# print(residuals(train_arima))

# Test residuals for normality using the Shapiro-Wilk test
shapiro.test(residuals) # W = 0.66141, p-value < 2.2e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.66141, p-value < 2.2e-16

Peaks:

Dec 2020 : 68.63449

Dec 2021 : 68.64206

Dec 2022 : 68.64206

Valleys:

Apr 2021 : 59.42203

Apr 2022 : 59.42246

Apr 2023 : 59.42246

To begin, the p-value from out Shapiro-Wilk test is less than 2.2e-16, which is significantly smaller than the threshold of 0.05. This shows the data is not normally distributed. We assumed this already due to seasonality, and the extreme outliers of our 2013 year. This makes sense with what we know so far. As shown in this plot as well as in our peaks and valleys, this model yields near identical results to our first ARIMA model. However, the model predicts first year movement at about 0.012 per year. This is very close to the yearly claims of around 0.013 degrees. It seems this model did not overcome our forecasting issues. Next we will try the same model with smoothed data. For comparison we will note that our Mean Absolute Error (MAE) has a value of 1.62816.

4.4.3 auto.arima() with lambda parameter smoothed

Now we will try a similar method, only with smoothed data, in order to handle our large dip in 2013.

# First, split smoothed data to training and test sets
train_data_smoothed <- window(ts_data_moving_avg, start = c(2002, 7), end = c(2019, 12))
test_data_smoothed <- window(ts_data_moving_avg, start = c(2020, 1), end = c(2020, 12))

# Fit the ARIMA model using the smoothed training data
train_arima_smoothed <- auto.arima(train_data_smoothed, stationary = TRUE, stepwise = FALSE, approximation = FALSE)

# Forecast for the next 10 years (120 periods) using smoothed data
arima_forecast_10_years_smoothed <- forecast(train_arima_smoothed, h = 120)
# print(arima_forecast_10_years_smoothed$mean)

# Plot original ts, forecasts (10 years), and test data with 95% confidence intervals using smoothed data
min_val <- min(c(train_data_smoothed, test_data_smoothed, arima_forecast_10_years_smoothed$mean), na.rm = TRUE)
max_val <- max(c(train_data_smoothed, test_data_smoothed, arima_forecast_10_years_smoothed$mean), na.rm = TRUE)
plot(train_data_smoothed, xlim = c(2002, 2033), ylim = c(min_val, max_val), main = "Train, Test and 10-Year Forecast (Smoothed Data)")
lines(arima_forecast_10_years_smoothed$mean, col = "blue")
lines(test_data_smoothed, col = "black")
lines(arima_forecast_10_years_smoothed$lower[,2], col = "red", lty = 2)  # 95% lower confidence interval
lines(arima_forecast_10_years_smoothed$upper[,2], col = "red", lty = 2)  # 95% upper confidence interval

# Find peaks and valleys for second plot (Train, Test and 10-Year Forecast) using smoothed data
peak_months_10_years_smoothed <- integer(0)
valley_months_10_years_smoothed <- integer(0)

forecast_values_10_years_smoothed <- arima_forecast_10_years_smoothed$mean

# Note Eval = FALSE here to save on printed space
for (i in 2:(length(forecast_values_10_years_smoothed) - 1)) {
  if (forecast_values_10_years_smoothed[i] > forecast_values_10_years_smoothed[i - 1] && forecast_values_10_years_smoothed[i] > forecast_values_10_years_smoothed[i + 1]) {
    peak_months_10_years_smoothed <- c(peak_months_10_years_smoothed, i)
  } else if (forecast_values_10_years_smoothed[i] < forecast_values_10_years_smoothed[i - 1] && forecast_values_10_years_smoothed[i] < forecast_values_10_years_smoothed[i + 1]) {
    valley_months_10_years_smoothed <- c(valley_months_10_years_smoothed, i)
  }
}

cat("Peaks for Train, Test and 10-Year Forecast plot (Smoothed Data):\n")
for (i in peak_months_10_years_smoothed) {
  cat(format(as.Date("2002-01-01") + (length(train_data_smoothed) + i - 1) * 30.44, "%b %Y"), ": ", forecast_values_10_years_smoothed[i], "\n")
}

cat("\nValleys for Train, Test and 10-Year Forecast plot (Smoothed Data):\n")
for (i in valley_months_10_years_smoothed) {
  cat(format(as.Date("2002-01-01") + (length(train_data_smoothed) + i - 1) * 30.44, "%b %Y"), ": ", forecast_values_10_years_smoothed[i], "\n")
}

# Calculate Mean Absolute Error (MAE) to measure forecast accuracy for smoothed data
arima_forecast_smoothed <- forecast(train_arima_smoothed, h = length(test_data_smoothed))
mae_smoothed <- mean(abs(test_data_smoothed - arima_forecast_smoothed$mean))
cat("Mean Absolute Error (MAE) - Smoothed Data:", mae_smoothed, "\n")

## Mean Absolute Error (MAE) - Smoothed Data: 0.8112004

# Test residuals for normality using the Shapiro-Wilk test for smoothed data
shapiro_resid_smoothed <- shapiro.test(residuals(train_arima_smoothed))
cat("Shapiro-Wilk Test for Normality of Residuals - Smoothed Data:\n")

## Shapiro-Wilk Test for Normality of Residuals - Smoothed Data:

print(shapiro_resid_smoothed)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(train_arima_smoothed)
## W = 0.67303, p-value < 2.2e-16

Peaks:

Sep 2019 : 65.59909

Sep 2020 : 64.38814

Sep 2021 : 64.05944

Sep 2022 : 63.76637

Sep 2023 : 63.5312

Sep 2024 : 63.34178

Sep 2025 : 63.18923

Sep 2026 : 63.06637

Sep 2027 : 62.96742

Sep 2028 : 62.88773

Valleys:

Mar 2020 : 61.98057

Mar 2021 : 62.08287

Mar 2022 : 62.17261

Mar 2023 : 62.24781

Mar 2024 : 62.30815

Mar 2025 : 62.35676

Mar 2026 : 62.39591

Mar 2027 : 62.42744

Mar 2028 : 62.45283

Mar 2029 : 62.47328

To begin, the p-value from out Shapiro-Wilk test is less than 2.2e-16, which is significantly smaller than the threshold of 0.05. This shows the data is not normally distributed. Again, we assumed this already due to seasonality, and the extreme outliers of our 2013 year. As shown in this plot as well as in our peaks and valleys, this model yields near identical results to our second ARIMA model, slowly averaging down until it reaches about 62.4 degrees on each side. It seems, yet again, this model did not overcome our forecasting issues. For comparison we will note that our Mean Absolute Error (MAE) has a value of 0.8112004. This implies the model accuracy was improved between the two models.Although it does not forecast more than the first few months, statistically, it seems likely the first few months are forecasted accurately.

To ensure statistical accuracy in all our methods, it seems ARIMA will not fit our needs for this specific analysis. It clearly cannot forecast ahead the amount of years that we need to discover when our region would be exposed to the bleaching threshold. Although we could likely due some quick-hand math and provide our own estimates off of the first few months of our ARIMA model, the accuracy will likely be very low for our forecasts. Instead we should look for a new model that is able to predict values we need. This will ensure the integrity of our results are sound for the analysis.

4.4.4 EMP Model

Given the limitations of the ARIMA model for long-term forecasting, I chose to explore alternative models. One model I chose to look at was an Empirical Model Prediction (EMP). EMP models can potentially provide more accurate long-term projections by leveraging historical data and patterns, offering a better estimation for this analysis. Similar to our last ARIMA model, this model will also utilize training and test data for enhanced accuracy. As our smoothed model proved a lower MAE than our non-smoothed, I will once again create a model for smoothed and non-smoothed to compare for accuracy.

# Convert time indices to Date format
train_data_dates <- as.Date("2002-01-01") + seq(0, length(train_data) - 1) * 30.44
test_data_dates <- as.Date("2002-01-01") + seq(length(train_data), length(train_data) + length(test_data) - 1) * 30.44

# Prepare data for prophet
prophet_data <- data.frame(ds = train_data_dates, y = as.numeric(train_data))
test_prophet_data <- data.frame(ds = test_data_dates, y = as.numeric(test_data))

# Train the prophet model
m <- prophet(prophet_data, weekly.seasonality = FALSE, daily.seasonality = FALSE)

# Forecast using the prophet model for the next 10 years
future <- make_future_dataframe(m, periods = length(test_data) + 120, freq = "month", include_history = FALSE)
forecast_prophet <- predict(m, future)

# Extract the forecast values and confidence interval for the test data period
test_forecast <- forecast_prophet$yhat[1:length(test_data)]
test_lower <- forecast_prophet$yhat_lower[1:length(test_data)]
test_upper <- forecast_prophet$yhat_upper[1:length(test_data)]

# Create a new time series object for the forecasts
full_forecast_ts <- ts(c(rep(NA, length(train_data)), forecast_prophet$yhat), start = c(2002, 1), frequency = 12)

# Create a new time series object for the lower and upper bounds of the 95% prediction intervals
full_lower_bound_ts <- ts(c(rep(NA, length(train_data)), forecast_prophet$yhat_lower), start = c(2002, 1), frequency = 12)
full_upper_bound_ts <- ts(c(rep(NA, length(train_data)), forecast_prophet$yhat_upper), start = c(2002, 1), frequency = 12)

# Plot the original ts, forecasts, and 95% prediction intervals
ts.plot(train_data, xlim = c(2002, 2033), ylim = range(c(train_data, test_data, forecast_prophet$yhat)), main = "Train, Test and 10-Year Forecast")
lines(full_forecast_ts, col = "blue")
lines(test_data, col = "black")
lines(full_lower_bound_ts, col = "red", lty = 2)
lines(full_upper_bound_ts, col = "red", lty = 2)

# Note Eval = FALSE here to save on printed space
# Find peaks and valleys for the Prophet model
prophet_peak_months <- integer(0)
prophet_valley_months <- integer(0)

prophet_forecast_values <- full_forecast_ts[(length(train_data) + 1):length(full_forecast_ts)]

for (i in 2:(length(prophet_forecast_values) - 1)) {
  if (prophet_forecast_values[i] > prophet_forecast_values[i - 1] && prophet_forecast_values[i] > prophet_forecast_values[i + 1]) {
    prophet_peak_months <- c(prophet_peak_months, i)
  } else if (prophet_forecast_values[i] < prophet_forecast_values[i - 1] && prophet_forecast_values[i] < prophet_forecast_values[i + 1]) {
    prophet_valley_months <- c(prophet_valley_months, i)
  }
}

cat("Peaks for Train, Test and 10-Year Forecast plot (Prophet):\n")
for (i in prophet_peak_months) {
  cat(format(as.Date("2023-04-01") + (i - 1) * 30.44, "%b %Y"), ": ", prophet_forecast_values[i], "\n")
}

cat("\nValleys for Train, Test and 10-Year Forecast plot (Prophet):\n")
for (i in prophet_valley_months) {
  cat(format(as.Date("2023-04-01") + (i - 1) * 30.44, "%b %Y"), ": ", prophet_forecast_values[i], "\n")
}

# Calculate the Mean Absolute Error (MAE) to measure forecast accuracy
mae_prophet <- mean(abs(as.numeric(test_data) - test_forecast))
cat("Mean Absolute Error (MAE) - Prophet:", mae_prophet, "\n")

## Mean Absolute Error (MAE) - Prophet: 4.034598

Peaks:

Sep 2023 : 70.28246

Nov 2023 : 67.21614

Apr 2024 : 60.9028

Sep 2024 : 69.56459

Valleys:

May 2023 : 57.39331

Oct 2023 : 65.33315

Mar 2024 : 59.90452

May 2024 : 58.56097

Oct 2024 : 65.39903

Mar 2025 : 59.35465

As you can see from the forecast as well as our peaks and valleys, this model displays our monthly data very well showing multiple peaks and valleys throughout a year. In addition, it can be seen that approximately every 3-4 years, there is a dip to low temperatures. This gives us more insight that our extreme low in 2013 could have been an extreme case of one of our El Nino/La Nina event theory. Although it cannot be confirmed, this adds new evidence to support our theory. Overall this model seems to match our goals and will allow us to forecast further out into the future to see approximately when the Catalina region is expected to meet the bleaching threshold. Our MAE for this model is 3.991828, and we will compare this to our next model using smoothed data to see which is more accurate.

4.4.5 EMP Model Smoothed

Now we will run a similar model, using smoothed data as our input, to see if we can further increase the accuracy of our model

# Convert time indices to Date format
train_data_dates_smoothed <- as.Date("2002-01-01") + seq(0, length(train_data_smoothed) - 1) * 30.44
test_data_dates_smoothed <- as.Date("2002-01-01") + seq(length(train_data_smoothed), length(train_data_smoothed) + length(test_data_smoothed) - 1) * 30.44

# Prepare data for prophet
prophet_data_smoothed <- data.frame(ds = train_data_dates_smoothed, y = as.numeric(train_data_smoothed))
test_prophet_data_smoothed <- data.frame(ds = test_data_dates_smoothed, y = as.numeric(test_data_smoothed))

# Train the prophet model with smoothed data
m_smoothed <- prophet(prophet_data_smoothed, weekly.seasonality = FALSE, daily.seasonality = FALSE)

# Forecast using the prophet model for the next 10 years with smoothed data
future_smoothed <- make_future_dataframe(m_smoothed, periods = length(test_data_smoothed) + 120, freq = "month")
forecast_prophet_smoothed <- predict(m_smoothed, future_smoothed)

# Extract the forecast values and confidence interval for the test data period using smoothed data
test_forecast_smoothed <- forecast_prophet_smoothed$yhat[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
test_lower_smoothed <- forecast_prophet_smoothed$yhat_lower[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
test_upper_smoothed <- forecast_prophet_smoothed$yhat_upper[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]

# Create a new time series object for the forecasts using smoothed data
full_forecast_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)

# Create a new time series object for the lower and upper bounds of the 95% prediction intervals using smoothed data
full_lower_bound_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat_lower[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)
full_upper_bound_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat_upper[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)

# Find the 1st and 99th percentiles for the y-axis limits
y_min <- quantile(c(train_data_smoothed, test_data_smoothed, forecast_prophet_smoothed$yhat), 0.01, na.rm = TRUE)
y_max <- quantile(c(train_data_smoothed, test_data_smoothed, forecast_prophet_smoothed$yhat), 0.99, na.rm = TRUE)

# Plot the original ts, forecasts, and 95% prediction intervals using smoothed data
ts.plot(train_data_smoothed, xlim = c(2002, 2033), ylim = c(y_min, y_max), main = "Train, Test and 10-Year Forecast (Smoothed Data)")
lines(full_forecast_ts_smoothed, col = "blue")
lines(test_data_smoothed, col = "black")
lines(full_lower_bound_ts_smoothed, col = "red", lty = 2)
lines(full_upper_bound_ts_smoothed, col = "red", lty = 2)

# Calculate the Mean Absolute Error (MAE) to measure forecast accuracy using smoothed data
mae_prophet_smoothed <- mean(abs(test_data_smoothed - test_forecast_smoothed))
cat("Mean Absolute Error (MAE) - Prophet (Smoothed Data):", mae_prophet_smoothed, "\n")

## Mean Absolute Error (MAE) - Prophet (Smoothed Data): 1.433764

# Note Eval = FALSE here to save on printed space
find_peaks_valleys <- function(ts, start_date, n_train) {
  peaks <- integer(0)
  valleys <- integer(0)
  
  for (i in 2:(length(ts) - 1)) {
    if (!is.na(ts[i]) && !is.na(ts[i - 1]) && !is.na(ts[i + 1])) {
      if (ts[i] > ts[i - 1] && ts[i] > ts[i + 1]) {
        peaks <- c(peaks, i)
      } else if (ts[i] < ts[i - 1] && ts[i] < ts[i + 1]) {
        valleys <- c(valleys, i)
      }
    }
  }
  
  cat("Peaks:\n")
  for (i in peaks) {
    cat(format(as.Date(start_date) + (i - 1) * 30.44, "%b %Y"), ": ", ts[i], "\n")
  }
  
  cat("\nValleys:\n")
  for (i in valleys) {
    cat(format(as.Date(start_date) + (i - 1) * 30.44, "%b %Y"), ": ", ts[i], "\n")
  }
}

# Print Peaks and valleys
find_peaks_valleys(full_forecast_ts[(length(train_data) + 1):length(full_forecast_ts)], "2023-04-01", 0)
find_peaks_valleys(full_forecast_ts_smoothed[(length(train_data_smoothed) + 1):length(full_forecast_ts_smoothed)], "2023-04-01", 0)

Peaks:

Sep 2023 : 70.28246

Nov 2023 : 67.21614

Apr 2024 : 60.9028

Sep 2024 : 69.56459

Nov 2024 : 68.47595

May 2025 : 61.37627

Sep 2025 : 68.86981

Nov 2025 : 69.73549

May 2026 : 61.88408

Sep 2026 : 68.19928

Valleys:

May 2023 : 57.39331

Oct 2023 : 65.33315

Mar 2024 : 59.90452

May 2024 : 58.56097

Oct 2024 : 65.39903

Mar 2025 : 59.35465

May 2025 : 59.73477

Oct 2025 : 65.50211

Mar 2026 : 58.83793

As you can see from the forecast as well as our peaks and valleys, this model also displays our monthly data very well showing multiple peaks and valleys throughout a year. Also, as expected, our smoothed data option did prove to have a lower MAE at 1.3845 meaning we will have more accurate forecasts using this model.

4.4.5 EMP Model Smoothed (300 Year Forecast)

Now that we have our best model selected we can forecast a longer time period to discover approximately when we can expect the Catalina region to experience mass bleaching due to having sea surface temperatures above the bleaching threshold of 86.9°F. I tried a forecast of 100 years, then 200, and finally 300 years in order to see when this expected to happen.

# Convert time indices to Date format
train_data_dates_smoothed <- as.Date("2002-01-01") + seq(0, length(train_data_smoothed) - 1) * 30.44
test_data_dates_smoothed <- as.Date("2002-01-01") + seq(length(train_data_smoothed), length(train_data_smoothed) + length(test_data_smoothed) - 1) * 30.44

# Prepare data for prophet
prophet_data_smoothed <- data.frame(ds = train_data_dates_smoothed, y = as.numeric(train_data_smoothed))
test_prophet_data_smoothed <- data.frame(ds = test_data_dates_smoothed, y = as.numeric(test_data_smoothed))

# Train the prophet model with smoothed data
m_smoothed <- prophet(prophet_data_smoothed, weekly.seasonality = FALSE, daily.seasonality = FALSE)

# Forecast using the prophet model for the next 300 years with smoothed data
future_smoothed <- make_future_dataframe(m_smoothed, periods = length(test_data_smoothed) + 3600, freq = "month")
forecast_prophet_smoothed <- predict(m_smoothed, future_smoothed)

# Extract the forecast values and confidence interval for the test data period using smoothed data
test_forecast_smoothed <- forecast_prophet_smoothed$yhat[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
test_lower_smoothed <- forecast_prophet_smoothed$yhat_lower[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
test_upper_smoothed <- forecast_prophet_smoothed$yhat_upper[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]

# Create a new time series object for the forecasts using smoothed data
full_forecast_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)

# Create a new time series object for the lower and upper bounds of the 95% prediction intervals using smoothed data
full_lower_bound_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat_lower[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)
full_upper_bound_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat_upper[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)

# Find the 1st and 99th percentiles for the y-axis limits
y_min <- quantile(c(train_data_smoothed, test_data_smoothed, forecast_prophet_smoothed$yhat), 0.01, na.rm = TRUE)
y_max <- quantile(c(train_data_smoothed, test_data_smoothed, forecast_prophet_smoothed$yhat), 0.99, na.rm = TRUE)

# Plot the original ts, forecasts, and 95% prediction intervals using smoothed data
ts.plot(train_data_smoothed, xlim = c(2002, 2333), ylim = c(y_min, y_max), main = "Train, Test and 300-Year Forecast (Smoothed Data)")
lines(full_forecast_ts_smoothed, col = "blue")
lines(test_data_smoothed, col = "black")
lines(full_lower_bound_ts_smoothed, col = "red", lty = 2)
lines(full_upper_bound_ts_smoothed, col = "red", lty = 2)

# Calculate the Mean Absolute Error (MAE) to measure forecast accuracy using smoothed data
mae_prophet_smoothed <- mean(abs(test_data_smoothed - test_forecast_smoothed))
cat("Mean Absolute Error (MAE) - Prophet (Smoothed Data):", mae_prophet_smoothed, "\n")

## Mean Absolute Error (MAE) - Prophet (Smoothed Data): 1.433764

# Note Eval = FALSE here to save on printed space
# Peaks and valleys function
find_peaks_valleys <- function(ts, start_date, n_train) {
  peaks <- integer(0)
  valleys <- integer(0)
  
  for (i in 2:(length(ts) - 1)) {
    if (!is.na(ts[i]) && !is.na(ts[i - 1]) && !is.na(ts[i + 1])) {
      if (ts[i] > ts[i - 1] && ts[i] > ts[i + 1]) {
        peaks <- c(peaks, i)
      } else if (ts[i] < ts[i - 1] && ts[i] < ts[i + 1]) {
        valleys <- c(valleys, i)
      }
    }
  }
  
  cat("Peaks:\n")
  for (i in peaks) {
    cat(format(as.Date(start_date) + (n_train + i - 1) * 30.44, "%b %Y"), ": ", ts[i], "\n")
  }
  
  cat("\nValleys:\n")
  for (i in valleys) {
    cat(format(as.Date(start_date) + (n_train + i - 1) * 30.44, "%b %Y"), ": ", ts[i], "\n")
  }
}

# Print Peaks and valleys 
# find_peaks_valleys(full_forecast_ts[(length(train_data) + 1):length(full_forecast_ts)], "2023-04-01", 0)
find_peaks_valleys(full_forecast_ts_smoothed[(length(train_data_smoothed) + 1):length(full_forecast_ts_smoothed)], "2023-04-01", 0)

Peaks:

Jun 2258 : 87.85517

Dec 2258 : 86.96098

Valleys:

Sep 2287 : 88.52364

Nov 2287 : 88.1054

As shown in this forecast chart, and through my printed results of these peaks and valleys for the Catalina region to experience the full effects of the bleaching threshold, it would take a range of 230 to 260 years. We will analyze what this means further in our conclusion.

4.5 Spatial Analysis

For this analysis I decided to do a simple raster visualization, based on all averaged spatial points, as well as a linear regression analysis. Although a full shiny app tool was developed to cycle through the yearly raster for all data points, I had yet to construct a spatial visualization that displayed the relationship of all points over the entire time period. This will allow readers to visualize temperature trends visually of certain areas, such as if some areas are typically warmer or colder than others. Linear regression was chosen to test if the relationship between time and sea surface temperatures is significant or not, and to provide information about

4.5.1 Raster - Point Pattern Analysis

# Calculate the average SST values for each grid cell
cata_sst_avg <- cata_sst[, list(avg_sst = mean(sst)), by = list(lon, lats)]

# Create a SpatialPointsDataFrame from the averaged data
coordinates(cata_sst_avg) <- ~lon+lats
proj4string(cata_sst_avg) <- CRS("+proj=longlat +datum=WGS84")

## Warning in CRS("+proj=longlat +datum=WGS84"): sf required for
## evolution_status==2L

# Set the raster resolution
res <- 0.1  # Adjust this value based on your desired resolution

# Create an empty raster with the desired resolution
r <- raster(extent(cata_sst_avg), res = res)

# Rasterize the averaged SST data
sst_avg_data <- rasterize(cata_sst_avg, r, field = "avg_sst")

# Create a custom color scale
custom_blue_red <- colorRampPalette(c("#0099FF", "#FF6666"))

# Set the temperature unit for the legend
unit <- bquote("Temperature " * degree * "F")

# Plot the averaged data with the custom color scale and adjusted x-axis limits
plot(sst_avg_data, col = custom_blue_red(50), main = "Average SST Values 2002-2023", xlab = "Longitude", ylab = "Latitude")
legend("topright", legend = unit, fill = custom_blue_red(50), bty = "n")

The raster analysis revealed a clear spatial pattern in the SST values. Cooler temperatures were observed in the top-left corner of the region, while warmer temperatures were found in the bottom-right and bottom-left areas. Additionally, warmer temperatures were observed along the coastline. The raster showed distinct square areas that are confirmed to be islands in the region. These squares in the raster aligned with the locations of the islands on a map, suggesting that the rasterization process accurately represented the spatial distribution of SST values, taking into account the absence of SST data in areas covered by land features. The raster was created with a 0.1-degree resolution, which provided a smooth representation of the data while retaining sufficient detail to identify spatial patterns and land features.

The summary statistics for the raster data showed a minimum SST value of 53.46°F, a maximum of 73.78°F, and a median of 62.88°F. The presence of 416 NA values in the raster summary corresponds to land features, such as islands and coastlines, where no SST data was available in the original dataset.

4.5.2 Linear Regression

# Calculate yearly mean SST values
cata_sst_yearly <- cata_sst[, list(avg_sst = mean(sst)), by = list(lon, lats, year = as.integer(format(date, "%Y")))]

# Fit a linear regression model
sst_trend <- lm(avg_sst ~ year, data = cata_sst_yearly)

# Plot the trend
plot(cata_sst_yearly$year, cata_sst_yearly$avg_sst)
abline(sst_trend, col = "blue")

# Print the summary of the linear regression model
summary(sst_trend)

## 
## Call:
## lm(formula = avg_sst ~ year, data = cata_sst_yearly)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.5031  -1.7518   0.2541   1.8464  18.3041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.536e+01  4.622e-01   54.88   <2e-16 ***
## year        1.835e-02  2.297e-04   79.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.534 on 3023849 degrees of freedom
## Multiple R-squared:  0.002108,   Adjusted R-squared:  0.002107 
## F-statistic:  6387 on 1 and 3023849 DF,  p-value: < 2.2e-16

The results of our linear regression model show a low p-value (< 2.2e-16), which indicates this model is statistically significant. This means a relationship between year and temperature is unlikely a result of random chance. A low R-Square also indicates that this is not likely a good model fit for the data. Likely other unknown factors cause variation in SST, although year may play a small role. This makes sense as temperatures in different years can have wildly varying temperatures due to things like weather events or even the earths position in the solar system.

# Note Eval = FALSE here to save on printed space
# Print the intercept and slope of the trend line
cat("Intercept:", coef(sst_trend)[1], "\n")
cat("Slope:", coef(sst_trend)[2], "\n")

# Intercept: 25.36399
# Slope: 0.01835476
Expected_SST_July_2002 <- 25.36399 + (0.01835476 * 2002)
print(Expected_SST_July_2002)

## [1] 62.11022

# 62.11022 is AB line starting point
Expected_SST_Mar_2023 <- 25.36399 + (0.01835476 * 2023)
print(Expected_SST_Mar_2023)

## [1] 62.49567

# 62.49567 is AB line ending point

Slope Calculations

Regression Model:

86.9 - 62.49567 = 24.40433

24.40433 / 0.01835476 = 1329.591

Activist Estimate Claims:

86.9 - 62.44 = 24.46

24.46 / 0.013 = 1881.538

Slope of our AB regression line shows an increase of about 0.0183 degrees annually. Based on this regression model it would take approximately 1,329 years for the Catalina region to experience an annual average temperature in the bleaching threshold. Although this number 1,881 is even further away if we use the estimates from activists provided from our research. These numbers also seem wildly different from our previous model estimate of approximately 230-260 years.

Something to note is that linear regression is not a very complex model. As such estimates are likely to be inaccurate when compared to a more complex model. As we already noted, the low R-Square indicates that this is likely not a good model fit. There are many additional factors that we could likely add in that may increase the accuracy of this linear model, but this is not something that was available for the purposes of this analysis. In addition, this regression line follows an annual average. As we discussed, mass coral bleaching is likely to happen when the bleaching threshold is reached for as little as a two week period. If the entire yearly average is already over the bleaching threshold, then likely all coral has been bleached for quite some time in this region.

5. Discussion and Conclusion

In this project, my aim was to analyze the sea surface temperature (SST) data in the Catalina region as a representative subset for the entire ocean to determine when coral reefs might be at risk of bleaching due to rising temperatures. Additionally, I aimed to verify this threat according to media claims and discuss the potential implications of forecasted temperature changes on marine ecosystems due to coral reef bleaching.

Initially, during my exploratory data analysis, it was discovered that there was a consistent warming trend over the years. Additionally, I noticed a strong seasonality pattern in the temperature data, with warmer temperatures typically occurring during specific months. A notable temperature drop in 2013 could possibly be attributed to an El Niño or La Niña like weather event, although specific information on the cause was not found. Interestingly, my later forecast also captured periodic temperature drops occurring every 3-4 years, further emphasizing the complex nature of SST patterns. My summary statistics also revealed a positive autocorrelation, indicating that the temperatures in one period are likely to be followed by similar temperatures in subsequent periods, contributing to the overall understanding of the SST patterns in the Catalina region.

My time series analysis employed several models, including ARIMA and Empirical Model Prediction (EMP), to project future sea surface temperatures. My EMP model with smoothed data offered the most precise prediction, with a Mean Absolute Error (MAE) of 1.3845. Based on this model’s forecast, coral reefs in warmer regions, particularly those closer to the equator, could face mass bleaching earlier than the Catalina region. This model predicted that mass bleaching could occur within a range of 230 to 260 years in the Catalina region, implying that regions with higher temperatures might experience these events even sooner.

My spatial analysis revealed a clear pattern in the SST values, with cooler temperatures in the top-left corner of the region and warmer temperatures in the bottom-right and bottom-left areas and along the coastline. As the Catalina region is above the equator, it makes sense that sea temperatures are more relaxed the further from the equator you are. My linear regression model showed a statistically significant relationship between year and temperature, with an annual increase of about 0.0183 degrees. However, this model’s low R-Square value suggests it may not fit the data best.

The forecasted increase in sea surface temperature (SST) aligns with media claims of ocean warming, posing a significant threat to marine life, particularly coral reefs. Highly susceptible to temperature changes, these reefs support an estimated 25% of all marine life. Rising SST could lead to more frequent and severe coral bleaching events, disrupting the rich biodiversity of marine life and potentially endangering all life in our oceans. This decline in coral reefs and associated marine life would upset the balance of the marine ecosystem. The implications could impact human activities, such as fishing and tourism, which rely on healthy coral reefs.

In conclusion, while the complete destruction of marine life due to rising sea surface temperatures is unlikely within our generation, the projection that mass bleaching events could occur in as few as 2.5 to 3 centuries is a cause for concern. This analysis is based on current conditions and patterns observed in early 2023. If other factors not accounted for in this analysis, such as human waste, carbon levels, and acidity, continue to worsen, these estimates might be overestimated. However, there are increasing efforts toward environmental conservation and cleanups, which could help mitigate these effects. Consequently, it is difficult to determine whether our current estimates will increase or decrease as we approach the projected time frame. Nevertheless, this study emphasizes the importance of addressing the issue of rising sea surface temperatures to protect marine ecosystems, particularly coral reefs in warmer regions, from bleaching events and other adverse impacts.

Works Cited:

Bradford, Nick. “A Warming Ocean.” National Environmental Education Foundation, 27 July 2017, https://www.neefusa.org/nature/water/warming-ocean. Accessed 1 May 2023.

“Coral Reef Watch.” National Oceanic and Atmospheric Administration, National Oceanic and Atmospheric Administration, 2021, coralreefwatch.noaa.gov/satellite/education/tutorial/crw22_bleachingthreshold.php.

Environmental Protection Agency. “Basic Information about Coral Reefs.” EPA, 1 June 2022, https://www.epa.gov/coral-reefs/basic-information-about-coral-reefs#:~:text=Coral%20reefs%20are%20among%20the,point%20in%20their%20life%20cycle.

Environmental Protection Agency. “Climate Change Indicators: Ocean Heat.” Environmental Protection Agency, 2016, https://www.epa.gov/climate-indicators/climate-change-indicators-ocean-heat. Accessed 16 August 2016.

Environmental Protection Agency. “Climate Change Indicators: Sea Surface Temperature.” Environmental Protection Agency, 2016, https://www.epa.gov/climate-indicators/climate-change-indicators-sea-surface-temperature. Accessed 16 August 2016.

Lindsey, Rebecca, and Luann Dahlman. “Climate Change: Ocean Heat Content.” NOAA Climate.gov, 17 Aug. 2020, https://www.climate.gov/news-features/understanding-climate/climate-change-ocean-heat-content.

National Climate Assessment. “Report Findings: Oceans.” U.S. Global Change Research Program, 2014, http://nca2014.globalchange.gov/highlights/report-findings/oceans. Accessed 16 August 2016.

NOAA Climate Program Office. “The Sustained Global Ocean Observing System for Climate.” National Oceanic and Atmospheric Administration, 2016, https://cpo.noaa.gov/Portals/0/Docs/OOM/OCO_ScienceBrochure_Web.pdfx. Accessed April 2020.

NOAA Coral Reef Watch. “Coral Bleaching Degree Heating Week (DHW) Product.” Coral Reef Watch, National Oceanic and Atmospheric Administration, 2021, https://coralreefwatch.noaa.gov/product/5km/index_5km_dhw.php.

NOAA. “ERDDAP: Easier Access to Scientific Data.” ERDDAP, NOAA NMFS SWFSC ERD, 3 May 2023, https://coastwatch.pfeg.noaa.gov/erddap/griddap/erdMWsstdmday.das. Accessed 3 May 2023.

Forecasting Sea Surface Temperatures in the Catalina Region A Time Series and Spatial Analysis Approach

Samuel Harold

2023-05-01

1. Background and Motivation

2. Data

3. Project Objectives

3.1. Primary Questions

3.2. Preliminary Research

3.3. Methods

4. Data Processing and Data Analysis

4.1 Data preprocessing:

4.2 Data Prep

4.3 EDA:

4.3.1 Summary Statistics

4.3.2 Annual Summary

4.3.3 Temporal Trends Monthly

4.3.4 Histograms and density plots

4.3.5 Density Plot

4.3.6 Boxplots

4.3.7 Seasonal Decomposition

Test for stationarity

Calculate the number of differences required to make the time series stationary

Seasonal Data

Seasonal-Adjusted Data

4.3.8 Autocorrelation Check

4.4 Time Series Analysis

4.4.1 Fit the best ARIMA model

4.4.2 ARIMA Smoothed

4.4.3 auto.arima() with lambda parameter

4.4.3 auto.arima() with lambda parameter smoothed

4.4.4 EMP Model

4.4.5 EMP Model Smoothed

4.4.5 EMP Model Smoothed (300 Year Forecast)

4.5 Spatial Analysis

4.5.1 Raster - Point Pattern Analysis

4.5.2 Linear Regression

Slope Calculations

5. Discussion and Conclusion

Works Cited: