4.3 EDA:
I will conduct exploratory data analysis to visualize and summarize
the data, identify trends, and detect any outliers or inconsistencies.
This will include a shiny app tool to cycle for a visual EDA between
months. Code or a link will be provided to the professor along with data
files if use is needed. I will also utilize statistical methods for EDA,
as spatial data spanning many years will require efficient ways of
analyzing such vast amounts of data.
4.3.1 Summary Statistics
# Summary Statistics
summary(cata_sst$sst)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 32.18 59.52 62.10 62.44 65.39 81.41
The median (62.10 F) is slightly lower than the mean (62.44 F),
indicating a slight right skew in the data. Max and Min SST values show
a wide range of regional temperatures. This wide range could have
implications for marine life, including corals. In addition, the range
of SST and (SD) values vary across the years. Years with high
variability (e.g., 2013 and 2014) might indicate more unstable
environmental conditions affecting marine life.
4.3.2 Annual Summary
annual_summary <- cata_sst[, .(mean_sst = mean(sst, na.rm = TRUE),
median_sst = median(sst, na.rm = TRUE),
min_sst = min(sst, na.rm = TRUE),
max_sst = max(sst, na.rm = TRUE),
range_sst = max(sst, na.rm = TRUE) - min(sst, na.rm = TRUE),
sd_sst = sd(sst, na.rm = TRUE)),
by = year(date)]
head(annual_summary)
## year mean_sst median_sst min_sst max_sst range_sst sd_sst
## 1: 2002 63.57079 63.60238 54.482 71.609 17.127 2.407717
## 2: 2003 62.26467 61.55420 50.702 73.859 23.157 3.600943
## 3: 2004 62.34296 61.64300 51.638 75.011 23.373 3.837297
## 4: 2005 62.17049 61.70400 50.558 73.247 22.689 3.152014
## 5: 2006 62.68221 62.40119 50.144 76.820 26.676 4.042645
## 6: 2007 61.25971 60.36620 50.441 76.397 25.956 3.903932
Mean SST for each year ranges between 58.49 F (in 2023) and 65.92 F
(in 2015). Minimum SST across all years is 32.18 F, (in 2013). Maximum
SST across all years is 81.41 F, (in 2018). Range of SST, Highest 46.03
F (in 2014) Lowest at 12.37 F (in 2023).
Each year’s standard deviation (SD) shows consistent variability
across years, with a noticeable increase in 2013 and 2014. The highest
standard deviation is 7.99 F (in 2014), and the lowest standard
deviation is 2.03 F (in 2023). Mean SST increased from 2002 to 2015,
peaked (at 65.92 F), then decreased again, reaching its lowest value
(58.49 F) in 2023. Indicates fluctuations of SST over the years can
impact marine ecosystems and the risk of coral bleaching.
4.3.3 Temporal Trends Monthly
# Note Eval = FALSE here to save on printed space
monthly_avg_sst <- cata_sst[, .(mean_sst = mean(sst, na.rm = TRUE)),
by = .(year = year(date), month = month(date))]
for (i in 1:nrow(monthly_avg_sst)) {
cat(paste(monthly_avg_sst[i,]$year, monthly_avg_sst[i,]$month, monthly_avg_sst[i,]$mean_sst, sep = "\t"), "\n")
}
Seasonal trends are shown with higher summer temperatures and lower
winter temperatures. There also seems to be an upward trend in
temperatures overall, ranging from 60.9 to 64.9 in 2022 to 61.8 to 65.7
in 2022. 2013 had a significant drop in December to an average of 37.2
and 2014 had a sharp increase in July and August with averages of 68.2
and 69.4, respectively. The sharp increase and decrease of these
temperatures may be caused by El Nino, La Nina, or some other ocean
temperature affecting global events. If it is El Nino or La Nina, I must
note that these typically occur every 3-5 years. Unfortunately, during
my research, I could not pinpoint what may have caused this event. It is
also possible that this was a region specific event where only the
Catalina region experienced such low temperatures. It is odd, as most
research articles I could find pointed to higher sea temperatures for
the given time period, as opposed extremely cool sea temperatures.
4.3.4 Histograms and density plots
# Histogram
hist(cata_sst$sst, main = "Histogram of Sea Surface Temperatures", xlab = "Temperature (F)", breaks = 100)

The distribution is not uniform. Some temperature ranges are more
common than others. Sea surface temperature value frequency tends to
increase from 32 F to around 63-64 F, after which it decreases. Most Sea
surface temperatures values are approximately 63-64 F. Very few
observations are in extremely low and high temperatures.
4.3.5 Density Plot
plot(density(cata_sst$sst, na.rm = TRUE), main = "Density Plot of Sea Surface Temperatures", xlab = "Temperature (F)")

Our density plot shows a similar pattern to our histogram, however it
is displayed in a smoothed manner.
4.3.6 Boxplots
boxplot(sst ~ year(date), data = cata_sst, main = "Boxplot of Sea Surface Temperatures by Year", xlab = "Year", ylab = "Temperature (F)")

Year Outliers Percentage
2002 0.48% (23/4800)
2003 0.85% (41/4800)
2004 0.60% (29/4800)
2005 4.87% (234/4800)
2006 0.27% (13/4800)
2007 61.02% (2929/4800)
2008 0.00% (0/4800)
2009 0.50% (24/4800)
2010 16.02% (769/4800)
2011 11.81% (567/4800)
2012 8.88% (426/4800)
2013 2304.17% (110396/4800)
2014 2857.08% (137140/4800)
2015 1.48% (71/4800)
2016 238.23% (11435/4800)
2017 0.06% (3/4800)
2018 45.46% (2182/4800)
2019 41.81% (2007/4800)
2020 0.00% (0/4800)
2021 0.00% (0/4800)
2022 0.83% (40/4800)
2023 1.04% (50/4800)
The increase in IQR between 2012 and 2014 shows a wider data spread.
On the other hand, IQR decreases after 2014 and stays about the same
until 2023. This dip in 2023 is due to only being from January to March,
likely to have colder temperatures
4.3.7 Seasonal Decomposition
ts_data <- ts(cata_sst[, mean(sst, na.rm = TRUE), by = date]$V1, start = c(2002, 7), frequency = 12)
# Decompose the data
decomposedRes <- decompose(ts_data, type = 'multiplicative') # use type = "additive" for additive components
# Plot the decomposed result
plot(decomposedRes)

# Plot the trend component
plot(decomposedRes$trend)

# STL decomposition
stlRes <- stl(ts_data, s.window = "periodic")
stlcomps <- stlRes$time.series
# Extract the trend component
Trends <- stlcomps[, 2]
# Plot the trend component
plot(Trends)

# Decompose the time series using STL
ts_stl <- stl(ts_data, "periodic")
# De-seasonalize the time series
ts_sa <- seasadj(ts_stl)
# Plot the original series
plot(ts_data, type = "l")

# Plot the seasonally adjusted series
plot(ts_sa, type = "l")

# Seasonal plot
# Split the time series into four parts
ts_sa_part1 <- window(ts_sa, start = start(ts_sa), end = c(2007, 12))
ts_sa_part2 <- window(ts_sa, start = c(2008, 1), end = c(2012, 12))
ts_sa_part3 <- window(ts_sa, start = c(2013, 1), end = c(2017, 12))
ts_sa_part4 <- window(ts_sa, start = c(2018, 1), end = end(ts_sa))
# Part 1: 2002 - 2007
colors_part1 <- c("red", "blue", "green", "purple", "orange", "cyan")
seasonplot(ts_sa_part1, 12, col = colors_part1, year.labels = TRUE, main = "Seasonal Plot: 2002-2007")

# Part 2: 2008 - 2012
colors_part2 <- c("magenta", "yellow", "brown", "pink", "gray")
seasonplot(ts_sa_part2, 12, col = colors_part2, year.labels = TRUE, main = "Seasonal Plot: 2008-2012")

# Part 3: 2013 - 2017
colors_part3 <- c("darkred", "darkblue", "darkgreen", "darkorange", "darkgray")
seasonplot(ts_sa_part3, 12, col = colors_part3, year.labels = TRUE, main = "Seasonal Plot: 2013-2017")

# Part 4: 2018 - 2023
colors_part4 <- c("limegreen", "gold", "coral", "skyblue", "navy", "firebrick")
seasonplot(ts_sa_part4, 12, col = colors_part4, year.labels = TRUE, main = "Seasonal Plot: 2018-2023")

Test for stationarity
adf.test(ts_data)
## Warning in adf.test(ts_data): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: ts_data
## Dickey-Fuller = -7.5536, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
Augmented Dickey Fuller (ADF) Test has a p-value is 0.01,thus we
reject null hypothesis, indicating it is stationary.
kpss.test(ts_data, null = "Trend")
## Warning in kpss.test(ts_data, null = "Trend"): p-value greater than printed
## p-value
##
## KPSS Test for Trend Stationarity
##
## data: ts_data
## KPSS Trend = 0.087426, Truncation lag parameter = 5, p-value = 0.1
Kwiatkowski–Phillips–Schmidt–Shin (KPSS) Test has a p-value is 0.1,
thus we fail to reject null hypothesis, indicating trend
stationarity.
kpss.test(ts_data, null = "Level")
## Warning in kpss.test(ts_data, null = "Level"): p-value greater than printed
## p-value
##
## KPSS Test for Level Stationarity
##
## data: ts_data
## KPSS Level = 0.28983, Truncation lag parameter = 5, p-value = 0.1
KPSS test has a p-value is 0.1, fail to reject null hypothesis,
indicates level stationarity.
Now from deseasoning, Variance of the time series does not change
over time, and there is no apparent trend or seasonality affecting the
data.
Calculate the number of differences required to make the time series
stationary
ndiffs(ts_data, test = "adf") # 0
## [1] 0
ndiffs(ts_data, test = "kpss") # 0
## [1] 0
ndiffs(ts_data) # 0
## [1] 0
Each test shows no differences are required to make the time series
stationary.
Seasonal Data
The seasonal component shows a repeating pattern, enforcing
seasonality of the data. The trend component shows an overall upward
trend from the beginning to about T_52. It then shows a downward trend
from T_53 to T_100. After that, the trend starts to increase again to
end. The remainder component shows the residual fluctuations in data
that can’t be explained by seasonal or trend components. It also shows
positive and negative values through dataset which indicates variations
not accounted for by other components.
Seasonal-Adjusted Data
There is an overall trend upward for temperatures as years move
forward. Temperatures seem to fluctuate less than the seasonal data for
most years, especially 2013-2017. Still there is a large drop in 2013
and some other potential events that took place. This is to be expected.
Once again, I could not find information on an event at this time for
2013 so my assumption is that this was an anomaly in the Catalina region
only.
4.3.8 Autocorrelation Check
# Plot
acf(ts_data)

# Plot
acf_result <- acf(ts_data, lag.max = 12, plot = FALSE)
plot(acf_result, main = "Autocorrelation (Monthly Lags)", xaxt = "n")
xticks <- seq(0, 1, length.out = 13) # Create 13 equally spaced tick marks between 0 and 1
axis(side = 1, at = xticks, labels = 0:12) # Add custom x-axis labels (lags 0 to 12)

The results from our autocorrelation plots indicate a strong positive
correlation between sea temperatures in consecutive months. This means
that month-to-month data follows seasonality and will be mostly similar
to the previous month. As the time between lags is more significant, it
shows a weak correlation, again matching with seasonality.
4.4 Time Series Analysis
It seems from our EDA that we have noticed some extreme outliers in
the year of 2013. Overall though, it seems so far that claims of rising
ocean temperatures are to be true. Now that we have seen trends in our
data we can attempt to model our data for future forecasts. This will
allow us to estimate if rising sea temperatures are supposed to
continue, and if so at what rate? I have chosen to use ARIMA an model
for our forecasts. ARIMA will capture the temporal structure and
patterns within such as seasonality and autocorrelation. This should
allow for accurate predictions of future values in the time series.
Specifically, the auto.arima function is used in order to chose the best
fit ARIMA model. Keep in mind, we are utilizing “ts_data”. This time
series was used in our EDA and is composed of the mean value for all
spatial points for each month of every year. Mean values were used for
modeling due to the large size of the dataset. with over thirty million
data points, it was not possible to compute models with all points on
the equipment available to me. In addition, please not as well the
“ts_data” has been de-seasonalized during our EDA and is stationary.
4.4.1 Fit the best ARIMA model
# Fit the best ARIMA model
# AR <- auto.arima(ts_data, stepwise=FALSE, approximation=FALSE, trace=TRUE)
# Best model: ARIMA(1,0,1)(0,1,1)[12]
AR <- arima(ts_data, order = c(1, 0, 1), seasonal = list(order = c(0, 1, 1), period = 12))
AR_fit <- ts_data - residuals(AR)
# Forecast 120 periods ahead
AR_forecast <- predict(AR, n.ahead = 120)
# Create a ts object for forecasts
forecast_start <- end(ts_data) + 1/12
forecast_ts <- ts(AR_forecast$pred, start = forecast_start, frequency = 12)
# Create a ts object for upper and lower bounds of the 95% prediction intervals
upper_bound_ts <- ts(AR_forecast$pred + 2 * AR_forecast$se, start = forecast_start, frequency = 12)
lower_bound_ts <- ts(AR_forecast$pred - 2 * AR_forecast$se, start = forecast_start, frequency = 12)
# Plot the original ts, forecasts, and 95% prediction intervals
ts.plot(ts_data, xlim = c(2002, 2033), main = "Original time series, AR(1) forecasts, and 95% prediction intervals")
lines(forecast_ts, col = 2, lwd = 3)
lines(upper_bound_ts, col = 2, lty = 2, lwd = 2)
lines(lower_bound_ts, col = 2, lty = 2, lwd = 2)

# Find peaks and valleys of forecasts
peak_months <- c()
valley_months <- c()
# Convert forecasts to a vector
forecast_values <- as.vector(AR_forecast$pred)
# Iterate through forecasts, excluding first and last values
for (i in 2:(length(forecast_values) - 1)) {
if (forecast_values[i] > forecast_values[i - 1] && forecast_values[i] > forecast_values[i + 1]) {
peak_months <- c(peak_months, i)
} else if (forecast_values[i] < forecast_values[i - 1] && forecast_values[i] < forecast_values[i + 1]) {
valley_months <- c(valley_months, i)
}
}
# Convert month indices to month-year format
peak_dates <- as.Date("2002-01-01") + (length(ts_data) + peak_months - 1) * 30.44
valley_dates <- as.Date("2002-01-01") + (length(ts_data) + valley_months - 1) * 30.44
# Print peaks and valleys
cat("Peaks:\n")
for (i in 1:length(peak_months)) {
cat(format(peak_dates[i], "%b %Y"), ": ", forecast_values[peak_months[i]], "\n")
}
cat("\nValleys:\n")
for (i in 1:length(valley_months)) {
cat(format(valley_dates[i], "%b %Y"), ": ", forecast_values[valley_months[i]], "\n")
}
Peaks:
Nov 2022 : 68.49155
Nov 2023 : 68.49897
Nov 2024 : 68.49897
Valleys:
May 2023 : 58.87902
May 2024 : 58.87905
May 2025 : 58.87905
As shown in the plot above, and by our list of peaks and valleys, our
10 year predictive forecast assumes that temperatures will slightly
increase over the first year by about 0.00742 °F. This is not extremely
far off from our claimed estimated of 0.013°F annually. However, as you
can see, this model assumes that within a years time, all subsequent
years are subjected to be in the same repeating range. This is highly
unlikely, and is probably caused by either overfitting or extreme
outliers, such as 2013, affecting the overall model performance.
4.4.2 ARIMA Smoothed
To test this theory of overfitting and extreme outliers affecting my
forecasts I will preform a new ARIMA model. This model will be built
using smoothed data, in order to remove some of the more extreme
outliers. Hopefully, this will give me a much more accurate presentation
of the data as well as more accurate forecasts.
# Calculate the mean SST for each date
mean_sst_by_date <- cata_sst[, mean(sst, na.rm = TRUE), by = date]$V1
# Moving average of mean SST values at N window of N
N <- 7
moving_avg_mean_sst_by_date <- filter(mean_sst_by_date, rep(1/N, N), sides = 1)
# Add NA to beginning of ts to match original length
moving_avg_mean_sst_by_date <- c(rep(NA, N - 1), moving_avg_mean_sst_by_date)
# Create ts object with moving average mean values
ts_data_moving_avg <- ts(moving_avg_mean_sst_by_date, start = c(2002, 7), frequency = 12)
# auto.arima for best model
# AR_moving_avg <- auto.arima(ts_data_moving_avg, stationary=TRUE, stepwise=FALSE, approximation=FALSE)
# print(AR)
AR_moving_avg <- arima(ts_data_moving_avg, order = c(5, 0, 0)) # results show this is best model
# Alternative ARIMA model, From before, best ARIMA (keep for future reference)
# AR_moving_avg <- arima(ts_data_moving_avg, order = c(1, 0, 1), seasonal = list(order = c(0, 1, 1), period = 12))
# Forecast 120 periods ahead
AR_forecast_moving_avg <- forecast(AR_moving_avg, h = 120)
# Bounds settings
upper_bound_ts <- AR_forecast_moving_avg$upper[,2] # 95% prediction interval upper bound
lower_bound_ts <- AR_forecast_moving_avg$lower[,2] # 95% prediction interval lower bound
# Plot original ts, forecasts, and 95% prediction intervals
ts.plot(ts_data_moving_avg, xlim = c(2002, 2033), main = "Smoothed time series, ARIMA forecasts, and 95% prediction intervals")
lines(AR_forecast_moving_avg$mean, col = 2, lwd = 3)
lines(upper_bound_ts, col = 2, lty = 2, lwd = 2)
lines(lower_bound_ts, col = 2, lty = 2, lwd = 2)

# Note Eval = FALSE here to save on printed space
# Initialize peak and valley months
peak_months <- c()
valley_months <- c()
# Convert forecasts to a vector
forecast_values <- as.vector(AR_forecast_moving_avg$mean)
# Iterate through forecasts, excluding first and last values
for (i in 2:(length(forecast_values) - 1)) {
if (forecast_values[i] > forecast_values[i - 1] && forecast_values[i] > forecast_values[i + 1]) {
peak_months <- c(peak_months, i)
} else if (forecast_values[i] < forecast_values[i - 1] && forecast_values[i] < forecast_values[i + 1]) {
valley_months <- c(valley_months, i)
}
}
# Convert month indices to month-year format
peak_dates <- as.Date("2002-01-01") + (length(ts_data_moving_avg) + peak_months - 1) * 30.44
valley_dates <- as.Date("2002-01-01") + (length(ts_data_moving_avg) + valley_months - 1) * 30.44
# Print peaks and valleys
cat("Peaks:\n")
for (i in 1:length(peak_months)) {
cat(format(peak_dates[i], "%b %Y"), ": ", forecast_values[peak_months[i]], "\n")
}
cat("\nValleys:\n")
for (i in 1:length(valley_months)) {
cat(format(valley_dates[i], "%b %Y"), ": ", forecast_values[valley_months[i]], "\n")
}
Peaks:
Aug 2023 : 63.6071
Aug 2024 : 62.88364
Jul 2025 : 62.59479
Jul 2026 : 62.48095
Jul 2027 : 62.43551
Jul 2028 : 62.4177
Jul 2029 : 62.41081
Jun 2030 : 62.4082
Jun 2031 : 62.40722
Jun 2032 : 62.40683
Valleys:
Feb 2023 : 60.05424
Feb 2024 : 61.56719
Jan 2025 : 62.0962
Jan 2026 : 62.28597
Jan 2027 : 62.35983
Jan 2028 : 62.38859
Jan 2029 : 62.39972
Jan 2030 : 62.40399
Dec 2030 : 62.40557
Dec 2031 : 62.40619
As you can see our from smoothed plot and list of peaks and valleys,
our confidence interval seems more concise in this model, as if it were
more accurate. There seems to be a much larger difference from year to
year starting out, showing an decrease of about 1.2 degrees from peaks
and an equivalent increase of the valleys. Thus, we seem to still be
running into an issue with our forecast. Again, this may be due to
overfitting or our extreme values.This could additionally be a drawback
of ARIMA models that is causing issues with this type of data
specifically. The model gets more sure of itself as time goes on, and
subsequently lowers the range of peaks and valleys until it evens out at
around 62.4 degrees. Logically we know that ocean temperatures are not
going to even out this way, even if it made sense statistically based on
one model. To combat this we may need to switch up our model again.
4.4.3 auto.arima() with lambda parameter
Based on the seemingly inaccuracy of our last two models, I will
implement a different methodology in hopes to overcome any limitations
from this model. I will split the data to utilize testing and training
data, as well as implemented use of the lambda parameter in the
auto.arima function. We will utilize the auto lambda parameter so that
it can chose the appropriate lambda for our data. This will ensure
variance is efficiently stabilized for our model. In addition, we will
be testing for normality of our data to try to get additional insights
to troubleshoot our forecasting issue. Note, for this first ARIMA plot
we will not be using smoothing data, we will follow up with a smoothed
data plot next and compare results. We will compare forecasting accuracy
of models by calculating Mean Absolute Error (MAE).
# Initialize ts
ts_data <- ts(cata_sst[, mean(sst, na.rm = TRUE), by = date]$V1, start = c(2002, 7), frequency = 12)
# Split into training and test data
train_data <- window(ts_data, end = c(2020, 12))
test_data <- window(ts_data, start = c(2021, 1))
train_arima <- auto.arima(train_data, lambda = "auto")
# Forecast 10 years (120 periods) using best model
arima_forecast <- forecast(train_arima, h = 120)
# print(arima_forecast$mean)
# Plot original ts, forecasts, and test with 95% confidence intervals
plot(train_data, xlim = c(2002, 2033), ylim = range(c(train_data, test_data, arima_forecast$mean)), main = "Train, Test and 10-Year Forecast")
lines(arima_forecast$mean, col = "blue")
lines(test_data, col = "black")
lines(arima_forecast$lower[,2], col = "red", lty = 2) # 95% lower confidence interval
lines(arima_forecast$upper[,2], col = "red", lty = 2) # 95% upper confidence interval

# Note Eval = FALSE here to save on printed space
# Find peaks and valleys for first plot
peak_months <- integer(0)
valley_months <- integer(0)
forecast_values <- arima_forecast$mean
for (i in 2:(length(forecast_values) - 1)) {
if (forecast_values[i] > forecast_values[i - 1] && forecast_values[i] > forecast_values[i + 1]) {
peak_months <- c(peak_months, i)
} else if (forecast_values[i] < forecast_values[i - 1] && forecast_values[i] < forecast_values[i + 1]) {
valley_months <- c(valley_months, i)
}
}
cat("Peaks for Train, Test and 10-Year Forecast plot:\n")
for (i in peak_months) {
cat(format(as.Date("2002-01-01") + (length(train_data) + i - 1) * 30.44, "%b %Y"), ": ", forecast_values[i], "\n")
}
cat("\nValleys for Train, Test and 10-Year Forecast plot:\n")
for (i in valley_months) {
cat(format(as.Date("2002-01-01") + (length(train_data) + i - 1) * 30.44, "%b %Y"), ": ", forecast_values[i], "\n")
}
# Calculate Mean Absolute Error (MAE) to measure forecast accuracy
mae <- mean(abs(test_data - arima_forecast$mean[1:length(test_data)]))
cat("Mean Absolute Error (MAE):", mae, "\n")
# Check residuals
residuals <- residuals(train_arima)
plot(residuals)

qqnorm(residuals)
qqline(residuals)

# Print residuals data points
# print(residuals(train_arima))
# Test residuals for normality using the Shapiro-Wilk test
shapiro.test(residuals) # W = 0.66141, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.66141, p-value < 2.2e-16
Peaks:
Dec 2020 : 68.63449
Dec 2021 : 68.64206
Dec 2022 : 68.64206
Valleys:
Apr 2021 : 59.42203
Apr 2022 : 59.42246
Apr 2023 : 59.42246
To begin, the p-value from out Shapiro-Wilk test is less than
2.2e-16, which is significantly smaller than the threshold of 0.05. This
shows the data is not normally distributed. We assumed this already due
to seasonality, and the extreme outliers of our 2013 year. This makes
sense with what we know so far. As shown in this plot as well as in our
peaks and valleys, this model yields near identical results to our first
ARIMA model. However, the model predicts first year movement at about
0.012 per year. This is very close to the yearly claims of around 0.013
degrees. It seems this model did not overcome our forecasting issues.
Next we will try the same model with smoothed data. For comparison we
will note that our Mean Absolute Error (MAE) has a value of 1.62816.
4.4.3 auto.arima() with lambda parameter smoothed
Now we will try a similar method, only with smoothed data, in order
to handle our large dip in 2013.
# First, split smoothed data to training and test sets
train_data_smoothed <- window(ts_data_moving_avg, start = c(2002, 7), end = c(2019, 12))
test_data_smoothed <- window(ts_data_moving_avg, start = c(2020, 1), end = c(2020, 12))
# Fit the ARIMA model using the smoothed training data
train_arima_smoothed <- auto.arima(train_data_smoothed, stationary = TRUE, stepwise = FALSE, approximation = FALSE)
# Forecast for the next 10 years (120 periods) using smoothed data
arima_forecast_10_years_smoothed <- forecast(train_arima_smoothed, h = 120)
# print(arima_forecast_10_years_smoothed$mean)
# Plot original ts, forecasts (10 years), and test data with 95% confidence intervals using smoothed data
min_val <- min(c(train_data_smoothed, test_data_smoothed, arima_forecast_10_years_smoothed$mean), na.rm = TRUE)
max_val <- max(c(train_data_smoothed, test_data_smoothed, arima_forecast_10_years_smoothed$mean), na.rm = TRUE)
plot(train_data_smoothed, xlim = c(2002, 2033), ylim = c(min_val, max_val), main = "Train, Test and 10-Year Forecast (Smoothed Data)")
lines(arima_forecast_10_years_smoothed$mean, col = "blue")
lines(test_data_smoothed, col = "black")
lines(arima_forecast_10_years_smoothed$lower[,2], col = "red", lty = 2) # 95% lower confidence interval
lines(arima_forecast_10_years_smoothed$upper[,2], col = "red", lty = 2) # 95% upper confidence interval

# Find peaks and valleys for second plot (Train, Test and 10-Year Forecast) using smoothed data
peak_months_10_years_smoothed <- integer(0)
valley_months_10_years_smoothed <- integer(0)
forecast_values_10_years_smoothed <- arima_forecast_10_years_smoothed$mean
# Note Eval = FALSE here to save on printed space
for (i in 2:(length(forecast_values_10_years_smoothed) - 1)) {
if (forecast_values_10_years_smoothed[i] > forecast_values_10_years_smoothed[i - 1] && forecast_values_10_years_smoothed[i] > forecast_values_10_years_smoothed[i + 1]) {
peak_months_10_years_smoothed <- c(peak_months_10_years_smoothed, i)
} else if (forecast_values_10_years_smoothed[i] < forecast_values_10_years_smoothed[i - 1] && forecast_values_10_years_smoothed[i] < forecast_values_10_years_smoothed[i + 1]) {
valley_months_10_years_smoothed <- c(valley_months_10_years_smoothed, i)
}
}
cat("Peaks for Train, Test and 10-Year Forecast plot (Smoothed Data):\n")
for (i in peak_months_10_years_smoothed) {
cat(format(as.Date("2002-01-01") + (length(train_data_smoothed) + i - 1) * 30.44, "%b %Y"), ": ", forecast_values_10_years_smoothed[i], "\n")
}
cat("\nValleys for Train, Test and 10-Year Forecast plot (Smoothed Data):\n")
for (i in valley_months_10_years_smoothed) {
cat(format(as.Date("2002-01-01") + (length(train_data_smoothed) + i - 1) * 30.44, "%b %Y"), ": ", forecast_values_10_years_smoothed[i], "\n")
}
# Calculate Mean Absolute Error (MAE) to measure forecast accuracy for smoothed data
arima_forecast_smoothed <- forecast(train_arima_smoothed, h = length(test_data_smoothed))
mae_smoothed <- mean(abs(test_data_smoothed - arima_forecast_smoothed$mean))
cat("Mean Absolute Error (MAE) - Smoothed Data:", mae_smoothed, "\n")
## Mean Absolute Error (MAE) - Smoothed Data: 0.8112004
# Test residuals for normality using the Shapiro-Wilk test for smoothed data
shapiro_resid_smoothed <- shapiro.test(residuals(train_arima_smoothed))
cat("Shapiro-Wilk Test for Normality of Residuals - Smoothed Data:\n")
## Shapiro-Wilk Test for Normality of Residuals - Smoothed Data:
print(shapiro_resid_smoothed)
##
## Shapiro-Wilk normality test
##
## data: residuals(train_arima_smoothed)
## W = 0.67303, p-value < 2.2e-16
Peaks:
Sep 2019 : 65.59909
Sep 2020 : 64.38814
Sep 2021 : 64.05944
Sep 2022 : 63.76637
Sep 2023 : 63.5312
Sep 2024 : 63.34178
Sep 2025 : 63.18923
Sep 2026 : 63.06637
Sep 2027 : 62.96742
Sep 2028 : 62.88773
Valleys:
Mar 2020 : 61.98057
Mar 2021 : 62.08287
Mar 2022 : 62.17261
Mar 2023 : 62.24781
Mar 2024 : 62.30815
Mar 2025 : 62.35676
Mar 2026 : 62.39591
Mar 2027 : 62.42744
Mar 2028 : 62.45283
Mar 2029 : 62.47328
To begin, the p-value from out Shapiro-Wilk test is less than
2.2e-16, which is significantly smaller than the threshold of 0.05. This
shows the data is not normally distributed. Again, we assumed this
already due to seasonality, and the extreme outliers of our 2013 year.
As shown in this plot as well as in our peaks and valleys, this model
yields near identical results to our second ARIMA model, slowly
averaging down until it reaches about 62.4 degrees on each side. It
seems, yet again, this model did not overcome our forecasting issues.
For comparison we will note that our Mean Absolute Error (MAE) has a
value of 0.8112004. This implies the model accuracy was improved between
the two models.Although it does not forecast more than the first few
months, statistically, it seems likely the first few months are
forecasted accurately.
To ensure statistical accuracy in all our methods, it seems ARIMA
will not fit our needs for this specific analysis. It clearly cannot
forecast ahead the amount of years that we need to discover when our
region would be exposed to the bleaching threshold. Although we could
likely due some quick-hand math and provide our own estimates off of the
first few months of our ARIMA model, the accuracy will likely be very
low for our forecasts. Instead we should look for a new model that is
able to predict values we need. This will ensure the integrity of our
results are sound for the analysis.
4.4.4 EMP Model
Given the limitations of the ARIMA model for long-term forecasting, I
chose to explore alternative models. One model I chose to look at was an
Empirical Model Prediction (EMP). EMP models can potentially provide
more accurate long-term projections by leveraging historical data and
patterns, offering a better estimation for this analysis. Similar to our
last ARIMA model, this model will also utilize training and test data
for enhanced accuracy. As our smoothed model proved a lower MAE than our
non-smoothed, I will once again create a model for smoothed and
non-smoothed to compare for accuracy.
# Convert time indices to Date format
train_data_dates <- as.Date("2002-01-01") + seq(0, length(train_data) - 1) * 30.44
test_data_dates <- as.Date("2002-01-01") + seq(length(train_data), length(train_data) + length(test_data) - 1) * 30.44
# Prepare data for prophet
prophet_data <- data.frame(ds = train_data_dates, y = as.numeric(train_data))
test_prophet_data <- data.frame(ds = test_data_dates, y = as.numeric(test_data))
# Train the prophet model
m <- prophet(prophet_data, weekly.seasonality = FALSE, daily.seasonality = FALSE)
# Forecast using the prophet model for the next 10 years
future <- make_future_dataframe(m, periods = length(test_data) + 120, freq = "month", include_history = FALSE)
forecast_prophet <- predict(m, future)
# Extract the forecast values and confidence interval for the test data period
test_forecast <- forecast_prophet$yhat[1:length(test_data)]
test_lower <- forecast_prophet$yhat_lower[1:length(test_data)]
test_upper <- forecast_prophet$yhat_upper[1:length(test_data)]
# Create a new time series object for the forecasts
full_forecast_ts <- ts(c(rep(NA, length(train_data)), forecast_prophet$yhat), start = c(2002, 1), frequency = 12)
# Create a new time series object for the lower and upper bounds of the 95% prediction intervals
full_lower_bound_ts <- ts(c(rep(NA, length(train_data)), forecast_prophet$yhat_lower), start = c(2002, 1), frequency = 12)
full_upper_bound_ts <- ts(c(rep(NA, length(train_data)), forecast_prophet$yhat_upper), start = c(2002, 1), frequency = 12)
# Plot the original ts, forecasts, and 95% prediction intervals
ts.plot(train_data, xlim = c(2002, 2033), ylim = range(c(train_data, test_data, forecast_prophet$yhat)), main = "Train, Test and 10-Year Forecast")
lines(full_forecast_ts, col = "blue")
lines(test_data, col = "black")
lines(full_lower_bound_ts, col = "red", lty = 2)
lines(full_upper_bound_ts, col = "red", lty = 2)

# Note Eval = FALSE here to save on printed space
# Find peaks and valleys for the Prophet model
prophet_peak_months <- integer(0)
prophet_valley_months <- integer(0)
prophet_forecast_values <- full_forecast_ts[(length(train_data) + 1):length(full_forecast_ts)]
for (i in 2:(length(prophet_forecast_values) - 1)) {
if (prophet_forecast_values[i] > prophet_forecast_values[i - 1] && prophet_forecast_values[i] > prophet_forecast_values[i + 1]) {
prophet_peak_months <- c(prophet_peak_months, i)
} else if (prophet_forecast_values[i] < prophet_forecast_values[i - 1] && prophet_forecast_values[i] < prophet_forecast_values[i + 1]) {
prophet_valley_months <- c(prophet_valley_months, i)
}
}
cat("Peaks for Train, Test and 10-Year Forecast plot (Prophet):\n")
for (i in prophet_peak_months) {
cat(format(as.Date("2023-04-01") + (i - 1) * 30.44, "%b %Y"), ": ", prophet_forecast_values[i], "\n")
}
cat("\nValleys for Train, Test and 10-Year Forecast plot (Prophet):\n")
for (i in prophet_valley_months) {
cat(format(as.Date("2023-04-01") + (i - 1) * 30.44, "%b %Y"), ": ", prophet_forecast_values[i], "\n")
}
# Calculate the Mean Absolute Error (MAE) to measure forecast accuracy
mae_prophet <- mean(abs(as.numeric(test_data) - test_forecast))
cat("Mean Absolute Error (MAE) - Prophet:", mae_prophet, "\n")
## Mean Absolute Error (MAE) - Prophet: 4.034598
Peaks:
Sep 2023 : 70.28246
Nov 2023 : 67.21614
Apr 2024 : 60.9028
Sep 2024 : 69.56459
Valleys:
May 2023 : 57.39331
Oct 2023 : 65.33315
Mar 2024 : 59.90452
May 2024 : 58.56097
Oct 2024 : 65.39903
Mar 2025 : 59.35465
As you can see from the forecast as well as our peaks and valleys,
this model displays our monthly data very well showing multiple peaks
and valleys throughout a year. In addition, it can be seen that
approximately every 3-4 years, there is a dip to low temperatures. This
gives us more insight that our extreme low in 2013 could have been an
extreme case of one of our El Nino/La Nina event theory. Although it
cannot be confirmed, this adds new evidence to support our theory.
Overall this model seems to match our goals and will allow us to
forecast further out into the future to see approximately when the
Catalina region is expected to meet the bleaching threshold. Our MAE for
this model is 3.991828, and we will compare this to our next model using
smoothed data to see which is more accurate.
4.4.5 EMP Model Smoothed
Now we will run a similar model, using smoothed data as our input, to
see if we can further increase the accuracy of our model
# Convert time indices to Date format
train_data_dates_smoothed <- as.Date("2002-01-01") + seq(0, length(train_data_smoothed) - 1) * 30.44
test_data_dates_smoothed <- as.Date("2002-01-01") + seq(length(train_data_smoothed), length(train_data_smoothed) + length(test_data_smoothed) - 1) * 30.44
# Prepare data for prophet
prophet_data_smoothed <- data.frame(ds = train_data_dates_smoothed, y = as.numeric(train_data_smoothed))
test_prophet_data_smoothed <- data.frame(ds = test_data_dates_smoothed, y = as.numeric(test_data_smoothed))
# Train the prophet model with smoothed data
m_smoothed <- prophet(prophet_data_smoothed, weekly.seasonality = FALSE, daily.seasonality = FALSE)
# Forecast using the prophet model for the next 10 years with smoothed data
future_smoothed <- make_future_dataframe(m_smoothed, periods = length(test_data_smoothed) + 120, freq = "month")
forecast_prophet_smoothed <- predict(m_smoothed, future_smoothed)
# Extract the forecast values and confidence interval for the test data period using smoothed data
test_forecast_smoothed <- forecast_prophet_smoothed$yhat[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
test_lower_smoothed <- forecast_prophet_smoothed$yhat_lower[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
test_upper_smoothed <- forecast_prophet_smoothed$yhat_upper[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
# Create a new time series object for the forecasts using smoothed data
full_forecast_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)
# Create a new time series object for the lower and upper bounds of the 95% prediction intervals using smoothed data
full_lower_bound_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat_lower[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)
full_upper_bound_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat_upper[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)
# Find the 1st and 99th percentiles for the y-axis limits
y_min <- quantile(c(train_data_smoothed, test_data_smoothed, forecast_prophet_smoothed$yhat), 0.01, na.rm = TRUE)
y_max <- quantile(c(train_data_smoothed, test_data_smoothed, forecast_prophet_smoothed$yhat), 0.99, na.rm = TRUE)
# Plot the original ts, forecasts, and 95% prediction intervals using smoothed data
ts.plot(train_data_smoothed, xlim = c(2002, 2033), ylim = c(y_min, y_max), main = "Train, Test and 10-Year Forecast (Smoothed Data)")
lines(full_forecast_ts_smoothed, col = "blue")
lines(test_data_smoothed, col = "black")
lines(full_lower_bound_ts_smoothed, col = "red", lty = 2)
lines(full_upper_bound_ts_smoothed, col = "red", lty = 2)

# Calculate the Mean Absolute Error (MAE) to measure forecast accuracy using smoothed data
mae_prophet_smoothed <- mean(abs(test_data_smoothed - test_forecast_smoothed))
cat("Mean Absolute Error (MAE) - Prophet (Smoothed Data):", mae_prophet_smoothed, "\n")
## Mean Absolute Error (MAE) - Prophet (Smoothed Data): 1.433764
# Note Eval = FALSE here to save on printed space
find_peaks_valleys <- function(ts, start_date, n_train) {
peaks <- integer(0)
valleys <- integer(0)
for (i in 2:(length(ts) - 1)) {
if (!is.na(ts[i]) && !is.na(ts[i - 1]) && !is.na(ts[i + 1])) {
if (ts[i] > ts[i - 1] && ts[i] > ts[i + 1]) {
peaks <- c(peaks, i)
} else if (ts[i] < ts[i - 1] && ts[i] < ts[i + 1]) {
valleys <- c(valleys, i)
}
}
}
cat("Peaks:\n")
for (i in peaks) {
cat(format(as.Date(start_date) + (i - 1) * 30.44, "%b %Y"), ": ", ts[i], "\n")
}
cat("\nValleys:\n")
for (i in valleys) {
cat(format(as.Date(start_date) + (i - 1) * 30.44, "%b %Y"), ": ", ts[i], "\n")
}
}
# Print Peaks and valleys
find_peaks_valleys(full_forecast_ts[(length(train_data) + 1):length(full_forecast_ts)], "2023-04-01", 0)
find_peaks_valleys(full_forecast_ts_smoothed[(length(train_data_smoothed) + 1):length(full_forecast_ts_smoothed)], "2023-04-01", 0)
Peaks:
Sep 2023 : 70.28246
Nov 2023 : 67.21614
Apr 2024 : 60.9028
Sep 2024 : 69.56459
Nov 2024 : 68.47595
May 2025 : 61.37627
Sep 2025 : 68.86981
Nov 2025 : 69.73549
May 2026 : 61.88408
Sep 2026 : 68.19928
Valleys:
May 2023 : 57.39331
Oct 2023 : 65.33315
Mar 2024 : 59.90452
May 2024 : 58.56097
Oct 2024 : 65.39903
Mar 2025 : 59.35465
May 2025 : 59.73477
Oct 2025 : 65.50211
Mar 2026 : 58.83793
As you can see from the forecast as well as our peaks and valleys,
this model also displays our monthly data very well showing multiple
peaks and valleys throughout a year. Also, as expected, our smoothed
data option did prove to have a lower MAE at 1.3845 meaning we will have
more accurate forecasts using this model.
4.4.5 EMP Model Smoothed (300 Year Forecast)
Now that we have our best model selected we can forecast a longer
time period to discover approximately when we can expect the Catalina
region to experience mass bleaching due to having sea surface
temperatures above the bleaching threshold of 86.9°F. I tried a forecast
of 100 years, then 200, and finally 300 years in order to see when this
expected to happen.
# Convert time indices to Date format
train_data_dates_smoothed <- as.Date("2002-01-01") + seq(0, length(train_data_smoothed) - 1) * 30.44
test_data_dates_smoothed <- as.Date("2002-01-01") + seq(length(train_data_smoothed), length(train_data_smoothed) + length(test_data_smoothed) - 1) * 30.44
# Prepare data for prophet
prophet_data_smoothed <- data.frame(ds = train_data_dates_smoothed, y = as.numeric(train_data_smoothed))
test_prophet_data_smoothed <- data.frame(ds = test_data_dates_smoothed, y = as.numeric(test_data_smoothed))
# Train the prophet model with smoothed data
m_smoothed <- prophet(prophet_data_smoothed, weekly.seasonality = FALSE, daily.seasonality = FALSE)
# Forecast using the prophet model for the next 300 years with smoothed data
future_smoothed <- make_future_dataframe(m_smoothed, periods = length(test_data_smoothed) + 3600, freq = "month")
forecast_prophet_smoothed <- predict(m_smoothed, future_smoothed)
# Extract the forecast values and confidence interval for the test data period using smoothed data
test_forecast_smoothed <- forecast_prophet_smoothed$yhat[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
test_lower_smoothed <- forecast_prophet_smoothed$yhat_lower[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
test_upper_smoothed <- forecast_prophet_smoothed$yhat_upper[(length(train_data_smoothed) + 1):(length(train_data_smoothed) + length(test_data_smoothed))]
# Create a new time series object for the forecasts using smoothed data
full_forecast_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)
# Create a new time series object for the lower and upper bounds of the 95% prediction intervals using smoothed data
full_lower_bound_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat_lower[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)
full_upper_bound_ts_smoothed <- ts(c(rep(NA, length(train_data_smoothed)), forecast_prophet_smoothed$yhat_upper[(length(train_data_smoothed) + 1):length(forecast_prophet_smoothed$yhat)]), start = c(2002, 1), frequency = 12)
# Find the 1st and 99th percentiles for the y-axis limits
y_min <- quantile(c(train_data_smoothed, test_data_smoothed, forecast_prophet_smoothed$yhat), 0.01, na.rm = TRUE)
y_max <- quantile(c(train_data_smoothed, test_data_smoothed, forecast_prophet_smoothed$yhat), 0.99, na.rm = TRUE)
# Plot the original ts, forecasts, and 95% prediction intervals using smoothed data
ts.plot(train_data_smoothed, xlim = c(2002, 2333), ylim = c(y_min, y_max), main = "Train, Test and 300-Year Forecast (Smoothed Data)")
lines(full_forecast_ts_smoothed, col = "blue")
lines(test_data_smoothed, col = "black")
lines(full_lower_bound_ts_smoothed, col = "red", lty = 2)
lines(full_upper_bound_ts_smoothed, col = "red", lty = 2)

# Calculate the Mean Absolute Error (MAE) to measure forecast accuracy using smoothed data
mae_prophet_smoothed <- mean(abs(test_data_smoothed - test_forecast_smoothed))
cat("Mean Absolute Error (MAE) - Prophet (Smoothed Data):", mae_prophet_smoothed, "\n")
## Mean Absolute Error (MAE) - Prophet (Smoothed Data): 1.433764
# Note Eval = FALSE here to save on printed space
# Peaks and valleys function
find_peaks_valleys <- function(ts, start_date, n_train) {
peaks <- integer(0)
valleys <- integer(0)
for (i in 2:(length(ts) - 1)) {
if (!is.na(ts[i]) && !is.na(ts[i - 1]) && !is.na(ts[i + 1])) {
if (ts[i] > ts[i - 1] && ts[i] > ts[i + 1]) {
peaks <- c(peaks, i)
} else if (ts[i] < ts[i - 1] && ts[i] < ts[i + 1]) {
valleys <- c(valleys, i)
}
}
}
cat("Peaks:\n")
for (i in peaks) {
cat(format(as.Date(start_date) + (n_train + i - 1) * 30.44, "%b %Y"), ": ", ts[i], "\n")
}
cat("\nValleys:\n")
for (i in valleys) {
cat(format(as.Date(start_date) + (n_train + i - 1) * 30.44, "%b %Y"), ": ", ts[i], "\n")
}
}
# Print Peaks and valleys
# find_peaks_valleys(full_forecast_ts[(length(train_data) + 1):length(full_forecast_ts)], "2023-04-01", 0)
find_peaks_valleys(full_forecast_ts_smoothed[(length(train_data_smoothed) + 1):length(full_forecast_ts_smoothed)], "2023-04-01", 0)
Peaks:
Jun 2258 : 87.85517
Dec 2258 : 86.96098
Valleys:
Sep 2287 : 88.52364
Nov 2287 : 88.1054
As shown in this forecast chart, and through my printed results of
these peaks and valleys for the Catalina region to experience the full
effects of the bleaching threshold, it would take a range of 230 to 260
years. We will analyze what this means further in our conclusion.