This data set presents the monthly mean total sunspot numbers from January 1749 up to the last elapsed month, with provisional values for recent 3 months. The monthly mean is calculated as the arithmetic mean of daily total sunspot numbers for each calendar month. The data extends back to 1749, the year from which consistent daily observations became available, allowing for reliable monthly averages. Prior to 1749, only yearly means are present due to the sparsity of data.
In this study, we will employ various time series forecasting techniques to unravel the underlying trends and seasonality in sunspot activity. We’ll be using 3 exponential models: Simple Exponential Smoothing to address level changes, Holt’s models which introduce trend components with an option for damping to capture potential deceleration in trend, and Holt-Winters models which are designed to encapsulate both trend and seasonal variations, in additive for linear and multiplicative for exponential seasonal effects, with the possibility of incorporating damping to temper future forecasts.
Missing values in the data set are denoted by a value of -1. The data set includes standard deviation values derived from daily observations, providing a measure of variability and the standard error on the monthly means, allowing for an assessment of the precision of the data. In addition, February 1824 is an exception in this historical record; lacking daily values, its monthly mean was interpolated from neighboring months by the compiler, R. Wolf. As we’ll be analyzing data that takes place in the 2000s and 2010s, this doesn’t pose an issue for us.
To get an idea of what our data set looks like, here are the first 6 observations in the data set.
| X1749.01.1749.042…96.7…1.0…..1.1 |
|---|
| 1749;02;1749.123; 104.3; -1.0; -1;1 |
| 1749;03;1749.204; 116.7; -1.0; -1;1 |
| 1749;04;1749.288; 92.8; -1.0; -1;1 |
| 1749;05;1749.371; 141.7; -1.0; -1;1 |
| 1749;06;1749.455; 139.2; -1.0; -1;1 |
| 1749;07;1749.538; 158.0; -1.0; -1;1 |
This data needs to be reformatted as it currently is 3299 observations of a single variable. To transform this, we will separate the variable delimited by semi-colons. The new data frame looks much better:
| Year | Month | Date | Mean.Sunspots | Mean.Standard.Deviation |
|---|---|---|---|---|
| 1749 | 1 | 1749 | 96.7 | -1 |
| 1749 | 2 | 1749 | 104.3 | -1 |
| 1749 | 3 | 1749 | 116.7 | -1 |
| 1749 | 4 | 1749 | 92.8 | -1 |
| 1749 | 5 | 1749 | 141.7 | -1 |
| 1749 | 6 | 1749 | 139.2 | -1 |
| Observation.Count | Provisional.Marker |
|---|---|
| -1 | 1 |
| -1 | 1 |
| -1 | 1 |
| -1 | 1 |
| -1 | 1 |
| -1 | 1 |
This data set records monthly mean and standard deviation for daily sunspot observations compiled into a per calendar month basis.
The link to the raw CSV file in posted on GitHub: https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%206%20(Time%20Series)/Updated_Sunspot_Data.csv.
This data set is nearly comprehensive except for a missing monthly observation as well as some missing values towards the beginning of the data set. Since we’re concerned with the 200 most recent observations, this is not something that we’ll encounter.
Research Question: Can we effectively forecast future sunspot numbers based on historical trends and cyclic behavior observed in the data?
The objectives for this case study are as follows:
The hypotheses of the study are that:
Time series analysis is particularly well-suited for our sunspot number data set because it involves observations recorded sequentially over time. Time Series data is typically characterized by trends, seasonality, and error/residual components. By employing time series analysis, we aim to dissect these underlying patterns: the long-term trend shows the general movement of sunspot numbers over the years, while the seasonal component would reveal any regular fluctuations within a given period (e.g., an 11-year solar cycle). The irregular component, or error term, encapsulates random variations and anomalies not explained by trend or seasonality.
The code snippet provided defines a time series object, sunspots_ts, which represents the monthly mean sunspot numbers for a specific subset of the sunspots data frame. This subset spans from the 3100th to the 3300th observation, capturing the 201 most recent observations. The start parameter for the time series is dynamically set to correspond to the year and month of the 3100th observation in the sunspots data set, ensuring that the time series aligns with the actual chronological order of the data. The frequency parameter is set to 12 to reflect the monthly periodicity of the data. The resulting time series object is then plotted with the given labels and title, showing the variation in mean monthly sunspot numbers for the selected period.
sunspot_200 <- sunspots[3101:3300,]
sunspots_ts <- ts(sunspot_200$Mean.Sunspots, start=c(sunspots$Year[3100], sunspots$Month[3100]), frequency=12)
par(mar = c(2,2,2,2))
plot(sunspots_ts, main = "Mean Monthly sunspots from April 2007 to Present", ylab = "Month", xlab = "Mean Monthly sunspot")
Smoothing techniques are fundamental in time series analysis, primarily used to filter out noise and highlight underlying trends and patterns in the data. These methods work by averaging out variations to create a smoother line.
The primary goal is to simplify the complex, often noisy real-world data, making it easier to analyze and understand. They do this by adjusting for irregular components (e.g. random spikes or drops) so that these short-term fluctuations have less of an impact on the overall trend, seasonal cycle, or any other recurring patterns. In essence, it helps prepare the time series data for analysis by removing the noise caused by random fluctuations.
General Exponential Smoothing methods operate under the Exponential Smoothing (ETS) framework: a comprehensive approach to modeling time series data by encapsulating three fundamental components: error (E), trend (T), and seasonality (S). Each of these components can be modeled in different ways, either additively, multiplicatively, or left out (none), depending on the nature and behavior of the time series.
Error (E): The error component can be additive, where the variations around the trend-seasonal components are roughly constant over time, or multiplicative, where these variations change proportionally with the level of the time series.
Trend (T): The trend component can also be additive, suitable for a linear trend, or multiplicative, for an exponential trend. In addition, it can be ‘damped’ (either additively or multiplicatively), which anticipates the trend will level off in the future, or it can be non-existent if the series doesn’t display a trend.
Seasonality (S): Similarly, the seasonality component can be additive, where the seasonal fluctuations are consistent over time, or multiplicative, where they change in proportion to the level of the series. If there’s no seasonality in the data, this component is not included.
Simple Exponential Smoothing (SES) is an forecasting method designed to capture level patterns in time series data, or the consistent baseline value around which data points fluctuate. Unlike complex statistical models, SES relies on a single smoothing coefficient which ranges between 0 and 1.
This coefficient determines the weight of influence that each observation has on the forecast. The most recent observation has the most weight, and the weight decreases exponentially for older observations.
Smoothing Equation: It takes your most recent observation, weighted by \(\alpha\), then combines it with the observation before that, weighted with \((1 - \alpha)\).
Forecasting Equation: The forecast for the next period is the last smoothed value.
Error Correction Form: After calculating the forecast error term (actual - forecasted value), the smoothing coefficient \((\alpha)\) is applied to this error. This value is then added to the last smoothed value, giving yhou the new smoothed value.
Holt’s Linear Trend Model enhances the Simple Exponential Smoothing technique by incorporating a trend component. This modification is particularly useful for time series data that exhibit a linear trend over time. The model’s primary goal is to forecast data by recognizing not just the level of the series, but also its direction and rate of change.
Level: Similar to SES, we weight the most recent observation with the smoothing coefficient \(\alpha\) and weight the previous observation with (1 - \(\alpha\)). This time, it factors in last observation’s level and trend values.
Trend: This captures the estimated trend in the data at time t. It is a weighted average of the current estimated trend and the previous trend. It uses a different smoothing coefficient for trend, \(\beta^*\).
Error Term: The difference between the actual observation and the previous forecast.
Forecast function: The function add h trend term to the current level in order to predict h periods forward.
Damped trend methods are an extension of exponential smoothing techniques used in time series forecasting. The term “damped” refers to the gradual slowing down of the trend component over time. This means that instead of the trend continuing indefinitely in the same direction, it is expected to level off in the future.
The main reason for using a damped trend is to prevent over-forecasting. Trend methods can often be unreliably large or small if predicted too far in advance (e.g. a startup company’s sales forecast) - damping helps account for other factors which may eventually affect the trend.
The damped trend model introduces a damping parameter, commonly \(\phi\), which is a value between 0 and 1. This damping can be applied to additive or multiplicative trends.
Additive Damping: Since the trend value is added to the level, the damping parameter \(\phi\) is multiplied with the trend value. Predicting h periods forward makes the trend factor now: \((1 + \phi + \phi^2 + ... + \phi^h) * b_t\).
Multiplicative Damping: Since the trend value is multiplied with the level, the dampening parameter is applied exponentially to the trend value. This means that the new trend value for predicting h periods forwards is \(\beta_t^{(1 + \phi + \phi^2 + ... + \phi^h)}\).
Damping a trend value (additive or multiplicative) can affect any other exponential smoothing models whether that’s Holt’s or any of the Holt-Winter’s models.
Building upon the foundations laid by Holt’s Linear Trend Model and the concept of damped trends, the Holt-Winters method extends these ideas by addressing seasonality. This method refines the approach to time series forecasting by introducing additional coefficients that account for seasonal variations. The Holt-Winters method introduces a term for the seasonal component, \(s_t\), and a smoothing parameter for seasonality, \(\gamma\).
There are two different types of seasonal components:
Additive Seasonality: Used when seasonal variations are roughly constant throughout the series, the additive model assumes that the magnitude of seasonal fluctuations does not vary with the level of the time series. The seasonality component here is calculated as a series of fixed seasonal deviations from trend and level; its constant in magnitude regardless of magnitude.
Multiplicative Seasonality: Used when seasonal variations change in proportion to the level of the time series. It’s particularly useful when the seasonal pattern is amplified as the series grows, such as in sales data that show higher seasonal peaks with increased customer demand. The seasonality component here is expressed as a fraction or multiple of trend and level components; its magnitude changes proportionally to the level and scales with the time series.
As mentioned previously, you can include additive or multiplicative damping to the trend on top of having additive or multiplicative seasonality.
In order to forecast we begin with the division of the time series data into training and testing sets. The training set is used to fit the model, while the testing set is reserved to evaluate the model’s predictive performance. By using different sample sizes for the training set, we can assess how the amount of historical data impacts the model’s accuracy. For instance, larger training sets may capture more of the underlying patterns, potentially leading to more accurate forecasts. However, there’s also a risk of overfitting, where the model learns the noise in the training data rather than the underlying signal, which can negatively impact out-of-sample predictions.
We will reserve the last 12 months of data for testing our forecasts. The idea is that we want to assess the predictability based on how well it can forecast the 12 most recent observations.
We encountered two errors when creating Holt models of our time series data. These error both arose as a result of the training data containing a zero.
The first error occurred when trying to fit exponential damping to our models. Since the process of exponentially damping requires taking logarithms of the data and the log of zero is undefined, a Holt of the original data containing a zero can’t be run.
The second error occurred when calculating the Mean Percentage Error (MPE) of the models. When calculating the MPE, percentage errors are taken by finding the difference between actual and forecast value, all divided by the actual value then multiplied by 100. Since the actual value is 0, the resulting MPEs are all undefined.
To alleviate this issue, we’re imputing 0.01 in replace of the 0 value since it adequately explains how little in magnitude the observation was without being undefined. However, this is a temporary solution and I strongly discourage using both the MPE and MAPE for accuracy measurement since the result is unreliable.
train.data = sunspot_200[1:(200-12), 4]
## last 12 observations
test.data = sunspot_200[189:200,4]
train.data[28] = 0.01 # exponential = TRUE can't work with 0s, 28th value is a zero
sunspot.ts = ts(train.data, frequency = 12, start = c(2007, 4))
fit1 = ses(sunspot.ts, h=12)
fit2 = holt(sunspot.ts, initial="optimal", h=12)
fit3 = holt(sunspot.ts,damped=TRUE, h=12 )
fit4 = holt(sunspot.ts,exponential=TRUE, damped=TRUE, h =12)
fit5 = hw(sunspot.ts,h=12, seasonal="additive")
fit6 = hw(sunspot.ts,h=12, seasonal="multiplicative")
fit7 = hw(sunspot.ts,h=12, seasonal="additive",damped=TRUE)
fit8 = hw(sunspot.ts,h=12, seasonal="multiplicative",damped=TRUE)
To evaluate the performance of our exponential smoothing models, we will utilize a suite of accuracy metrics, each providing unique insights into the models’ forecasting capabilities.
As a result of our imputed value of 0.01, MPE and MAPE are very off, by the nature of how they’re calculated. As a result, we will exclude both accuracy measures from our accuracy table.
| ME | RMSE | MAE | MASE | ACF1 | |
|---|---|---|---|---|---|
| SES | 0.72 | 14.32 | 9.97 | 0.40 | 0.03 |
| Holt Linear | -0.07 | 14.31 | 10.01 | 0.40 | 0.04 |
| Holt Add. Damped | 0.81 | 14.30 | 9.89 | 0.40 | 0.05 |
| Holt Exp. Damped | 0.16 | 14.25 | 9.82 | 0.39 | 0.05 |
| HW Add. | 0.50 | 13.89 | 10.04 | 0.40 | 0.02 |
| HW Exp. | 1.02 | 35.01 | 26.58 | 1.06 | 0.82 |
| HW Add. Damp | 0.66 | 13.89 | 9.98 | 0.40 | 0.04 |
| HW Exp. Damp | 0.49 | 13.64 | 9.57 | 0.38 | 0.07 |
Holt-Winters with Exponential Damping appears to be the best model for minimizing the average error. Although it has an autocorrelation of 0.07, this is an acceptable amount because it indicates only a slight relationship between consecutive forecasting errors. It’s low enough that we can confidently say our model’s predictions are not systematically biased by errors in the previous forecast.
That being said, Holt-Winters with additive damping, Holt Winters with additive seasonality, and Holt Linear with Exponential Damping also performed similarly well on the training data. It is not yet conclusive what model we should use based off these accuracy measures.
We’ll now plot the historical data, our training data, alongside the forecasts of all 8 models we’ve produced.
The non-seasonal smoothing models depicted in the first graph suggests that Simple Exponential Smoothing (SES), Holt’s Linear, and both variants of Holt’s Damped models do not closely track the trend in sunspot data. Their forecasted values diverge notably from the observed series, particularly towards the end of the observation period, indicating their limited capability in capturing the underlying trend without considering seasonality.
In contrast, the second graph showcasing Holt-Winters models — incorporating seasonal adjustments — demonstrates a marked enhancement in fitting the observed sunspot data. The presence of seasonal components within these models evidently plays a crucial role in aligning the forecasts more closely with the actual data, underscoring the importance of seasonality in accurate forecasting of this series.
Now that we’ve chosen to use the Holt-Winters models, we’ll refit the model with all 200 observations for our final model’s updated smoothing parameters.
First, we’ll look at the accuracy of the exponentially smoothed models on the test data.
| MSE | MAPE | |
|---|---|---|
| SES | 962.75 | 23.56 |
| Holt.Add | 844.11 | 21.31 |
| Holt.Add.Damp | 820.72 | 20.94 |
| Holt.Exp | 651.98 | 18.48 |
| HW.Add | 1024.82 | 24.56 |
| HW.Exp | 66999.63 | 492.04 |
| HW.Add.Damp | 935.65 | 22.96 |
| HW.Exp.Damp | 1721.11 | 37.59 |
These accuracy measures differ drastically from the measures on the training data. Upon inspection of this data, it appears as though the Holt-Winters Model with exponential damping was overfitted to the training data - this means that the model learned the training data, including the noise and outliers, too well and performs poorly on unseen data.
Surprisingly, all 3 Holt models have a lower rate of error than both our additive Holt-Winters models despite the fact that we confirmed seasonality is significant from our previous plots.
One possible explanation is that our test data exhibited a pronounced and consistent trend that overshadowed the seasonal fluctuations. As a result, these non-seasonal models could inadvertently provide better short-term forecasts. This phenomenon occurs when the trend’s influence is larger in magnitude compared to seasonal patterns. We are currently near the solar maximum of the current solar cycle, which means increased sun activity and amount of sunspots, which means that trend has a stronger magnitude than seasonality.
We’ll explore these accuracy measurements by overlaying the actual data on a plot of the model-forecasted values.
Just to visually confirm our models’ accuracy, we overlaid the real data as black to compare with our models. We also added 2 of the Holt models to compare and better understand why they were numerically “more accurate”.
This plot helps to emphasize why the actual prediction capability of a model cannot be judged by accuracy measurements alone. The ability to track the actual data points, especially the peaks and troughs which are characteristic of sunspot activity, is crucial. The Holt models seem to underperform in this aspect, failing to predict the sharp increase towards the end of the series.
This graph gives us a clearer picture of how the forecasted values compare to the test data. It appears that while the Holt-Winters models underpredicted the forecasted values, it was able to mirror the seasonality relatively well.
Despite the better accuracy measurements of the Holt models, the visual analysis of the graph suggests that the Holt-Winters model with additive seasonality and damping is a more suitable choice. It offers a better blend of trend, level, and seasonal component adjustment that aligns more closely with the actual behavior of the sunspot series.
| x | |
|---|---|
| alpha | 0.6283305 |
| beta | 0.0067453 |
| gamma | 0.0001016 |
In our quest to forecast future sunspot numbers, we’ve delved into historical patterns and cyclical behaviors through a comprehensive time series analysis. Our case study’s objectives were twofold: to discern underlying trends in the last 200 recorded observations of sunspot activity (roughly past 15 years) and to leverage a range of forecasting models to project future numbers. We hypothesized that historical sunspot data would exhibit discernible periodicity and that changes in monthly sunspot variability could signal shifts in solar activity or observational precision.
Our initial findings, based on the training data, positioned the Holt-Winters Models as best types of models due to serial plots that showed that they modeled seasonality better. However, a subsequent examination of the test data unveiled that the Holt models had better accuracy measurements on the test data.
We observed that the non-seasonal smoothing models’ forecasts aligned more closely with the test data’s upward trend. This alignment, unexpected given the confirmed significance of seasonality from our previous plots, suggested that the current phase of the solar cycle — the solar maximum — exerted a dominant influence. The solar maximum, characterized by heightened solar activity and increased sunspot counts, likely amplified the trend component’s significance, overshadowing seasonal variations.
In light of these insights, we converged on the Holt-Winters model with additive seasonality and damping as our model of choice. Its proficient blend of trend, level, and seasonal adjustments promises a more accurate reflection of the sunspot series’ nuanced dynamics. The model’s final smoothing parameters, estimated using all 200 observations, are poised to refine our forecasts further.
Gupta, D. (2018). Applied Analytics through Case Studies Using SAS and R. APress.
Ciaburro G. (2018). Regression Analysis with R: Design and Develop Statistical Nodes to Identify Unique Relationships Within Data at Scale. Packt Publishing.
“Sunspot data from the World Data Center SILSO, Royal Observatory of Belgium, Brussels”
GegznaV. (1965, September 1). How to add more space between columns of knitr::kable() in Rstudio Notebooks?. Stack Overflow. https://stackoverflow.com/questions/57873293/how-to-add-more-space-between-columns-of-knitrkable-in-rstudio-notebooks
National Oceanic and Atmospheric Administration. (2014). Sunspots/Solar cycle. Sunspots/Solar Cycle | NOAA / NWS Space Weather Prediction Center. https://www.swpc.noaa.gov/phenomena/sunspotssolar-cycle