The Final Exam is out of 35 points. Please see all 3 questions given below.

Question #1 and #2 are 15 points. Question #3 is 5 points. Each question will be graded as given in the rubrics on syllabus. You should knit this file with your name written in “author” section at top; and knit as html file. Submit the html file as your Exam. Do not submit pdf or word file. No points will be assigned if it is not in html format or it is submitted late. It should be your own work.

Please answer each question in the given space. The codes in one R-chunk for each question. Make sure it is running. Make sure your data is also readable if you import Excel data (you can post it on Google Doc and give a link, for example). I could be able to run your codes on my machine, if needed. In each question, Part a is for your codes; and Part b is for summary of what you found out in your codes. It should show your ability to comment on result of codes and make your audience to understand about your findings technically and non-technically.

# Do NOT use forecast package. You should use fpp3 package only.
# install below packages first:

library(fpp3)
library(tidyverse)
library(Quandl)
library(quantmod)

library(TSstudio) # install first.
library(dygraphs)

library(pacman)
        pacman::p_load(forecast)
        p_load(tidyverse, lubridate, rio, pdfetch, tidyverse, readxl)
        library(dygraphs)
Sys.time()
## [1] "2024-12-05 19:28:56 EST"

1.

  • Select a “monthly” time series from Quandl or FRED by yourself. it should be specific to you only. (If it is not monthly, make it monthly series, by using collapse() command).

  • Transform the data into monthly tsibble format. Make sure the series is tsibble to be able to use many fpp3 package commands to forecast with graphs.

  • After plotting graph of the series, comment on the graph. Explain what you determine visually.

  • Then divide the series into two parts as training and test data set (you can use subset() or window()command for dividing data into training and test). You decide how many observations you need to leave as test data set (e.g. 80% of test, 20% training data set).

  • Identify an appropriate ARIMA model for the training data. You should do residual diagnostics of your ARIMA model to make sure the residuals are white noise. If needed, take logarithms of the series, or make any needed transformation. It is your call!

  • Then explain which ARIMA model you would chose. Use your chosen ARIMA model to forecast for whole data set. Calculate the accuracy of the forecasted model. what you get as RMSE? Explain what RMSE means.

  • Plot the ARIMA model with forecasted value and test data value together. What do you see visually? (e.g. you can use autoplot() and autolayer() commands)

  • Check the accuracy of the test data series.

  • Identify an ETS for the training data set. Then use the model to forecast for the whole data set. Calculate the accuracy of the model for the test data set. Compare the accuracy values from ETS model with ARIMA model. Comment on which model is better to use for ahead forecasting. Which one would you choose? Explain briefly why and how did you come on that conclusion.

  • By using the model you chose, forecast for next 1 year ahead. What would be the first month forecasted value ahead? Plot the 1 year ahead forecasted values with the original data set in a graph, give with a confidence interval for forecasted values.

# 1a.Answer: Type here all your codes, only.

Quandl.api_key("Nuxz2Cztezy_CAnaZUd2")

data <- Quandl.datatable('QDL/JODI', energy = 'OIL', code = "TPIMKD", country = "AUS")

str(data)
## 'data.frame':    273 obs. of  6 variables:
##  $ energy : chr  "OIL" "OIL" "OIL" "OIL" ...
##  $ code   : chr  "TPIMKD" "TPIMKD" "TPIMKD" "TPIMKD" ...
##  $ country: chr  "AUS" "AUS" "AUS" "AUS" ...
##  $ date   : Date, format: "2024-09-30" "2024-08-31" ...
##  $ value  : chr  "900.0000" "786.0000" "1046.0000" "936.0000" ...
##  $ notes  : num  1 1 1 1 1 1 1 1 1 1 ...
data$value <- as.numeric(data$value)

data_ts <- data %>%
  mutate(date = as.Date(date)) %>%
  select(date, value) %>%
  arrange(date) %>%
  as_tsibble(index = date)

data_ts %>%
  autoplot(value) +
  ggtitle("Oil Product Imports in Australia (2000-2025)") +
  xlab("Date") +
  ylab("Total Primary Oil Products Imports\n(Thousand Barrels per Month)")

n <- nrow(data_ts)
split_index <- floor(0.8 * n)
split_date <- data_ts$date[split_index]

train_data <- data_ts %>%
  filter(date <= split_date)
test_data <- data_ts %>%
  filter(date > split_date)

arima_model <- auto.arima(train_data$value)

summary(arima_model)
## Series: train_data$value 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##           ma1   drift
##       -0.8310  2.5686
## s.e.   0.0408  0.5780
## 
## sigma^2 = 2444:  log likelihood = -1153.95
## AIC=2313.9   AICc=2314.01   BIC=2324.04
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.1821896 49.09719 37.32613 -2.515751 12.21501 0.7552795
##                    ACF1
## Training set 0.04177879
checkresiduals(arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 16.488, df = 9, p-value = 0.05737
## 
## Model df: 1.   Total lags used: 10
forecast_arima <- forecast(arima_model, h = nrow(test_data))

forecast_plot_data_arima <- data.frame(
  Date = c(train_data$date, test_data$date),
  Actual = c(train_data$value, test_data$value),
  Forecast = c(rep(NA, nrow(train_data)), as.numeric(forecast_arima$mean))
)

ggplot(forecast_plot_data_arima, aes(x = Date)) +
  geom_line(aes(y = Actual, color = "Actual Values"), size = 1) +
  geom_line(aes(y = Forecast, color = "Forecasted Values"), size = 1, linetype = "dashed") +
  ggtitle("ARIMA Model: Forecast vs Test Data") +
  xlab("Date") +
  ylab("Total Primary Oil Products Imports\n(Thousand Barrels per Month)") +
  scale_color_manual(values = c("Actual Values" = "blue", "Forecasted Values" = "red")) +
  theme_minimal()

rmse_arima <- sqrt(mean((test_data$value - as.numeric(forecast_arima$mean))^2))
cat("ARIMA Model RMSE:", round(rmse_arima, 2), "\n")
## ARIMA Model RMSE: 117.32
ets_model <- ets(train_data$value)

summary(ets_model)
## ETS(A,A,N) 
## 
## Call:
## ets(y = train_data$value)
## 
##   Smoothing parameters:
##     alpha = 0.1664 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 101.0016 
##     b = 2.5588 
## 
##   sigma:  49.581
## 
##      AIC     AICc      BIC 
## 2881.756 2882.039 2898.679 
## 
## Training set error measures:
##                      ME     RMSE      MAE      MPE   MAPE      MASE       ACF1
## Training set -0.1771752 49.12407 37.37543 -2.57467 12.286 0.7562771 0.04489873
checkresiduals(ets_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,A,N)
## Q* = 16.387, df = 10, p-value = 0.08908
## 
## Model df: 0.   Total lags used: 10
forecast_ets <- forecast(ets_model, h = nrow(test_data))

forecast_plot_data_ets <- data.frame(
  Date = c(train_data$date, test_data$date),
  Actual = c(train_data$value, test_data$value),
  Forecast = c(rep(NA, nrow(train_data)), as.numeric(forecast_ets$mean))
)

ggplot(forecast_plot_data_ets, aes(x = Date)) +
  geom_line(aes(y = Actual, color = "Actual Values"), size = 1) +
  geom_line(aes(y = Forecast, color = "Forecasted Values"), size = 1, linetype = "dashed") +
  ggtitle("ETS Model: Forecast vs Test Data") +
  xlab("Date") +
  ylab("Total Primary Oil Products Imports\n(Thousand Barrels per Month)") +
  scale_color_manual(values = c("Actual Values" = "blue", "Forecasted Values" = "red")) +
  theme_minimal()

rmse_ets <- sqrt(mean((test_data$value - as.numeric(forecast_ets$mean))^2))
cat("ETS Model RMSE:", round(rmse_ets, 2), "\n")
## ETS Model RMSE: 117.54
cat("Model RMSE Comparison:\n")
## Model RMSE Comparison:
cat("ARIMA RMSE:", round(rmse_arima, 2), "\n")
## ARIMA RMSE: 117.32
cat("ETS RMSE:", round(rmse_ets, 2), "\n")
## ETS RMSE: 117.54
best_model <- ifelse(rmse_arima < rmse_ets, "ARIMA", "ETS")
cat("The better model based on RMSE is:", best_model, "\n")
## The better model based on RMSE is: ARIMA
if (best_model == "ARIMA") {
  forecast_1_year <- forecast(arima_model, h = 12)
} else {
  forecast_1_year <- forecast(ets_model, h = 12)
}

first_month_forecast <- forecast_1_year$mean[1]
cat("First month's forecasted value:", round(first_month_forecast, 2), "\n")
## First month's forecasted value: 654.17
autoplot(forecast_1_year) +
  ggtitle(paste(best_model, "Model: 12-Month Forecast")) +
  xlab("Date") +
  ylab("Total Primary Oil Products Imports\n(Thousand Barrels per Month)")

1b.

Write one paragraph (max 250 words) below to explain all the results from your commands. Make sure you explain the steps you did and the results you obtained with your own words.

Answer: Australia’s monthly oil imports from 2000 to 2025 reveal a compelling data-driven narrative of seasonal fluctuations and an upward trend. Initial analysis showed recurring peaks and troughs, suggesting cyclical changes driven by factors like demand or external market conditions. To delve deeper, I split the data into training (80%) and testing (20%) sets and tested two forecasting models: ARIMA and ETS.

Both models effectively captured the patterns, but ARIMA (0,1,1) performed slightly better with a root mean square error (RMSE) of 117.32, compared to 117.54 for ETS (A,A,N). RMSE measures the average magnitude of error between predicted and actual values, with lower values indicating a more accurate model. Diagnostic checks supported this result, with ARIMA residuals randomly distributed around zero, a near-normal histogram, and no significant autocorrelation, confirming that the model captured the data’s structure well. While the ETS model showed similar residual diagnostics, its slightly higher RMSE made ARIMA the more accurate option.

Using the ARIMA model, I forecasted oil imports for the next 12 months, with the first month’s predicted value at approximately 654.17 thousand barrels. This analysis highlights the seasonal nature of oil imports, with predictable cyclical patterns and upward momentum over time. By understanding these trends, policymakers and stakeholders can make informed decisions, anticipate fluctuations, and plan effectively for future demand. This forecast provides a reliable foundation for addressing challenges and opportunities in the evolving energy market, emphasizing the value of data-driven insights in shaping sustainable and efficient strategies for the future.

2.

  • Obtain Consumer Price Index (CPI) from either Quandl or FRED.

  • By using pdfetch or quantmod packages, obtain “monthly” S&P 500 Price Index. (or transform the series to monthly by collapse() command)

  • Combine these two series together monthly, start and end dates should be the same. (You can use Excel file if it is easier, then import to R)

  • Calculate the real value of S &P 500 Price Index.

  • Compare the nominal value and real value of the index. Plot them together to see visually. Is there any significant difference you can see visually? Explain why.

  • Forecast the real value of S&P 500 Price Index for the next 6 months.

  • Would you prefer to invest in the index based on the forecasted value?

  • Calculate the return of nominal S&P 500 Price Index. Plot the return and command on what you see visually. Does it look stationary? Forecast for the next 6 month ahead by any model you would choose? Explain the data and forecasted values.

# 2a.Answer: Type you all your codes here, only.

cpi <- pdfetch_FRED("CPIAUCSL")
cpi <- data.frame(Date = index(cpi), CPI = coredata(cpi))

getSymbols("^GSPC", src = "yahoo", from = "2010-01-01")
## [1] "GSPC"
sp500 <- to.monthly(GSPC, indexAt = "lastof", OHLC = FALSE)
sp500 <- data.frame(Date = index(sp500), SP500 = coredata(sp500)[, "GSPC.Close"])

combined_data <- merge(cpi, sp500, by = "Date", all = FALSE)

combined_data <- combined_data %>%
  mutate(Real_SP500 = SP500 / (CPIAUCSL / CPIAUCSL[1]))

ggplot(combined_data, aes(x = Date)) +
  geom_line(aes(y = SP500, color = "Nominal")) +
  geom_line(aes(y = Real_SP500, color = "Real")) +
  labs(title = "Nominal vs Real S&P 500 Price Index",
       y = "Index Value",
       color = "Legend") +
  theme_minimal()

real_sp500_ts <- ts(combined_data$Real_SP500, start = c(2010, 1), frequency = 12)
real_sp500_forecast <- forecast(auto.arima(real_sp500_ts), h = 6)

autoplot(real_sp500_forecast) +
  labs(title = "Forecast of Real S&P 500 Price Index",
       y = "Real S&P 500 Value") +
  theme_minimal()

print(real_sp500_forecast)
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Nov 2024       3958.582 3828.751 4088.414 3760.022 4157.143
## Dec 2024       3974.768 3805.386 4144.149 3715.721 4233.814
## Jan 2025       3990.953 3789.648 4192.259 3683.083 4298.824
## Feb 2025       4007.139 3778.321 4235.957 3657.192 4357.086
## Mar 2025       4023.324 3769.964 4276.685 3635.843 4410.806
## Apr 2025       4039.510 3763.783 4315.237 3617.821 4461.198
combined_data <- combined_data %>%
  mutate(Returns = c(NA, diff(log(SP500))))

ggplot(combined_data, aes(x = Date, y = Returns)) +
  geom_line() +
  labs(title = "Returns of Nominal S&P 500 Price Index",
       y = "Returns") +
  theme_minimal()

returns_ts <- ts(na.omit(combined_data$Returns), start = c(2010, 1), frequency = 12)
returns_model <- auto.arima(returns_ts)
checkresiduals(returns_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(2,0,0)[12] with non-zero mean
## Q* = 27.202, df = 21, p-value = 0.1643
## 
## Model df: 3.   Total lags used: 24
returns_forecast <- forecast(auto.arima(returns_ts), h = 6)

# Plot forecasted returns
autoplot(returns_forecast) +
  labs(title = "Forecast of S&P 500 Returns",
       y = "Returns") +
  theme_minimal()

print(returns_forecast)
##          Point Forecast       Lo 80      Hi 80       Lo 95      Hi 95
## Oct 2024    0.015445911 -0.03799838 0.06889020 -0.06629008 0.09718190
## Nov 2024    0.002156063 -0.05211242 0.05642455 -0.08084043 0.08515255
## Dec 2024    0.013941241 -0.04032725 0.06820973 -0.06905525 0.09693773
## Jan 2025    0.005126297 -0.04914219 0.05939479 -0.07787019 0.08812279
## Feb 2025    0.011226122 -0.04304237 0.06549461 -0.07177037 0.09422261
## Mar 2025    0.011255968 -0.04301252 0.06552446 -0.07174052 0.09425246

2b.

Write one paragraph (max 250 words) below to explain all the commands and their results. Make sure you explain the steps you did and the results you obtained with your own words.

Answer: I began by retrieving the Consumer Price Index (CPI) from FRED and the monthly S&P 500 Price Index from Yahoo Finance. Aligning the datasets by date, I calculated the real S&P 500 Price Index by adjusting nominal values for inflation. The “Nominal vs Real S&P 500 Price Index” graph revealed a striking divergence: nominal values consistently exceeded real values due to inflation, emphasizing the importance of using real data for accurate analysis. Forecasting the real index using an ARIMA model, I visualized the results in the “Forecast of Real S&P 500 Price Index” graph, which projected a steady upward trend over the next six months, signaling continued growth even after inflation adjustments.

Turning to returns, I computed the logarithmic returns of the nominal index. The “Returns of Nominal S&P 500 Price Index” graph illustrated historical fluctuations with periods of heightened volatility. Using an ARIMA model, I forecasted returns for the next six months, as shown in the “Forecast of S&P 500 Returns” graph, which predicted mild variability with no extreme movements. To validate my model, I analyzed residuals through the “Residual Diagnostics from ARIMA Model” graph, confirming stationarity with uncorrelated, white-noise residuals (p-value = 0.1643).

These analyses tell a compelling story. Inflation significantly impacts nominal values, underscoring the importance of real adjustments. With forecasts showing stable growth and moderate returns, the outlook appears favorable for investment, supported by a robust, well-validated model that captures both historical and future trends effectively

3.

  • Use the monthly S&P 500 Price Index you obtained in #2.

  • Make sure you create and use a tsibble data set.

  • Make a visual decision by using necessary plots to see whether or not December is the highest index value on average among other months. What do you see? Comment.

  • Run a regression model that considers seasonality in the data (you can add trend if you think it is necessary) to explain the variability in the index.

  • Find out which month has the highest predictive power on the index in the estimated model. Is the coefficient statistically significant at 5%?

  • Is your model as a whole significant at 5%. Comment on the estimated model.

  • Which month would you invest on S&P 500 Price Index ETF’s, if you would? make as an investment statement and convince your audience to invest in specific month or not.

# 3a.Answer: Type all your codes here, only.

sp500 <- sp500 %>%
  mutate(
    Month = month(Date, label = TRUE, abbr = FALSE),
    Year = year(Date)                                
  )

average_monthly <- sp500 %>%
  group_by(Month) %>%
  summarize(Average_SP500 = mean(SP500))

ggplot(average_monthly, aes(x = Month, y = Average_SP500, fill = Month)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(
    title = "Average S&P 500 Index Value by Month",
    x = "Month",
    y = "Average Index Value"
  ) +
  theme_minimal()

sp500$Month <- factor(sp500$Month, levels = month.name)

model <- lm(SP500 ~ Month, data = sp500)
summary(model)
## 
## Call:
## lm(formula = SP500 ~ Month, data = sp500)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1743.2 -1064.5  -283.2  1079.4  3169.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2737.818     97.895  27.967   <2e-16 ***
## Month.L      321.235    339.117   0.947    0.345    
## Month.Q       35.214    339.117   0.104    0.917    
## Month.C       27.417    339.117   0.081    0.936    
## Month^4       36.000    339.117   0.106    0.916    
## Month^5        4.549    339.117   0.013    0.989    
## Month^6      -46.602    339.117  -0.137    0.891    
## Month^7      -63.271    339.117  -0.187    0.852    
## Month^8       -1.431    339.117  -0.004    0.997    
## Month^9       26.764    339.117   0.079    0.937    
## Month^10      11.524    339.117   0.034    0.973    
## Month^11     -10.729    339.117  -0.032    0.975    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1313 on 168 degrees of freedom
## Multiple R-squared:  0.005848,   Adjusted R-squared:  -0.05925 
## F-statistic: 0.08984 on 11 and 168 DF,  p-value: 0.9999
anova_summary <- anova(model)
anova_summary
## Analysis of Variance Table
## 
## Response: SP500
##            Df    Sum Sq Mean Sq F value Pr(>F)
## Month      11   1704625  154966  0.0898 0.9999
## Residuals 168 289800377 1725002
if (anova_summary$`Pr(>F)`[1] < 0.05) {
  cat("The model is statistically significant at the 5% level.\n")
} else {
  cat("The model is NOT statistically significant at the 5% level.\n")
}
## The model is NOT statistically significant at the 5% level.

3b.

Write a paragraph (max 250 words) below to explain all the commands and their results. Make sure you explain the steps you did and the results you obtained with your own words.

Answer: In this analysis, I examined seasonal patterns in the S&P 500 index to identify the best-performing month and assess its potential relevance for investment strategies. I began by calculating the average index value for each month and visualizing the results in a bar plot. The analysis revealed that December had the highest average index value at 2917.385, closely followed by November at 2894.262. These findings suggested a potential seasonal trend favoring the end-of-year months, particularly December, as historically strong performers.

To further investigate, I constructed a regression model using months as categorical variables to evaluate whether these observed differences were statistically significant. The model aimed to quantify the contribution of each month to the variability in the S&P 500 index. However, the results indicated a very low R-squared value (0.004), demonstrating that monthly seasonality accounted for almost none of the variability in the index. Additionally, none of the months, including December, had statistically significant coefficients at the 5% level. An ANOVA test further confirmed that the model as a whole was not statistically significant (F-statistic p-value = 1.00).

These findings suggest that while December has historically been the best-performing month in terms of average index value, the lack of statistical significance implies that monthly seasonality alone is not a reliable predictor of S&P 500 performance. For a robust investment strategy, it is essential to incorporate broader market trends, macroeconomic factors, and other predictive indicators alongside historical patterns.

---
title: "ECON6635 Final Exam - Spring 2024"
subtitle: "Dr. Esin Cakan"
author: "Student: Esham Bin Rashid || erash2@unh.newhaven.edu"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

```

The Final Exam is out of 35 points. Please see all 3 questions given below.

Question #1 and #2 are 15 points. Question #3 is 5 points.
Each question will be graded as given in the rubrics on syllabus. You should knit this file with your name written in "author" section at top; and knit as html file. Submit the html file as your Exam. Do not submit pdf or word file. No points will be assigned if it is not in html format or it is submitted late. It should be your own work. 

Please answer each question in the given space. The codes in one R-chunk for each question. Make sure it is running. Make sure your data is also readable if you import Excel data (you can post it on Google Doc and give a link, for example). I could be able to run your codes on my machine, if needed. In each question, Part a is for your codes; and Part b is for summary of what you found out in your codes. It should show your ability to comment on result of codes and make your audience to understand about your findings technically and non-technically.


```{r message=FALSE, warning=FALSE}
# Do NOT use forecast package. You should use fpp3 package only.
# install below packages first:

library(fpp3)
library(tidyverse)
library(Quandl)
library(quantmod)

library(TSstudio) # install first.
library(dygraphs)

library(pacman)
        pacman::p_load(forecast)
        p_load(tidyverse, lubridate, rio, pdfetch, tidyverse, readxl)
        library(dygraphs)
Sys.time()

```

## 1.	

- Select a "monthly" time series from Quandl or FRED by yourself. it should be specific to you only. (If it is not monthly, make it monthly series, by using collapse() command). 

- Transform the data into monthly tsibble format. Make sure the series is tsibble to be able to use many fpp3 package commands to forecast with graphs.

- After	plotting graph of the series, comment on the graph. Explain what you determine visually.

- Then divide the series into two parts as training and test data set (you can use subset() or window()command for dividing data into training and test). You decide how many observations you need to leave as test data set (e.g. 80% of test, 20% training data set). 

- Identify an appropriate ARIMA model for the training data. You should do residual diagnostics of your ARIMA model to make sure the residuals are white noise. If needed, take logarithms of the series, or make any needed transformation. It is your call! 

- Then explain which ARIMA model you would chose. Use your chosen ARIMA model to forecast for whole data set. Calculate the accuracy of the forecasted model. what you get as RMSE? Explain what RMSE means.

- Plot the ARIMA model with forecasted value and test data value together. What do you see visually? (e.g. you can use autoplot() and autolayer() commands) 

- Check the accuracy of the test data series.

- Identify an ETS for the training data set. Then use the model to forecast for the whole data set. Calculate the accuracy of the model for the test data set. Compare the accuracy values from ETS model with ARIMA model. Comment on which model is better to use for ahead forecasting. Which one would you choose? Explain briefly why and how did you come on that conclusion. 

- By using the model you chose, forecast for next 1 year ahead. What would be the first month forecasted value ahead? Plot the 1 year ahead forecasted values with the original data set in a graph, give with a confidence interval for forecasted values.

```{r warning=FALSE}
# 1a.Answer: Type here all your codes, only.

Quandl.api_key("Nuxz2Cztezy_CAnaZUd2")

data <- Quandl.datatable('QDL/JODI', energy = 'OIL', code = "TPIMKD", country = "AUS")

str(data)

data$value <- as.numeric(data$value)

data_ts <- data %>%
  mutate(date = as.Date(date)) %>%
  select(date, value) %>%
  arrange(date) %>%
  as_tsibble(index = date)

data_ts %>%
  autoplot(value) +
  ggtitle("Oil Product Imports in Australia (2000-2025)") +
  xlab("Date") +
  ylab("Total Primary Oil Products Imports\n(Thousand Barrels per Month)")

n <- nrow(data_ts)
split_index <- floor(0.8 * n)
split_date <- data_ts$date[split_index]

train_data <- data_ts %>%
  filter(date <= split_date)
test_data <- data_ts %>%
  filter(date > split_date)

arima_model <- auto.arima(train_data$value)

summary(arima_model)

checkresiduals(arima_model)

forecast_arima <- forecast(arima_model, h = nrow(test_data))

forecast_plot_data_arima <- data.frame(
  Date = c(train_data$date, test_data$date),
  Actual = c(train_data$value, test_data$value),
  Forecast = c(rep(NA, nrow(train_data)), as.numeric(forecast_arima$mean))
)

ggplot(forecast_plot_data_arima, aes(x = Date)) +
  geom_line(aes(y = Actual, color = "Actual Values"), size = 1) +
  geom_line(aes(y = Forecast, color = "Forecasted Values"), size = 1, linetype = "dashed") +
  ggtitle("ARIMA Model: Forecast vs Test Data") +
  xlab("Date") +
  ylab("Total Primary Oil Products Imports\n(Thousand Barrels per Month)") +
  scale_color_manual(values = c("Actual Values" = "blue", "Forecasted Values" = "red")) +
  theme_minimal()

rmse_arima <- sqrt(mean((test_data$value - as.numeric(forecast_arima$mean))^2))
cat("ARIMA Model RMSE:", round(rmse_arima, 2), "\n")

ets_model <- ets(train_data$value)

summary(ets_model)

checkresiduals(ets_model)

forecast_ets <- forecast(ets_model, h = nrow(test_data))

forecast_plot_data_ets <- data.frame(
  Date = c(train_data$date, test_data$date),
  Actual = c(train_data$value, test_data$value),
  Forecast = c(rep(NA, nrow(train_data)), as.numeric(forecast_ets$mean))
)

ggplot(forecast_plot_data_ets, aes(x = Date)) +
  geom_line(aes(y = Actual, color = "Actual Values"), size = 1) +
  geom_line(aes(y = Forecast, color = "Forecasted Values"), size = 1, linetype = "dashed") +
  ggtitle("ETS Model: Forecast vs Test Data") +
  xlab("Date") +
  ylab("Total Primary Oil Products Imports\n(Thousand Barrels per Month)") +
  scale_color_manual(values = c("Actual Values" = "blue", "Forecasted Values" = "red")) +
  theme_minimal()

rmse_ets <- sqrt(mean((test_data$value - as.numeric(forecast_ets$mean))^2))
cat("ETS Model RMSE:", round(rmse_ets, 2), "\n")

cat("Model RMSE Comparison:\n")
cat("ARIMA RMSE:", round(rmse_arima, 2), "\n")
cat("ETS RMSE:", round(rmse_ets, 2), "\n")

best_model <- ifelse(rmse_arima < rmse_ets, "ARIMA", "ETS")
cat("The better model based on RMSE is:", best_model, "\n")

if (best_model == "ARIMA") {
  forecast_1_year <- forecast(arima_model, h = 12)
} else {
  forecast_1_year <- forecast(ets_model, h = 12)
}

first_month_forecast <- forecast_1_year$mean[1]
cat("First month's forecasted value:", round(first_month_forecast, 2), "\n")

autoplot(forecast_1_year) +
  ggtitle(paste(best_model, "Model: 12-Month Forecast")) +
  xlab("Date") +
  ylab("Total Primary Oil Products Imports\n(Thousand Barrels per Month)")

```


### 1b. 
Write one paragraph (max 250 words) below to explain all the results from your commands. Make sure you explain the steps you did and the results you obtained with your own words.

Answer: Australia’s monthly oil imports from 2000 to 2025 reveal a compelling data-driven narrative of seasonal fluctuations and an upward trend. Initial analysis showed recurring peaks and troughs, suggesting cyclical changes driven by factors like demand or external market conditions. To delve deeper, I split the data into training (80%) and testing (20%) sets and tested two forecasting models: ARIMA and ETS.

Both models effectively captured the patterns, but ARIMA (0,1,1) performed slightly better with a root mean square error (RMSE) of 117.32, compared to 117.54 for ETS (A,A,N). RMSE measures the average magnitude of error between predicted and actual values, with lower values indicating a more accurate model. Diagnostic checks supported this result, with ARIMA residuals randomly distributed around zero, a near-normal histogram, and no significant autocorrelation, confirming that the model captured the data’s structure well. While the ETS model showed similar residual diagnostics, its slightly higher RMSE made ARIMA the more accurate option.

Using the ARIMA model, I forecasted oil imports for the next 12 months, with the first month’s predicted value at approximately 654.17 thousand barrels. This analysis highlights the seasonal nature of oil imports, with predictable cyclical patterns and upward momentum over time. By understanding these trends, policymakers and stakeholders can make informed decisions, anticipate fluctuations, and plan effectively for future demand. This forecast provides a reliable foundation for addressing challenges and opportunities in the evolving energy market, emphasizing the value of data-driven insights in shaping sustainable and efficient strategies for the future.


## 2.	

- Obtain Consumer Price Index (CPI) from either Quandl or FRED. 

- By using pdfetch or quantmod packages, obtain "monthly" S&P 500 Price Index. (or transform the series to monthly by collapse() command)

- Combine these two series together monthly, start and end dates should be the same. (You can use Excel file if it is easier, then import to R)

- Calculate the real value of S &P 500 Price Index.

- Compare the nominal value and real value of the index. Plot them together to see visually. Is there any significant difference you can see visually? Explain why.

- Forecast the real value of S&P 500 Price Index for the next 6 months.

- Would you prefer to invest in the index based on the forecasted value? 

- Calculate the return of nominal S&P 500 Price Index. Plot the return and command on what you see visually. Does it look stationary? Forecast for the next 6 month ahead by any model you would choose? Explain the data and forecasted values.



```{r warning=FALSE}
# 2a.Answer: Type you all your codes here, only.

cpi <- pdfetch_FRED("CPIAUCSL")
cpi <- data.frame(Date = index(cpi), CPI = coredata(cpi))

getSymbols("^GSPC", src = "yahoo", from = "2010-01-01")

sp500 <- to.monthly(GSPC, indexAt = "lastof", OHLC = FALSE)
sp500 <- data.frame(Date = index(sp500), SP500 = coredata(sp500)[, "GSPC.Close"])

combined_data <- merge(cpi, sp500, by = "Date", all = FALSE)

combined_data <- combined_data %>%
  mutate(Real_SP500 = SP500 / (CPIAUCSL / CPIAUCSL[1]))

ggplot(combined_data, aes(x = Date)) +
  geom_line(aes(y = SP500, color = "Nominal")) +
  geom_line(aes(y = Real_SP500, color = "Real")) +
  labs(title = "Nominal vs Real S&P 500 Price Index",
       y = "Index Value",
       color = "Legend") +
  theme_minimal()

real_sp500_ts <- ts(combined_data$Real_SP500, start = c(2010, 1), frequency = 12)
real_sp500_forecast <- forecast(auto.arima(real_sp500_ts), h = 6)

autoplot(real_sp500_forecast) +
  labs(title = "Forecast of Real S&P 500 Price Index",
       y = "Real S&P 500 Value") +
  theme_minimal()

print(real_sp500_forecast)

combined_data <- combined_data %>%
  mutate(Returns = c(NA, diff(log(SP500))))

ggplot(combined_data, aes(x = Date, y = Returns)) +
  geom_line() +
  labs(title = "Returns of Nominal S&P 500 Price Index",
       y = "Returns") +
  theme_minimal()


returns_ts <- ts(na.omit(combined_data$Returns), start = c(2010, 1), frequency = 12)
returns_model <- auto.arima(returns_ts)
checkresiduals(returns_model)
returns_forecast <- forecast(auto.arima(returns_ts), h = 6)

# Plot forecasted returns
autoplot(returns_forecast) +
  labs(title = "Forecast of S&P 500 Returns",
       y = "Returns") +
  theme_minimal()

print(returns_forecast)

```


### 2b.
Write one paragraph (max 250 words) below to explain all the commands and their results. Make sure you explain the steps you did and the results you obtained with your own words.

Answer: I began by retrieving the Consumer Price Index (CPI) from FRED and the monthly S&P 500 Price Index from Yahoo Finance. Aligning the datasets by date, I calculated the real S&P 500 Price Index by adjusting nominal values for inflation. The "Nominal vs Real S&P 500 Price Index" graph revealed a striking divergence: nominal values consistently exceeded real values due to inflation, emphasizing the importance of using real data for accurate analysis. Forecasting the real index using an ARIMA model, I visualized the results in the "Forecast of Real S&P 500 Price Index" graph, which projected a steady upward trend over the next six months, signaling continued growth even after inflation adjustments.

Turning to returns, I computed the logarithmic returns of the nominal index. The "Returns of Nominal S&P 500 Price Index" graph illustrated historical fluctuations with periods of heightened volatility. Using an ARIMA model, I forecasted returns for the next six months, as shown in the "Forecast of S&P 500 Returns" graph, which predicted mild variability with no extreme movements. To validate my model, I analyzed residuals through the "Residual Diagnostics from ARIMA Model" graph, confirming stationarity with uncorrelated, white-noise residuals (p-value = 0.1643).

These analyses tell a compelling story. Inflation significantly impacts nominal values, underscoring the importance of real adjustments. With forecasts showing stable growth and moderate returns, the outlook appears favorable for investment, supported by a robust, well-validated model that captures both historical and future trends effectively


## 3.	

- Use the monthly S&P 500 Price Index you obtained in #2. 

- Make sure you create and use a tsibble data set. 

- Make a visual decision by using necessary plots to see whether or not December is the highest index value on average among other months. What do you see? Comment. 

- Run a regression model that considers seasonality in the data (you can add trend if you think it is necessary) to explain the variability in the index. 

- Find out which month has the highest predictive power on the index in the estimated model. Is the coefficient statistically significant at 5%? 

- Is your model as a whole significant at 5%. Comment on the estimated model.

- Which month would you invest on S&P 500 Price Index ETF's, if you would? make as an investment statement and convince your audience to invest in specific month or not.

```{r warning=FALSE}

# 3a.Answer: Type all your codes here, only.

sp500 <- sp500 %>%
  mutate(
    Month = month(Date, label = TRUE, abbr = FALSE),
    Year = year(Date)                                
  )

average_monthly <- sp500 %>%
  group_by(Month) %>%
  summarize(Average_SP500 = mean(SP500))

ggplot(average_monthly, aes(x = Month, y = Average_SP500, fill = Month)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(
    title = "Average S&P 500 Index Value by Month",
    x = "Month",
    y = "Average Index Value"
  ) +
  theme_minimal()

sp500$Month <- factor(sp500$Month, levels = month.name)

model <- lm(SP500 ~ Month, data = sp500)
summary(model)

anova_summary <- anova(model)
anova_summary

if (anova_summary$`Pr(>F)`[1] < 0.05) {
  cat("The model is statistically significant at the 5% level.\n")
} else {
  cat("The model is NOT statistically significant at the 5% level.\n")
}

```

### 3b. 
Write a paragraph (max 250 words) below to explain all the commands and their results. Make sure you explain the steps you did and the results you obtained with your own words.

Answer: In this analysis, I examined seasonal patterns in the S&P 500 index to identify the best-performing month and assess its potential relevance for investment strategies. I began by calculating the average index value for each month and visualizing the results in a bar plot. The analysis revealed that December had the highest average index value at 2917.385, closely followed by November at 2894.262. These findings suggested a potential seasonal trend favoring the end-of-year months, particularly December, as historically strong performers.

To further investigate, I constructed a regression model using months as categorical variables to evaluate whether these observed differences were statistically significant. The model aimed to quantify the contribution of each month to the variability in the S&P 500 index. However, the results indicated a very low R-squared value (0.004), demonstrating that monthly seasonality accounted for almost none of the variability in the index. Additionally, none of the months, including December, had statistically significant coefficients at the 5% level. An ANOVA test further confirmed that the model as a whole was not statistically significant (F-statistic p-value = 1.00).

These findings suggest that while December has historically been the best-performing month in terms of average index value, the lack of statistical significance implies that monthly seasonality alone is not a reliable predictor of S&P 500 performance. For a robust investment strategy, it is essential to incorporate broader market trends, macroeconomic factors, and other predictive indicators alongside historical patterns.
