Time Series Analysis || An Introductiory Start (Part 1)

1 Purpose of this notebook

The purpose of this notebook is to show how to use the timetkpackage in R to analyze time series data. When we think of time series analysis thefirst thing that would probably come to our mind is “forecasting”

The brief introduction to the time series topic:

Data Exploration
Locally Estimated Scatterplot Smoothing (LOESS)
Autocorrelation
Cross Correlation
Transformation Techniques
Seasonal Decomposition

Without understanding how your time series behaves it will be complex to understand what is an accurate forecast based on past events

2 Sources

3 Loading and Gaining Insights

3.1 Loading Libraries for the analysis and Importing file

setwd("C:/Users/DellPC/Desktop/Corner/R_source_code/time_series/archive")

# Core  libraries 
library(tidyverse)
library(timetk)
library(tidyquant)

# Exploration 
library(DataExplorer)

# Visualization 
library(plotly)
library(highcharter)
library(ggthemes)

complete_stock_tbl <- read.csv("all_stocks_2006-01-01_to_2018-01-01.csv")

#Create sector by type of stock 
#ABBA (Yahoo fnal day of trading in October)
#Yahoo was classified as Technology 

final_stock_tbl <- complete_stock_tbl %>% 
   janitor::clean_names() %>% 
   mutate(
    sector = case_when(
      name == "MMM" ~ "Business/Consumer Services",
      name == "AXP" ~ "Financial Services",
      name == "AAPL" ~ "Technology", 
      name == "BA" ~ "Industrial Goods",
      name == "CAT" ~ "Industrial Goods", 
      name == "CVX" ~ "Companies on the Energy Service",
      name == "CSCO" ~ "Technology",
      name == "KO" ~ "Consumer Goods", 
      name == "DIS" ~ "Media/Entertainment",
      name == "XOM" ~ "Companies on the Energy Service",
      name == "GE" ~ "Business/Consumer Services",
      name == "GS" ~ "Financial Services",
      name == "HD" ~ "Retail/Wholesale",
      name == "IBM" ~ "Business/Consumer Services",
      name == "INTC" ~ "Technology",
      name == "JNJ" ~ "Health Care/Life Sciences",
      name == "JPM" ~ "Financial Services",
      name == "MCD" ~ "Leisure/Arts/Hospitality",
      name == "MRK" ~ "Health Care/Life Sciences",
      name == "MSFT" ~ "Technology",
      name == "NKE" ~ "Consumer Goods",
      name == "PFE" ~ "Health Care/Life Sciences",
      name == "PG" ~ "Consumer Goods",
      name == "TRV" ~ "Financial Services",
      name == "UTX" ~ "Technology", 
      name == "UNH" ~ "Financial Services", 
      name == "VZ" ~ "Telecommunication Services", 
      name == "WMT" ~ "Retail/Wholesale",
      name == "GOOGL" ~ "Technology",
      name == "AMZN" ~ "Retail/Wholesale",
      TRUE ~ "Technology"
      
    )
  )

# Count by unique sector 

stocks_by_sector <- final_stock_tbl %>% 
   distinct(sector, name) %>% 
   group_by(sector) %>% 
   summarise(
      count = n()
   ) %>%
   ungroup() %>%
   arrange(desc(count))

stocks_by_sector

## # A tibble: 11 x 2
##    sector                          count
##    <chr>                           <int>
##  1 Technology                          7
##  2 Financial Services                  5
##  3 Business/Consumer Services          3
##  4 Consumer Goods                      3
##  5 Health Care/Life Sciences           3
##  6 Retail/Wholesale                    3
##  7 Companies on the Energy Service     2
##  8 Industrial Goods                    2
##  9 Leisure/Arts/Hospitality            1
## 10 Media/Entertainment                 1
## 11 Telecommunication Services          1

# Use palette_light() to know the color codes of tidyquant theme

# Treemap with Tidyquant colors 

treemap_by_sector <- stocks_by_sector %>%
   hctreemap2(group_vars = "sector",
              size_var = "count",
              layoutAlgorithm = "squarified") %>% 
   hc_plotOptions(treemap = list(colorByPoint = TRUE)) %>%
    hc_colors(c("#2C3E50", "#E31A1C", "#18BC9C", "#CCBE93", 
                "#A6CEE3", "#1F78B4", "#B2DF8A", "#FB9A99",
                "#FDBF6F", "#FF7F00", "#6A3D9A" )) %>%
    hc_colorAxis(dataClasses = color_classes(stocks_by_sector$sector)) %>%    
    hc_legend(enabled = FALSE) %>% 
  hc_title(text = "Number of Companies by Sector") %>% 
  hc_subtitle(text = "Dow Jones Index") %>% 
    hc_tooltip(pointFormat = "<b>{point.name}</b>:<br>
             Count: {point.value:,.0f}")

# Return the plot 

treemap_by_sector

3.2 Companies to Analyze

The following companies will be analyzed:

Apple (AAPL): Belongs to the Technolgy sector.
Caterpillar (CAT): Belongs to the Industrial Goods sector.
Google (GOOGL): Belongs to the Technology sector.
Exxon (XOM): Belongs to the Energy sector
Golman Sachs (GS): Belongs to the Financial Sector.
WalMart: Belongs to the Retail/Wholesale sector.

3.3 Explaining Loess

As shown in the plots below we see some companies which are less volatile than others, companies that have a more upward trend line than others. But have you ever asked yourself what is the point of the trendline and how it is calculated? LOESS stands for Locally Weighted Scatterplot Smoothing and think of this as a regression within a specific “window” (range of time) in which the closest observation is given more weight than “farther” observations of a specific “focal point”. We performed weighted least squared (best fitting line). I would suggest to look at this explanation from StatQuest since it explains in an easy to understand way how the Smooth line is calculated (LOESS).

# Function to get average price by week, month and quarter 
# If we use the mean(clsoe) daily we get the actual vaues 
# Since values are unique 

convert_date_ts <- function(data, unit = "day") {
   
   new_data <- data %>% 
      mutate(date = floor_date(date, unit = unit)) %>% 
      group_by(date, name) %>% 
      summarise(
         close = mean(close)
      ) %>% 
      ungroup() 
   
   return(new_data)
}

3.3.1 Day

# Pick by stock (6 stocks from different industries)

filtered_final_stock_tbl <- final_stock_tbl %>%
   filter(name %in% c("AAPL", "GS", "GOOGL", "XOM", "CAT", "WMT")) %>%
   mutate(
      date =ymd(date)
   )

# Daily 

filtered_final_stock_tbl %>% 
   convert_date_ts() %>% 
   group_by(name) %>% 
   plot_time_series(
      .date_var = date,
      .value = close,
      .facet_ncol = 2,
      .smooth_color = "#18BC9C",
      .smooth_size = 0.5
   )

3.3.2 Week

filtered_final_stock_tbl %>% 
   convert_date_ts(unit = "week") %>% 
   group_by(name) %>% 
   plot_time_series(
      .date_var = date, 
      .value = close, 
      .facet_ncol = 2, 
      .smooth_color = "#18BC9C",
      .smooth_size = 0.5
   )

3.3.3 Month

filtered_final_stock_tbl %>% 
   convert_date_ts(unit = "month") %>% 
   group_by(name) %>% 
   plot_time_series(
      .date_var = date, 
      .value = close, 
      .facet_ncol = 2, 
      .smooth_color = "#18BC9C",
      .smooth_size = 0.5
   )

3.3.4 Quarter

filtered_final_stock_tbl %>% 
   convert_date_ts(unit = "quarter") %>% 
   group_by(name) %>% 
   plot_time_series(
      .date_var = date,
      .value = close,
      .facet_ncol = 2,
      .smooth_color = "#18BC9C",
      .smooth_size = 0.5
   )

3.3.5 Year

filtered_final_stock_tbl %>% 
   convert_date_ts(unit = "year") %>% 
   group_by(name) %>% 
   plot_time_series(
      .date_var = date, 
      .value = close,
      .facet_ncol = 2,
      .smooth_color = "#18BC9C",
      .smooth_size = 0.5
   )

3.4 Determining missing valueswith DataExplorer

# Missing values on the 2017-07-31

filtered_final_stock_tbl %>% 
   plot_missing(
      ggtheme = theme_calc(),
      title = "Percent Missing Values By Column"
   )

# Daily stock return by company 
# Using Tidyquant package

daily_stock_return <- filtered_final_stock_tbl %>%
  mutate(
    name = name %>% as_factor()
  ) %>% 
  group_by(name) %>% 
  tq_transmute(
    select = close,
    mutate_fun = periodReturn,
    period = "daily",
    col_rename = "stock_return"
  )

3.5 Which Stocks grew the most percent wise on average

Note: As expected, during the years of 2007 and 2008 is where we have the highest volatility due to the financial crisis that occur during those years caused mainly by uncertainty in the markets.

Apple (AAPL): Not so surprisingly enough, Apple was the one on top of the list when it comes to returns. However, when the time period gets into Quarterly and Yearly the standard deviation is much higher meaning that this company faced a lot of volatility at certain periods of time.
EXXON (XOM): Had the lowest amount of stock return.
Goldman Sachs (GS): This company had on average the highest level of volatility, probably mostly impacted by the financial recession of 2007-2008.

# Daily stock return by company
# Using Tidyquant package 

daily_stock_return <- filtered_final_stock_tbl %>% 
   mutate(
      name = name %>% as.factor()
   ) %>% 
   group_by(name) %>%
   tq_transmute(
      select = close, 
      mutate_fun = periodReturn,
      period = "daily",
      col_rename = "stock_return"
   )

# Visualization with ggplot 

daily_stock_return %>% 
   ggplot(aes(x = date, y = stock_return)) + 
   geom_line(color = palette_light()[[1]]) + 
   facet_wrap(~name) +
   theme_calc() + 
   labs(
      title = "Stock Returns",
      subtitle = "by Company",
      y = "returns"
   ) + 
   theme(plot.title = element_text(hjust = 0.5))

return_volatility_tbl <- function(data, unit = "daily", ...){
   
   vars_col <- quos(...)
   
   tbl_return_volatility <- data %>% 
      mutate(
         name = name %>% as_factor()
      ) %>% 
      group_by(name) %>% 
      tq_transmute(
         select = close,
         mutate_fun = periodReturn,
         period = unit,
         col_rename = "stock_return"
      ) %>% 
      mutate(
         company = case_when(
            name == "AAPL" ~ "Apple",
            name == "GOOGL" ~ "Google",
            name == "XOM" ~ "Exxton",
            name == "GS" ~ "Goldman Sachs",
            name == "WMT" ~ "Walmart",
            TRUE ~ "Caterpillar"
         )
      ) %>% 
      group_by(!!!vars_col) %>% 
      summarise(
         `percent return` = mean(stock_return),
         `stand deviation` = sd(stock_return)
      ) %>%
      ungroup() %>% 
      arrange(desc(`percent return`)) 
   
   return(tbl_return_volatility)
}

3.5.1 Daily

library(DT)

filtered_final_stock_tbl %>%
   return_volatility_tbl(unit = "daily", company, name) %>%
   datatable()

3.5.2 Weekly

library(DT)

filtered_final_stock_tbl %>%
   return_volatility_tbl(unit = "weekly", company, name) %>%
   datatable()

3.5.3 Monthly

library(DT)

filtered_final_stock_tbl %>%
   return_volatility_tbl(unit = "monthly", company, name) %>%
   datatable()

3.5.4 Quarterly

library(DT)

filtered_final_stock_tbl %>%
   return_volatility_tbl(unit = "quarterly", company, name) %>%
   datatable()

3.5.5 Yearly

library(DT)

filtered_final_stock_tbl %>%
   return_volatility_tbl(unit = "yearly", company, name) %>%
   datatable()

3.6 Exploring Seasonality and Pattern

To better understand certain seasonality patters it is critical to see how the target variable behaves at certain periods of time. In this case we would analyze how the closing price of company x behaves at different days of the week, months, quarters and years. This could help us understand if there are other external factors that influences the price of the stock.

Summary:

Different periods in time: We will analyze the distribution at during weekdays, week, month quarter and year.
Exxon (XOM) For this case we will look at how the stock of Exxon behaves. We will see the distribution of the closing price at different points in time.
Weekdays: During weekdays we dont see an obvious differents in the distribution of the closing price.
Months: Months of June and December tend to have a slighltly higher closing price.
Quarters: The second and fourth quarter has a slightly higher median price in the distribution of the closing price.
Year: 2010 had the worst distribution in the closing price for the company (probably an external factor impacted the sector), while 2014 was the year in which the distribution of the closing price was the highest.

# Let's pick Exxon for seasonality 

filtered_final_stock_tbl %>% 
   filter(name == "XOM") %>% 
   plot_seasonal_diagnostics(
      .date_var = date,
      .value =close,
      .interactive = TRUE
   )

3.7 STL (Seasonal and Trend decomposition using LOESS)

When we talk about Seasonal Decomposition think of it as a technique to decompose a time series into three characteristics - Trend - Seasonality - Residual (Remainder)

Explaining Trend: As we have seen before, trend gives a general direction as to where the time series is heading at. Remember, trend is determined by using “LOESS” which is nothing more than a local regression within a specific window time frame.

Explaining Seasonality: In the second panel we observe the seasonal swing across time. This is usually capture around the trend line we observe in the first panel.Capturing seasonality swings could help our forecast model if seasonality has an impact on predicting a target variable in this case a closing price. When the seasonality is constant as shown below we have an Additive Decomposition. Additive Decompositions assumes that the swings in our second panel are the same every year (constant). When the seasonality is not constant then we have a Multiplicative Decomposition. We usually use multiplicative decomposition when we use an exponential growth in our seasonality panel.

Combining Seasonality + Trend : By combining this two, we try to replicate the actual time series data although, they are not the same it is quite a close approximation.

Residuals: This is just the remainder between the actual data and the seasonal + trend time series.

Formula: - Additive Decomposition: Seasonal + Trend + Random - Multiplicative Decomposition: SeasonalTrendRandom

Note: To see the seasonality zoom in on the season panel for one of the stocks. You will clearly see an Additive Decomposition

3.7.1 Non-Interactive STL

# Function to create seasonal decomposition with loess for the stocks 

plot_seasonal_decomposition <- function(data, interactive = FALSE) {
   
   stl_plot <- data %>%
      group_by(name) %>%
      plot_stl_diagnostics(
         .date_var = date,
         .value = log1p(close),
         .interactive = interactive
      )
   
   return(stl_plot)
}

# Visualization 


filtered_final_stock_tbl %>% 
   filter(name %in% c("AAPL", "XOM", "GS")) %>%
   group_by(name) %>%
   plot_seasonal_decomposition()

3.8 Outlier Detection

Anomalies are used mainly for finding out events that were not expected in our time series. It can also be that a mistake in the data could have caused an outlier in the time series. Detecting Outliers gives us the possibility of investigating further if events that were not expected actually drove the time series to behaved in an anomaly manner.

filtered_final_stock_tbl %>%
   group_by(name) %>% 
   plot_anomaly_diagnostics(
      .date = date,
      .value = close,
      .facet_ncol = 2,
      .interactive = TRUE,
      .title = "Anomaly Diagnostics Dow Jones",
      .anom_color = "#FB3029",
      .max_anomalies = 0.07,
      .alpha = 0.05
   )

3.8.1 Investigate Anomaly of XOM (Exxon)

As we have seen previously, Exxon belongs to the energy industry more specifically the *oil** industry

filtered_final_stock_tbl %>%
   filter(name == "XOM") %>%
   plot_anomaly_diagnostics(
      .date = date,
      .value = close,
      .interactive = TRUE,
      .title = "Anomaly Diagnostics Exxon (XOM)",
      .anom_color = "#FB3029",
      .max_anomalies = 0.05,
      .alpha = 0.05
   )

4 Statistical Methods

Definition List:

Lags: Lag is the time gap we have with our time series. Remember, Lag 1 is measuring the time series with itself so it will “always” have a correlation of 1.
Autocorrelation: Autocorrelation or ACF, explains how a time series is similar to itself through the use of lags. By finding patterns, we will be able to find cyclical seasonality patterns which would further help us to use certain lags for forecasting since we will add this seasonality patterns to our model. Autocorrelation is comprised of both the direct and indirect relationships (unlinke PACF which is only direct.)
Partial Autocorrelation: Partial Autocorrelation (PACF) is the same as autocorrelation however, the impact of lags from before are removed or in other words the relationship from previous lags are not taken in the calculation. Therefore, if we want to find the relationship between lag 1 and lag 4, the correlation between these two lags will be direct and will not be affected by the correlations of lag 2 and lag 3.

Sources:

A Gentle Introduction to Autocorrelation and Partial Autocorrelation

4.0.1 Autocorrelation (ACF)

filtered_final_stock_tbl %>% 
   group_by(name) %>% 
   summarise_by_time(
      .date_var = date, 
      .value = mean(close, na.rm = TRUE),
      .by = "month"
   ) %>%
   tk_acf_diagnostics(
      .date_var = date,
      .value = log1p(.value)
   ) %>% 
   ggplot(aes(x = lag, y = ACF, color = name, group = name)) +
   # Add horizontal line a y =0 
   geom_hline(yintercept = 0) + 
   # Plot autocorrelations 
   geom_point(size = 2) + 
   geom_segment(aes(xend = lag, yend = 0), size = 1) +
   #Add cutoffs 
   geom_line(aes(y = .white_noise_upper), color = "black", linetype = 2) + 
   geom_line(aes(y = .white_noise_lower), color = "black", linetype = 2) +
   # Add facets 
   facet_wrap(~name, ncol = 3) +
   # Aesthetis 
   expand_limits(y = c(-1, 1)) +
   scale_color_stata() +
   theme_calc() +
   theme(
      legend.position = "none",
      axis.text.x = element_text(angle = 45, hjust = 1),
      plot.title = element_text(hjust = 0.5)
   ) + 
   labs(
      title = "AutoCorrelation (ACF)",
      subtitle = "Dow Jones 30 Index"
   )

4.0.2 Partial Auto Correlation (PACF)

filtered_final_stock_tbl %>% 
   group_by(name) %>%
   summarise_by_time(
      .date_var = date,
      .value = mean(close, na.rm = TRUE),
      .by = "month"
   ) %>% 
  tk_acf_diagnostics(
    .date_var = date,
    .value = log1p(.value)
  ) %>% 
   ggplot(aes(x = lag, y = PACF, color = name, group = name)) + 
   # Add horizontal line a y =0 
   geom_hline(yintercept = 0) + 
   # Plot autorcorrelation 
   geom_point(size = 2) + 
   geom_segment(aes(xend = lag, yend = 0), size = 1) +
   # Add cutoffs 
   geom_line(aes(y = .white_noise_upper), color = "black", linetype = 2) +
   geom_line(aes(y = .white_noise_lower), color = "black", linetype = 2) + 
   # Add facets 
   facet_wrap(~name, ncol = 3) + 
   # Aesthetics 
   expand_limits(y = c(-1, 1)) + 
   scale_color_stata() + 
   theme_calc() + 
   theme(
      legend.position = "none",
      axis.text.x = element_text(angle = 45, hjust = 1),
      plot.title = element_text(hjust = 0.5)
   ) + 
   labs(
      title = "Partial AutoCorrelation (PACF)",
      subtitle = "Dow Jones 30 Index"
   )

tech_stocks_tbl <- filtered_final_stock_tbl %>% 
   filter(sector == "Technology") %>% 
   select(date, name, close) %>% 
   pivot_wider(names_from = name, values_from = close) %>% 
   summarise_by_time(
      .date_var = date,
      .by = "month",
      across(AAPL:GOOGL, .fns = mean)
   )

tech_stocks_tbl %>% 
   tk_acf_diagnostics(
      date,
      AAPL,
      .ccf_vars = GOOGL
   ) %>% 
   select(lag, CCF_GOOGL) %>% 
   datatable()

Summary: - CCF on AAPL Close Price: The closing price of AAPL is the target variable and the closing price of GOOGL is the independent variable. - High correlation at Lag 0: This indicate that when the closing AAPL is high or increasing this has a positive effect on the closing price of GOOGL On that same day or at LAG 0!

# Plot
tech_stocks_tbl %>% 
  plot_acf_diagnostics(
    date,
    AAPL,
    .ccf_vars = GOOGL,
    .show_ccf_vars_only = TRUE,
    .interactive=FALSE, 
    .line_color = "black",
    .point_color =palette_light()[[2]],
    .line_size = 1.5,
    .title = "Cross Correlation of Technology Stocks"
  )

4.1 Simple Moving Averages

Moving averages helps us smooth the time series and helps us to better understand where the trend is heading. Moving Averages helps us detect as well seasonlity in our time series

Most Common Moving Average in the Stock Market: - 15 days MA - 30 days MA - 100 days MA - 200 days MA

Note: Notice how the higher the moving average the smoother the line. Usually, we use moving averages to detect trends in our time series data.

gs_ma_sample <- filtered_final_stock_tbl %>% 
   select(date, close, name) %>%
   filter(name == "GS") %>% 
   filter_by_time(.date_var = date, .start_date = "2015", .end_date = "2017") %>% 
   mutate(
      adjusted_15_mm_close = slidify_vec(
         .x = close, 
         .period = 15,
         .f = mean, 
         na.rm = TRUE, 
         .align = "center", 
         .partial = TRUE
      ),
      adjusted_30_mm_close = slidify_vec(
         .x = close,
         .period = 30,
         .f = mean, 
         na.rm = TRUE,
         .align = "center",
         .partial = TRUE
      ),
      adjusted_100_mm_close = slidify_vec(
         .x = close, 
         .period = 100, 
         .f = mean, 
         na.rm = TRUE, 
         .align = "center",
         .partial = TRUE
      ),
      adjusted_300_mm_close = slidify_vec(
         .x = close,
         .period = 300, 
         .f = mean, 
         na.rm = TRUE, 
         .align = "center",
         .partial = TRUE
      )
   ) %>% 
   pivot_longer(contains("close"), names_repair = "unique") %>%
   rename(
      name = name...2,
      metric = name...3
   )

gs_ma_sample %>% 
   ggplot(aes(x = date, y = value)) + 
   geom_line(aes(color = metric), size = 1) + 
   theme_calc() + 
   theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5)) +
   guides(color = guide_legend(nrow = 2, byrow = TRUE)) + 
   labs(
      title = "Mopving Averager (MA) as a Metric",
      subtitle = str_glue("GS Stock 2015 - 2017 "),
      y = "Closing Price"
   ) + 
   scale_color_stata() +
   scale_x_date(expand = c(0,0))

mm_final_tbl <- filtered_final_stock_tbl %>% 
   select(date, close, name) %>% 
   mutate(
      adjusted_15_mm_close = slidify_vec(
         .x = close,
         .period = 15,
         .f = median,
         na.rm = TRUE,
         .align = "center",
         .partial = TRUE
      ),
      adjusted_30_mm_close =slidify_vec(
         .x = close,
         .period = 30,
         .f = median, 
         na.rm = TRUE, 
         .align = "center",
         .partial = TRUE
      ),
      adjusted_100_mm_close = slidify_vec(
         .x = close, 
         .period = 100,
         .f = median, 
         na.rm = TRUE, 
         .align = "center",
         .partial = TRUE
      ),
      adjusted_300_mm_close = slidify_vec(
         .x = close,
         .period = 300,
         .f = median,
         na.rm = TRUE,
         .align = "center",
         .partial = TRUE
      )
   ) %>% 
   pivot_longer(contains("close"), names_repair = "unique") %>%
   rename(
      name = name...2,
      metric = name...3
   )

# Visualization 

mm_final_tbl %>% 
   ggplot(aes(x = date, y = value, color = metric)) + 
   geom_line(size = 1, alpha = 0.8) + 
   facet_wrap(~name, scales = "free_y") + 
   theme_calc() + 
   scale_color_stata() + 
   labs(
      title = "Moving Medians",
      subtitle = "Dow Jones 30",
      y = "Price"
   ) +
   theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5),
         legend.text = element_text(size = 10)) + 
   guides(color = guide_legend(nrow = 2, byrow = TRUE))

4.2 Moving Average in Forecasting

The goal of this section is to showcase in which scenarios moving average could be useful and in which ones could it be that they are not that useful.

Summary:

Easy to understand: Moving averages are easy to understand thus not many technicalities are needed to understand this concept.
Expected Volatility: When there is an expected volatility, you run into the risk of having large residuals through the use of moving averages.
Low volatility: Moving averages are useful during time frames of low volatility as it is able to capture the short-term trend.

gs_filtered_data <- filtered_final_stock_tbl %>%
   filter(name == "GS") %>% 
   filter_by_time(
      .date_var = date,
      .start_date = "2009",
      .end_date = "2012"
   ) %>% 
   select(date, name, close)

ma_60day_forecasting <- gs_filtered_data %>% 
   mutate(
      mavg_60_days = slidify_vec(
         .date_var = date,
         .x = close,
         .f = mean,
         na.rm = TRUE,
         .period = 60,
         .align = "right"
      )
   ) %>% 
   bind_rows(
      future_frame(., .length_out = 60)
   ) %>%
   fill(
      mavg_60_days,
      .direction = "down"
   )

# Forecasting plot 

ma_60day_forecasting %>% 
   pivot_longer(close:mavg_60_days, names_repair = "unique") %>%
   rename(
      "type" = "name...3"
          ) %>%
   ggplot(aes(x=date, y = value)) + 
   geom_line(aes(color = type), size = 1) +
   theme_calc() +
   scale_color_stata() +
   labs(
      title = "60 day Moving Average",
      subtitle = "MA as a Forecasting Technique"
   )

4.3 Residuals of Forecasting with Moving Average

In this scenario we can see that there was an uplift that the moving average was not able to capture. Therefore, moving average are good during times of stability since it is able to capture the trend. However, during times of volatility you will miss a lot on uplift of stock prices or you will not able to hedge against downward trends in stock prices!. Therefore, it is important to explore other interesting techniques.

actual_60days <- filtered_final_stock_tbl %>%
   filter(name == "GS") %>%
   filter_by_time(
      .start_date = "2013",
      .end_date = "2013-03-01"
   ) %>% 
   select(date, name, close)


plot_series <- ma_60day_forecasting %>%
   filter_by_time(
      .start_date = "2013-01-02",
      .end_date = "2013-03-01"
   ) %>%
   left_join(actual_60days, by = "date") %>%
   select(date, mavg_60_days, close.y) %>%
   rename(
      "actual_close_price" = "close.y"
   ) %>%
   pivot_longer(mavg_60_days:actual_close_price) %>%
   drop_na() %>%
   ggplot(aes(x=date, value)) + 
   geom_line(aes(color = name), size = 1) +
   geom_point(aes(color = name, size=2)) +
   theme_calc() +
   scale_color_stata() + 
   labs(title = "Residuals from 60 day Moving average",
        subtitle = "GS stock Moving Average") +
   theme(legend.position = "bottom") + 
   guides(color = guide_legend(nrow = 1, byrow = TRUE))

plot_series

4.4 Rolling Correlation

Another way of finding a relationship between two time series is through the use of Rolling Correlations. What we do here is find a correlation at different points in time between two time series and determine if the relationship between the both is significant.

Summary:

APPL vs GOOGL: Is the relationship between the two technological stocks significant enough?
Time lag of 1 year: A time lag of one year will be used in our monthly time series (which we created). Do the two stocks tend to move together?
End result: In most cases we see a rolling correlation > than 0.5! Which is significant enough and which gives evidence that these two stocks move together.

# We will use the slidify function
# Turns function into a rolling / sliding function 


# We will get the monthly average for each stock 
# And find if there is a correlation 

wider_format_stocks_tbl <- filtered_final_stock_tbl %>%
   filter(name %in% c("AAPL", "GOOGL")) %>%
   select(date, close, name) %>%
   mutate(
      date = floor_date(date, unit = "month")
   ) %>%
   group_by(date, name) %>% 
   summarise(
      close_avg = mean(close)
   ) %>% ungroup() %>% 
   pivot_wider(names_from = name, values_from = close_avg)

rolling_cor <- slidify(
  .f = ~ cor(.x, .y, use = "pairwise.complete.obs"), 
  .period = 12,
  .align  = "right",
  .partial = FALSE
)

wider_format_stocks_tbl %>%
   mutate(
      rolling_cor_close_price = rolling_cor(AAPL, GOOGL)
   ) %>%
   ggplot(aes(x = date, y = rolling_cor_close_price)) + 
   geom_line(size = 1, color = "#045884") +
   theme_calc() + 
   scale_color_stata() + 
   geom_smooth(method = "loess", color = "red") +
   scale_x_date(expand = c(0,0)) + 
   labs(
      title = "Rolling Correlation in the Technology Sectior",
      subtitle = "AAPL vs GOOGL",
      y = "Rolling Correlation",
      x = "Date"
   ) + 
   theme(plot.title = element_text(hjust = 0.5)) + 
   scale_x_date(expand = c(0,0))

4.5 Linear Regression and the use of Log Transfomrations

The main reason why we use log transformations in time series analysis is to reduce the impact of extensive outliers therefore, stabilizing the variance across the time series. Log transformations are required in models like Linear Regression in which an outlier can cause a significant impact on the adjusted squared error. Although, in this case we dont have “significant” outliers log transformations can help you mitigate the impact of such.

Log Transformation Formula:

\(\large Y = log_b^x\)
\(\large b^y = x\)

Summary:

Log1p function: This function is the same as a log transformation + 1. We use this to mitigate the impact of values which are zero. (This is not the case)
Feature Importance: Notice how most of our months are giving significant importance to reduce our adjusted R-squared. Meaning that the month has a significant impact on determining where the closing price of a stock is heading. This will lead to the question is there some sort of seasonality?
Getting from log values to actual values? To transform back to our actual values we use the expm1 function which will allows us to get the orginial values back. We want to definitely know this whenever we are forecasting!

# We will pick GS as it is one of the ones with the highest volatility

# GS linear regression model using log transformations
log_gs_lm_plot <- filtered_final_stock_tbl %>% 
  filter(name == "GS") %>% 
  filter_by_time(
    .start_date = "2016-01-01",
    .end_date = "2017-01-01"
  ) %>% 
  plot_time_series_regression(
    .date_var = date,
    .formula = log1p(close) ~ as.numeric(date) + 
      wday(date, label = TRUE) + 
      month(date, label = TRUE),
    .interactive=FALSE,
    .show_summary = TRUE,
    .title = "Log transformations in Linear Regression Models"
  )

## 
## Call:
## stats::lm(formula = .formula, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.114757 -0.018312  0.001485  0.020826  0.105742 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -9.3685671  4.3585702  -2.149  0.03262 *  
## as.numeric(date)              0.0008535  0.0002566   3.326  0.00102 ** 
## wday(date, label = TRUE).L   -0.0018868  0.0050972  -0.370  0.71159    
## wday(date, label = TRUE).Q    0.0004393  0.0050410   0.087  0.93063    
## wday(date, label = TRUE).C   -0.0015788  0.0049871  -0.317  0.75184    
## wday(date, label = TRUE)^4    0.0031286  0.0049447   0.633  0.52753    
## month(date, label = TRUE).L   0.0212167  0.0939807   0.226  0.82159    
## month(date, label = TRUE).Q   0.2516840  0.0078410  32.098  < 2e-16 ***
## month(date, label = TRUE).C   0.1107287  0.0078093  14.179  < 2e-16 ***
## month(date, label = TRUE)^4   0.0581218  0.0078034   7.448 1.79e-12 ***
## month(date, label = TRUE)^5  -0.0174280  0.0077749  -2.242  0.02592 *  
## month(date, label = TRUE)^6   0.0410392  0.0078056   5.258 3.28e-07 ***
## month(date, label = TRUE)^7  -0.0437489  0.0077170  -5.669 4.19e-08 ***
## month(date, label = TRUE)^8  -0.0437221  0.0076994  -5.679 3.99e-08 ***
## month(date, label = TRUE)^9   0.0105762  0.0077319   1.368  0.17266    
## month(date, label = TRUE)^10  0.0131328  0.0076836   1.709  0.08874 .  
## month(date, label = TRUE)^11 -0.0166546  0.0077086  -2.161  0.03174 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03551 on 235 degrees of freedom
## Multiple R-squared:  0.933,  Adjusted R-squared:  0.9285 
## F-statistic: 204.6 on 16 and 235 DF,  p-value: < 2.2e-16

log_gs_lm_plot

Transforming back to the original values using expm1 function:

log_transformed_back_plot <- filtered_final_stock_tbl %>% 
  filter(name == "GS") %>% 
  filter_by_time(
    .start_date = "2016-01-01",
    .end_date = "2017-01-01"
  ) %>% 
  plot_time_series_regression(
    .date_var = date,
    .formula = expm1(log1p(close)) ~ as.numeric(date) + 
      wday(date, label = TRUE) + 
      month(date, label = TRUE),
    .interactive=FALSE,
    .show_summary = TRUE,
    .title = "Transforming to actual close price using expm1 function"
  )

## 
## Call:
## stats::lm(formula = .formula, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0393  -3.0182   0.3419   3.3421  18.1935 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.804e+03  7.652e+02  -3.664 0.000306 ***
## as.numeric(date)              1.751e-01  4.505e-02   3.885 0.000133 ***
## wday(date, label = TRUE).L   -3.287e-01  8.948e-01  -0.367 0.713718    
## wday(date, label = TRUE).Q    3.947e-02  8.850e-01   0.045 0.964464    
## wday(date, label = TRUE).C   -2.123e-01  8.755e-01  -0.242 0.808638    
## wday(date, label = TRUE)^4    5.540e-01  8.681e-01   0.638 0.523987    
## month(date, label = TRUE).L  -2.077e+00  1.650e+01  -0.126 0.899952    
## month(date, label = TRUE).Q   4.869e+01  1.377e+00  35.375  < 2e-16 ***
## month(date, label = TRUE).C   2.412e+01  1.371e+00  17.596  < 2e-16 ***
## month(date, label = TRUE)^4   1.309e+01  1.370e+00   9.555  < 2e-16 ***
## month(date, label = TRUE)^5  -1.112e+00  1.365e+00  -0.814 0.416229    
## month(date, label = TRUE)^6   6.582e+00  1.370e+00   4.803 2.78e-06 ***
## month(date, label = TRUE)^7  -7.165e+00  1.355e+00  -5.289 2.82e-07 ***
## month(date, label = TRUE)^8  -7.312e+00  1.352e+00  -5.410 1.55e-07 ***
## month(date, label = TRUE)^9   1.411e+00  1.357e+00   1.040 0.299594    
## month(date, label = TRUE)^10  1.905e+00  1.349e+00   1.412 0.159213    
## month(date, label = TRUE)^11 -2.533e+00  1.353e+00  -1.872 0.062459 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.234 on 235 degrees of freedom
## Multiple R-squared:  0.9425, Adjusted R-squared:  0.9386 
## F-statistic: 240.6 on 16 and 235 DF,  p-value: < 2.2e-16

log_transformed_back_plot

4.6 Splitting Time Series for Forecasting:

When we talk about implementing machine learning models we usually talk about splitting the dataset into:

Training sets
Validation sets
Testing Sets

You can compose the training sets of sixty percent of the dataset twenty percent validating and twenty percent the testing sets.However, this distribution should rely on your own logical judgement. Coming to our time series topic, this is also the case for implementing machine learning models in a time series. One of the first steps you will have to do is splitting your time series data. Lets look at how we do that with an example: I will split the time series into a training and testing set and you will see it visually.

# Test is 25% of our data
split_df <- filtered_final_stock_tbl %>%
  filter(name == "GS") %>% 
  time_series_split(
    date_var = date,
    assess = "3 year",
    cumulative = TRUE
  )



splitted_data <- split_df %>% 
  tk_time_series_cv_plan() 


splitted_data %>% 
  ggplot() + geom_line(aes(x=date, y=close, color=.key), size=1) +
  theme_calc() + scale_color_stata() +
  geom_rect(data=data.frame(xmin=as.Date(c("2015-01-02")),
                               xmax=as.Date(c("2017-12-29")),
                               ymin=-Inf,
                               ymax=Inf),
               aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax),
               fill=palette_light()[2],alpha=0.3) + 
  geom_label(
    label = "Testing Set",
    vjust = 0.5,
    aes(x=as.Date(c("2016-07-02")),
        y = 100)) + 
       geom_rect(data=data.frame(xmin=as.Date(c("2006-01-03")),
                               xmax=as.Date(c("2014-12-31")),
                               ymin=-Inf,
                               ymax=Inf),
               aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax),
               fill=palette_light()[3],alpha=0.3) + scale_x_date(expand = c(0,0)) + 
    geom_label(
    label = "Training Set",
    vjust = 0.5,
    aes(x=as.Date(c("2012-07-02")),
        y = 220)) + 
  labs(
    title = "Splitting Time Series into Training and Testing Sets",
    subtitle = "Dow Jones 30"
  ) + theme(plot.title = element_text(hjust=0.5), legend.position = "bottom") + 
  guides(color = guide_legend(nrow=1, byrow=TRUE)) + 
  scale_y_continuous(labels = scales::dollar)

4.7 Autoregressive “Integrated” Moving Average (ARIMA)

Note: Still working on the explanation

Most of us have hear the concept of Autoregressive Integrated Moving Average (ARIMA). Before understanding the concept of ARIMA we should answer the following questions:

What is stationarity? There is a concept called “Stationarity”/ which basically means the difference of t - t1. For example, in this case the closing price of in Jan 1st 2016 - the closing price Jan 2nd 2016 and so on.
What is the effect of applying stationarity to a dataset? Stationary datasets creates a constant mean and variance which reduces the impact of trend and seasonality which affect the time series at different points in time. The variance and mean should be constant meaning the time series should look basically the same at any point in time.
Do we need to implement differencing when using ARIMA: The answer to that question is no, since ARIMA does differencing automatically and this is what the “I” in ARIMA does it implements differencing to make the time series stationary and avoid the impact of trend and seasonality.

What does it mean when we have a flat line in our forecast with ARIMA? (As shown below)

flat_line_forecast

Source: Blog SaS

When we have a flat line in our forecast model it means that the model is unable to capture neither a trend nor seasonality. This is due to that the model is unable to capture the different dynamics within our time series. Even though, it does not look as nice as we would want the forecast to be (moving up or down), actually this type of forecast could give a much better results rather than just adding random noise to our forecast.

Are there ways to improve the output of the forecast? Definitely, there are transformations and feature engineering steps that will better help your model capture any unforeseen trend or seasonality in your time series. I plan to talk more in-depth about ARIMA and other types of transformations and feature engineering processes in another notebook since I would like to keep this notebook as simple as possible for the moment.

5 Conclusion

I hope this notebook gave you a brief overview of the things we will see in this time series session. If you enjoyed going through the notebook I am preparing another one which will help the Kaggle community start deploying time series models for forecasting. If you are interested you can visit Time Series || Feature Engineering Concepts

Time Series II Feature Engineering

Nguyen_LSCM

4/17/2022

1 Purpose of this notebook

2 Sources

3 Loading and Gaining Insights

3.1 Loading Libraries for the analysis and Importing file

3.2 Companies to Analyze

3.3 Explaining Loess

3.3.1 Day

3.3.2 Week

3.3.3 Month

3.3.4 Quarter

3.3.5 Year

3.4 Determining missing valueswith DataExplorer

3.5 Which Stocks grew the most percent wise on average

3.5.1 Daily

3.5.2 Weekly

3.5.3 Monthly

3.5.4 Quarterly

3.5.5 Yearly

3.6 Exploring Seasonality and Pattern

3.7 STL (Seasonal and Trend decomposition using LOESS)

3.7.1 Non-Interactive STL

3.8 Outlier Detection

3.8.1 Investigate Anomaly of XOM (Exxon)

4 Statistical Methods

4.0.1 Autocorrelation (ACF)

4.0.2 Partial Auto Correlation (PACF)

4.1 Simple Moving Averages

4.2 Moving Average in Forecasting

4.3 Residuals of Forecasting with Moving Average

4.4 Rolling Correlation

4.5 Linear Regression and the use of Log Transfomrations

4.6 Splitting Time Series for Forecasting:

4.7 Autoregressive “Integrated” Moving Average (ARIMA)

5 Conclusion