Time Series Analysis || An Introductiory Start (Part 1)
The purpose of this notebook is to show how to use the timetkpackage in R to analyze time series data. When we think of time series analysis thefirst thing that would probably come to our mind is “forecasting”
Without understanding how your time series behaves it will be complex to understand what is an accurate forecast based on past events
setwd("C:/Users/DellPC/Desktop/Corner/R_source_code/time_series/archive")
# Core libraries
library(tidyverse)
library(timetk)
library(tidyquant)
# Exploration
library(DataExplorer)
# Visualization
library(plotly)
library(highcharter)
library(ggthemes)
complete_stock_tbl <- read.csv("all_stocks_2006-01-01_to_2018-01-01.csv")
#Create sector by type of stock
#ABBA (Yahoo fnal day of trading in October)
#Yahoo was classified as Technology
final_stock_tbl <- complete_stock_tbl %>%
janitor::clean_names() %>%
mutate(
sector = case_when(
name == "MMM" ~ "Business/Consumer Services",
name == "AXP" ~ "Financial Services",
name == "AAPL" ~ "Technology",
name == "BA" ~ "Industrial Goods",
name == "CAT" ~ "Industrial Goods",
name == "CVX" ~ "Companies on the Energy Service",
name == "CSCO" ~ "Technology",
name == "KO" ~ "Consumer Goods",
name == "DIS" ~ "Media/Entertainment",
name == "XOM" ~ "Companies on the Energy Service",
name == "GE" ~ "Business/Consumer Services",
name == "GS" ~ "Financial Services",
name == "HD" ~ "Retail/Wholesale",
name == "IBM" ~ "Business/Consumer Services",
name == "INTC" ~ "Technology",
name == "JNJ" ~ "Health Care/Life Sciences",
name == "JPM" ~ "Financial Services",
name == "MCD" ~ "Leisure/Arts/Hospitality",
name == "MRK" ~ "Health Care/Life Sciences",
name == "MSFT" ~ "Technology",
name == "NKE" ~ "Consumer Goods",
name == "PFE" ~ "Health Care/Life Sciences",
name == "PG" ~ "Consumer Goods",
name == "TRV" ~ "Financial Services",
name == "UTX" ~ "Technology",
name == "UNH" ~ "Financial Services",
name == "VZ" ~ "Telecommunication Services",
name == "WMT" ~ "Retail/Wholesale",
name == "GOOGL" ~ "Technology",
name == "AMZN" ~ "Retail/Wholesale",
TRUE ~ "Technology"
)
)# Count by unique sector
stocks_by_sector <- final_stock_tbl %>%
distinct(sector, name) %>%
group_by(sector) %>%
summarise(
count = n()
) %>%
ungroup() %>%
arrange(desc(count))
stocks_by_sector## # A tibble: 11 x 2
## sector count
## <chr> <int>
## 1 Technology 7
## 2 Financial Services 5
## 3 Business/Consumer Services 3
## 4 Consumer Goods 3
## 5 Health Care/Life Sciences 3
## 6 Retail/Wholesale 3
## 7 Companies on the Energy Service 2
## 8 Industrial Goods 2
## 9 Leisure/Arts/Hospitality 1
## 10 Media/Entertainment 1
## 11 Telecommunication Services 1
# Use palette_light() to know the color codes of tidyquant theme
# Treemap with Tidyquant colors
treemap_by_sector <- stocks_by_sector %>%
hctreemap2(group_vars = "sector",
size_var = "count",
layoutAlgorithm = "squarified") %>%
hc_plotOptions(treemap = list(colorByPoint = TRUE)) %>%
hc_colors(c("#2C3E50", "#E31A1C", "#18BC9C", "#CCBE93",
"#A6CEE3", "#1F78B4", "#B2DF8A", "#FB9A99",
"#FDBF6F", "#FF7F00", "#6A3D9A" )) %>%
hc_colorAxis(dataClasses = color_classes(stocks_by_sector$sector)) %>%
hc_legend(enabled = FALSE) %>%
hc_title(text = "Number of Companies by Sector") %>%
hc_subtitle(text = "Dow Jones Index") %>%
hc_tooltip(pointFormat = "<b>{point.name}</b>:<br>
Count: {point.value:,.0f}")
# Return the plot
treemap_by_sectorThe following companies will be analyzed:
As shown in the plots below we see some companies which are less volatile than others, companies that have a more upward trend line than others. But have you ever asked yourself what is the point of the trendline and how it is calculated? LOESS stands for Locally Weighted Scatterplot Smoothing and think of this as a regression within a specific “window” (range of time) in which the closest observation is given more weight than “farther” observations of a specific “focal point”. We performed weighted least squared (best fitting line). I would suggest to look at this explanation from StatQuest since it explains in an easy to understand way how the Smooth line is calculated (LOESS).
# Function to get average price by week, month and quarter
# If we use the mean(clsoe) daily we get the actual vaues
# Since values are unique
convert_date_ts <- function(data, unit = "day") {
new_data <- data %>%
mutate(date = floor_date(date, unit = unit)) %>%
group_by(date, name) %>%
summarise(
close = mean(close)
) %>%
ungroup()
return(new_data)
}# Pick by stock (6 stocks from different industries)
filtered_final_stock_tbl <- final_stock_tbl %>%
filter(name %in% c("AAPL", "GS", "GOOGL", "XOM", "CAT", "WMT")) %>%
mutate(
date =ymd(date)
)
# Daily
filtered_final_stock_tbl %>%
convert_date_ts() %>%
group_by(name) %>%
plot_time_series(
.date_var = date,
.value = close,
.facet_ncol = 2,
.smooth_color = "#18BC9C",
.smooth_size = 0.5
)filtered_final_stock_tbl %>%
convert_date_ts(unit = "week") %>%
group_by(name) %>%
plot_time_series(
.date_var = date,
.value = close,
.facet_ncol = 2,
.smooth_color = "#18BC9C",
.smooth_size = 0.5
)filtered_final_stock_tbl %>%
convert_date_ts(unit = "month") %>%
group_by(name) %>%
plot_time_series(
.date_var = date,
.value = close,
.facet_ncol = 2,
.smooth_color = "#18BC9C",
.smooth_size = 0.5
)filtered_final_stock_tbl %>%
convert_date_ts(unit = "quarter") %>%
group_by(name) %>%
plot_time_series(
.date_var = date,
.value = close,
.facet_ncol = 2,
.smooth_color = "#18BC9C",
.smooth_size = 0.5
)filtered_final_stock_tbl %>%
convert_date_ts(unit = "year") %>%
group_by(name) %>%
plot_time_series(
.date_var = date,
.value = close,
.facet_ncol = 2,
.smooth_color = "#18BC9C",
.smooth_size = 0.5
)# Missing values on the 2017-07-31
filtered_final_stock_tbl %>%
plot_missing(
ggtheme = theme_calc(),
title = "Percent Missing Values By Column"
)# Daily stock return by company
# Using Tidyquant package
daily_stock_return <- filtered_final_stock_tbl %>%
mutate(
name = name %>% as_factor()
) %>%
group_by(name) %>%
tq_transmute(
select = close,
mutate_fun = periodReturn,
period = "daily",
col_rename = "stock_return"
)Note: As expected, during the years of 2007 and 2008 is where we have the highest volatility due to the financial crisis that occur during those years caused mainly by uncertainty in the markets.
# Daily stock return by company
# Using Tidyquant package
daily_stock_return <- filtered_final_stock_tbl %>%
mutate(
name = name %>% as.factor()
) %>%
group_by(name) %>%
tq_transmute(
select = close,
mutate_fun = periodReturn,
period = "daily",
col_rename = "stock_return"
)
# Visualization with ggplot
daily_stock_return %>%
ggplot(aes(x = date, y = stock_return)) +
geom_line(color = palette_light()[[1]]) +
facet_wrap(~name) +
theme_calc() +
labs(
title = "Stock Returns",
subtitle = "by Company",
y = "returns"
) +
theme(plot.title = element_text(hjust = 0.5))return_volatility_tbl <- function(data, unit = "daily", ...){
vars_col <- quos(...)
tbl_return_volatility <- data %>%
mutate(
name = name %>% as_factor()
) %>%
group_by(name) %>%
tq_transmute(
select = close,
mutate_fun = periodReturn,
period = unit,
col_rename = "stock_return"
) %>%
mutate(
company = case_when(
name == "AAPL" ~ "Apple",
name == "GOOGL" ~ "Google",
name == "XOM" ~ "Exxton",
name == "GS" ~ "Goldman Sachs",
name == "WMT" ~ "Walmart",
TRUE ~ "Caterpillar"
)
) %>%
group_by(!!!vars_col) %>%
summarise(
`percent return` = mean(stock_return),
`stand deviation` = sd(stock_return)
) %>%
ungroup() %>%
arrange(desc(`percent return`))
return(tbl_return_volatility)
}library(DT)
filtered_final_stock_tbl %>%
return_volatility_tbl(unit = "daily", company, name) %>%
datatable()library(DT)
filtered_final_stock_tbl %>%
return_volatility_tbl(unit = "weekly", company, name) %>%
datatable()library(DT)
filtered_final_stock_tbl %>%
return_volatility_tbl(unit = "monthly", company, name) %>%
datatable()library(DT)
filtered_final_stock_tbl %>%
return_volatility_tbl(unit = "quarterly", company, name) %>%
datatable()library(DT)
filtered_final_stock_tbl %>%
return_volatility_tbl(unit = "yearly", company, name) %>%
datatable()To better understand certain seasonality patters it is critical to see how the target variable behaves at certain periods of time. In this case we would analyze how the closing price of company x behaves at different days of the week, months, quarters and years. This could help us understand if there are other external factors that influences the price of the stock.
Summary:# Let's pick Exxon for seasonality
filtered_final_stock_tbl %>%
filter(name == "XOM") %>%
plot_seasonal_diagnostics(
.date_var = date,
.value =close,
.interactive = TRUE
)When we talk about Seasonal Decomposition think of it as a technique to decompose a time series into three characteristics - Trend - Seasonality - Residual (Remainder)
Explaining Trend: As we have seen before, trend gives a general direction as to where the time series is heading at. Remember, trend is determined by using “LOESS” which is nothing more than a local regression within a specific window time frame.
Explaining Seasonality: In the second panel we observe the seasonal swing across time. This is usually capture around the trend line we observe in the first panel.Capturing seasonality swings could help our forecast model if seasonality has an impact on predicting a target variable in this case a closing price. When the seasonality is constant as shown below we have an Additive Decomposition. Additive Decompositions assumes that the swings in our second panel are the same every year (constant). When the seasonality is not constant then we have a Multiplicative Decomposition. We usually use multiplicative decomposition when we use an exponential growth in our seasonality panel.
Combining Seasonality + Trend : By combining this two, we try to replicate the actual time series data although, they are not the same it is quite a close approximation.
Residuals: This is just the remainder between the actual data and the seasonal + trend time series.
Formula: - Additive Decomposition: Seasonal + Trend + Random - Multiplicative Decomposition: SeasonalTrendRandom
Note: To see the seasonality zoom in on the season panel for one of the stocks. You will clearly see an Additive Decomposition
# Function to create seasonal decomposition with loess for the stocks
plot_seasonal_decomposition <- function(data, interactive = FALSE) {
stl_plot <- data %>%
group_by(name) %>%
plot_stl_diagnostics(
.date_var = date,
.value = log1p(close),
.interactive = interactive
)
return(stl_plot)
}
# Visualization
filtered_final_stock_tbl %>%
filter(name %in% c("AAPL", "XOM", "GS")) %>%
group_by(name) %>%
plot_seasonal_decomposition()Anomalies are used mainly for finding out events that were not expected in our time series. It can also be that a mistake in the data could have caused an outlier in the time series. Detecting Outliers gives us the possibility of investigating further if events that were not expected actually drove the time series to behaved in an anomaly manner.
filtered_final_stock_tbl %>%
group_by(name) %>%
plot_anomaly_diagnostics(
.date = date,
.value = close,
.facet_ncol = 2,
.interactive = TRUE,
.title = "Anomaly Diagnostics Dow Jones",
.anom_color = "#FB3029",
.max_anomalies = 0.07,
.alpha = 0.05
)As we have seen previously, Exxon belongs to the energy industry more specifically the *oil** industry
filtered_final_stock_tbl %>%
filter(name == "XOM") %>%
plot_anomaly_diagnostics(
.date = date,
.value = close,
.interactive = TRUE,
.title = "Anomaly Diagnostics Exxon (XOM)",
.anom_color = "#FB3029",
.max_anomalies = 0.05,
.alpha = 0.05
)filtered_final_stock_tbl %>%
group_by(name) %>%
summarise_by_time(
.date_var = date,
.value = mean(close, na.rm = TRUE),
.by = "month"
) %>%
tk_acf_diagnostics(
.date_var = date,
.value = log1p(.value)
) %>%
ggplot(aes(x = lag, y = ACF, color = name, group = name)) +
# Add horizontal line a y =0
geom_hline(yintercept = 0) +
# Plot autocorrelations
geom_point(size = 2) +
geom_segment(aes(xend = lag, yend = 0), size = 1) +
#Add cutoffs
geom_line(aes(y = .white_noise_upper), color = "black", linetype = 2) +
geom_line(aes(y = .white_noise_lower), color = "black", linetype = 2) +
# Add facets
facet_wrap(~name, ncol = 3) +
# Aesthetis
expand_limits(y = c(-1, 1)) +
scale_color_stata() +
theme_calc() +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)
) +
labs(
title = "AutoCorrelation (ACF)",
subtitle = "Dow Jones 30 Index"
)filtered_final_stock_tbl %>%
group_by(name) %>%
summarise_by_time(
.date_var = date,
.value = mean(close, na.rm = TRUE),
.by = "month"
) %>%
tk_acf_diagnostics(
.date_var = date,
.value = log1p(.value)
) %>%
ggplot(aes(x = lag, y = PACF, color = name, group = name)) +
# Add horizontal line a y =0
geom_hline(yintercept = 0) +
# Plot autorcorrelation
geom_point(size = 2) +
geom_segment(aes(xend = lag, yend = 0), size = 1) +
# Add cutoffs
geom_line(aes(y = .white_noise_upper), color = "black", linetype = 2) +
geom_line(aes(y = .white_noise_lower), color = "black", linetype = 2) +
# Add facets
facet_wrap(~name, ncol = 3) +
# Aesthetics
expand_limits(y = c(-1, 1)) +
scale_color_stata() +
theme_calc() +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)
) +
labs(
title = "Partial AutoCorrelation (PACF)",
subtitle = "Dow Jones 30 Index"
)tech_stocks_tbl <- filtered_final_stock_tbl %>%
filter(sector == "Technology") %>%
select(date, name, close) %>%
pivot_wider(names_from = name, values_from = close) %>%
summarise_by_time(
.date_var = date,
.by = "month",
across(AAPL:GOOGL, .fns = mean)
)
tech_stocks_tbl %>%
tk_acf_diagnostics(
date,
AAPL,
.ccf_vars = GOOGL
) %>%
select(lag, CCF_GOOGL) %>%
datatable()Summary: - CCF on AAPL Close Price: The closing price of AAPL is the target variable and the closing price of GOOGL is the independent variable. - High correlation at Lag 0: This indicate that when the closing AAPL is high or increasing this has a positive effect on the closing price of GOOGL On that same day or at LAG 0!
# Plot
tech_stocks_tbl %>%
plot_acf_diagnostics(
date,
AAPL,
.ccf_vars = GOOGL,
.show_ccf_vars_only = TRUE,
.interactive=FALSE,
.line_color = "black",
.point_color =palette_light()[[2]],
.line_size = 1.5,
.title = "Cross Correlation of Technology Stocks"
) Moving averages helps us smooth the time series and helps us to better understand where the trend is heading. Moving Averages helps us detect as well seasonlity in our time series
Most Common Moving Average in the Stock Market: - 15 days MA - 30 days MA - 100 days MA - 200 days MA
Note: Notice how the higher the moving average the smoother the line. Usually, we use moving averages to detect trends in our time series data.
gs_ma_sample <- filtered_final_stock_tbl %>%
select(date, close, name) %>%
filter(name == "GS") %>%
filter_by_time(.date_var = date, .start_date = "2015", .end_date = "2017") %>%
mutate(
adjusted_15_mm_close = slidify_vec(
.x = close,
.period = 15,
.f = mean,
na.rm = TRUE,
.align = "center",
.partial = TRUE
),
adjusted_30_mm_close = slidify_vec(
.x = close,
.period = 30,
.f = mean,
na.rm = TRUE,
.align = "center",
.partial = TRUE
),
adjusted_100_mm_close = slidify_vec(
.x = close,
.period = 100,
.f = mean,
na.rm = TRUE,
.align = "center",
.partial = TRUE
),
adjusted_300_mm_close = slidify_vec(
.x = close,
.period = 300,
.f = mean,
na.rm = TRUE,
.align = "center",
.partial = TRUE
)
) %>%
pivot_longer(contains("close"), names_repair = "unique") %>%
rename(
name = name...2,
metric = name...3
)gs_ma_sample %>%
ggplot(aes(x = date, y = value)) +
geom_line(aes(color = metric), size = 1) +
theme_calc() +
theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5)) +
guides(color = guide_legend(nrow = 2, byrow = TRUE)) +
labs(
title = "Mopving Averager (MA) as a Metric",
subtitle = str_glue("GS Stock 2015 - 2017 "),
y = "Closing Price"
) +
scale_color_stata() +
scale_x_date(expand = c(0,0))mm_final_tbl <- filtered_final_stock_tbl %>%
select(date, close, name) %>%
mutate(
adjusted_15_mm_close = slidify_vec(
.x = close,
.period = 15,
.f = median,
na.rm = TRUE,
.align = "center",
.partial = TRUE
),
adjusted_30_mm_close =slidify_vec(
.x = close,
.period = 30,
.f = median,
na.rm = TRUE,
.align = "center",
.partial = TRUE
),
adjusted_100_mm_close = slidify_vec(
.x = close,
.period = 100,
.f = median,
na.rm = TRUE,
.align = "center",
.partial = TRUE
),
adjusted_300_mm_close = slidify_vec(
.x = close,
.period = 300,
.f = median,
na.rm = TRUE,
.align = "center",
.partial = TRUE
)
) %>%
pivot_longer(contains("close"), names_repair = "unique") %>%
rename(
name = name...2,
metric = name...3
)
# Visualization
mm_final_tbl %>%
ggplot(aes(x = date, y = value, color = metric)) +
geom_line(size = 1, alpha = 0.8) +
facet_wrap(~name, scales = "free_y") +
theme_calc() +
scale_color_stata() +
labs(
title = "Moving Medians",
subtitle = "Dow Jones 30",
y = "Price"
) +
theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5),
legend.text = element_text(size = 10)) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))The goal of this section is to showcase in which scenarios moving average could be useful and in which ones could it be that they are not that useful.
Summary:gs_filtered_data <- filtered_final_stock_tbl %>%
filter(name == "GS") %>%
filter_by_time(
.date_var = date,
.start_date = "2009",
.end_date = "2012"
) %>%
select(date, name, close)
ma_60day_forecasting <- gs_filtered_data %>%
mutate(
mavg_60_days = slidify_vec(
.date_var = date,
.x = close,
.f = mean,
na.rm = TRUE,
.period = 60,
.align = "right"
)
) %>%
bind_rows(
future_frame(., .length_out = 60)
) %>%
fill(
mavg_60_days,
.direction = "down"
)
# Forecasting plot
ma_60day_forecasting %>%
pivot_longer(close:mavg_60_days, names_repair = "unique") %>%
rename(
"type" = "name...3"
) %>%
ggplot(aes(x=date, y = value)) +
geom_line(aes(color = type), size = 1) +
theme_calc() +
scale_color_stata() +
labs(
title = "60 day Moving Average",
subtitle = "MA as a Forecasting Technique"
)In this scenario we can see that there was an uplift that the moving average was not able to capture. Therefore, moving average are good during times of stability since it is able to capture the trend. However, during times of volatility you will miss a lot on uplift of stock prices or you will not able to hedge against downward trends in stock prices!. Therefore, it is important to explore other interesting techniques.
actual_60days <- filtered_final_stock_tbl %>%
filter(name == "GS") %>%
filter_by_time(
.start_date = "2013",
.end_date = "2013-03-01"
) %>%
select(date, name, close)
plot_series <- ma_60day_forecasting %>%
filter_by_time(
.start_date = "2013-01-02",
.end_date = "2013-03-01"
) %>%
left_join(actual_60days, by = "date") %>%
select(date, mavg_60_days, close.y) %>%
rename(
"actual_close_price" = "close.y"
) %>%
pivot_longer(mavg_60_days:actual_close_price) %>%
drop_na() %>%
ggplot(aes(x=date, value)) +
geom_line(aes(color = name), size = 1) +
geom_point(aes(color = name, size=2)) +
theme_calc() +
scale_color_stata() +
labs(title = "Residuals from 60 day Moving average",
subtitle = "GS stock Moving Average") +
theme(legend.position = "bottom") +
guides(color = guide_legend(nrow = 1, byrow = TRUE))
plot_seriesAnother way of finding a relationship between two time series is through the use of Rolling Correlations. What we do here is find a correlation at different points in time between two time series and determine if the relationship between the both is significant.
Summary:# We will use the slidify function
# Turns function into a rolling / sliding function
# We will get the monthly average for each stock
# And find if there is a correlation
wider_format_stocks_tbl <- filtered_final_stock_tbl %>%
filter(name %in% c("AAPL", "GOOGL")) %>%
select(date, close, name) %>%
mutate(
date = floor_date(date, unit = "month")
) %>%
group_by(date, name) %>%
summarise(
close_avg = mean(close)
) %>% ungroup() %>%
pivot_wider(names_from = name, values_from = close_avg)
rolling_cor <- slidify(
.f = ~ cor(.x, .y, use = "pairwise.complete.obs"),
.period = 12,
.align = "right",
.partial = FALSE
)wider_format_stocks_tbl %>%
mutate(
rolling_cor_close_price = rolling_cor(AAPL, GOOGL)
) %>%
ggplot(aes(x = date, y = rolling_cor_close_price)) +
geom_line(size = 1, color = "#045884") +
theme_calc() +
scale_color_stata() +
geom_smooth(method = "loess", color = "red") +
scale_x_date(expand = c(0,0)) +
labs(
title = "Rolling Correlation in the Technology Sectior",
subtitle = "AAPL vs GOOGL",
y = "Rolling Correlation",
x = "Date"
) +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_date(expand = c(0,0))The main reason why we use log transformations in time series analysis is to reduce the impact of extensive outliers therefore, stabilizing the variance across the time series. Log transformations are required in models like Linear Regression in which an outlier can cause a significant impact on the adjusted squared error. Although, in this case we dont have “significant” outliers log transformations can help you mitigate the impact of such.
Log Transformation Formula:
\(\large Y = log_b^x\)
\(\large b^y = x\)
# We will pick GS as it is one of the ones with the highest volatility
# GS linear regression model using log transformations
log_gs_lm_plot <- filtered_final_stock_tbl %>%
filter(name == "GS") %>%
filter_by_time(
.start_date = "2016-01-01",
.end_date = "2017-01-01"
) %>%
plot_time_series_regression(
.date_var = date,
.formula = log1p(close) ~ as.numeric(date) +
wday(date, label = TRUE) +
month(date, label = TRUE),
.interactive=FALSE,
.show_summary = TRUE,
.title = "Log transformations in Linear Regression Models"
)##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.114757 -0.018312 0.001485 0.020826 0.105742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.3685671 4.3585702 -2.149 0.03262 *
## as.numeric(date) 0.0008535 0.0002566 3.326 0.00102 **
## wday(date, label = TRUE).L -0.0018868 0.0050972 -0.370 0.71159
## wday(date, label = TRUE).Q 0.0004393 0.0050410 0.087 0.93063
## wday(date, label = TRUE).C -0.0015788 0.0049871 -0.317 0.75184
## wday(date, label = TRUE)^4 0.0031286 0.0049447 0.633 0.52753
## month(date, label = TRUE).L 0.0212167 0.0939807 0.226 0.82159
## month(date, label = TRUE).Q 0.2516840 0.0078410 32.098 < 2e-16 ***
## month(date, label = TRUE).C 0.1107287 0.0078093 14.179 < 2e-16 ***
## month(date, label = TRUE)^4 0.0581218 0.0078034 7.448 1.79e-12 ***
## month(date, label = TRUE)^5 -0.0174280 0.0077749 -2.242 0.02592 *
## month(date, label = TRUE)^6 0.0410392 0.0078056 5.258 3.28e-07 ***
## month(date, label = TRUE)^7 -0.0437489 0.0077170 -5.669 4.19e-08 ***
## month(date, label = TRUE)^8 -0.0437221 0.0076994 -5.679 3.99e-08 ***
## month(date, label = TRUE)^9 0.0105762 0.0077319 1.368 0.17266
## month(date, label = TRUE)^10 0.0131328 0.0076836 1.709 0.08874 .
## month(date, label = TRUE)^11 -0.0166546 0.0077086 -2.161 0.03174 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03551 on 235 degrees of freedom
## Multiple R-squared: 0.933, Adjusted R-squared: 0.9285
## F-statistic: 204.6 on 16 and 235 DF, p-value: < 2.2e-16
log_gs_lm_plotTransforming back to the original values using expm1 function:
log_transformed_back_plot <- filtered_final_stock_tbl %>%
filter(name == "GS") %>%
filter_by_time(
.start_date = "2016-01-01",
.end_date = "2017-01-01"
) %>%
plot_time_series_regression(
.date_var = date,
.formula = expm1(log1p(close)) ~ as.numeric(date) +
wday(date, label = TRUE) +
month(date, label = TRUE),
.interactive=FALSE,
.show_summary = TRUE,
.title = "Transforming to actual close price using expm1 function"
)##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0393 -3.0182 0.3419 3.3421 18.1935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.804e+03 7.652e+02 -3.664 0.000306 ***
## as.numeric(date) 1.751e-01 4.505e-02 3.885 0.000133 ***
## wday(date, label = TRUE).L -3.287e-01 8.948e-01 -0.367 0.713718
## wday(date, label = TRUE).Q 3.947e-02 8.850e-01 0.045 0.964464
## wday(date, label = TRUE).C -2.123e-01 8.755e-01 -0.242 0.808638
## wday(date, label = TRUE)^4 5.540e-01 8.681e-01 0.638 0.523987
## month(date, label = TRUE).L -2.077e+00 1.650e+01 -0.126 0.899952
## month(date, label = TRUE).Q 4.869e+01 1.377e+00 35.375 < 2e-16 ***
## month(date, label = TRUE).C 2.412e+01 1.371e+00 17.596 < 2e-16 ***
## month(date, label = TRUE)^4 1.309e+01 1.370e+00 9.555 < 2e-16 ***
## month(date, label = TRUE)^5 -1.112e+00 1.365e+00 -0.814 0.416229
## month(date, label = TRUE)^6 6.582e+00 1.370e+00 4.803 2.78e-06 ***
## month(date, label = TRUE)^7 -7.165e+00 1.355e+00 -5.289 2.82e-07 ***
## month(date, label = TRUE)^8 -7.312e+00 1.352e+00 -5.410 1.55e-07 ***
## month(date, label = TRUE)^9 1.411e+00 1.357e+00 1.040 0.299594
## month(date, label = TRUE)^10 1.905e+00 1.349e+00 1.412 0.159213
## month(date, label = TRUE)^11 -2.533e+00 1.353e+00 -1.872 0.062459 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.234 on 235 degrees of freedom
## Multiple R-squared: 0.9425, Adjusted R-squared: 0.9386
## F-statistic: 240.6 on 16 and 235 DF, p-value: < 2.2e-16
log_transformed_back_plotYou can compose the training sets of sixty percent of the dataset twenty percent validating and twenty percent the testing sets.However, this distribution should rely on your own logical judgement. Coming to our time series topic, this is also the case for implementing machine learning models in a time series. One of the first steps you will have to do is splitting your time series data. Lets look at how we do that with an example: I will split the time series into a training and testing set and you will see it visually.
# Test is 25% of our data
split_df <- filtered_final_stock_tbl %>%
filter(name == "GS") %>%
time_series_split(
date_var = date,
assess = "3 year",
cumulative = TRUE
)
splitted_data <- split_df %>%
tk_time_series_cv_plan()
splitted_data %>%
ggplot() + geom_line(aes(x=date, y=close, color=.key), size=1) +
theme_calc() + scale_color_stata() +
geom_rect(data=data.frame(xmin=as.Date(c("2015-01-02")),
xmax=as.Date(c("2017-12-29")),
ymin=-Inf,
ymax=Inf),
aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax),
fill=palette_light()[2],alpha=0.3) +
geom_label(
label = "Testing Set",
vjust = 0.5,
aes(x=as.Date(c("2016-07-02")),
y = 100)) +
geom_rect(data=data.frame(xmin=as.Date(c("2006-01-03")),
xmax=as.Date(c("2014-12-31")),
ymin=-Inf,
ymax=Inf),
aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax),
fill=palette_light()[3],alpha=0.3) + scale_x_date(expand = c(0,0)) +
geom_label(
label = "Training Set",
vjust = 0.5,
aes(x=as.Date(c("2012-07-02")),
y = 220)) +
labs(
title = "Splitting Time Series into Training and Testing Sets",
subtitle = "Dow Jones 30"
) + theme(plot.title = element_text(hjust=0.5), legend.position = "bottom") +
guides(color = guide_legend(nrow=1, byrow=TRUE)) +
scale_y_continuous(labels = scales::dollar)Note: Still working on the explanation
Most of us have hear the concept of Autoregressive Integrated Moving Average (ARIMA). Before understanding the concept of ARIMA we should answer the following questions:
What does it mean when we have a flat line in our forecast with ARIMA? (As shown below)
flat_line_forecast
Source: Blog SaS
When we have a flat line in our forecast model it means that the model is unable to capture neither a trend nor seasonality. This is due to that the model is unable to capture the different dynamics within our time series. Even though, it does not look as nice as we would want the forecast to be (moving up or down), actually this type of forecast could give a much better results rather than just adding random noise to our forecast.
Are there ways to improve the output of the forecast? Definitely, there are transformations and feature engineering steps that will better help your model capture any unforeseen trend or seasonality in your time series. I plan to talk more in-depth about ARIMA and other types of transformations and feature engineering processes in another notebook since I would like to keep this notebook as simple as possible for the moment.
I hope this notebook gave you a brief overview of the things we will see in this time series session. If you enjoyed going through the notebook I am preparing another one which will help the Kaggle community start deploying time series models for forecasting. If you are interested you can visit Time Series || Feature Engineering Concepts