1.0 data - etfs lists: ticker & name
filtered to remove leveraged/short/inversed etfs, price >= 10$
& mktcap >= 100m as of 2018
1.1 data - etf prices
from 2015-06-05 to 2022-06-07
etf_prices <- foreach(var = 1:nrow(etf_list), .combine = "bind_rows", .packages = "data.table") %dopar% {
price_api(
ticker = etf_list$ticker[var],
from = from_date,
to = to_date
)
}
etf_prices <- etf_prices %>%
rename_all(~tolower(.)) %>%
select(ticker, everything()) %>%
drop_na(close) %>%
mutate(date = ymd(date)) %>%
arrange(ticker, date)
head(etf_prices, 5)
2.0 spy - filtered for spy prices & calculated returns
daily simple returns = (last/first)-1
hold =
(close/lag(close))-1
overnight = (open/lag(close))-1
intraday = (close/open)-1
the first days of all returns set
to 0
head(spy_returns, 5)
2.1 spy - cumulative return timeseries

2.2 spy - return/inferential metrics
2015-06-05 to 2022-06-07
spy_returns %>%
pivot_longer(hold:intraday, names_to = "spy") %>%
group_by(spy) %>%
summarise(return = port(value, total = TRUE),
mean = mean(value),
median = median(value),
min = min(value),
q10 = quantile(value, .1),
q90 = quantile(value, .9),
max = max(value))
2.3 spy - return/portfolio metrics
spy_returns %>%
pivot_longer(hold:intraday, names_to = "spy") %>%
group_by(spy) %>%
mutate(portfolio = port(value, balance = 10000)) %>%
summarise(return = port(value, total = TRUE),
cagr = cagr(portfolio, days = as.numeric(last(date)-first(date))),
stdev = sd(value),
annual_vol = stdev*sqrt(252),
sharpe = cagr/annual_vol)
3.0 all etfs - return summary tbl grouped by ticker
ie. first row shows the simple returns of hold:intraday for aaxj, for
2015-06 to 2022-06. overnight was filtered for <= q95(overnight).
seemed to have some bad data (possible outliers) for 5-8 tickers, quick
attempt to remove them.
3.1 all etfs - summary tbl of the table from 3.0
etf_returns %>%
summarise(across(hold:intraday, ~port(., total = TRUE)),
trading_days = n()) %>%
filter(overnight <= quantile(overnight, .95)) %>%
summary()
ticker hold overnight intraday trading_days
Length:344 Min. :-0.70217 Min. :-0.76216 Min. :-0.85631 Min. :1106
Class :character 1st Qu.:-0.04497 1st Qu.:-0.04444 1st Qu.:-0.27788 1st Qu.:1765
Mode :character Median : 0.26180 Median : 0.33977 Median :-0.02531 Median :1765
Mean : 0.39452 Mean : 0.64582 Mean :-0.01074 Mean :1749
3rd Qu.: 0.75331 3rd Qu.: 1.11882 3rd Qu.: 0.17950 3rd Qu.:1765
Max. : 3.43673 Max. : 4.77693 Max. : 2.31032 Max. :1765
3.2 all etfs - boxplots of 3.0 data

3.3 all etfs - top3 overnight return tickers
wanted to take a closer look at the top tickers for overnight
returns: gvip, ptf, & xme. graphed a cumulative return timeseries
for them. question: highest overnight return doesn’t necessarily mean
overnight outperforms the buyhold strat per ticker. creating a new
metric next section to look into this.

3.4 all etfs - new metric: ‘excess overnight returns’ / ‘overnight
premium’
created new metric called ‘premium’ = overnight returns - hold
returns. arranged by descending order, top5 listed below
3.5 all etfs - graphed cumulative returns of top3 etfs from 3.4
the data for some of these might not be right

4.0 text eda - tokenized words from etf names
- calculated the total return by ticker
- split etf names into words
- grouped words, then aggregated ticker returns + counted how many
times each word was in data (frequency)
- distinct’d all return data to avoid words with same return values
(happens when words are shown multiple times, same value, same etf. ie.
“S & P…”)
- filtered for word count to be between 10 & 100 (wanted to see a
grouping effect/relationship, doesn’t help if one word is from one
ticker or if the word is etf, index, or fund)
- graphed a verticle column chart by word, by return type (ie.
hold/overnight/intraday), colored/filled by count
etf_returns %>%
summarise(across(hold:intraday, ~port(., total = TRUE))) %>%
left_join(etf_list, by = "ticker") %>%
unnest_tokens(words, name) %>%
group_by(words) %>%
summarise(across(hold:intraday, ~mean(.)),
count = n()) %>%
distinct(hold, overnight, intraday, .keep_all = TRUE) %>%
filter(count >= 10 & count <= 100) %>%
pivot_longer(hold:intraday) %>%
mutate(words = reorder_within(words, value, name)) %>%
ggplot(aes(value, words, fill = count)) +
geom_col() +
scale_y_reordered() +
facet_wrap(~name, scales = "free") +
labs(title = "aggregated returns by tokenized words",
subtitle = "avg total returns of 300+ etfs by words from name, 2015-2020",
x = "avg total returns (%)", y = "")

4.1 text eda - same concept as 4.0 except for the metric:
premium

5.0 date eda - boxplots of etf returns by year
how do these returns pane out in years like 2020?
etf_returns %>%
mutate(year = year(date)) %>%
group_by(ticker, year) %>%
summarise(across(hold:intraday, ~port(., total = TRUE)),
count = n()) %>%
filter(count >= 220) %>%
pivot_longer(hold:intraday) %>%
ggplot(aes(name, value, fill = name)) +
geom_boxplot(outlier.shape=NA) +
facet_wrap(~year, scales = "free") +
coord_cartesian(ylim = c(-.5, 1)) +
geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
labs(title = "etf returns by year",
subtitle = "2016-2021, full years",
x = "", y = "return (%)")
`summarise()` has grouped output by 'ticker'. You can override using the `.groups` argument.

5.1 date eda - summary tbl of data from 5.0
etf returns are aggregated by year
etf_returns %>%
mutate(year = year(date)) %>%
group_by(ticker, year) %>%
summarise(across(hold:intraday, ~port(., total = TRUE)),
count = n()) %>%
filter(count >= 220) %>%
mutate(premium = overnight-hold) %>%
ungroup() %>%
group_by(year) %>%
summarise(mean = mean(premium),
median = median(premium),
sd = sd(premium),
q10 = quantile(premium, .1),
q90 = quantile(premium, .9),
etfs_count = n())
`summarise()` has grouped output by 'ticker'. You can override using the `.groups` argument.
5.2 date eda - boxplots of etf returns by weekday
similar concept as 5.1, except for weekday. should be noted, the day
of overnight returns are of the day that they are closed, not opened.
ie. overnight returns for monday are actually the returns held from
friday close to monday open.
etf_returns %>%
mutate(weekday = weekdays(date)) %>%
group_by(ticker, weekday) %>%
summarise(across(hold:intraday, ~port(., total = TRUE)),
count = n()) %>%
filter(count >= 220) %>%
pivot_longer(hold:intraday) %>%
ggplot(aes(name, value, fill = name)) +
geom_boxplot(outlier.shape=NA) +
facet_wrap(~weekday, scales = "free") +
coord_cartesian(ylim = c(-.5, 1.2)) +
geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
labs(title = "etf returns by weekday",
subtitle = "2016-2021, full years",
x = "", y = "return (%)")
`summarise()` has grouped output by 'ticker'. You can override using the `.groups` argument.

5.3 date eda - boxplots of etf ‘excess overnight returns’ by year +
weekdays
similar concept as above boxplots, but combined for one metric:
excess overnight returns (overnight returns - hold returns)
etf_returns %>%
mutate(year = year(date),
weekday = weekdays(date, abbreviate = TRUE)) %>%
group_by(ticker, year, weekday) %>%
summarise(across(hold:overnight, ~port(., total = TRUE)),
count = n()) %>%
filter(count >= 45) %>%
mutate(premium = overnight-hold) %>%
ggplot(aes(weekday, premium, fill = weekday)) +
geom_boxplot(outlier.shape=NA) +
facet_wrap(~year, scales = "free") +
coord_cartesian(ylim = c(-.3, 0.3)) +
geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
labs(title = "'excess overnight returns' by year + weekdays",
subtitle = "2016-2021, full years",
x = "", y = "return (%)")
`summarise()` has grouped output by 'ticker', 'year'. You can override using the `.groups` argument.

5.4 date eda - boxplots of etf ‘excess overnight returns’ by year +
quarter
similar concept as above boxplots, but combined for one metric:
excess overnight returns (overnight returns - hold returns)
etf_returns %>%
mutate(year = year(date),
quarter = as.factor(quarter(date))) %>%
group_by(ticker, year, quarter) %>%
summarise(across(hold:overnight, ~port(., total = TRUE)),
count = n()) %>%
filter(year %in% c(2016:2021)) %>%
mutate(premium = overnight-hold) %>%
ggplot(aes(quarter, premium, fill = quarter)) +
geom_boxplot(outlier.shape=NA) +
facet_wrap(~year, scales = "free") +
coord_cartesian(ylim = c(-.25, 0.35)) +
geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
labs(title = "'excess overnight returns' by year + quarter",
subtitle = "2016-2021, full years",
x = "", y = "return (%)")
`summarise()` has grouped output by 'ticker', 'year'. You can override using the `.groups` argument.

---
title: "overnight eda"
output: html_notebook
---

```{r, include=FALSE}
# libraries/utilities
source("utils/libraries.R")
source("utils/helper_functions.R")

# options
options(scipen = 100)
registerDoParallel(cores = 10)
knitr::opts_chunk$set(fig.width = 12, fig.height = 5)
```

# 1.0 data - etfs lists: ticker & name
filtered to remove leveraged/short/inversed etfs, price >= 10$ & mktcap >= 100m as of 2018

```{r}
etf_list %>%
  select(ticker, name)
```

# 1.1 data - etf prices
from 2015-06-05 to 2022-06-07

```{r}
etf_prices <- etf_prices %>%
  rename_all(~tolower(.)) %>%
  select(ticker, everything()) %>%
  drop_na(close) %>%
  mutate(date = ymd(date)) %>%
  arrange(ticker, date)

head(etf_prices, 5)
```

```{r, include=FALSE}
spy_returns <- etf_prices %>%
  filter(ticker == "SPY") %>%
  mutate(hold = close/lag(close)-1,
         overnight = open/lag(close)-1,
         intraday = close/open-1,
         across(hold:intraday, ~replace(., 1, 0)))
```

# 2.0 spy - filtered for spy prices & calculated returns
<br>daily simple returns = (last/first)-1</br>
<br>hold = (close/lag(close))-1</br>
<br>overnight = (open/lag(close))-1</br>
<br>intraday = (close/open)-1</br>
<br>the first days of all returns set to 0</br>

```{r}
head(spy_returns, 5)
```

# 2.1 spy - cumulative return timeseries

```{r}
spy_returns %>%
  mutate(across(hold:intraday, ~port(.))) %>%  # summary()
  pivot_longer(hold:intraday) %>%
  ggplot(aes(date, value, color = name)) +
  geom_line() +
  labs(title = "spy cumulative returns",
       subtitle = "2015-06-05 to 2022-06-07",
       x = "", y = "returns (%)")
```

# 2.2 spy - return/inferential metrics
2015-06-05 to 2022-06-07

```{r}
spy_returns %>%
  pivot_longer(hold:intraday, names_to = "spy") %>%
  group_by(spy) %>%
  summarise(return = port(value, total = TRUE),
            mean = mean(value),
            median = median(value),
            min = min(value),
            q10 = quantile(value, .1),
            q90 = quantile(value, .9),
            max = max(value))
```

# 2.3 spy - return/portfolio metrics

```{r}
spy_returns %>%
  pivot_longer(hold:intraday, names_to = "spy") %>%
  group_by(spy) %>%
  mutate(portfolio = port(value, balance = 10000)) %>% 
  summarise(return = port(value, total = TRUE),
            cagr = cagr(portfolio, days = as.numeric(last(date)-first(date))),
            stdev = sd(value),
            annual_vol = stdev*sqrt(252),
            sharpe = cagr/annual_vol)
```

# 3.0 all etfs - return summary tbl grouped by ticker
ie. first row shows the simple returns of hold:intraday for aaxj, for 2015-06 to 2022-06.
overnight was filtered for <= q95(overnight). seemed to have some bad data (possible outliers) for 5-8 tickers, quick attempt to remove them.

```{r}
# etf_returns <- etf_prices %>%
#   group_by(ticker) %>%
#   mutate(hold = close/lag(close)-1,
#          overnight = open/lag(close)-1,
#          intraday = close/open-1,
#          across(hold:intraday, ~replace(., 1, 0))) %>%
#   select(ticker, date, hold:intraday)

etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            trading_days = n()) %>% 
  filter(overnight <= quantile(overnight, .95)) %>%
  head(5)
```

# 3.1 all etfs - summary tbl of the table from 3.0

```{r}
etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            trading_days = n()) %>% 
  filter(overnight <= quantile(overnight, .95)) %>% 
  summary() 
```

# 3.2 all etfs - boxplots of 3.0 data

```{r}
etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE))) %>% 
  filter(overnight <= quantile(overnight, .95)) %>% 
  pivot_longer(hold:intraday) %>%
  ggplot(aes(name, value)) +
  geom_boxplot(outlier.shape=NA) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  coord_cartesian(ylim = c(-1,3)) +
  labs(title = "etf returns",
       subtitle = "2015-06-05 to 2022-06-07",
       x = "", y = "returns (%)")
```

# 3.3 all etfs - top3 overnight return tickers
wanted to take a closer look at the top tickers for overnight returns: gvip, ptf, & xme. graphed a cumulative return timeseries for them.
question: highest overnight return doesn't necessarily mean overnight outperforms the buyhold strat per ticker. creating a new metric next section to look into this.

```{r, fig.width=10}
etf_returns %>%
  filter(ticker %in% c("GVIP", "PTF", "XME")) %>%
  mutate(across(hold:intraday, ~port(.))) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(date, value, color = name)) +
  geom_line() +
  facet_wrap(~ticker, scales = "free") +
  labs(title = "top3 overnight cumulative returns",
       subtitle = "2015-06-05 to 2022-06-07",
       x = "", y = "returns (%)")
```

# 3.4 all etfs - new metric: 'excess overnight returns' / 'overnight premium'
created new metric called 'premium' = overnight returns - hold returns.
arranged by descending order, top5 listed below

```{r}
etf_returns %>%
  summarise(across(hold:overnight, ~port(., total = TRUE))) %>%
  mutate(premium = overnight-hold) %>%
  arrange(desc(premium)) %>%
  filter(overnight <= quantile(overnight, .95)) %>%
  head(5)
```

# 3.5 all etfs - graphed cumulative returns of top3 etfs from 3.4
the data for some of these might not be right

```{r, fig.width=10}
etf_returns %>%
  filter(ticker %in% c("KWEB", "FENY", "GVIP")) %>%
  mutate(across(hold:intraday, ~port(.))) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(date, value, color = name)) +
  geom_line() +
  facet_wrap(~ticker, scales = "free") +
  labs(title = "top3 overnight cumulative returns",
       subtitle = "2015-06-05 to 2022-06-07",
       x = "", y = "returns (%)")
```


# 4.0 text eda - tokenized words from etf names
1. calculated the total return by ticker
2. split etf names into words
3. grouped words, then aggregated ticker returns + counted how many times each word was in data (frequency)
4. distinct'd all return data to avoid words with same return values (happens when words are shown multiple times, same value, same etf. ie. "S & P...")
5. filtered for word count to be between 10 & 100 (wanted to see a grouping effect/relationship, doesn't help if one word is from one ticker or if the word is etf, index, or fund)
6. graphed a verticle column chart by word, by return type (ie. hold/overnight/intraday), colored/filled by count

```{r, fig.height=12}
etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE))) %>%
  left_join(etf_list, by = "ticker") %>%
  unnest_tokens(words, name) %>%
  group_by(words) %>%
  summarise(across(hold:intraday, ~mean(.)),
            count = n()) %>%
  distinct(hold, overnight, intraday, .keep_all = TRUE) %>%
  filter(count >= 10 & count <= 100) %>%
  pivot_longer(hold:intraday) %>%
  mutate(words = reorder_within(words, value, name)) %>%
  ggplot(aes(value, words, fill = count)) +
  geom_col() +
  scale_y_reordered() +
  facet_wrap(~name, scales = "free") +
  labs(title = "aggregated returns by tokenized words",
       subtitle = "avg total returns of 300+ etfs by words from name, 2015-2020",
       x = "avg total returns (%)", y = "")
```

# 4.1 text eda - same concept as 4.0 except for the metric: premium

```{r, fig.height=8}
etf_returns %>%
  summarise(across(hold:overnight, ~port(., total = TRUE))) %>%
  mutate(premium = overnight-hold) %>%
  left_join(etf_list, by = "ticker") %>%
  unnest_tokens(words, name) %>%
  group_by(words) %>%
  summarise(premium = mean(premium),
            count = n()) %>%
  distinct(premium, .keep_all = TRUE) %>%
  filter(count >= 10 & count <= 100) %>% 
  ggplot(aes(premium, fct_reorder(words, premium), fill = count)) +
  geom_col() +
  labs(title = "aggregated overnight excess returns by tokenized words",
       subtitle = "avg total 'overnight excess returns' (overnight-hold) of 300+ etfs by words from name, 2015-2022",
       x = "avg overnight excess returns (%)", y = "")
```

# 5.0 date eda - boxplots of etf returns by year
how do these returns pane out in years like 2020?

```{r}
etf_returns %>%
  mutate(year = year(date)) %>%
  group_by(ticker, year) %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 220) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(name, value, fill = name)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~year, scales = "free") +
  coord_cartesian(ylim = c(-.5, 1)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "etf returns by year",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
```

# 5.1 date eda - summary tbl of data from 5.0
etf returns are aggregated by year

```{r}
etf_returns %>%
  mutate(year = year(date)) %>%
  group_by(ticker, year) %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 220) %>%
  mutate(premium = overnight-hold) %>%
  ungroup() %>%
  group_by(year) %>%
  summarise(mean = mean(premium),
            median = median(premium),
            sd = sd(premium),
            q10 = quantile(premium, .1),
            q90 = quantile(premium, .9),
            etfs_count = n())
```

# 5.2 date eda - boxplots of etf returns by weekday
similar concept as 5.1, except for weekday. should be noted, the day of overnight returns are of the day that they are closed, not opened. ie. overnight returns for monday are actually the returns held from friday close to monday open.

```{r}
etf_returns %>%
  mutate(weekday = weekdays(date)) %>%
  group_by(ticker, weekday) %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 220) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(name, value, fill = name)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~weekday, scales = "free") +
  coord_cartesian(ylim = c(-.5, 1.2)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "etf returns by weekday",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
```

# 5.3 date eda - boxplots of etf 'excess overnight returns' by year + weekdays
similar concept as above boxplots, but combined for one metric: excess overnight returns (overnight returns - hold returns)

```{r}
etf_returns %>%
  mutate(year = year(date),
         weekday = weekdays(date, abbreviate = TRUE)) %>%
  group_by(ticker, year, weekday) %>%
  summarise(across(hold:overnight, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 45) %>%
  mutate(premium = overnight-hold) %>%
  ggplot(aes(weekday, premium, fill = weekday)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~year, scales = "free") +
  coord_cartesian(ylim = c(-.3, 0.3)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "'excess overnight returns' by year + weekdays",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
```

# 5.4 date eda - boxplots of etf 'excess overnight returns' by year + quarter
similar concept as above boxplots, but combined for one metric: excess overnight returns (overnight returns - hold returns)

```{r}
etf_returns %>%
  mutate(year = year(date),
         quarter = as.factor(quarter(date))) %>%
  group_by(ticker, year, quarter) %>%
  summarise(across(hold:overnight, ~port(., total = TRUE)),
            count = n()) %>%
  filter(year %in% c(2016:2021)) %>%
  mutate(premium = overnight-hold) %>%
  ggplot(aes(quarter, premium, fill = quarter)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~year, scales = "free") +
  coord_cartesian(ylim = c(-.25, 0.35)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "'excess overnight returns' by year + quarter",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
```


























