1.0 data - etfs lists: ticker & name

filtered to remove leveraged/short/inversed etfs, price >= 10$ & mktcap >= 100m as of 2018

1.1 data - etf prices

from 2015-06-05 to 2022-06-07

etf_prices <- foreach(var = 1:nrow(etf_list), .combine = "bind_rows", .packages = "data.table") %dopar% {
  price_api(
    ticker = etf_list$ticker[var],
    from = from_date,
    to = to_date
  )
}

etf_prices <- etf_prices %>%
  rename_all(~tolower(.)) %>%
  select(ticker, everything()) %>%
  drop_na(close) %>%
  mutate(date = ymd(date)) %>%
  arrange(ticker, date)

head(etf_prices, 5)

2.0 spy - filtered for spy prices & calculated returns


daily simple returns = (last/first)-1

hold = (close/lag(close))-1

overnight = (open/lag(close))-1

intraday = (close/open)-1

the first days of all returns set to 0

head(spy_returns, 5)

2.1 spy - cumulative return timeseries

2.2 spy - return/inferential metrics

2015-06-05 to 2022-06-07

spy_returns %>%
  pivot_longer(hold:intraday, names_to = "spy") %>%
  group_by(spy) %>%
  summarise(return = port(value, total = TRUE),
            mean = mean(value),
            median = median(value),
            min = min(value),
            q10 = quantile(value, .1),
            q90 = quantile(value, .9),
            max = max(value))

2.3 spy - return/portfolio metrics

spy_returns %>%
  pivot_longer(hold:intraday, names_to = "spy") %>%
  group_by(spy) %>%
  mutate(portfolio = port(value, balance = 10000)) %>% 
  summarise(return = port(value, total = TRUE),
            cagr = cagr(portfolio, days = as.numeric(last(date)-first(date))),
            stdev = sd(value),
            annual_vol = stdev*sqrt(252),
            sharpe = cagr/annual_vol)

3.0 all etfs - return summary tbl grouped by ticker

ie. first row shows the simple returns of hold:intraday for aaxj, for 2015-06 to 2022-06. overnight was filtered for <= q95(overnight). seemed to have some bad data (possible outliers) for 5-8 tickers, quick attempt to remove them.

3.1 all etfs - summary tbl of the table from 3.0

etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            trading_days = n()) %>% 
  filter(overnight <= quantile(overnight, .95)) %>% 
  summary() 
    ticker               hold            overnight           intraday         trading_days 
 Length:344         Min.   :-0.70217   Min.   :-0.76216   Min.   :-0.85631   Min.   :1106  
 Class :character   1st Qu.:-0.04497   1st Qu.:-0.04444   1st Qu.:-0.27788   1st Qu.:1765  
 Mode  :character   Median : 0.26180   Median : 0.33977   Median :-0.02531   Median :1765  
                    Mean   : 0.39452   Mean   : 0.64582   Mean   :-0.01074   Mean   :1749  
                    3rd Qu.: 0.75331   3rd Qu.: 1.11882   3rd Qu.: 0.17950   3rd Qu.:1765  
                    Max.   : 3.43673   Max.   : 4.77693   Max.   : 2.31032   Max.   :1765  

3.2 all etfs - boxplots of 3.0 data

3.3 all etfs - top3 overnight return tickers

wanted to take a closer look at the top tickers for overnight returns: gvip, ptf, & xme. graphed a cumulative return timeseries for them. question: highest overnight return doesn’t necessarily mean overnight outperforms the buyhold strat per ticker. creating a new metric next section to look into this.

3.4 all etfs - new metric: ‘excess overnight returns’ / ‘overnight premium’

created new metric called ‘premium’ = overnight returns - hold returns. arranged by descending order, top5 listed below

3.5 all etfs - graphed cumulative returns of top3 etfs from 3.4

the data for some of these might not be right

4.0 text eda - tokenized words from etf names

  1. calculated the total return by ticker
  2. split etf names into words
  3. grouped words, then aggregated ticker returns + counted how many times each word was in data (frequency)
  4. distinct’d all return data to avoid words with same return values (happens when words are shown multiple times, same value, same etf. ie. “S & P…”)
  5. filtered for word count to be between 10 & 100 (wanted to see a grouping effect/relationship, doesn’t help if one word is from one ticker or if the word is etf, index, or fund)
  6. graphed a verticle column chart by word, by return type (ie. hold/overnight/intraday), colored/filled by count
etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE))) %>%
  left_join(etf_list, by = "ticker") %>%
  unnest_tokens(words, name) %>%
  group_by(words) %>%
  summarise(across(hold:intraday, ~mean(.)),
            count = n()) %>%
  distinct(hold, overnight, intraday, .keep_all = TRUE) %>%
  filter(count >= 10 & count <= 100) %>%
  pivot_longer(hold:intraday) %>%
  mutate(words = reorder_within(words, value, name)) %>%
  ggplot(aes(value, words, fill = count)) +
  geom_col() +
  scale_y_reordered() +
  facet_wrap(~name, scales = "free") +
  labs(title = "aggregated returns by tokenized words",
       subtitle = "avg total returns of 300+ etfs by words from name, 2015-2020",
       x = "avg total returns (%)", y = "")

4.1 text eda - same concept as 4.0 except for the metric: premium

5.0 date eda - boxplots of etf returns by year

how do these returns pane out in years like 2020?

etf_returns %>%
  mutate(year = year(date)) %>%
  group_by(ticker, year) %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 220) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(name, value, fill = name)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~year, scales = "free") +
  coord_cartesian(ylim = c(-.5, 1)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "etf returns by year",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
`summarise()` has grouped output by 'ticker'. You can override using the `.groups` argument.

5.1 date eda - summary tbl of data from 5.0

etf returns are aggregated by year

etf_returns %>%
  mutate(year = year(date)) %>%
  group_by(ticker, year) %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 220) %>%
  mutate(premium = overnight-hold) %>%
  ungroup() %>%
  group_by(year) %>%
  summarise(mean = mean(premium),
            median = median(premium),
            sd = sd(premium),
            q10 = quantile(premium, .1),
            q90 = quantile(premium, .9),
            etfs_count = n())
`summarise()` has grouped output by 'ticker'. You can override using the `.groups` argument.

5.2 date eda - boxplots of etf returns by weekday

similar concept as 5.1, except for weekday. should be noted, the day of overnight returns are of the day that they are closed, not opened. ie. overnight returns for monday are actually the returns held from friday close to monday open.

etf_returns %>%
  mutate(weekday = weekdays(date)) %>%
  group_by(ticker, weekday) %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 220) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(name, value, fill = name)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~weekday, scales = "free") +
  coord_cartesian(ylim = c(-.5, 1.2)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "etf returns by weekday",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
`summarise()` has grouped output by 'ticker'. You can override using the `.groups` argument.

5.3 date eda - boxplots of etf ‘excess overnight returns’ by year + weekdays

similar concept as above boxplots, but combined for one metric: excess overnight returns (overnight returns - hold returns)

etf_returns %>%
  mutate(year = year(date),
         weekday = weekdays(date, abbreviate = TRUE)) %>%
  group_by(ticker, year, weekday) %>%
  summarise(across(hold:overnight, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 45) %>%
  mutate(premium = overnight-hold) %>%
  ggplot(aes(weekday, premium, fill = weekday)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~year, scales = "free") +
  coord_cartesian(ylim = c(-.3, 0.3)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "'excess overnight returns' by year + weekdays",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
`summarise()` has grouped output by 'ticker', 'year'. You can override using the `.groups` argument.

5.4 date eda - boxplots of etf ‘excess overnight returns’ by year + quarter

similar concept as above boxplots, but combined for one metric: excess overnight returns (overnight returns - hold returns)

etf_returns %>%
  mutate(year = year(date),
         quarter = as.factor(quarter(date))) %>%
  group_by(ticker, year, quarter) %>%
  summarise(across(hold:overnight, ~port(., total = TRUE)),
            count = n()) %>%
  filter(year %in% c(2016:2021)) %>%
  mutate(premium = overnight-hold) %>%
  ggplot(aes(quarter, premium, fill = quarter)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~year, scales = "free") +
  coord_cartesian(ylim = c(-.25, 0.35)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "'excess overnight returns' by year + quarter",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
`summarise()` has grouped output by 'ticker', 'year'. You can override using the `.groups` argument.

---
title: "overnight eda"
output: html_notebook
---

```{r, include=FALSE}
# libraries/utilities
source("utils/libraries.R")
source("utils/helper_functions.R")

# options
options(scipen = 100)
registerDoParallel(cores = 10)
knitr::opts_chunk$set(fig.width = 12, fig.height = 5)
```

# 1.0 data - etfs lists: ticker & name
filtered to remove leveraged/short/inversed etfs, price >= 10$ & mktcap >= 100m as of 2018

```{r}
etf_list %>%
  select(ticker, name)
```

# 1.1 data - etf prices
from 2015-06-05 to 2022-06-07

```{r}
etf_prices <- etf_prices %>%
  rename_all(~tolower(.)) %>%
  select(ticker, everything()) %>%
  drop_na(close) %>%
  mutate(date = ymd(date)) %>%
  arrange(ticker, date)

head(etf_prices, 5)
```

```{r, include=FALSE}
spy_returns <- etf_prices %>%
  filter(ticker == "SPY") %>%
  mutate(hold = close/lag(close)-1,
         overnight = open/lag(close)-1,
         intraday = close/open-1,
         across(hold:intraday, ~replace(., 1, 0)))
```

# 2.0 spy - filtered for spy prices & calculated returns
<br>daily simple returns = (last/first)-1</br>
<br>hold = (close/lag(close))-1</br>
<br>overnight = (open/lag(close))-1</br>
<br>intraday = (close/open)-1</br>
<br>the first days of all returns set to 0</br>

```{r}
head(spy_returns, 5)
```

# 2.1 spy - cumulative return timeseries

```{r}
spy_returns %>%
  mutate(across(hold:intraday, ~port(.))) %>%  # summary()
  pivot_longer(hold:intraday) %>%
  ggplot(aes(date, value, color = name)) +
  geom_line() +
  labs(title = "spy cumulative returns",
       subtitle = "2015-06-05 to 2022-06-07",
       x = "", y = "returns (%)")
```

# 2.2 spy - return/inferential metrics
2015-06-05 to 2022-06-07

```{r}
spy_returns %>%
  pivot_longer(hold:intraday, names_to = "spy") %>%
  group_by(spy) %>%
  summarise(return = port(value, total = TRUE),
            mean = mean(value),
            median = median(value),
            min = min(value),
            q10 = quantile(value, .1),
            q90 = quantile(value, .9),
            max = max(value))
```

# 2.3 spy - return/portfolio metrics

```{r}
spy_returns %>%
  pivot_longer(hold:intraday, names_to = "spy") %>%
  group_by(spy) %>%
  mutate(portfolio = port(value, balance = 10000)) %>% 
  summarise(return = port(value, total = TRUE),
            cagr = cagr(portfolio, days = as.numeric(last(date)-first(date))),
            stdev = sd(value),
            annual_vol = stdev*sqrt(252),
            sharpe = cagr/annual_vol)
```

# 3.0 all etfs - return summary tbl grouped by ticker
ie. first row shows the simple returns of hold:intraday for aaxj, for 2015-06 to 2022-06.
overnight was filtered for <= q95(overnight). seemed to have some bad data (possible outliers) for 5-8 tickers, quick attempt to remove them.

```{r}
# etf_returns <- etf_prices %>%
#   group_by(ticker) %>%
#   mutate(hold = close/lag(close)-1,
#          overnight = open/lag(close)-1,
#          intraday = close/open-1,
#          across(hold:intraday, ~replace(., 1, 0))) %>%
#   select(ticker, date, hold:intraday)

etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            trading_days = n()) %>% 
  filter(overnight <= quantile(overnight, .95)) %>%
  head(5)
```

# 3.1 all etfs - summary tbl of the table from 3.0

```{r}
etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            trading_days = n()) %>% 
  filter(overnight <= quantile(overnight, .95)) %>% 
  summary() 
```

# 3.2 all etfs - boxplots of 3.0 data

```{r}
etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE))) %>% 
  filter(overnight <= quantile(overnight, .95)) %>% 
  pivot_longer(hold:intraday) %>%
  ggplot(aes(name, value)) +
  geom_boxplot(outlier.shape=NA) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  coord_cartesian(ylim = c(-1,3)) +
  labs(title = "etf returns",
       subtitle = "2015-06-05 to 2022-06-07",
       x = "", y = "returns (%)")
```

# 3.3 all etfs - top3 overnight return tickers
wanted to take a closer look at the top tickers for overnight returns: gvip, ptf, & xme. graphed a cumulative return timeseries for them.
question: highest overnight return doesn't necessarily mean overnight outperforms the buyhold strat per ticker. creating a new metric next section to look into this.

```{r, fig.width=10}
etf_returns %>%
  filter(ticker %in% c("GVIP", "PTF", "XME")) %>%
  mutate(across(hold:intraday, ~port(.))) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(date, value, color = name)) +
  geom_line() +
  facet_wrap(~ticker, scales = "free") +
  labs(title = "top3 overnight cumulative returns",
       subtitle = "2015-06-05 to 2022-06-07",
       x = "", y = "returns (%)")
```

# 3.4 all etfs - new metric: 'excess overnight returns' / 'overnight premium'
created new metric called 'premium' = overnight returns - hold returns.
arranged by descending order, top5 listed below

```{r}
etf_returns %>%
  summarise(across(hold:overnight, ~port(., total = TRUE))) %>%
  mutate(premium = overnight-hold) %>%
  arrange(desc(premium)) %>%
  filter(overnight <= quantile(overnight, .95)) %>%
  head(5)
```

# 3.5 all etfs - graphed cumulative returns of top3 etfs from 3.4
the data for some of these might not be right

```{r, fig.width=10}
etf_returns %>%
  filter(ticker %in% c("KWEB", "FENY", "GVIP")) %>%
  mutate(across(hold:intraday, ~port(.))) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(date, value, color = name)) +
  geom_line() +
  facet_wrap(~ticker, scales = "free") +
  labs(title = "top3 overnight cumulative returns",
       subtitle = "2015-06-05 to 2022-06-07",
       x = "", y = "returns (%)")
```


# 4.0 text eda - tokenized words from etf names
1. calculated the total return by ticker
2. split etf names into words
3. grouped words, then aggregated ticker returns + counted how many times each word was in data (frequency)
4. distinct'd all return data to avoid words with same return values (happens when words are shown multiple times, same value, same etf. ie. "S & P...")
5. filtered for word count to be between 10 & 100 (wanted to see a grouping effect/relationship, doesn't help if one word is from one ticker or if the word is etf, index, or fund)
6. graphed a verticle column chart by word, by return type (ie. hold/overnight/intraday), colored/filled by count

```{r, fig.height=12}
etf_returns %>%
  summarise(across(hold:intraday, ~port(., total = TRUE))) %>%
  left_join(etf_list, by = "ticker") %>%
  unnest_tokens(words, name) %>%
  group_by(words) %>%
  summarise(across(hold:intraday, ~mean(.)),
            count = n()) %>%
  distinct(hold, overnight, intraday, .keep_all = TRUE) %>%
  filter(count >= 10 & count <= 100) %>%
  pivot_longer(hold:intraday) %>%
  mutate(words = reorder_within(words, value, name)) %>%
  ggplot(aes(value, words, fill = count)) +
  geom_col() +
  scale_y_reordered() +
  facet_wrap(~name, scales = "free") +
  labs(title = "aggregated returns by tokenized words",
       subtitle = "avg total returns of 300+ etfs by words from name, 2015-2020",
       x = "avg total returns (%)", y = "")
```

# 4.1 text eda - same concept as 4.0 except for the metric: premium

```{r, fig.height=8}
etf_returns %>%
  summarise(across(hold:overnight, ~port(., total = TRUE))) %>%
  mutate(premium = overnight-hold) %>%
  left_join(etf_list, by = "ticker") %>%
  unnest_tokens(words, name) %>%
  group_by(words) %>%
  summarise(premium = mean(premium),
            count = n()) %>%
  distinct(premium, .keep_all = TRUE) %>%
  filter(count >= 10 & count <= 100) %>% 
  ggplot(aes(premium, fct_reorder(words, premium), fill = count)) +
  geom_col() +
  labs(title = "aggregated overnight excess returns by tokenized words",
       subtitle = "avg total 'overnight excess returns' (overnight-hold) of 300+ etfs by words from name, 2015-2022",
       x = "avg overnight excess returns (%)", y = "")
```

# 5.0 date eda - boxplots of etf returns by year
how do these returns pane out in years like 2020?

```{r}
etf_returns %>%
  mutate(year = year(date)) %>%
  group_by(ticker, year) %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 220) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(name, value, fill = name)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~year, scales = "free") +
  coord_cartesian(ylim = c(-.5, 1)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "etf returns by year",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
```

# 5.1 date eda - summary tbl of data from 5.0
etf returns are aggregated by year

```{r}
etf_returns %>%
  mutate(year = year(date)) %>%
  group_by(ticker, year) %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 220) %>%
  mutate(premium = overnight-hold) %>%
  ungroup() %>%
  group_by(year) %>%
  summarise(mean = mean(premium),
            median = median(premium),
            sd = sd(premium),
            q10 = quantile(premium, .1),
            q90 = quantile(premium, .9),
            etfs_count = n())
```

# 5.2 date eda - boxplots of etf returns by weekday
similar concept as 5.1, except for weekday. should be noted, the day of overnight returns are of the day that they are closed, not opened. ie. overnight returns for monday are actually the returns held from friday close to monday open.

```{r}
etf_returns %>%
  mutate(weekday = weekdays(date)) %>%
  group_by(ticker, weekday) %>%
  summarise(across(hold:intraday, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 220) %>%
  pivot_longer(hold:intraday) %>%
  ggplot(aes(name, value, fill = name)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~weekday, scales = "free") +
  coord_cartesian(ylim = c(-.5, 1.2)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "etf returns by weekday",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
```

# 5.3 date eda - boxplots of etf 'excess overnight returns' by year + weekdays
similar concept as above boxplots, but combined for one metric: excess overnight returns (overnight returns - hold returns)

```{r}
etf_returns %>%
  mutate(year = year(date),
         weekday = weekdays(date, abbreviate = TRUE)) %>%
  group_by(ticker, year, weekday) %>%
  summarise(across(hold:overnight, ~port(., total = TRUE)),
            count = n()) %>%
  filter(count >= 45) %>%
  mutate(premium = overnight-hold) %>%
  ggplot(aes(weekday, premium, fill = weekday)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~year, scales = "free") +
  coord_cartesian(ylim = c(-.3, 0.3)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "'excess overnight returns' by year + weekdays",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
```

# 5.4 date eda - boxplots of etf 'excess overnight returns' by year + quarter
similar concept as above boxplots, but combined for one metric: excess overnight returns (overnight returns - hold returns)

```{r}
etf_returns %>%
  mutate(year = year(date),
         quarter = as.factor(quarter(date))) %>%
  group_by(ticker, year, quarter) %>%
  summarise(across(hold:overnight, ~port(., total = TRUE)),
            count = n()) %>%
  filter(year %in% c(2016:2021)) %>%
  mutate(premium = overnight-hold) %>%
  ggplot(aes(quarter, premium, fill = quarter)) +
  geom_boxplot(outlier.shape=NA) +
  facet_wrap(~year, scales = "free") +
  coord_cartesian(ylim = c(-.25, 0.35)) +
  geom_hline(yintercept = 0, linetype = 2, alpha = 0.5, color = "red") +
  labs(title = "'excess overnight returns' by year + quarter",
       subtitle = "2016-2021, full years",
       x = "", y = "return (%)")
```


























