Background

As someone interested in long-term investing (e.g. the lazy method), I have preferred ETFs over individual stocks for quite some time. ETFs provide the same level of trading flexibility as individual stocks, but due to their nature, they provide much greater portfolio diversification and better risk management.

Indexing against the S&P 500 or the Russell 2000 is a safe and sound long-term (e.g.: 10+ years) investment approach if you have no patience or time for day trading. With ETFs, you can often achieve stronger and more stable long-run returns than mega index funds.

In the following 4 steps, we will go through a nicely reproducible and fairly automated process of creating a small, but good enough ETF investment portfolio. I will caveat all this by stating that this is not intended as investment advice. But the below method provides a quick and analytically rigorous method of selecting top ETFs over and beyond what a typical brokerage account’s ETF selection/filtering platform provides. With the recent massive Dow drop and an inverted yield curve (which is typically a 6-18 month lag to a recession), the next few months will be ripe for augmenting your personal portfolio with ETFs.

The motivation for many of the functions used in the below example came from the Business Science University’s Russell 2000 analysis. As someone who focuses on the practical aspects of data science, I have particularly enjoyed what this company is doing in terms of analytics education, by training data scientists to deliver advanced, yet practical and actionable solutions with measurable ROIs. In the real world, analytics-driven, measurable business results always outweigh cool tools and analytical sophistication, so it’s critical that analytics practitioners focus pre-project and post-mortem ROI.

Okay, let’s get started by invoking some key libraries and baseline functions. You can see the source code for each section by clicking on the blue tabs.

knitr::opts_chunk$set(echo = TRUE, cache= TRUE)

library(rvest)
library(quantmod)
library(tidyverse)
library(stringr)
library(lubridate)
library(modelr)
library(plotly)
library(data.table)
library(readODS)
library(corrplot)
library(dtwclust)
library(dtw)
library(ggforce)
library(knitr)

header_changes = function(mydata){
  setnames(mydata, names(mydata), tolower(names(mydata)))
  setnames(mydata, names(mydata), gsub(" ","_", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("_(sum)","", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("_(agg)","", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("_(avg)","_avg", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("(c)","", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("$","dollar_volume", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("%","pct", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("w/o","without", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("-","_", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("&","", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("/","_", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("\\(","", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub("\\)","", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), gsub(".","", names(mydata), fixed = TRUE))
  setnames(mydata, names(mydata), iconv(names(mydata), "UTF-8","ASCII", sub=""))
}

`%ni%` = Negate('%in%')  #this is equivalent to "NOT IN" in SQL

Step 1: Data munging and exploratory analysis

Obtain historical stock prices via Python

We first read in an exhaustive list of 1,320 ETFs and their weekly price history for the past 20 years. We have previously created this data via Python’s alpha_vantage package. For more information, check out the Alpha Vantage website. The API can be free, although as usual, there are limitations (speed and call limits). If you want something more robust, there are fairly reasonably price subscription options, or a host of other APIs that specialize in market data.

Use your jupyter notebook or RStudio via the reticulate package to replicate the ETF price scraping. Sample python script on how to do it is below.

Note that the data.table package, which is very popular in R, has been made available in Python under the datatable name. It works more efficiently than pandas for large data manipulation (think 1MM+ rows), and provides more creativity for reading/writing smaller files. I highly recommend you check it out.

#download the list of ETFs at: https://www.nasdaq.com/etfs/list

from alpha_vantage.timeseries import TimeSeries
import numpy as np
import pandas as pd
from pandas import DataFrame as df
import os
import csv
import datatable as dt

# Specify the API key
ts = TimeSeries(key='your API key',output_format='pandas', indexing_type='date')

#read in ETF listing
ETF_df = dt.fread('ETF_symbols.csv')

#convert to pandas
ETF_df = ETF_df.to_pandas()

#loop through df
stock_price_history = pd.DataFrame()

for i in ETF_df.index:
    print(str(i) + " out of " + str(max(ETF_df.index)))
    try:
        stock = ETF_df.iloc[i,:][0]
        data, meta_data = ts.get_weekly_adjusted(symbol=stock)
        data.index.name = 'date'
        data.reset_index(inplace=True)
        data['symbol'] = stock
        stock_price_history = stock_price_history.append(data)
    except:
        pass
    
    
#write out csv
stock_price_history = dt.Frame(stock_price_history)

stock_price_history.to_csv('ETF_weekly_prices.csv')

Read in ETF price history in R

We will read in the ETF price history, and specify a couple of conditions (you can easily change these below) for our analytical data set:
1. We will limit our evalution for the last 10 years’ of weekly prices
2. We will only include ETFs that have traded for the last 3 years

Click here to access the 2 .csv files used in our scripts: (“ETF_weekly_prices.csv” and “ETF_symbols.csv”)

Example of our raw data is below:

n_year_length = 10 #this is how many years of returns we will analyze (we are interested in long term investments)
n_trading_weeks = 3 * 52 #we will only include ETFs that have traded for at least this many weeks (ideally, at least 3 years, or 156 weeks)

#read in ETF list that has already completed parsing
stocklist = fread('/home/kakasa09/stock_investing/ETF_weekly_prices.csv')

stocklist = stocklist %>% as_tibble() %>% select(date, `5. adjusted close`, `7. dividend amount`, symbol) %>%
  rename(Date = date,
         Adjusted = `5. adjusted close`,
         Dividend = `7. dividend amount`)

stocklist$Date <- as.Date(stocklist$Date)
latest_trading_date_global = max(stocklist$Date)

stocklist = stocklist %>% group_by(symbol) %>% mutate(trading_weeks = n_distinct(Date),
                                                      latest_trading_date = max(Date)) %>% ungroup() %>%
  filter(trading_weeks >= n_trading_weeks)  %>% filter(Date >= today() -
                                                         (365 * n_year_length)) %>%
  filter(latest_trading_date == latest_trading_date_global) %>% select(-latest_trading_date,-trading_weeks)

#this is how our raw data looks like after doing some filtering

kable(head(stocklist, n=5), format ='markdown')
Date Adjusted Dividend symbol
2009-08-14 18.9262 0 QABA
2009-08-21 19.1747 0 QABA
2009-08-28 18.5694 0 QABA
2009-09-04 18.0266 0 QABA
2009-09-11 18.4807 0 QABA

Calculate weekly log returns and dividend performance for each ETF

The get_log_returns function is a slightly modified version of the elegant code provided through the Business Science University’s Russell 2000 analysis. As a general rule in R programming, if you need to use a set of complex operations 3-4 times, then make it into a function.

In addition to evaluating the weekly log returns, we are also interested in the average annual dividend payout for each ETF. Dividend performance will play a very small role in our final ETF selection, but it is an important consideration nonetheless.

This is how our raw data looks like after we have applied the get_log_returns function, and obtained the average weekly log return and standard deviation of those weekly log returns for each of our ETFs. Note that the newly created data frame contains both lists (weekly stock prices and log returns for each ETF) and numeric variables.

Below is the average annual dividend payout by ETF (expressed in dollars).

#get the weekly log returns of our adjusted ETF prices    
#function modified from original source: https://www.business-science.io/investments/2016/11/30/Russell2000_Analysis.html

get_log_returns <- function(x, return_format = "tibble", period = 'weekly', ...) {
          # Convert tibble to xts
          x <- as.xts(x[, names(x) != "Date"], order.by = x$Date)
          # Get ETF prices
          log_returns_xts <- periodReturn(x = x$Adjusted, type = 'log', period = period, ...)
          # Rename
          names(log_returns_xts) <- "Log_Returns"
          stock_dates = index(log_returns_xts)
          # Return in xts format if tibble is not specified
     
            log_returns <- log_returns_xts %>%
              tibble::as_tibble() %>%
              mutate(Date = stock_dates) %>%
              mutate(Date = ymd(Date))

          log_returns
        }


# Use purrr to map the functions on the scraped ETF prices (scraping done in Python)
stocklist <- stocklist %>% group_by(symbol) %>% nest(.key = 'stock_prices') %>%
      mutate(
        log_returns  = map(stock_prices,
                           function(.x) get_log_returns(.x, return_format = "tibble")),
        mean_log_returns = map_dbl(log_returns, ~ mean(.$Log_Returns)),
        sd_log_returns   = map_dbl(log_returns, ~ sd(.$Log_Returns)),
        n_trade_weeks = map_dbl(stock_prices, nrow)
      )
    
# Calculate the average annual dividend payout:
dividend_profiles = stocklist %>% select(symbol, stock_prices) %>% unnest() %>% mutate(year  = year(Date)) %>% 
            group_by(symbol, year) %>% summarise(dividend_sum = sum(Dividend, na.rm = T)) %>%
              group_by(symbol) %>% summarise(dividend_payout_by_year_mean = mean(dividend_sum, na.rm = T)) %>% ungroup()
   

dividend_profiles[is.na(dividend_profiles)] <- 0

# Let's see how the first few rows look like:

kable(head(dividend_profiles, n=5), format = 'markdown')
symbol dividend_payout_by_year_mean
AADR 0.228060
AAXJ 1.024218
ACIM 1.389725
ACWF 0.426460
ACWI 1.093864

Visualize reward-to-risk performance

This is a neat, interactive visual using plotly, originally featured in the Business Science University article referenced earlier. I personally like it, because you can dynamically observe and zoom in on ETFs with high average weekly log returns AND low standard deviation of those returns and vice versa. Ideally, we are interested in ETFs with the former characteristics (strong, stable price inflation over a longer time horizon).

The reward metric aptly captures the above phenomena: ETFs with strongly positive reward metrics have strong, stable growth, while those with strongly negative reward metrics have strong, stable losses. In our graph, just like in business, the greener the better.

#An ETF with high average log returns and low standard deviation of log returns will get a very high reward metric score.
#This makes intuitive sense: we are interested in ETFs with high returns that are also stable over time
#conversely, an ETF with a low average log return and high standard deviation will get a very low reward metric score

stocklist <- stocklist %>%
  mutate(reward_metric = (mean_log_returns * n_trade_weeks) / sd_log_returns) %>%
          left_join(dividend_profiles)


#Visualize the results with Plotly
#the darker the green, the better the reward metric is for the ETF

# Inputs
trade_day_thresh <- min(stocklist$n_trade_weeks)
lab <- "Select ETFs"
back_col <- '#2C3E50'
font_col <- '#FFFFFF'
line_col <- "#FFFFFF"
grid_col <- 'rgb(255, 255, 255)'
col_brew_pal <- 'YlGn'
# Plotly
plot_ly(data   = stocklist %>% filter(n_trade_weeks >= trade_day_thresh),
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd_log_returns,
        y      = ~ mean_log_returns,
        color  = ~ reward_metric,
        colors = col_brew_pal,
        size   = ~ reward_metric,
        text   = ~ str_c("Ticker: ", symbol, "<br>",
                         "No. of Trading Weeks: ", n_trade_weeks, "<br>",
                         "Reward to Risk: ", round(reward_metric, 1)),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = line_col))
) %>%
  layout(title   = str_c(lab, 'Analysis: ETF Risk vs Reward', sep = " "),
         xaxis   = list(title = 'Risk: StDev of Weekly Log Returns (SDWLR)',
                        gridcolor = grid_col,
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward: Mean Weekly Log Returns (MWLR)',
                        gridcolor = grid_col,
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = font_col),
         paper_bgcolor = back_col,
         plot_bgcolor = back_col)

Step 2: Filter ETFs based on performance clusters

Cluster ETFs based on normalized prices

In this section, we create 15 distinct time series clusters of our ETFs, using their normalized weekly prices (scaling and centering). 15 clusters provided nice separation in the weekly price trends of 1,300+ ETFs, but you can certainly try different number of clusters. The idea is that we will separate ETFs into similar temporal behavioral groups, and in subsequent sections, only focus on the clusters that have strong, positive trends over the past 10 years.

The performance clusters also serve as a proxy for industry concentration as similar industries will exhibit similar time series patterns.

For our time series clustering, we used the excellent dtwclust package. You can find more information on it below:
1. https://github.com/asardaes/dtwclust
2. https://en.wikipedia.org/wiki/Dynamic_time_warping

#limit time series to evaluating performance for the last 156 weeks (you can change this)
all_stock_cordata <- stocklist %>% 
  select(symbol, stock_prices) %>%
  unnest() %>% filter(Date >= today()-(n_trading_weeks*7))

#eliminate dates that are sparse
num_stocklist = length(unique(all_stock_cordata$symbol))

dates_sparsity = all_stock_cordata %>% group_by(Date) %>%
                          summarise(sparsity = 1- n_distinct(symbol)/num_stocklist ) %>% ungroup() %>%
                          filter(sparsity <= 0.01)

stocks_sparsity  = all_stock_cordata %>% filter(Date %in% unique(dates_sparsity$Date)) %>%
                                  group_by(symbol) %>% summarise(num_weeks = n()) %>%
                                    ungroup() %>% filter(num_weeks == dim(dates_sparsity)[1])

all_stock_cordata = all_stock_cordata %>% filter(symbol %in% unique(stocks_sparsity$symbol))

all_stock_cordata$adjusted_normalized = as.numeric(scale(  all_stock_cordata$Adjusted, scale =TRUE, center = TRUE))


# Spread format conducive to cor()
all_stock_cordata <- all_stock_cordata %>%
  select(symbol, Date, adjusted_normalized) %>% 
  spread(key = symbol, value = adjusted_normalized) %>%
  na.omit()

all_stock_cordata_matrix = t(as.matrix(all_stock_cordata[,-1]))

#Create 15 time series clusters: 15 seems to provide nice separation (be sure to try a few clusters to visually confirm)
#Clusters group ETFs together that have similar price trends over the last 3 years (they can also be a proxy for industry concentration within the ETFs)
num_ts_clust = 15

dtw_cluster = tsclust(all_stock_cordata_matrix, type="partitional",k=num_ts_clust,
                      distance="dtw_basic",centroid = "pam",seed=1234,trace=T,
                      args = tsclust_args(dist = list(window.size = 20)))
plot(dtw_cluster)

Visualize cluster centroids

Next, we visualize each cluster centroid, and select the clusters that have a positive slope AND they are in the top 50th percentile among all ETF clusters. We can immediately tell that clusters 1 and 2 exhibit strong growth with acceptable volatility levels, while the weekly prices in clusters 10 and 15 have trended down over time.

#visualize the slope of centroids with ggplot
clust_centroids = tibble(centroids = dtw_cluster@centroids) %>% mutate(cluster = 1:num_ts_clust) %>%
            unnest() %>% mutate(week_index = rep(1:(n()/num_ts_clust),num_ts_clust))


clust_centroids %>%
  ggplot(aes(x = week_index, y = centroids)) +
  geom_ref_line(h = 0) +
  geom_line(aes(col = cluster)) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~cluster, scales = "free_y") +
  theme(legend.position = "None", axis.text.x = element_text(angle=90)) +
  labs(title = "Cluster centroid trends over the last 3 years",
       x = "Week Index",
       y = "Normalized ETF Price")

We will limit subsequent analysis to top performing ETF clusters

# Function to return linear model of cluster centroids
centroid_model <- function(data) {
  lm(centroids ~ week_index, data = data)
}


# Function to return slope of the linear model
slope_centroid <- function(centroid_model) {
  centroid_model$coefficients[[2]]
}

cluster_slopes = clust_centroids %>% 
                group_by(cluster) %>% nest() %>%
                    mutate(
                      centroid_model = map(data, centroid_model),
                      slope = map_dbl(centroid_model, slope_centroid)) 

median_cluster_slope = median(cluster_slopes$slope, na.rm = T)

clusters_keep =   cluster_slopes %>% filter(slope > median_cluster_slope) %>%
                                                filter(slope > 0)

clusters_keep = unique(clusters_keep$cluster)

clusters_exclude = unique(clust_centroids$cluster)[unique(clust_centroids$cluster) %ni% clusters_keep]


#filter stocklist to only include the ETFs that belong to the clusters with positive trends
cluster_info = data.frame(symbol = names(dtw_cluster@datalist), ts_cluster = dtw_cluster@cluster) %>% 
                              filter(ts_cluster %in% clusters_keep)
  
stocklist = stocklist %>% inner_join(cluster_info)

We will focus on the following ETF clusters in Steps 3 and 4 as they are in the top 50th percentile among all clusters AND have a positive slope: 1, 2, 6, 8, 11, 12, 13.

On the other hand, the following ETF clusters will be exluded from our analysis going forward (these are the bottom performers): 3, 4, 5, 7, 9, 10, 14, 15.

Step 3: Filter ETFs based on Growth-To-Consistency metric

This section contains another amazingly useful function borrowed from Business Science University. It takes the previously calculated reward metric and augments it a bit: the growth metric rewards strong ETF price growth (positive time series slope), penalizes high volatility (standard deviation of log returns) and rewards consistency (number of times average annual log returns were above zero). In return, you get a score that enables you to focus on ETFs that provide strong and steady growth with low volatility. Exactly what we need!

Specify top performers within each cluster based on reward metric

Recall that we were left with 7 out of 15 ETF performance clusters (eliminated 8 groups of ETFs due to undesirable historical performance). Using the growth metric from above, we will select the top 5 (you can easily change that) ETFs within each of the 7 clusters. We are in essence focusing our subsequent analysis on the top 35 ETFs, stratified by their temporal behavior (the clusters).

top_n_limit <- 10  #select this many top performing ETFs within each time series cluster based on their Reward Metric ranking

hp <- stocklist %>% group_by(ts_cluster) %>% 
  mutate(rank = reward_metric %>% desc() %>% min_rank()) %>% ungroup() %>% 
  filter(rank <= top_n_limit) %>%
  arrange(rank)


# Function to return MWLR by year
means_by_year <- function(log_returns) {
  log_returns %>%
    mutate(year = year(Date)) %>%
    group_by(year) %>%
    summarize(mean_log_returns = mean(Log_Returns))
}


# Map function to data frame
hp <- hp %>%
  mutate(means_by_year = map(log_returns, means_by_year))


# Compute three attributes of high performing stocks and further filter---------------------------

# Attribute 1: Number of Times MWLR by Year Drops Below Zero
# Function to return number of times a stock's MDLR by year drops below zero
means_below_zero <- function(means_by_year) {
  means_by_year %>%
    filter(mean_log_returns < 0) %>%
    nrow()
}

# Map function to data frame for all stocks
hp <- hp %>%
  mutate(means_below_zero = map_dbl(means_by_year, means_below_zero))


# Attribute 2: Slope of MWLR by Year
# Function to return linear model
means_by_year_model <- function(means_by_year) {
  lm(mean_log_returns ~ year, data = means_by_year)
}


# Function to return slope of linear model
slope <- function(means_by_year_model) {
  means_by_year_model$coefficients[[2]]
}  


# Map modeling and slope functions
hp <- hp %>%
  mutate(
    means_by_year_model = map(means_by_year, means_by_year_model),
    slope = map_dbl(means_by_year_model, slope)
  )

# Attribute 3: Standard deviation of MWLR by year
sd_of_means_by_year <- function(means_by_year) {
  sd(means_by_year$mean_log_returns)
}

# Map to data frame
hp <- hp %>%
  mutate(sd_of_means_by_year = map_dbl(means_by_year, sd_of_means_by_year))


# Develop Growth-to-Consistency Metric -----------------------------------------

hp <- hp %>%
  mutate(growth_metric = slope /((means_below_zero + 1) * sd_of_means_by_year)) 


# Visualize Performance of Top N number of ETFs by Cluster --------------------------------------

top_n_limit <- 5

#Further trim list based on "Growth to Consistency" metric: select top N number of ETFs within each cluster

top_n_stock_profiles = hp %>% group_by(ts_cluster) %>% 
  mutate(rank = growth_metric %>% desc() %>% min_rank()) %>% ungroup() %>% 
  filter(rank <= top_n_limit) %>% 
  arrange(rank)

hp_stock_list = unique(top_n_stock_profiles$symbol)


ncol_hp_gg = 4
nrow_hp_gg = 5
num_pages_ggplot = ceiling(dim(top_n_stock_profiles)[1]/(ncol_hp_gg*nrow_hp_gg))

Visualize the average annual MWLR for the selected ETFs

Below is a visual of our reduced list of top 5 ETFs per cluster. The blue line indicates the slope of each ETF’s average annual weekly log returns. Each point on the graph represents the weekly log returns, averaged out by year for the past 10 years. Again, you can change the time horizon if you wish. Notice that ETFs like IGV have a strong, positive slope over the last 10 years, and the average weekly log returns are almost always above zero in each of the years. Again, exactly what we are looking for!

for(i in 1:num_pages_ggplot){
  x= top_n_stock_profiles %>%
    mutate(symbol_cluster = paste("(",ts_cluster,") ",symbol, sep = "")) %>% 
    select(ts_cluster, symbol_cluster, means_by_year)  %>% 
    unnest() %>%
    mutate(ts_cluster = as.factor(ts_cluster)) %>% rename(ETF_cluster = ts_cluster) %>% 
    ggplot(aes(x = year, y = mean_log_returns)) +
    geom_ref_line(h = 0) +
    geom_line(aes(col = ETF_cluster)) +
    geom_smooth(method = "lm", se = FALSE) +
    facet_wrap_paginate(facets = 'symbol_cluster', ncol = ncol_hp_gg, nrow = nrow_hp_gg, page = i, scales = "free_y") +
    theme(legend.position = "None", axis.text.x = element_text(angle=90)) +
    labs(title = paste0("Average MWLR per annum for select list of ",top_n_limit*length(clusters_keep), " (top ",top_n_limit," ETFs per cluster)"),
         subtitle = paste0("page ",i," of ",num_pages_ggplot),
         x = "Year",
         y = "Avg. Annual Mean Weekly Log Returns (MWLR)")
  print(x) }

Step 4: Creating our final list of ETFs to invest in

Final list of ETFs to consider

From the list of 35, we will narrow down the top 10 ETFs based on the final rank metric, which is a weighted measure of our growth metric (95% weight) and dividend payout performance (5% weight). Note that we are not stratifying this by cluster anymore, since the previous step has taken care of that. Below is the list of 10 ETFs for investment consideration. We have called out the performance trends of IGV in Step 3, and this ETF appears as the 3rd most desirable in our list.

#apply weights (we will weight dividend performance at 5%)
dividend_wt = 0.05
stock_growth_wt = 1-dividend_wt

#re-rank it, but without the clusters this time
top_n_stock_profiles2 = hp %>% filter(symbol %in% hp_stock_list) %>%
  mutate(rank = growth_metric %>% desc() %>% min_rank()) %>% 
  select(ts_cluster, symbol, dividend_payout_by_year_mean, rank) %>%
  arrange(rank) %>% rename(stock_growth_rank = rank) %>%
                        mutate(dividend_payout_rank = row_number(desc(dividend_payout_by_year_mean))) %>%
                        mutate(final_rank = stock_growth_wt*stock_growth_rank+
                                 dividend_payout_rank * dividend_wt) %>%
                          mutate(final_rank = row_number(final_rank)) %>%
                          arrange(final_rank) %>% select(ts_cluster, symbol, final_rank, stock_growth_rank, dividend_payout_rank)



# Purchase Qty Recommendation --------------------------------------
top_n_recent_prices = hp %>% filter(symbol %in% hp_stock_list) %>%
                            select(symbol, stock_prices) %>% unnest() 

max_date= max(top_n_recent_prices$Date)

top_n_recent_prices = top_n_recent_prices %>% filter(Date == max_date) %>%
                      select(symbol, Adjusted) %>% rename(recent_price = Adjusted)


#specify how many ETFs to consider 
n_stocks_consider = 10

#Specify # of ETFs we want to invest in:
final_invest_ETF_count = 5

#recommendation on what ETFs to consider: 
top_n_stock_profiles_consider = top_n_stock_profiles2 %>% left_join(top_n_recent_prices) %>% filter(final_rank <= n_stocks_consider)

consideration_stock_list = unique(top_n_stock_profiles_consider$symbol)

kable(top_n_stock_profiles_consider, format = 'markdown')
ts_cluster symbol final_rank stock_growth_rank dividend_payout_rank recent_price
13 IHI 1 1 33 244.02
1 VGT 2 2 23 217.64
1 IGV 3 3 35 224.55
8 VPU 4 5 4 134.20
1 IGM 5 4 31 222.00
13 XLG 6 6 5 213.55
12 IDU 7 7 3 152.26
12 IWF 8 8 22 161.00
6 VIG 9 9 16 117.34
11 MGK 10 10 28 132.31

Visualize Correlations for ETFs in consideration

In order to narrow down our top 10 to only the 5 ETFs we want to invest in (again, arbitrary parameters, you can change these if you wish), we need to look at how correlated they are with each other. We visualize the correlation coefficients, and apply 5 hierarchical clusters to correspond with the number of ETFs we want to invest in. Notice the middle portion in our correlation matrix: these 4 ETFs are almost perfectly correlated with each other, so we will select the top performing one from this group.

top_n_stock_profiles %>% filter(symbol %in% consideration_stock_list) %>%
  select(symbol, log_returns) %>%
  unnest() %>% spread(key = symbol, value = Log_Returns) %>%
  na.omit()  %>% select(-Date) %>%
  cor() %>%
  corrplot(order   = "hclust", method = 'number',
           addrect = final_invest_ETF_count, rect.col = 'yellow', rect.lwd = 5, number.digits = 2)

Visualize Correlations for ETFs we will invest in

We will create the final list of 5 ETFs to invest in based on 2 factors:
1. The final_rank of the ETF from the top_n_stock_profiles_consider table AND
2. The correlation of log returns with other ETFs. Minimizing the correlations is a good risk mitigation measure.

We achieve the above by selecting the highest ranking ETF within each hierarchical cluster. Our final list of top 5:
1. IHI: it is #1 in terms of final_rank from the top_n_stock_profiles_consider table, and not super highly correlated with other ETFs (mostly below 0.8).
2. VGT: #2 rank. The returns of this ETF are almost perfectly correlated with IGM, therefore we leave the latter out of consideration (despite IGM being ranked #5).
3. IGV: #3 based on final rank.
4. VPU: #4 final rank and perfectly correlated with IDU. So we throw IDU out of consideration.
5. XLG: #6 based on final rank, but moved forward to #5 in our investment list. Why? We excluded IGM previously from consideration.

After we have determined in a systemic, fairly automated fashion the 5 ETFs we want in our investment portfolio, we may want to add a few more ETFs manually. Perhaps these are from industries that we are intimately familiar with, or industries that we think will benefit in the long-run due to global economic or infrastructure phenomena.

I decided to mix in a couple of water industry ETFs (CGW and TBLU). It was somewhat of an ethical dilemma for me, as water scarcity has been, and will increasingly be a major global issue. With these personal additions, our final list of ETFs will grow to 7.

Let’s see how correlated their weekly log returns are against each other. As you can see, there is fairly decent diversification in our final investment portfolio. While ~ 50% of the coefficients are in the 0.8-0.9 range, the other 50% are hovering in the neighborhood of 0.3-0.7.

#final list based on our analysis
final_invest_stock_list = c('IHI','VGT','IGV','XLG','VPU')


#make any manual additions for specific ETFs you are interested in (in this case, I'm interested in water ETFs) 
ETF_additions = data.frame(symbol = c('CGW', 'TBLU'),
                           recent_price = c( 37.46, 31), stringsAsFactors = F)


final_invest_stock_list = c(final_invest_stock_list, ETF_additions$symbol)



#Correlation plot of final ETF selections:
final_cor_data = fread('/home/kakasa09/stock_investing/ETF_weekly_prices.csv')

final_cor_data = final_cor_data %>% as_tibble() %>% select(date, `5. adjusted close`, `7. dividend amount`, symbol) %>%
  rename(Date = date,
         Adjusted = `5. adjusted close`,
         Dividend = `7. dividend amount`)

final_cor_data$Date <- as.Date(final_cor_data$Date )    

final_cor_data = final_cor_data %>%  filter(Date >= today()-(365*n_year_length)) %>%
                  filter(symbol %in% c(final_invest_stock_list)) %>% group_by(symbol) %>% nest(.key = 'stock_prices') %>%
                    mutate(log_returns  = map(stock_prices,
                                              function(.x)
                                                get_log_returns(.x, return_format = "tibble"))) 

final_cor_data %>% select(symbol, log_returns) %>%
                        unnest() %>% spread(key = symbol, value = Log_Returns) %>%
                          na.omit()  %>% select(-Date) %>%
                              cor() %>%
                                  corrplot(order = "hclust", method = 'number', number.digits = 2)