Volatility estimation seems straightforward on the surface, but in reality it can be challenging. For instance, take a look at the chart generated below.

# generate random data
sa <- c(1,1.1,.9,.8,.92,1.05,1)
sb <- c(1,1,1,1,1,1,1)
hr <- c(1,2,3,4,5,6,7)

# plot data
plot(hr, sa, type = "l", col = "blue", main = "Stock Price Over Day", xlab = "Time", ylab = "Price")
lines(hr, sb, type = "l", col = "red")
legend("bottomright", legend = c("Stock A", "Stock B"), col = c("blue", "red"), lty = 1, cex = 0.8)

The two lines represent the prices of two separate stocks (A and B) over the course of the day. You can see that the stocks start and end the day at the same price, yet take far different paths to get there. Stock A goes up and down dramatically, whereas Stock B is constant. However, if we were just looking at the daily close-close volatility of both stocks, they would be the same.

A different estimator that attempts to correct for this is the Garman-Klass Yang-Zhang volatility estimator, which incorporates the open, high, low, and closing price of the stock, not just the closing price. This estimator accounts for intraday pricing movement and jumps, and can better handle some of the original problems from above.

Given the two estimators, it makes sense to ask what the difference is between the two, and what are the characteristics of stocks that have large divergences. To do so, I took stocks from the S&P 500 and their respective pricing data going back to 2000 (or shorter if the stock had not been around in 2000), found the rolling close-close and GKYZ volatility of each stock, compared the two, and then regressed the difference to see if it was a predictor for average returns.

To get all of the S&P 500 tickers, I followed a tutorial to scrape them from Wikipedia from Bryant Crocker, listed below. The composition of the S&P 500 has changed over the last twenty years, but my goal was more to get a large amount of tickers to test rather than directly replicate the S&P 500, so I was fine by not being too accurate in regards to the composition.

# scraping came from here to get symbols: https://towardsdatascience.com/exploring-the-sp500-with-r-part-1-scraping-data-acquisition-and-functional-programming-56c9498f38e8

library(rvest)
library(tidyverse)
url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
# use that URL to scrape the SP500 table using rvest
tickers <- url %>%
  # read the HTML from the webpage
  read_html() %>%
  # one way to get table
  #html_nodes(xpath='//*[@id="mw-content-text"]/div/table[1]') %>%
  # easier way to get table
  html_nodes(xpath = '//*[@id="constituents"]') %>% 
  html_table()
#create a vector of tickers
sp500tickers <- tickers[[1]]
sp500tickers = sp500tickers %>% 
  mutate(Symbol = case_when(Symbol == "BRK.B" ~ "BRK-B", 
                            Symbol == "BF.B" ~ "BF-B",TRUE~as.character(Symbol))) 

To get the pricing data I used the riingo package in R. I then summarized the data by ticker to get average daily log return, average rolling 20 day close-close volatility, average rolling 20 day GKYZ volatility, and the average spread, which is defined as the between the average difference between the rolling 20 day GKYZ volatility and the rolling 20 day close-close volatility. Having a positive spread implies that the stock was more volatile intraday than the close-close volatility would suggest.

To make this document render quickly, I loaded the data I have saved, but the code to scrape is down below.

library(purrr)
library(riingo) # data
library(TTR) # for volatility

# RIINGO_TOKEN = INSERT HERE
riingo_set_token(RIINGO_TOKEN)

get_dat = function(ticker){
  df = riingo_prices(ticker, start_date = "01-01-2000", end_date = "01-01-2023") %>% 
    mutate(lag1 = lag(adjClose,1), 
           dr_log = log(adjClose/lag1), 
           cc_20vol = volatility(adjClose, n = 20, calc = "close"), 
           gkyz20 = volatility(as.matrix(cbind(adjOpen,adjHigh,adjLow,adjClose)),  n = 20, calc = "gk.yz"), 
           spread_vol = gkyz20 - cc_20vol) %>% 
    drop_na()
  
}

safe_fn = possibly(get_dat) # errors w some tickers

final_df2 = map(sp500tickers$Symbol, safe_fn, .progress = T) %>% bind_rows() # only got 487 tickers, assuming something to do w tiingo api

I first wanted to see if there was any correlation between daily log returns and the spread between vol estimators. Turns out there was a small one.

library(readr)
final_df2 = read_csv("sp500_vol_dat2.csv") # only got 487 tickers, assuming something to do w tiingo api

smry = final_df2 %>%
  group_by(ticker) %>%
  summarise(avg_log_ret = mean(dr_log),
            avg_cc = mean(cc_20vol),
            avg_gkyz = mean(gkyz20),
            avg_spread = mean(spread_vol)) 

head(smry, 5)
## # A tibble: 5 × 5
##   ticker avg_log_ret avg_cc avg_gkyz avg_spread
##   <chr>        <dbl>  <dbl>    <dbl>      <dbl>
## 1 A         0.000215  0.352    0.352   0.000659
## 2 AAL      -0.000116  0.569    0.586   0.0172  
## 3 AAP       0.000444  0.299    0.318   0.0187  
## 4 AAPL      0.000885  0.354    0.368   0.0141  
## 5 ABBV      0.000750  0.252    0.274   0.0226
library(corrplot)
corrplot(cor(smry[,-1]), method = "number")

Finally, I then plotted the average spread against average returns by stock, and ran a regression between the two.

plot(smry$avg_spread, smry$avg_log_ret)
m = lm(avg_log_ret ~ avg_spread, data = smry)
abline(m, col = "blue")
legend("bottomright", bty="n", legend=paste("R2 is", format(summary(m)$adj.r.squared, digits=3), ". p-value of avg_spread = ", round(summary(m)$coefficients[[8]],5)))

summary(m)
## 
## Call:
## lm(formula = avg_log_ret ~ avg_spread, data = smry)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.075e-03 -1.633e-04 -7.920e-06  1.562e-04  2.218e-03 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.757e-04  2.228e-05  16.862  < 2e-16 ***
## avg_spread  4.514e-03  1.197e-03   3.771 0.000183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0003054 on 484 degrees of freedom
## Multiple R-squared:  0.02854,    Adjusted R-squared:  0.02653 
## F-statistic: 14.22 on 1 and 484 DF,  p-value: 0.0001828

As you can see, there is a slightly positive relationship between the spread of volatility measures and average daily returns. Since the slope is positive, it is saying the higher the intraday volatility relative to the close volatility, the higher the returns. This makes sense, as higher volatility should result in a higher return.

Additional Sources: https://medium.com/swlh/the-realized-volatility-puzzle-588a74ab3896 https://towardsdatascience.com/exploring-the-sp500-with-r-part-1-scraping-data-acquisition-and-functional-programming-56c9498f38e8