Background and Motivation

The primary motivation of the project is to highlight the “risk” associated with Stocks by applying various statistical methods to measure risk factors such as “Market Sentiment” (“Beta”, “Standard Deviation”) and “Social Sentiment” (“Twitter feeds”). The fundamental reason to choose a topic in finance is partly current experience working in the industry and partly to understand the significance of “risk” being an end investor. Many of us manage different types of brokerage accounts (stock trading, 401k and IRA accounts) without clearly understanding the differing risk associated with the performance of these accounts. The risk appetite itself differs widely with different classes of investors depending on their personal situation and other factors. Hence it is prudent for such investors to understand the risk factors associated with their individual stocks or funds. Further, the recent advances in the social media has provided a wealth of information in the form of Twitter feeds or Google trends that can further provide great insight on the sentiment of investors influenced by market and other events. This research is an attempt to highlight concepts around “risk” factors by performing a thorough statistical analysis backed by appropriate models.

Project Objectives

Volatility is a statistical measure of the dispersion of returns for a given security or market-index. Commonly, the higher the volatility, the riskier the security. The volatility itself is affected by multiple factors such as “Beta”, “Standard Deviation”, “Earnings”, “Balance Sheet”, “Analyst Recommendations” and “Social Sentiment”. We will start with the concept of Beta as a measure of volatility of a stock relative to the benchmark (like S&P 500) and analyze how it affects the returns of a stock relative to the benchmark. In particular, we will run a simple case study by choosing three stocks with widely differing returns relative to the benchmark (S&P 500) to emphasize the effect of beta on such classes of stocks. We will then generalize this a bit more, by comparing Beta (risk) for a universe of stocks against their respective returns for a chosen period and more important, we want to understand if a portfolio of low-volatility stocks earn returns different from a portfolio of high-volatility stocks. This is one of the fundamental questions that this research ponders upon by providing a statistical insight into the results. We will then repeat the analysis using Standard Deviation as the volatility measurement. The way we present this part of the material is by running a multiple regression model on stock’s return using beta and standard deviation as predictors. While the main focus of this research paper is on stock’s beta and standard deviation, we will also introduce other significant measures of volatility such as Twitter/Stocktwits feeds that is gaining momentum since the advent of Social Media Revolution. This is primarily done to make the investigation complete. .

## put your code here
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
opts_chunk$set(cache = TRUE, message = FALSE)
library(readr)
library(ggplot2)
theme_set(theme_bw(base_size = 16))
library(gganimate)
library(broom)

Load beta of a universe of stocks

url <- "https://raw.githubusercontent.com/goodwillyoga/E107project/master/Mohan/data/2500%20Stocks%20Financial%20Data.csv"
financial_data <- read_csv(url)

# data wrangling - rename columns etc
names(financial_data)[names(financial_data) == 'Beta (5 Yr Annualized)'] <- 'Beta_5Year'
names(financial_data)[names(financial_data) == 'Beta (1 Year Annualized)'] <- 'Beta_1Year'
names(financial_data)[names(financial_data) == 'Standard Deviation (5 Yr Annualized)'] <- 'SD_5Year'
names(financial_data)[names(financial_data) == 'P/E (Next Year\'s Estimate)'] <- 'PE_NextYear'
names(financial_data)[names(financial_data) == 'Total Return (5 Yr Annualized)'] <- 'Annualized_Return_5Year'
names(financial_data)[names(financial_data) == 'Sector/Industry'] <- 'Sector'
names(financial_data)[names(financial_data) == 'Security Type'] <- 'SecurityType'

select(financial_data, Symbol, SecurityType, Sector, Beta_5Year, Beta_1Year, SD_5Year, PE_NextYear, Annualized_Return_5Year) %>% head() %>% kable
Symbol SecurityType Sector Beta_5Year Beta_1Year SD_5Year PE_NextYear Annualized_Return_5Year
TALN CS Consumer Discretionary 0.0066691 0.7587322 2.21670 15.00000 10.75663
ED CS Utilities 0.0173725 0.2749312 0.15303 15.28586 9.82558
IIJI DR Information Technology 0.0184750 1.1182022 0.40247 22.30952 10.56512
NEM CS Materials 0.0252339 0.6348882 0.36968 21.13001 -19.88953
APPS CS Information Technology 0.0275968 -0.2365876 0.86178 19.10256 -2.60258
ITC CS Utilities 0.0302839 0.3943955 0.16944 16.26724 12.80138
ggplot(financial_data, aes(x=Beta_5Year)) + 
  geom_histogram(aes(y=..density..), bins=20, colour="black", fill="white") + 
  geom_density(alpha=0.2, fill="#FF6666") +
  geom_vline(aes(xintercept=median(Beta_5Year)), color="red", linetype="dashed", size=1) +
  ggtitle("Beta distribution")

# Plot beta group_by sector to understand patterns
sectors<-c('Utilities', 'Consumer Staples', 'Health Care', 'Information Technology', 'Energy', 'Industrials', 'Consumer Discretionary' , 'Materials', 'Financials')
filter(financial_data, Sector %in% sectors) %>%
  ggplot(aes(x=Beta_5Year)) + 
  geom_histogram(aes(y=..density..), bins=20, colour="black", fill="white") + 
  geom_density(alpha=0.2, fill="#FF6666") +
  facet_wrap(~Sector)

# Summarize beta per sector
sectors_beta<-financial_data %>% filter(Sector %in% sectors) %>%
  group_by(Sector) %>%
  summarize(Beta_Median=round(median(Beta_5Year), 3), 
            Beta_Mean=round(mean(Beta_5Year), 3), 
            Beta_SD=round(sd(Beta_5Year), 3),
            Returns_Mean=round(mean(Annualized_Return_5Year), 3),
            Returns_SD=round(sd(Annualized_Return_5Year), 3)) %>% 
  arrange(Beta_Median)

all_beta<-financial_data %>% 
  summarize(Beta_Median=round(median(Beta_5Year), 3), 
            Beta_Mean=round(mean(Beta_5Year), 3), 
            Beta_SD=round(sd(Beta_5Year), 3),
            Returns_Mean=round(mean(Annualized_Return_5Year), 3),
            Returns_SD=round(sd(Annualized_Return_5Year), 3))
all_beta$Sector<-"ALL"
all_beta<-all_beta[,c("Sector", "Beta_Median", "Beta_Mean", "Beta_SD", "Returns_Mean", "Returns_SD")]

sectors_beta<-rbind(sectors_beta, all_beta)
sectors_beta %>% kable
Sector Beta_Median Beta_Mean Beta_SD Returns_Mean Returns_SD
Utilities 0.439 0.568 0.438 8.121 11.119
Consumer Staples 0.770 0.805 0.412 15.474 12.483
Financials 0.990 1.078 0.510 13.707 11.086
Health Care 1.011 1.084 0.480 18.709 16.902
Consumer Discretionary 1.194 1.278 0.583 12.253 17.247
Information Technology 1.284 1.361 0.569 8.020 16.070
Energy 1.300 1.365 0.577 -0.969 17.245
Industrials 1.313 1.373 0.565 10.186 15.397
Materials 1.420 1.386 0.556 3.334 20.459
ALL 1.145 1.201 0.575 10.578 16.064
# bar plot showing mean beta and stock-returns across sectors
ggplot(sectors_beta, aes(x=factor(Sector), y=Beta_Mean)) + 
  geom_bar(stat="identity", width=0.5, colour="black", fill="#FF6666", alpha=0.7) + 
  geom_bar(data=sectors_beta, aes(x=factor(Sector), y=Returns_Mean), stat="identity", width=0.5, colour="black", fill="red", alpha=0.2) + 
  xlab("Sector") +
  ylab("%") +
  scale_y_log10() +
  scale_x_discrete(labels = abbreviate) +
  ggtitle("Mean beta and returns across sectors")

# It is clear that beta for different industry sectors are different with utilities and consumer staples having very low beta values as expected. Utilities and consumer staples tend to fluctuate less with the overall market sentiment as they are basic necessities. On the other hand, industrials and materials are cyclical in nature and their beta tends to be very high. Financials have almost perfect correlation with S&P500 with beta almost 1.

# Also it is clear that returns are different for different beta. We will later see if there is a clear relation between returns and beta for a universe of 2500 stocks.

Data wranging with historical prices - SPY and 3 chosen stocks

SPY_url<-"https://raw.githubusercontent.com/goodwillyoga/E107project/master/Mohan/data/SPY%20weekly%2010%20years.csv"
SPY_history<-read_csv(SPY_url)

ED_url<-"https://raw.githubusercontent.com/goodwillyoga/E107project/master/Mohan/data/ED%20weekly%2010%20years.csv"
ED_history<-read_csv(ED_url)

ADSK_url<-"https://raw.githubusercontent.com/goodwillyoga/E107project/master/Mohan/data/ADSK%20weekly%2010%20years.csv"
ADSK_history<-read_csv(ADSK_url)

IBM_url<-"https://raw.githubusercontent.com/goodwillyoga/E107project/master/Mohan/data/IBM%20weekly%2010%20years.csv"
IBM_history<-read_csv(IBM_url)

# build a consolidated frame with weekly returns (%) for SPY/ED/ADSK/IBM
stocks<-data.frame(Date=SPY_history$Date, 
                   SPY.Close=SPY_history$Close, SPY.Volume=SPY_history$Volume, SPY.LastClose=SPY_history$Close, 
                   ED.Close=ED_history$Close, ED.Volume=ED_history$Volume, ED.LastClose=ED_history$Close, 
                   ADSK.Close=ADSK_history$Close, ADSK.Volume=ADSK_history$Volume, ADSK.LastClose=ADSK_history$Close,
                   IBM.Close=IBM_history$Close, IBM.Volume=IBM_history$Volume, IBM.LastClose=IBM_history$Close)

numRows<-nrow(SPY_history)

# populate last week return to facilitate calculating weekly % returns
stocks$SPY.LastClose[1:numRows-1]<-SPY_history$Close[2:numRows]
stocks$ED.LastClose[1:numRows-1]<-ED_history$Close[2:numRows]
stocks$ADSK.LastClose[1:numRows-1]<-ADSK_history$Close[2:numRows]
stocks$IBM.LastClose[1:numRows-1]<-IBM_history$Close[2:numRows]


# compuate weekly % returns
stocks<-mutate(stocks, SPY.Returns=round((SPY.Close-SPY.LastClose)*100/SPY.LastClose, 3),
               ED.Returns=round((ED.Close-ED.LastClose)*100/ED.LastClose, 3),
               ADSK.Returns=round((ADSK.Close-ADSK.LastClose)*100/ADSK.LastClose, 3),
               IBM.Returns=round((IBM.Close-IBM.LastClose)*100/IBM.LastClose, 3))


head(stocks, 2)
##        Date SPY.Close SPY.Volume SPY.LastClose ED.Close ED.Volume
## 1 4/18/2016    209.24   75277700        207.78    71.10   2222300
## 2 4/11/2016    207.78   83455600        204.50    75.85   1273700
##   ED.LastClose ADSK.Close ADSK.Volume ADSK.LastClose IBM.Close IBM.Volume
## 1        75.85      60.07     1365700          58.42    149.30    7887200
## 2        76.04      58.42     1607000          56.76    151.72    3428700
##   IBM.LastClose SPY.Returns ED.Returns ADSK.Returns IBM.Returns
## 1        151.72       0.703     -6.262        2.824      -1.595
## 2        149.35       1.604     -0.250        2.925       1.587
# Summarize weekly returns
SPX_Weekly_Returns<-summary(stocks$SPY.Returns)
ED_Weekly_Returns<-summary(stocks$ED.Returns)
ADSK_Weekly_Returns<-summary(stocks$ADSK.Returns)
IBM_Weekly_Returns<-summary(stocks$IBM.Returns)
Weekly_Returns<-rbind(SPX_Weekly_Returns, ED_Weekly_Returns)
Weekly_Returns<-rbind(Weekly_Returns, ADSK_Weekly_Returns)
Weekly_Returns<-rbind(Weekly_Returns, IBM_Weekly_Returns)
Weekly_Returns %>% kable
Min. 1st Qu. Median Mean 3rd Qu. Max.
SPX_Weekly_Returns -19.79 -0.9742 0.2685 0.1236 1.492 13.290
ED_Weekly_Returns -12.86 -1.1780 0.2165 0.1213 1.412 8.115
ADSK_Weekly_Returns -26.42 -2.7590 0.3265 0.2244 3.256 20.020
IBM_Weekly_Returns -15.17 -1.3650 0.2075 0.1633 1.777 14.470

Betacomputation: case Study of 3 stocks

# Correlation
ED_cor<-cor(stocks$ED.Returns, stocks$SPY.Returns)
ADSK_cor<-cor(stocks$ADSK.Returns, stocks$SPY.Returns)
IBM_cor<-cor(stocks$IBM.Returns, stocks$SPY.Returns)

# Beta is the slope of the regression line between SPY returns and the stock returns
# Use SPY returns to predict individual stock returns
ED_beta<-tidy(lm(stocks$ED.Returns ~ stocks$SPY.Returns), conf.int=TRUE) %>% 
  filter(term == "stocks$SPY.Returns")
ADSK_beta<-tidy(lm(stocks$ADSK.Returns ~ stocks$SPY.Returns), conf.int=TRUE) %>% 
  filter(term == "stocks$SPY.Returns")
IBM_beta<-tidy(lm(stocks$IBM.Returns ~ stocks$SPY.Returns), conf.int=TRUE) %>% 
  filter(term == "stocks$SPY.Returns")

# use kable to show beta estimates, p values, r-squared and CI values
ED_beta$term[1]<-"Beta (ED)"
ADSK_beta$term[1]<-"Beta (ADSK)"
IBM_beta$term[1]<-"Beta (IBM)"
beta_est<-union(IBM_beta, union(ED_beta, ADSK_beta))
beta_est<-mutate(beta_est, r=c(ED_cor, ADSK_cor, IBM_cor))
beta_est<-mutate(beta_est, r.squared_perc=round(r*r*100, 3))
beta_est<-mutate(beta_est, estimate=round(as.numeric(estimate), 2))
beta_est<-rbind(beta_est, c('Beta (SPY)', 1.0, 0, 0, 0, 1, 1, 0, 0))
beta_est %>% kable
term estimate std.error statistic p.value conf.low conf.high r r.squared_perc
Beta (IBM) 0.81 0.0386624990621484 21.0047585679583 2.19355547257144e-71 0.736142567805647 0.888050349063038 0.480272895860774 23.066
Beta (ADSK) 1.41 0.0697787761685322 20.2732056560925 8.9916796773494e-68 1.27755652691469 1.5517224324755 0.664426455822036 44.146
Beta (ED) 0.41 0.0331447254388387 12.4862254750735 1.77815348658691e-31 0.348738492197471 0.478966538080018 0.677502402383787 45.901
Beta (SPY) 1 0 0 0 1 1 0 0
# bar plot showing estimated beta and calculated returns
ggplot(beta_est, aes(x=factor(term), y=estimate)) + 
  geom_bar(stat="identity", width=0.5, colour="black", fill="#FF6666", alpha=0.3) + 
  xlab("Case Study") +
  ylab("Beta Estimate") +
  ggtitle("Case Study: Estimated Beta")

# plot using smoothing
ggplot_lm<-function(stock_col) {
  stocks %>% filter(SPY.Returns >= -10) %>% 
    ggplot(aes_string(x="SPY.Returns", y=stock_col)) + 
    geom_point() +
    geom_smooth(method="lm", se = TRUE) +
    ggtitle(paste(stock_col, " returns against SPY"))
}

ggplot_lm("ED.Returns")

ggplot_lm("ADSK.Returns")

ggplot_lm("IBM.Returns")

We now want to study if a portfolio of low-volatility stocks earn returns different from a portfolio of high-volatility stocks. We will do this using regression as well as simulation

Regression of stock returns using beta (predictor)

beta_cor<-cor(financial_data$Annualized_Return_5Year, financial_data$Beta_5Year)
beta_returns<-tidy(lm(Annualized_Return_5Year ~ Beta_5Year, financial_data), conf.int=TRUE) %>% 
  filter(term == "Beta_5Year")
beta_returns$r<-beta_cor
beta_returns$r.squared_perc<-beta_cor * beta_cor * 100

beta_returns %>% kable
term estimate std.error statistic p.value conf.low conf.high r r.squared_perc
Beta_5Year -5.475609 0.5409108 -10.12294 0 -6.536275 -4.414944 -0.1958902 3.837296
financial_data %>%  
  ggplot(aes_string(x="Beta_5Year", y="Annualized_Return_5Year")) + 
  geom_point() +
  geom_smooth(method="lm", se = TRUE) +
  ggtitle("Beta vs Annualized_Returns (5 Year)")

# From the regression, it appears that annualized returns goes by about 5.47% for every one unit increase in beta.

MonteCarlo simulation

beta_sorted<-financial_data %>% arrange(Beta_5Year) %>% select(Symbol, Beta_5Year, Annualized_Return_5Year)

# Monte Carlo simulation parameters
set.seed(131)
num_bins<-5                     # distinct bins after sorting betas
num_stocks<-nrow(beta_sorted)   # total number of stocks in our universe
bin_size<-num_stocks/num_bins   # number of stocks in each bin
low_beta_bin<-1:bin_size        # low beta stocks position
high_beta_bin<-(num_stocks-bin_size+1):num_stocks  # high beta stocks position
num_samples<-50                 # number of samples randomly chosen from each bin for each run of the simulation

run_expt<-function(bin) {
  samples<-beta_sorted[sample(bin, size=num_samples), ]$Annualized_Return_5Year
  mean_returns<-mean(samples)
  mean_returns
}

# run the simulation to compare returns of low beta stocks vs high beta stocks
B<-10^5
returns_low_beta<-replicate(B, run_expt(low_beta_bin))
mean_returns_low_beta<-mean(returns_low_beta)
sd_returns_low_beta<-sd(returns_low_beta)

returns_high_beta<-replicate(B, run_expt(high_beta_bin))
mean_returns_high_beta<-mean(returns_high_beta)
sd_returns_high_beta<-sd(returns_high_beta)

# tabulate the results
returns_results<-data_frame(Classification="Low Beta", 
                         Mean_Returns=round(mean_returns_low_beta, 3), SD_Returns=round(sd_returns_low_beta, 3))
returns_results<-bind_rows(returns_results, 
                    data_frame(Classification="High Beta", 
                               Mean_Returns=round(mean_returns_high_beta, 3), SD_Returns= round(sd_returns_high_beta, 3)))
returns_results %>% kable
Classification Mean_Returns SD_Returns
Low Beta 13.426 1.778
High Beta 5.013 2.550
# Monte Carlo simulation appears to be in agreement with the regression model.