FIN488 Data Science

Questions

Q1. Write a function returning a dataframe of moments. (10 points)

In the first exam, we computed moments for tickers using a tidy workflow.
Now write a function to automate the task:
- Arguments are ticker(s), fromDate and toDate.
- The function must return a data frame of moments mean, sd, skewness and kurtosis (mean, sd, skewness and kurtosis).
- Use log returns on “adjusted” prices.

library(tidyverse)
library(tidyquant)
library(tibbletime)
library(moments)
ticks <- c("QQQ","GLD","ALB")

yourfunction <- function(tick=ticks,fromDate="2019-01-01",toDate="2019-10-31") {
 x <- tq_get(tick, get = "stock.prices", from = fromDate, to = toDate)
 x %>% group_by(symbol) %>% 
   dplyr::mutate(value = log(adjusted)-log(dplyr::lag(adjusted))) %>% 
   stats::na.omit() %>% 
   dplyr::summarise(mean = mean(value),
                    sd = sd(value),
                    skewness= moments::skewness(value),
                    kurt = moments::kurtosis(value))
  
}

# test your function as per below
yourfunction(tick=c("FB","V","PFE"),fromDate="2019-01-01",toDate=as.character(Sys.Date()))

## # A tibble: 3 x 5
##   symbol      mean     sd skewness  kurt
##   <chr>      <dbl>  <dbl>    <dbl> <dbl>
## 1 FB      0.00169  0.0183    0.418  8.30
## 2 PFE    -0.000546 0.0126   -0.978  6.18
## 3 V       0.00135  0.0120   -0.479  5.05

Q2. Multiple Regression (20 points)

You aim at constructing a portfolio of maximum three sector ETFs to replicate the S&P 500 performance.
You have chosen the following sector ETFs:
- XLV - Health Care.
- XLE - Energy.
- XLK - Technology.
Use adjusted log returns and assume the portfolio is automatically rebalanced daily to simplify our analysis.
Use the period from “2018-01-01” to “2019-10-31”.

A) Extract Data and Compute Returns (3 points)

tick=c("SPY","XLV","XLE","XLK")
df <- tq_get(tick, get = "stock.prices",
             from = "2018-01-01",
             to = "2019-10-31")
df.wide <- df %>% group_by(symbol) %>% dplyr::mutate(value = log(adjusted)-log(dplyr::lag(adjusted))) %>% 
  stats::na.omit() %>% select(date, symbol, value) %>% 
  pivot_wider(names_from = symbol, values_from = value)

B) Multicollinearity (5 points)

Compute the correlations between the three independent variables and, in a few words, discuss its implications on assumptions made in multivariate regression models.

library(PerformanceAnalytics)
  dftxs <- df.wide %>% 
  timetk::tk_xts(rename_index = "date")

PerformanceAnalytics::chart.Correlation(dftxs)

#or to calculate it by formula
cor(df.wide$XLV, df.wide$XLE)

## [1] 0.5875933

cor(df.wide$XLV, df.wide$XLK)

## [1] 0.7651036

cor(df.wide$XLE, df.wide$XLK)

## [1] 0.6265774

Correlations are shown in the graph. In a multivariate regression analysis, we assume no multicollinearity. According to the correlation between the three variables, there are multicollinearity between them, because they are all correlated. This can be a problem for the regression model since the “indenpendent varaibles” are not indenpendent. With multicollinearity, one should be careful when adding variables into the model, since it can falsely increase R square from correlation between the variables. #### C) Build Your Multiple Regressions (6 points)

Build your regression starting with XLK.
Then add XLV.
Then add XLE.

Print your results and explain in a few bullets your thoughts and conclusions on the quality and significance of this multiple regression.

fit1 <- lm(SPY ~ XLK, df.wide)
summary(fit1)

## 
## Call:
## lm(formula = SPY ~ XLK, data = df.wide)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0139371 -0.0018300 -0.0000934  0.0019978  0.0107294 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.133e-05  1.597e-04  -0.447    0.655    
## XLK          6.722e-01  1.171e-02  57.421   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003422 on 458 degrees of freedom
## Multiple R-squared:  0.878,  Adjusted R-squared:  0.8778 
## F-statistic:  3297 on 1 and 458 DF,  p-value: < 2.2e-16

fit2 <- lm(SPY ~ XLK + XLV, df.wide)
summary(fit2)

## 
## Call:
## lm(formula = SPY ~ XLK + XLV, data = df.wide)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0067542 -0.0016756 -0.0000728  0.0017316  0.0075759 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.707e-05  1.175e-04  -0.571    0.568    
## XLK          4.703e-01  1.337e-02  35.163   <2e-16 ***
## XLV          3.561e-01  1.805e-02  19.729   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002518 on 457 degrees of freedom
## Multiple R-squared:  0.9341, Adjusted R-squared:  0.9338 
## F-statistic:  3241 on 2 and 457 DF,  p-value: < 2.2e-16

fit3 <- lm(SPY ~ XLK + XLV + XLE, df.wide)
summary(fit3)

## 
## Call:
## lm(formula = SPY ~ XLK + XLV + XLE, data = df.wide)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0080237 -0.0013128 -0.0000201  0.0013766  0.0063243 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.573e-05  9.868e-05   0.362    0.717    
## XLK         4.137e-01  1.191e-02  34.740   <2e-16 ***
## XLV         3.094e-01  1.548e-02  19.989   <2e-16 ***
## XLE         1.369e-01  9.787e-03  13.989   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002108 on 456 degrees of freedom
## Multiple R-squared:  0.9539, Adjusted R-squared:  0.9536 
## F-statistic:  3146 on 3 and 456 DF,  p-value: < 2.2e-16

plot(fit3)

The model, at first glance, looks very significant with an adjusted R sqaure of 95.36% and very low P values. However, due to the multicollinearity between the variables, the incremental value increased in R square might be a results from the correlation between the variables. This could lead to inaccurate coefficient estimate since the model has hard time isolating variables. The residue analysis shows that the model has high goodness of fit. In conclusion, the regression model shows that SPY is positively correlated with all three variables and the correlations are significant. One should also take multicollinearity into the consideration when discussing the reliability of the model.

D) Hedging and Replication (6 points)

Thinking hedge ratios, how many units of each of the three sectors ETFs would you buy to replicate the SPY return?

cor(df.wide$SPY, df.wide$XLV)*sd(df.wide$SPY)/sd(df.wide$XLV)

## [1] 0.8417804

cor(df.wide$SPY, df.wide$XLE)*sd(df.wide$SPY)/sd(df.wide$XLE)

## [1] 0.5437735

cor(df.wide$SPY, df.wide$XLK)*sd(df.wide$SPY)/sd(df.wide$XLK)

## [1] 0.6721947

One can buy 0.8417804 unit of XLV, 0.5437735 unit of XLE, 0.6721947 units of XLK to replicate SPY return.

FIN488 Data Science

Exam 1

Tszue Tse

2019-11-06

Format and Guidance

Grading