Abstract
This is a solution for Workshop 3. Not all the workshop will be displayed; only the sections were students needed to work on an exercise or respond questions.First, I load the libraries
library(quantmod)
Then, I download the data from Yahoo Finance using getSymbols from January 2016 to Aug 2021:
getSymbols(c("AAPL", "MSFT"), from="2016-01-01", to="2021-08-31",
periodicity="monthly", src="yahoo")
## [1] "AAPL" "MSFT"
Notice that the objects AAPL and MSFT were created in your environment (right pane)
Double-click the objects in your environment or use the View() function to take a look at them.
Since the data only contains prices, you have to calculate returns. You can do this by either creating new objects or by adding a new column to the existing data frames.
Calculate continuously compounded returns (the logarithmic difference of Adjusted prices of Apple & Microsoft)
# You can create new objects for the returns by getting the first difference
# of the log prices:
<- diff(log(AAPL$AAPL.Adjusted))
r_AAPL <- diff(log(MSFT$MSFT.Adjusted))
r_MSFT # The first month will have NA since you do not have price before the first
# month. Then, you can delete the NA values using the na.omit function:
<- na.omit(r_AAPL)
r_AAPL <- na.omit(r_MSFT)
r_MSFT # Or you can add a new column to the original R Objects using the same formula:
$r_AAPL <- na.omit(diff(log(AAPL$AAPL.Adjusted)))
AAPL$r_MSFT <- na.omit(diff(log(MSFT$MSFT.Adjusted))) MSFT
Notice how the $ operator is useful to create new columns and access exiting columns in a data frame.
I use the summary() function to examine the descriptive statistics of the stock returns:
summary(r_AAPL)
## Index AAPL.Adjusted
## Min. :2016-02-01 Min. :-0.20340
## 1st Qu.:2017-06-16 1st Qu.:-0.01776
## Median :2018-11-01 Median : 0.04664
## Mean :2018-10-31 Mean : 0.02851
## 3rd Qu.:2020-03-16 3rd Qu.: 0.08786
## Max. :2021-08-01 Max. : 0.19423
summary(r_MSFT)
## Index MSFT.Adjusted
## Min. :2016-02-01 Min. :-0.102087
## 1st Qu.:2017-06-16 1st Qu.: 0.001902
## Median :2018-11-01 Median : 0.028075
## Mean :2018-10-31 Mean : 0.026824
## 3rd Qu.:2020-03-16 3rd Qu.: 0.056479
## Max. :2021-08-01 Max. : 0.127800
Now we write the null and alternative hypothesis to check whether AAPL and MSFT returns are greater than zero:
H0: mean(r_AAPL) = 0
Ha: mean(r_AAPL) > 0
H0: mean(r_MSFT) = 0
Ha: mean(r_MSFT) > 0
Remember that your hypothesis will be always the ALTERNATIVE hypothesis (Ha), and the Null hypothesis is the opposite of your hypothesis. You start assuming that the NULL hypothesis is TRUE (Not your hypothesis)
The standard error of the VARIABLE OF STUDY (se) is the standard deviation of the the variable of study. In this case, the variable of study is the mean of stock returns. Then, the standard error is the standard deviation of the MEAN of stock returns.
The standard deviation of the mean is equal to the standard deviation of the individual returns divided by the squared root of N
See the Note Basics of Hypothesis Testing to see the explanation of this formula.
<- nrow(r_AAPL)
N # nrow is a function that brings the # of rows of an R object. In this case,
# the # of rows of the r_AAPL will be the total # of months selected
# In this case, N is the same for MSFT since both have return data for the
# months selected
<- sd(r_AAPL) / sqrt(N)
se_AAPL se_AAPL
## [1] 0.01010732
<- sd(r_MSFT) / sqrt(N)
se_MSFT se_MSFT
## [1] 0.0061237
sd() is a function that calculates the standard deviation, while sqrt() calculates square root. nrow() is useful to know the number of rows in a vector, matrix or data frame. By using nrow(r_AAPL) we get N.
Remember that the t-Statistic is always calculated as:
xa = Value of the variable of study you got from the data
x0 = Value of the variable of study according to the Null Hypothesis (Ho)
t = (xa - x0) / standard error(xa)
In this case, the variable of study is the mean of returns, so xa= mean(returns):
t = (mean(returns) - 0) / se
The numerator is the difference between the mean returns of your data minus the mean return of the NULL HYPOTHESIS, which is zero. The denominator is the standard error of the variable of study (se), which is the individual standard deviation divided by the squared root of N.
Then, the t value is actually the distance between the actual value of the mean return minus the hypothetical value, which is zero. This distance is measured in # of standard deviations of the mean returns.
Then,
se = sd(returns) / sqrt(N)
t = (mean(sample)) - 0) / (sd(returns) / sqrt(N))
<- mean(r_AAPL) mean_r_AAPL
<- (mean_r_AAPL - 0) / se_AAPL
t_value_AAPL t_value_AAPL
## [1] 2.821079
# Repeat for MSFT
<- mean(r_MSFT)
mean_r_MSFT <- ((mean_r_MSFT - 0) / se_MSFT)
t_value_MSFT t_value_MSFT
## [1] 4.38034
Since the t-value for r_AAPL test is bigger than 2, then there is enough statistical evidence to say that the mean of returns of AAPL is bigger than 0.
Similarly, the t-value for the MSFT test is bigger than 2, so we also conclude that there is enough statistical evidence to say that the mean returns of MSFT are significantly bigger than zero.
In both cases we could reject the null hypothesis, so we have evidence to support our hypothesis (the alternative hypothesis) that state that the mean return of both stocks is bigger than zero.
We can easily run the t-test using the function t.test:
<- t.test(as.numeric(r_AAPL), alternative = "greater") ttest_AAPL
t.test only accepts vectors as arguments.
We use the as.numeric() function in order to turn r_AAPL into a vector (it was an xts object).
Also, we specify the alternative argument as greater to tell the function that our alternative hypothesis is that r_AAPL is greater than 0.
class(ttest_AAPL)
## [1] "htest"
# ttest_GE is a htest object, which contains a list of objects
names(ttest_AAPL)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
# The names function shows the objects contained in the list
<- as.numeric(ttest_AAPL$statistic)
tvalAAPL # I store the statistic object of ttest_GE in a new object called tvalGE
# Now, I compare the t-value that I manually computed and the t-value from
# t-test()
round(tvalAAPL, 10) == round(t_value_AAPL, 10)
## [1] TRUE
# The console shows TRUE. Hence, they are the same.
# round() is a function to round up numbers to x decimals.
# Differences after the 10th decimal place are not significant.
# Repeat process for MSFT
<- t.test(as.numeric(r_MSFT), alternative = "greater")
ttest_MSFT <- as.numeric(ttest_MSFT$statistic)
tvalMSFT round(t_value_MSFT, 10) == round(tvalMSFT,10)
## [1] TRUE
I GOT EXACTLY THE SAME t-value WHEN I COMPUTED MANUALLY.
Run a t-test to compare whether the average monthly return of AAPL is greater than the average monthly returns of MSFT.
H0: mean(r_AAPL) > mean(r_MSFT)
Ha: mean(r_AAPL) < mean(r_MSFT)
I have to re-arrange the equality to leave a number to the right:
H0: mean(r_AAPL) - mean(r_MSFT) = 0
Ha: mean(r_AAPL) - mean(r_MSFT) <>0
The null hypothesis says that there is no difference between the means, and our hypothesis says that there is a difference.
Check that the VARIABLE OF STUDY of this test is the DIFFERENCE OF BOTH MEAN RETURNS. We can re-define the hypothesis using a variable for this difference:
dif = mean(r_AAPL) - mean(r_MSFT)
H0: dif = 0
Ha: dif <>0
Then, the variable of analysis is the DIFFERENCE of 2 MEANS if returns (dif), the mean of monthly returns of AAPL and the mean of monthly returns of MSFT.
The t-statistic is estimated as:
t= ((mean(r_AAPL) - mean(r_MSFT) - 0 ) / se
Remember that the standard error is the standard deviation of the VARIABLE OF STUDY. Then, the standard error is equal to:
se = SD( mean(r_AAPL) - mean(r_MSFT) )
se = SD(dif)
Then:
t= ((mean(r_AAPL) - mean(r_MSFT) - 0 ) / SD( mean(r_AAPL) - mean(r_MSFT) )
How can I calculate the standard deviation of the difference of the mean? I can start calculating the variance of this difference, and then get the squared root.
The variances of the stock returns might be different, so we can estimate the variance of the DIFFERENCE as:
Var(mean(r_AAPL) - mean(r_MSFT) ) = Var(mean(r_AAPL)) + Var(mean(r_MSFT))
Note that the variance of a difference is the sum of both variances.
This sounds incoherent, but remember that the variable of analysis is a random variable, so its values can vary from positive, zero, or negative values.
Variance is a measure of variability, so the variance of a difference is the sum of the variances. You can check this formula in any probability book.
Also, I am assuming that both returns are independent of each other. In other words, that there is no linear correlation between them. This might not be true, but for now, I will make this assumption.
If I do not make this assumption, the formula for the variance of the difference is more complicated since I need add 2 times the covariance between both returns!
Since the variance of a mean is equal to the variance of the individuals divided by N, then:
Var(mean(r_AAPL) - mean(r_MSFT) ) = Var(r_AAPL)/N + Var(r_MSFT)/N
Var(mean(r_AAPL) - mean(r_MSFT) ) = (1/N) * (Var(r_AAPL) + Var(r_MSFT) )
SD(mean(r_AAPL) - mean(r_MSFT) ) = sqrt((1/N)* (Var(r_AAPL) + Var(r_MSFT)))
t = (mean(r_AAPL) - mean(r_MSFT) - 0) / sqrt( (1/N)* (Var(r_AAPL) + Var(r_MSFT) ) )
cat("mean_r_AAPL = ", mean_r_AAPL)
## mean_r_AAPL = 0.02851356
cat("mean_r_MSFT = ", mean_r_MSFT)
## mean_r_MSFT = 0.02682389
<- nrow(r_AAPL) # (same as number of rows of MSFT)
N cat("N =", N)
## N = 67
cat("var_r_AAPL = ", var(r_AAPL) )
## var_r_AAPL = 0.006844584
cat("mean_r_MSFT = ", var(r_MSFT))
## mean_r_MSFT = 0.00251248
<- (mean_r_AAPL - mean_r_MSFT - 0) /
t sqrt((1/N) * (var(r_AAPL) + var(r_MSFT)))
# I just change from data frame to a numeric value:
<- as.numeric(t)
t print(t)
## [1] 0.1429782
I can do the same t-test much faster by using the ttest command:
<- t.test(as.numeric(r_AAPL),as.numeric(r_MSFT), paired = FALSE, var.equal = FALSE)
ttest ttest
##
## Welch Two Sample t-test
##
## data: as.numeric(r_AAPL) and as.numeric(r_MSFT)
## t = 0.14298, df = 108.7, p-value = 0.8866
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02173332 0.02511266
## sample estimates:
## mean of x mean of y
## 0.02851356 0.02682389
I specified the paired argument as FALSE since I am comparing 2 means of 2 groups of returns without considering the chronological order of the difference. I used the var.equal argument as FALSE since I am assuming that the variance of both mean returns are different
cat("The t-value using t.test is ", ttest$statistic, "\n")
## The t-value using t.test is 0.1429782
cat("The t-value calculated manually is ", t, "\n" )
## The t-value calculated manually is 0.1429782
cat("The p-value of the test is ", ttest$p.value)
## The p-value of the test is 0.8865721
Since the absolute value of the t-value is NOT greater than 2, then I CANNOT say that the mean return of Apple is significantly greater than the mean return of MSFT. The mean return of Apple is greater than the mean return of MSFT, but their difference is NOT big enough to say that there is a significant difference between both mean returns.
In conclusion, there is NOT statistical evidence to say with 95% of confidence that the mean return of Apple is greater than the mean return of MSFT; they are about the same.