Abstract
In this workshop we review the basics of RStudio and R Objects. In addition, we will learn how to apply hypothesis testing to Finance.You will work in RStudio. Create an R Notebook document to write whatever is asked in this workshop.
At the beginning of the R Notebook write Workshop 3 - Financial Econometrics II and your name (as we did in previous workshop).
You have to replicate all the steps explained in this workshop, and ALSO you have to do whatever is asked. Any QUESTION or any STEP you need to do will be written in CAPITAL LETTERS. For ANY QUESTION, you have to RESPOND IN CAPITAL LETTERS right after the question.
It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your notebook. Your own workshop/notebook will be very helpful for your further study.
Keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file.
This section was designed for those who want to be more familiar with the R language. This section is optional.
In R, any piece of information is stored into an object. Each R object is considered to be of a specific data class. Each * R class* has its own data structure. In other programming languages any piece of data is called a variable. The R objects are classified according to their data structures. This classification is:
A vector is a collection of values. The values of a vector can be of any of the following atomic classes: numeric
, character
, integer
, logical
(true/false) or complex
. Each vector can have only one type of data class.
Now we will create a small collection of numbers in a numeric
vector. A numeric
vector is the simplest type of data structure in R. In fact, even a single number is considered a vector of length
one.
To create a vector we can use the function c()
that means combine We can define a vector with the stock prices: 1.3, 1.5, 1.28 c()
function and separating each element by a comma as shown below.
<- c(1.3, 1.5, 1.28)
stock_1 stock_1
## [1] 1.30 1.50 1.28
We can also create an integer
vector with consequtive numbers as follows:
<- 1:10
stock_2 stock_2
## [1] 1 2 3 4 5 6 7 8 9 10
We can do arithmetic operations with a numeric
vector. For example, we can add a number to each element of the vector and assign the result into another numeric
vector as follows:
<- stock_1 - 1
stock_3 stock_3
## [1] 0.30 0.50 0.28
Also you can also make arithmetic operations with two or more vectors:
<- stock_2 + stock_3 stock_4
## Warning in stock_2 + stock_3: longitud de objeto mayor no es múltiplo de la
## longitud de uno menor
stock_4
## [1] 1.30 2.50 3.28 4.30 5.50 6.28 7.30 8.50 9.28 10.30
When given two vectors of the same length, R performs the specified arithmetic operation (+
, -
, *
, etc.) element-by-element.
The elements of a vector are R objects, so these element are of any of five basic or atomic
data classes:
<- c(2,3.43,4.21) x
<- c("2", "hi", "Econometrics I") x
<- c(2L,3L,4L) x
<- c(TRUE,FALSE,FALSE,TRUE) x
<- c(5i,4i) z
Now we will learn how to access a particular element of a vector. First we will create a vector with numbers form 101
to 105
using the function seq()
which generates a sequence of numbers according to the arguments in the function, in this case form 101
to 105
.
<- seq(from = 101, to = 105)
vector1 vector1
## [1] 101 102 103 104 105
If we want to access the third element of this vector, which is 103, we write the element number between the operator []
:
3] vector1[
## [1] 103
Imagine that we want to access not only one, but several element in the vector. This is done by indicating a vector of numbers that indicate the element number inside the original vector. We do this as follows:
c(1,3,5)] vector1[
## [1] 101 103 105
In the case that you want to modify certain element in the existing vector `vector1
, all you have to do is to assign <-
the new value to the specific vector location indicated by the value inside the square brackets operator []
.
The example below can be read as follows: the second element of the vector vector1
gets the value 109
.
2] <- 109
vector1[ vector1
## [1] 101 109 103 104 105
R considers all vectors as VERTICAL vectors. However, when you display a vector on the screen you will see it as HORIZONTAL. Let’s see an example:
<- seq(1,100)
vector2 vector2
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
We can see that the elements of the vector are displayed horizontally. Why is it important to know whether a vector is vertical or horizontal? when we do matrix operations with matrices and/or vectors we have to know whether the vector is horizontal or vertical.
When we display a vector R always shows an index [1] indicating that the vector starts with the element 1. If the vector has many elements that cannot be displayed in 1 row, then the vector continues in the following row and it display the element # in the following row as [#], where # will be the number of the first element shown in that row.
Actually, when we declare a numaric variable with only 1 value, R define it as a vector of 1 element. Then, the simpest R object is a vector, no matter if it has 1 or more elements.
Matrices are vectors with a dimension attribute. A typical matrix object has two dimensions: rows and columns. Those dimensions define the size of the matrix and must be defined every time you create a new matrix object.
In order to create a matrix you can use the function matrix()
, the first attribute is the the data in your matrix, the nrow
attribute is the desired number of rows while the ncol
attribute is the desired number of columns.
matrix(1:4, nrow = 2, ncol = 2)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Alternatively, you can create a matrix by joining 2 or more different vectors, using the functions rbind()
or cbind()
. rbind
joins to vectors as rows while cbind
combines columns. In the example below, we create two vectors stock1
and stock2
, each having 3 elements, then we combine them together into a matrix.
# Create vectors
<- c(1,2,3)
stock1 <- c(90,91,92)
stock2
# Combine vectors as VERTICAL vectors and save as `matrix1`
<-cbind(stock1,stock2)
matrix1
# Print object matrix1
matrix1
## stock1 stock2
## [1,] 1 90
## [2,] 2 91
## [3,] 3 92
# Combine vectors as HORIZONTAL vectors and save as `matrix1`
<-rbind(stock1,stock2)
matrix2
# Print object matrix1
matrix2
## [,1] [,2] [,3]
## stock1 1 2 3
## stock2 90 91 92
To access a certain element of a matrix we will use the square brackets operator []
after the name. The first element before the comma indicates the row number, while the second element the square brackets refers to the column. This way, you can point to any specific location in the matrix and extract the value saved on this particular location. In the case below, we are extracting the element in the first row second column, which corresponds to the number 90
1,2] matrix1[
## stock2
## 90
We can also refer a whole column or row simply by leaving the other element empty. In the case we want to observe the whole second row we just write:
2,] matrix1[
## stock1 stock2
## 2 91
A very important point to note is that, just as vectors, matrices can only store one class of object, i.e. numeric objects. In other words, you can not have both, string class elements and numeric class elements in the same matrix. To do this, we will use a different class of object called data frame
.
Data frames are used to store tabular data. A data frame is like an Excel spreadsheet kind of table. However, unlike matrices, data frames can store different classes of objects in different columns.
You can create a data frame using the data.frame()
function. In the example shown below, we combine both, string and numeric data in the same x
object.
<- data.frame(Stocks = c("ALSEA","AMXL","FEMSA"), price = c(9.4, 5.5, 7.8))
prices prices
## Stocks price
## 1 ALSEA 9.4
## 2 AMXL 5.5
## 3 FEMSA 7.8
We can ses the dimensions of the data frame using the dim()
function.
dim(prices)
## [1] 3 2
You can also create a data frame object by transforming an existing matrix via the function as.data.frame()
. In the example below, we first create a matrix called my.matrix
and then transform it to be my.df
. In the third line of code we check whether the new object is actually a data.frame
class object.
<- matrix(1:4, nrow = 2, ncol = 2)
my.matrix <- as.data.frame(my.matrix)
my.df class(my.df)
## [1] "data.frame"
The rbind
or cbind
functions can be also used to append rows or columns to an existing data frame. In the example below I will add a new column to the data frame prices
previously created, but first I will create a copy and call it prices.df
just to make it more descriptive.
# Make a copy
<- prices
prices.df <- cbind(prices.df, behavior=c("up", "down", "up"))
prices.df prices.df
## Stocks price behavior
## 1 ALSEA 9.4 up
## 2 AMXL 5.5 down
## 3 FEMSA 7.8 up
As we can see the first two columns have a name, but the third does not mean anything. Data frames, as any object in R
have attributes:
attributes(prices.df)
## $names
## [1] "Stocks" "price" "behavior"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3
In this case the data frame has 3 attributes: $names
, $row.names
and $class
. We can not only see those but manipulate them. The $names
attribute stores the names of the columns in your data frame, so if you do not like the names of your columns you can always change them using the colnames()
function as follows:
colnames(prices.df) <- c("ALSEA","AMXL","FEMSA")
You see the changes you have made, you can always print out the attributes
of the data frame:
attributes(prices.df)
## $names
## [1] "ALSEA" "AMXL" "FEMSA"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3
Or simply get the names of the columns or rows using the functions colnames()
or rownames()
.
colnames(prices.df)
## [1] "ALSEA" "AMXL" "FEMSA"
rownames(prices.df)
## [1] "1" "2" "3"
Changing the name of one of the column is easy, but now imagine that we had 100 or 1,000 columns, writing all the column names again is not practical. So we need to learn how to modify only one, to do this we have to know how to pinpoint the element we want to modify, so again we use the squared brackets []
to point to the specific location of the column that will be renamed. Let’s say you want to modify the name of the third column Behavior
to be up-down
:
colnames(prices.df)[3] <- "up-down"
# See the changes
colnames(prices.df)
## [1] "ALSEA" "AMXL" "up-down"
We can use the assignment operator <-
, also known as the back arrow operator to assign a value into a variable, as we have seen in the previous workshops, we can construct objects with information that will be useful for us, in this case, I will download information of a stock with the getsymbols function, and store the Apple adjusted price it in a variable.
To do that, remember to always load the package of the function you will use:
library(quantmod)
Now I will use the getsymbols function to download daily data from Aug 2020 through Yahoo Finance:
getSymbols("AAPL", from="2020-08-01", src="yahoo", periodicity="daily")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "AAPL"
# I see the first rows of this dataset:
tail(AAPL)
## AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
## 2021-08-17 150.23 151.68 149.09 150.19 92229700 150.19
## 2021-08-18 149.80 150.72 146.15 146.36 86326000 146.36
## 2021-08-19 145.03 148.00 144.50 146.70 86960300 146.70
## 2021-08-20 147.44 148.50 146.78 148.19 59947400 148.19
## 2021-08-23 148.31 150.19 147.89 149.71 60131800 149.71
## 2021-08-24 149.45 150.86 149.15 149.62 48606400 149.62
In this case, the object formed with the stock information AAPL
is an xts object.
This type of object is similar to a dataFrame, but with a time/date index. You can make reference to a column of this object by using the dollar sign. I will store the value of the adjusted prices column in a new variable:
#When I use the dollar sign, I am referring to an specific column of the object
<- AAPL$AAPL.Adjusted AdPrice
To view the contents of the first rows of the new object AdPrice
, just type head(AdPrice
):
head(AdPrice)
## AAPL.Adjusted
## 2020-08-03 108.0465
## 2020-08-04 108.7681
## 2020-08-05 109.1623
## 2020-08-06 112.9709
## 2020-08-07 110.4024
## 2020-08-10 112.0071
R replies printing out in the screen the value of the AdPrice
variable, which in this example is equal to AdPrice
. We can do more calculations using AdPrice
. For example, I will get the natural logarithm of the adjusted prices
<- log(AdPrice) lnAPPL
R did this calculation for every element in the AdPrice
variable and stored it in a new variable called lnAPPL
Using the getSymbols function, download monthly data for Apple (AAPL) and Microsoft (MSFT) from 2016 to date.
library(quantmod)
# Load libary first always
getSymbols(c("AAPL", "MSFT"), from="2016-01-01", periodicity="monthly", src="yahoo")
## [1] "AAPL" "MSFT"
Calculate the continuously compounded returns for each stock. Remember that the cc returns can be calculate as the first difference of the log of adjusted prices:
<- diff(log(Ad(AAPL)))
r_AAPL <- diff(log(Ad(MSFT)))
r_MSFT
head(r_AAPL, 5)
## AAPL.Adjusted
## 2016-01-01 NA
## 2016-02-01 -0.006699996
## 2016-03-01 0.125157717
## 2016-04-01 -0.150731161
## 2016-05-01 0.063244216
head(r_MSFT, 5)
## MSFT.Adjusted
## 2016-01-01 NA
## 2016-02-01 -0.07949864
## 2016-03-01 0.08919059
## 2016-04-01 -0.10208674
## 2016-05-01 0.06087242
We have NA in the first period since it is not possible to calculate returns in the first period. We can drop the NA values to avoid computation problems later:
= na.omit(r_AAPL)
r_AAPL = na.omit(r_MSFT) r_MSFT
Run a t-test to compare whether the mean return of a stock is different than zero.
To do a hypothesis test, we usually do the following steps:
DEFINE THE VARIABLE OF analysis
WRITE THE NULL AND THE ALTERNATIVE HYPOTHESIS.
CALCULATE THE STANDARD ERROR, WHICH IS THE STANDARD DEVIATION OF THE VARIABLE OF STUDY.
CALCULATE THE t-statistic (t-value). EXPLAIN/INTERPRET THE t-statistic.
WRITE YOUR CONCLUSION OF THE t-TEST
We will do these steps for the case of Apple return:
In this case, the variable of analysis is the mean return of Apple. We want to test whether the mean return of Apple is greater than zero. We start believing that the mean return of Apple is greater than zero.
The alternative hypothesis is always our belief, and the null hypothesis is the opposite of our belief. The null hypothesis is usually named H0, while the alternative hypothesis is named Ha. Then we define the hypotheses as follows:
H0: mean(r_AAPL) = 0
Ha: mean(r_AAPL) > 0
In any hypothesis test we assume that the null hypothesis is true. The alternative hypothesis is our belief that we want to provide evidence for.
In other words, we start being very skeptic about our belief, so we start assuming that we are wrong and that the null hypothesis is true.
Then, the purpose of any hypothesis test is provide strong evidence against the null hypothesis, so we can say with certain confidence (level of probability) that our alternative hypothesis might be true.
Then, how we provide this evidence against the null hypothesis?
We start assuming that H0 is true, then we collect a sample data, calculate the variable of analysis with the data, and then calculate the t-statistic of the test. Then, what is the t statistic?
The t-statistic or t-value is the standardized distance between the variable of analysis (calculated with the data) and the value stated in the null hypothesis. This standardized distance is measured in number of standard deviations of the variable of analysis.
For the case of this example, I can re-write the definition of the t-statistic as follows:
The t-statistic or t-value is the standardized distance between the mean return of Apple (calculated with historical returns) and zero (the value of the H0). This standardized distance is measured in number of standard deviations of the Apple mean returns.
The standard deviation of the variable of analysis is usually named as standard error of the test.
Then, we need to first calculate the standard error of the test, and then the t-statistic.
The standard error of the test is the standard deviation of the variable of analysis.
For the case of Apple returns, the standard error is the standard deviation of the mean returns of Apple. Check that it is not the same the standard deviation of Apple returns vs the standard deviation of the Apple **mean returns*.
From the Central Limit Theorem we learned that the standard deviation of the mean of a group is significantly reduced compared with the standard deviation of the individuals.
In this case, the standard deviation of Apple mean returns (the mean of a group of returns) must be much less than the standard deviation of Apple historical individual returns.
Then, how we can calculate the standard error, which is the standard deviation of the mean of a group? Remember what we learned from the Central Limit Theorem. The standard deviation of the mean of a group is equal to the standard deviation of the individuals divided by the squared root of N (N=the number of elements in the group).
Then, for the case of Apple mean return, the standard error is equal to the standard deviation of Apple historical returns divided by the squared root of the number of historical periods. Then, we can manually calculate the standard error as follows:
# I set N equal to the # of rows of the historical return dataset:
= nrow(r_AAPL)
N # I calculate the standard error:
<- sd(r_AAPL) / sqrt(N)
se_AAPL #Note that sd is a function to calculate the standard deviation of a variable
se_AAPL
## [1] 0.01010589
We got a standard error equal to 0.0101059, which is much less that the original standard deviation of the monthly Apple returns, which is equal to 0.0827202.
We will calculate the t-statistic a) by hand (manually calculated), and b) using the t.test function.
We start doing the manual calculation to better understand what is the t-statistic:
Since the t-statistic is the standardized distance from the real value of the variable of analysis and the null value stated in the null hypothesis, then:
<- (mean(r_AAPL) - 0) / se_AAPL
t_val t_val
## [1] 2.799825
The numerator of t_val is the distance between the Apple mean return and the hypothetical value of Apple mean return (stated in H0), which is zero. To measure this distance in number of standard deviations of Apple mean returns, we divide this distance by its standard error. By doing this division we get a standardized distance from the actual (real) Apple mean returns and zero, the hypothetical Apple mean return stated in the null hypothesis H0.
Now calculate the t-statistic using the t.test function:
<- t.test(as.numeric(r_AAPL), alternative = "greater")
ttest_AAPL ttest_AAPL
##
## One Sample t-test
##
## data: as.numeric(r_AAPL)
## t = 2.7998, df = 66, p-value = 0.003349
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
## 0.01143536 Inf
## sample estimates:
## mean of x
## 0.02829471
We can display only the t-value and the corresponding p-value of the test:
cat("t-vale from t.test =", ttest_AAPL$statistic,"\n")
## t-vale from t.test = 2.799825
cat("p-value = ", ttest_AAPL$p.value)
## p-value = 0.003349462
WE GOT THE SAME t-VALUE USING THE t.test FUNCTION compared with our MANUAL CALCULATION!
The t-value of the test is 2.7998249.
As we mentioned, the t-value or t-statistic is a measure of standardized distance between the real actual value of the variable of analysis and the hypothetical value stated in the null hypothesis.
In this case we can say that the real Apple mean return is 2.7998249 standard deviations away from zero, the hypothetical Apple mean return stated in H0.
When the t-value is bigger than 2, we have strong statistical evidence (at least at the 95% confidence level) to reject the null hypothesis H0. Why this is the case? Let’s quickly review what is the t-Student distribution vs the z normal distribution.
In any hypothesis testing we assume that the variable of analysis behaves like a t-Student distribution. The t-Student distribution is a probability distribution that is very similar to the normal probability distribution. The main difference is that the t-Student distribution better models extreme values for small samples, and real financial returns usually have more extreme values compared to the normal distribution.
When the sample size is bigger than 30, the normal z distribution behaves almost the same as the t Student distribution.
Since the variable of analysis is supposed to behave like a t-Student distribution, then when the t-statistic is greater than 2 this means that the hypothetical distribution with mean zero is very far away from the real distribution of the data. When this distance is 2, then, both distributions will overlap only in about 2.5%! This means that if we assume that the hypothetical distribution with mean=0 is true, then it will be very unlikely that we got a mean that is 2 standard deviations away from the tru value!! Then, we can say that we have strong evidence to reject the null hypothesis when the t-value is 2 or greater!
Since the t-value of the test is greater than 2, then we have strong statistical evidence to reject the null hypothesis that states that Apple mean return is zero. Therefore, AAPL mean return is statistically greater than 0.
Another more detailed interpretation using the p-value and the confidence level of the test is the following:
Since the t-value of the mean return of AAPL is greater than 2 and the corresponding p-value is less than 0.05, we can reject the null hypothesis at the 99.6650538% confidence (1-pvalue). Therefore, AAPL mean return is statistically greater than 0 with a confidence level of 99.6650538%.
Do a hypothesis test to check whether the Apple mean monthly return is greater than the Microsoft mean monthly return.
We follow the same steps of hypothesis testing we described in the previous example:
In this case, the variable of analysis is the difference of two mean returns. More specifically, the variable of analysis is the difference between Apple mean returns and Microsoft mean returns. If this difference is bigger than zero, then we say that Apple mean returns is greater than Microsoft mean returns.
We start calculating the mean of returns for both stocks:
<- mean(r_AAPL)
mean_AAPL_r <- mean(r_MSFT)
mean_MSFT_r
print(mean_AAPL_r)
## [1] 0.02829471
print(mean_MSFT_r)
## [1] 0.02686043
Remember that the alternative hypothesis (Ha) is always our belief, and the null hypothesis (H0) is the opposite of our belief. Then we define the hypotheses as follows.
Since the mean return of Apple is higher than the mean return of Microsoft we start believing that Apple is significantly offering higher average monthly returns compared to Microsoft. Then:
H0: mean(r_AAPL) = mean(r_MSFT)
Ha: mean(r_AAPL) > mean(r_MSFT)
However, the null hypothesis always has to be stated with a variable that is equal to a specific value. Remember that our variable of study is the difference of both means. Then, we can re-arrange the equality to leave a number to the right:
H0: mean(r_AAPL) - mean(r_MSFT) = 0
Ha: mean(r_AAPL) - mean(r_MSFT) <>0
We can define our variable of study as meandif:
meandif = mean(r_AAPL) - mean(r_MSFT)
Then, the final setup of the hypotheses is:
H0: meandif = 0
Ha: meandif > 0
In this case, the variable of study of this test is meandif, which is the difference of 2 means. The mean return of AAPL and MSFT are random variables, so the variable of this test is also a random variable.
To calculate the t value of this test, we have estimate the standar error, which is the standard deviation of meandif.
In this case the standard error is the standard deviation of the difference of both means (meandiff). Remember that the means of return of each stock are random variables.
From basic probability theory, if both random variables (Apple mean returns and Microsoft mean returns) are independent, then the variance of the difference of 2 random variables is the SUM of the variances! This sounds counter-intuitive.
WHY THE VARIANCE OF THE DIFFERENCE OF 2 RANDOM VARIABLES IS THE SUM OF THE 2 VARIANCES INSTEAD OF BEING THE DIFFERENCE OF BOTH VARIANCES? DO YOUR OWN RESEARCH AND BRIEFLY EXPLAIN.
Here we do the calculation of t-value manually in R:
<- nrow(r_AAPL)
N <- (mean_AAPL_r - mean_MSFT_r - 0) / sqrt( (1/N) * (var(r_AAPL) + var(r_MSFT) ))
t t
## AAPL.Adjusted
## AAPL.Adjusted 0.1213646
Now we calculate the t value of this test with the t.test function:
<- t.test(as.numeric(r_AAPL), as.numeric(r_MSFT), paired = FALSE, var.equal = FALSE)
ttest ttest
##
## Welch Two Sample t-test
##
## data: as.numeric(r_AAPL) and as.numeric(r_MSFT)
## t = 0.12136, df = 108.74, p-value = 0.9036
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02198917 0.02485773
## sample estimates:
## mean of x mean of y
## 0.02829471 0.02686043
We can also display only the t-value and the corresponding p-value:
cat("t-vale from t.test =", ttest$statistic,"\n")
## t-vale from t.test = 0.1213646
cat("p-value = ", ttest$p.value)
## p-value = 0.9036263
We got the same t value than the manual calculation.
We got a t-value= 0.1213646. This means that the distance between the difference of both mean returns from zero is only 0.1213646. Then, the real distribution of our variable of analysis is not too far away from the hypothetical distribution with a difference=0 to say that there is a significant distance to reject the null hypothesis.
Since the t-value of the test is less than 2 and the p-value is greater than 0.05, then the null hypothesis (H0) cannot be rejected. Therefore, we conclude that there is no significant difference between the average monthly returns of AAPL and MSFT over time; they are statistically equal.
Read carefully: Hypothesis Testing Read: Basics of Linear Regression Models.
Go to Canvas and respond Quiz 3 about Basics of Return and Risk. You will be able to try this quiz up to 3 times. Questions in this Quiz are related to concepts of the readings related to this Workshop. The grade of this Workshop will be the following: