1 General directions for this Workshop

You will work in RStudio. Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop.

You have to replicate all the steps explained in this workshop, and ALSO you have to do whatever is asked. Any QUESTION or any STEP you need to do will be written in CAPITAL LETTERS. For ANY QUESTION, you have to RESPOND IN CAPITAL LETTERS right after the question.

It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your notebook. Your own workshop/notebook will be very helpful for your further study.

You have to keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file. Pay attention in class to know how to generate an html file from your .Rmd.

2 Downloading and visualizing online financial data

  1. Once you have created a new R Notebook, you will see a sample R Notebook document. You must DELETE all the lines of this sample document except the first lines related to title and output. As title, write the workshop # and course, and add a new line with your name. You have to end up with something like:

title: “Workshop 1, Econometric Models”

author: YourName (First and Last names)

output: html_notebook


Now you are ready to continue writing your first R Notebook.

You can start writing your own notes/explanations we cover in this workshop. When you need to write lines of R Code, you need to click Insert at the top of the RStudio Window and select R. Immediately a chunk of R code will be set up to start writing your R code. You can execute this piece of code by clicking in the play button (green triangle).

Note that you can open and edit several R Notebooks, which will appear as tabs at the top of the window. You can visualize the output (results) of your code in the console, located at the bottom of the window. Also, the created variables are listed in the environment, located in the top-right pane. The bottom-right pane shows the files, plots, installed packages, help, and viewer tabs.

We start clearing our R environment:

rm(list=ls())
# To avoid scientific notation for numbers: 
options(scipen=999)
  1. We will use the quantmod R package to download online real financial data from Yahoo Finance. This package contains the getSymbols() function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet. In order to install packages in R, use the install.packages() function:
install.packages("quantmod")
  1. Now, you have installed a package it will be in your computer forever, so it is not necessary to install it again. However, next time you want to use it, you have to load the package in memory by using the library() function:
library(quantmod)

The getSymbols() downloads online and up-to-date financial data, such as stock prices, ETF prices, interest rates, exchange rates, etc. getSymbols() allows to download this data from multiple sources: Yahoo Finance, FRED, Oanda, and Tiingo. These sources have thousands of finance and economic data series from many market exchanges and other macroeconomic variables around the world.

  1. Type ?function in the console or the R Script and run it to know more about the syntaxis of any function. This will display the R documentation of the function in the bottom-right pane. Apply this trick for searching help to getSymbols.
?getSymbols
  1. Now, we will work with historical data of the market and the stocks we want to analyze from the Healthcare and Technology industries. Using getSymbols(), download the monthly prices of S&P 500 (^GSPC), Pfizer Inc. (PFE), AstraZeneca PLC (AZN), Sanofi (SNY), Novartis AG (NVS), Netflix Inc. (NFLX), Tesla Inc. (TSLA), Apple Inc. (AAPL) and Microsoft Corporation (MSFT), from January 1, 2017 to date from Yahoo Finance:
# We define a vector for the tickers:
tickers<-c("^GSPC","PFE","AZN","SNY","NVS","NFLX","TSLA","AAPL","MSFT")
getSymbols(Symbols=tickers, from="2017-01-01", src="yahoo", periodicity="monthly")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## pausing 1 second between requests for more than 5 symbols
## pausing 1 second between requests for more than 5 symbols
## pausing 1 second between requests for more than 5 symbols
## pausing 1 second between requests for more than 5 symbols
## pausing 1 second between requests for more than 5 symbols
## [1] "^GSPC" "PFE"   "AZN"   "SNY"   "NVS"   "NFLX"  "TSLA"  "AAPL"  "MSFT"

This function will create an xts-zoo R object for each ticker. Each object has the corresponding historical monthly prices. xts stands for extensible time-series. An xts-zoo object is designed to easily manipulate time series data.

In the Symbols argument you can specify more than one ticker by using the container c() operator and separated by commas. The from argument is used to indicate the initial date from which you want to bring data. The to argument is the end date of the series you want to download. In this case we omit the to argument in order to download the most recent data. The src argument indicates the source of the data, in this case it is Yahoo Finance. Finally, the periodicity argument specifies the granularity of the data (daily, weekly, monthly, quarterly).

  1. You can check the content of the dataset with the View() function. This will take you to a different tab showing the data as a table

DO THE SAME WITH THE STOCK DATASETS.

We can list the FIRST 5 rows of the S&P500 index by using head() function:

head(GSPC,5)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
## 2017-01-01   2251.57   2300.99  2245.13    2278.87 70483180000       2278.87
## 2017-02-01   2285.59   2371.54  2271.65    2363.64 69162420000       2363.64
## 2017-03-01   2380.13   2400.98  2322.25    2362.72 81547770000       2362.72
## 2017-04-01   2362.34   2398.16  2328.95    2384.20 65265670000       2384.20
## 2017-05-01   2388.50   2418.71  2352.72    2411.80 79607170000       2411.80

DO THE SAME WITH THE STOCK DATASETS.

Also, you can list the LAST 5 rows of any dataset. Note that you can change number of rows you want to display.

tail(GSPC, 5)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close  GSPC.Volume GSPC.Adjusted
## 2020-10-01   3385.87   3549.85  3233.94    3269.96  89737600000       3269.96
## 2020-11-01   3296.20   3645.99  3279.74    3621.63 100977880000       3621.63
## 2020-12-01   3645.87   3760.20  3633.40    3756.07  96056410000       3756.07
## 2021-01-01   3764.61   3870.90  3662.71    3714.24 105548790000       3714.24
## 2021-02-01   3731.17   3894.56  3725.62    3886.83  25430390000       3886.83

DO THE SAME FOR ALL STOCKS.

For each period, Yahoo Finance keeps track of the open, high, low, close (OHLC) and adjusted prices. Also, it keeps track of volume that was traded in every specific period. The adjusted prices are used for stocks, not for currencies. Adjusted prices considers dividend payments and also stock splits. Then, for the Bitcoin series we can use close of adjusted price to calculate daily returns.

Let’s see some of the benefits of using xts-zoo objects. We can, for example, select columns using any of the following functions, where x represents a generic xts zoo object:

  • Op(x): Extract the Opening prices of the period.
  • Hi(x): Extract the Highest price of the period.
  • Lo(x): Extract the Lowest price of the period.
  • Cl(x): Extract the closing prices of the period.
  • Vo(x): Extract the volume traded of the period.
  • Ad(x): Extract the Adjusted prices of the period.
  1. Save your file as W1-YourName.Rmd in your computer. Go to the File menu and select Save As.

  2. Visualize how our stocks have been valued over time. Use the following command to display the graph and save it.

plot(GSPC$GSPC.Adjusted)

DO THE SAME FOR OTHER 2 STOCKS.

We can also use another function to better visualize not only price over time, but also the volume traded each month:

chartSeries(GSPC, theme = ("white"))

DO THE SAME FOR OTHER 2 STOCKS. YOU CAN CHANGE THE COLOR OF THE THEME.

  1. Run all your code up to know by clicking the Knit to HTML button. (top-left of the R Studio window).Since this is your first R document, it is likely that you will have an error. Read what the error says and try to solve it. If there is no error, you will see your first R document in HTML with your first R program! An .html file will be saved in your working folder.

3 Data management: merging and cleaning datasets

  1. With xts objects/datasets it is easy to merge them into 1 integrated dataset, so we can apply several functions or calculations to only one dataset instead of doing the same process to each dataset. We use the merge function to do this:
prices<-merge(GSPC,AAPL,AZN,MSFT,NFLX,NVS,PFE,SNY,TSLA)
  1. To calculate returns we MUST use adjusted prices to consider any historical stock split or dividend payments of the stocks. We can apply the Ad function to the integrated dataset:
adjprices<-Ad(prices)
  1. We can rename the columns with the ticker names:
colnames(adjprices) <- c("GSPC","AAPL","AZN","MSFT","NFLX","NVS","PFE","SNY","TSLA")
  1. In Finance, when managing daily data it is very common to have gaps in the series. What does this mean? It means that the contains some missing data in some periods. For example, for stock series there is no data for weekends or holidays. R does not deal automatically with missing values (called NA’s). It is a good idea to have a data set free of NA’s. We can use the function na.omit to drop the periods with NA values:
adjprices<-na.omit(adjprices) 
View(adjprices)

4 Return calculation

A financial simple return for a stock (\(R_{t}\)) is calculated as a percentage change of price from the previous period (t-1) to the present period (t):

\[ R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1}}=\frac{Adjprice_{t}}{Adjprice_{t-1}}-1 \] For example, if the adjusted price of a stock at the end of January 2021 was $100.00, and its previous (December 2020) adjusted price was $80.00, then the monthly simple return of the stock in January 2021 will be:

\[ R_{Jan2021}=\frac{Adprice_{Jan2021}}{Adprice_{Dec2020}}-1=\frac{100}{80}-1=0.25 \]

We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.

In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models.

One way to calculate cc returns is by subtracting the log of the current adjusted price (at t) minus the log of the previous adjusted price (at t-1):

\[ r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1}) \] This is also called as the difference of the log of the price.

We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):

\[ r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right) \]

cc returns are usually represented by small r, while simple returns are represented by capital R.

  1. Calculate historical monthly simple and cc returns for each stock and the S&P500.

We have historical monthly adjusted prices for each stock. We will first start calculating returns for the S&P 500 index. We can use the log function to calculate the natural logarithm, and the lag function to get the previous value of the adjusted prices. Let’s calculate themonthly simple and cc returns for the S&P 500 in a new dataset:

# We calculate the simple returns of the S&P 500:
SP500_R = adjprices$GSPC / lag(adjprices$GSPC,n=1) - 1
# We calculate the cc returns of the S&P 500
SP500_r = log(adjprices$GSPC) - log(lag(adjprices$GSPC,n=1) )
# We can also do the same calculation of cc returns using the
#  second formula:
SP500_r_2 = log(adjprices$GSPC/lag(adjprices$GSPC,n=1) )

We calculated cc returns using two formulas and we got exactly the same result. The first formula gets the first difference of the log of the price. The first difference refers to the current value of a time series minus its value of the previous period.

We can simplify the calculation of cc returns by using the function diff(), which calculates the first difference of any time series xts dataset.

SP500_r <- na.omit(diff(log(adjprices$GSPC)))

As you see, we also apply the na.omit function to drop the first row since the calculation of return for the first month cannot be calculated (it gets an NA value).

  1. Now we can see the power of doing effective financial data management in R. If we apply the same calculations we did before to the integrated dataset, let’s see what happens:
# We calculate cc returns for the integrated adjprices dataset:
ccr = na.omit(diff(log(adjprices)))
View(ccr)
# We calculate simple returns for the integrated dataset:
R = adjprices / lag(adjprices,n=1) - 1
  1. Visualize the monthly returns over time
plot(ccr$GSPC)

Since all returns are integrated in one dataset, we can also see all returns in one plot:

plot(ccr)

5 Descriptive Statistics

  1. Calculate the mean, standard deviation and variance of continuously compounded (cc) monthly returns using the summary command (Do the same with all stocks):
summary(ccr$GSPC)
##      Index                 GSPC           
##  Min.   :2017-02-01   Min.   :-0.1336677  
##  1st Qu.:2018-02-01   1st Qu.:-0.0003893  
##  Median :2019-02-01   Median : 0.0177655  
##  Mean   :2019-01-30   Mean   : 0.0108962  
##  3rd Qu.:2020-02-01   3rd Qu.: 0.0353880  
##  Max.   :2021-02-01   Max.   : 0.1194208

As you can see, summary() does not show standard deviation or variance. You can also try the table.Stats() function. However, you must install and load the Performance Analytics package fist since table.Stats() belongs to such package.

install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)
table.Stats(ccr$GSPC)
##                    GSPC
## Observations    49.0000
## NAs              0.0000
## Minimum         -0.1337
## Quartile 1      -0.0004
## Median           0.0178
## Arithmetic Mean  0.0109
## Geometric Mean   0.0098
## Quartile 3       0.0354
## Maximum          0.1194
## SE Mean          0.0067
## LCL Mean (0.95) -0.0027
## UCL Mean (0.95)  0.0245
## Variance         0.0022
## Stdev            0.0472
## Skewness        -0.7343
## Kurtosis         1.3865

This function calculates the most common measures of central tendency and dispersion. As central tendency it calculates median, arithmetic and geometric mean. As dispersion measures it calculates the minimum, the maximum values, quartile 1, quartile 3, standard deviation, and variance.

  1. Do a histogram of returns (Do the same with all stocks).
hist(ccr$GSPC, main="Histogram of S&P 500 monthly returns", 
     xlab="Continuously Compounded returns", col="dark green")

  1. With the descriptive statistics you computed and with this histogram, WRITE A MEANINGFUL INTERPRETATION ABOUT THE MAIN CENTRAL TENDENCY AND DISPERSION MEASURES, AND ALSO INTERPRET THE HISTOGRAM OF CONTINUOUSLY COMPOUNDED RETURNS. You can read the note Basic Statistics for Finance before you write your own interpretation.

6 Introduction to Hypothesis testing

  1. Select 2 stocks you want to further analyze. For each stock, run a t-test to check whether the average monthly returns over time is significantly different than zero. You have to do the calculations MANUALLY and then use the t-test function in R. You have to INTERPRET your results. As comments in your .r file, explain your calculations.

You have to:

  1. WRITE THE NULL AND THE ALTERNATIVE HYPOTHESIS

  2. Calculate the Standard error, which is the standard deviation of the MEAN of returns.

  3. Calculate the t-statistic. EXPLAIN/INTERPRET THE VALUE OF t YOU GOT.

  4. WRITE YOUR CONCLUSION OF THE t-TEST

Here is an example of a t-test to check whether the S&P 500 has an average monthly returns significantly greater than zero:

# a)
# H0: mean(ccr$GSPC) = 0
# Ha: mean(ccr$GSPC) <> 0

# b)
se_GSPC.r <- sd(ccr$GSPC) / sqrt(nrow(ccr$GSPC) )
print(paste("Standard error S&P 500 =" , se_GSPC.r))
## [1] "Standard error S&P 500 = 0.00674725047827303"
# c)
t_GSPC.r <- (mean(ccr$GSPC) - 0) / se_GSPC.r
print(paste("t-value S&P 500 = ", t_GSPC.r))
## [1] "t-value S&P 500 =  1.61491080702246"

Since the t-value of the mean return of S&P 500 is lower than 2, I can’t reject the null hypothesis. Therefore, S&P 500 mean return is not statistically different than 0.

  1. Run the t-test using the t.test function.
  1. DID YOU GET THE SAME RESULT? BRIEFLY EXPLAIN
ttest_GSPC.r <- t.test(as.numeric(ccr$GSPC), alternative = "greater")
ttest_GSPC.r
## 
##  One Sample t-test
## 
## data:  as.numeric(ccr$GSPC)
## t = 1.6149, df = 48, p-value = 0.05644
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
##  -0.000420444          Inf
## sample estimates:
##  mean of x 
## 0.01089621
print(ttest_GSPC.r$statistic)
##        t 
## 1.614911
print(ttest_GSPC.r$statistic == t_GSPC.r)
##    t 
## TRUE
# I got the same result
  1. Run a t-test to compare whether the average monthly returns of two stocks over time are equal. Do this using the test command and also manually. INTERPRET your result with your own words (use comments and CAPITAL LETTERS for your answer).

Here is an example with PFE and AZN:

We start calculating the mean of returns for both stocks:

mean_PFE.r <- mean (ccr$PFE)
mean_AZN.r <- mean (ccr$AZN)

print(mean_PFE.r)
## [1] 0.006314097
print(mean_AZN.r)
## [1] 0.01532175

Now we set up the hypotheses. Since the mean return of Astrazeneca is higher than the mean return of Pfizer we start believing that Astrazeneca is significantly offering higher average monthly returns than Pfizer. Then the alternative hypothesis will be this initial belief:

#'  H0: mean(ccr$AZN) = mean(ccr$PFE)
#'  Ha: mean(ccr$AZN) > mean(ccr$PFE)

#' I have to re-arrange the equality to leave a number to the right:

#'   H0: mean(ccr$AZN) - mean(ccr$PFE) = 0
#'   Ha: mean(ccr$AZN) - mean(ccr$PFE) <>0

# We can set up this hypotheses as follows:

# meandif = mean(ccr$AZN) - mean(ccr$PFE)

#    H0: meandif = 0
#    Ha: meandif > 0

In this case, the random variable of this test is meandif, which is the difference of 2 means. The mean return of PFE and AZN are random variables, so the variable of this test is the difference of 2 random variables. To calculate the t value of this test, we have to know how to estimate the standard deviation of meandif, which is the difference of 2 random variables.

From basic probability theory, if both random variables are independent, then the variance of the difference of 2 random variables is the SUM of the variances! This sounds counter-intuitive. WHY THE VARIANCE OF THE DIFFERENCE OF 2 RANDOM VARIABLES IS THE SUM OF THE 2 VARIANCES INSTEAD OF BEING THE DIFFERENCE OF BOTH VARIANCES? DO YOUR OWN RESEARCH AND BRIEFLY EXPLAIN.

Here we do the calculation of t-value manually in R:

#'  t = (mean(ccr$AZN) - mean(ccr$PFE) - 0) / sqrt( (1/N)* (Var(ccr$AZN) + Var(ccr$PFE)) )

N <- nrow(ccr$AZN)
t <- (mean_AZN.r - mean_PFE.r - 0) / sqrt( (1/N) * (var(ccr$AZN) + var(ccr$PFE) ))

cat("t-value = ", t)
## t-value =  0.712505

Now we do the t-test using the t.test function and check whether we got the same calculation for t:

ttest <- t.test(as.numeric(ccr$AZN), as.numeric(ccr$PFE), paired = FALSE, var.equal = FALSE)
ttest
## 
##  Welch Two Sample t-test
## 
## data:  as.numeric(ccr$AZN) and as.numeric(ccr$PFE)
## t = 0.71251, df = 95.73, p-value = 0.4779
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01608788  0.03410319
## sample estimates:
##   mean of x   mean of y 
## 0.015321750 0.006314097
cat("t-vale from t.test =", ttest$statistic)
## t-vale from t.test = 0.712505
cat("p-value = ", ttest$p.value)
## p-value =  0.4778851

We got the same t-value with the t.test function and our manual calculation.

Conclusion of the test: Since the absolute value of the t is much less than 2 and the p-value of the test is 0.47 we do NOT have statistical evidence to reject the null hypothesis. In other words, we do not have enough statistical evidence at the 95% confidence level to say that the average monthly returns of Astrazeneca is significantly higher than the average monthly returns of Pfizer. We can just say that the average monthly returns of Astrazeneca is higher than that of Pfizer, but this difference is not big enough to say that Astrazeneca will offer 95% of the time returns higher than Pfizer.