Lab 4: Data Transformations

Author

Amanda Rose Knudsen

Overview: Large Technology Stocks

For this assignment we practice data transformation using a dataset of the daily prices and daily trading volumes of a group of large technology stocks that trade on US stock exchanges. Click here to download stocks.csv, which contains data going back to 2000. The dataset contains several variables, including:

symbol: which is the ticker symbol for the stock
date: which is the trading date
open, high, low, and close, which are the price at the start of trading, the high price during the day, the low price during the day, and the stock price at the close of trading (unit is USD)
adjusted: which is the stock price at close adjusted for the financial effects of special events (such as dividends). Unit is USD
volume: which is the number of shares which traded during a given trading day.

library(tidyverse)
library(TTR)
library(kableExtra)

stocks = read_csv("Lab04_Data/stocks.csv")

stocks |> head(10) |> kable()

symbol	date	open	high	low	close	volume	adjusted
AAPL	2000-01-03	0.936384	1.004464	0.907924	0.999442	535796800	0.8440041
AAPL	2000-01-04	0.966518	0.987723	0.903460	0.915179	512377600	0.7728457
AAPL	2000-01-05	0.926339	0.987165	0.919643	0.928571	778321600	0.7841553
AAPL	2000-01-06	0.947545	0.955357	0.848214	0.848214	767972800	0.7162958
AAPL	2000-01-07	0.861607	0.901786	0.852679	0.888393	460734400	0.7502258
AAPL	2000-01-10	0.910714	0.912946	0.845982	0.872768	505064000	0.7370310
AAPL	2000-01-11	0.856585	0.887277	0.808036	0.828125	441548800	0.6993311
AAPL	2000-01-12	0.848214	0.852679	0.772321	0.778460	976068800	0.6573902
AAPL	2000-01-13	0.843610	0.881696	0.825893	0.863839	1032684800	0.7294905
AAPL	2000-01-14	0.892857	0.912946	0.887277	0.896763	390376000	0.7572942

All of the the stock prices and the trading volume have been adjusted for stock splits, so that the data provide a continuous record of how prices and trading volume changed.

There are several important functions and packages that you will need to use to complete this exercise.

We will make a lot of use of window functions in dplyr, which are very helpful for making transformations (including lead, lag, percent_rank). The dplyr vignettes and articles are incredibly useful for learning about all the different functions available
We will use the TTR package (which is part of the tidyquant family of packages, see here). The main function we will use is called runMean. An alternative package that is also very nice but not part of the tidyverse is RcppRoll
If you aren’t very comfortable with logarithms, you should read more about them. They are one of the most important mathematical functions for data science. We aren’t using their mathematical properties much this week but they will be important throughout your data science journey. Khan Academy has a decent video, and this article in the journal nature has some more context.
We will calculate some correlation coefficients, using the cor function from base R (?cor to see how it is used). There is also a tidyverse package called corrr that is useful for calculating correlations on data frames, but we won’t use it for this lab.
The motivation for today’s assignment came from some news articles a few years ago about how big tech stocks collectively had a miniature meltdown after powering the stock market for several consecutive years, see this article at Morningstar
The wikipedia page on Market Impact has some references to Kyle’s \(\lambda\), which we will calculate this week.

Problem 1: The price of a stock on a given day only conveys information in relation to the stock price on other days. One useful measure is the daily return of the stock, which we will define as the ratio of the adjusted closing price on the current day of trading to the adjusted closing price on the previous day of trading. Read the following article on window functions in dplyr: window functions in dplyr.

Find a function there that will help you calculate the daily return and use it along with mutate to add a return column to the data frame containing the daily return.

Hint: make sure to use group_by(symbol), otherwise your calculation might transpose prices from a different stock at the beginning of each time series.

stocks <- stocks |> 
  arrange(symbol, date) |> 
  group_by(symbol) |> 
  mutate(
    return = adjusted / lag(adjusted)   
  ) |> 
  ungroup()

Differences between the adjusted return and the return measured on the close price should indicate special corporate events such as dividends.

Calculate the un-adjusted return using the same technique you used to calculate the return, but replacing the adjusted variable with the close variable, and find the datapoint in the dataset where the return exceeded the unadjusted return by the greatest margin. (Hint to check you have done it right: it happened in November 2004). The reason that the close price and the adjusted price differ is because stock prices typically decrease when a dividend is paid (to account for the cash paid out). The adjusted value has been modified from the beginning of the initial data record to increase adjusted to compensate for dividends. A dividend is just a payment that a company makes periodically to those who hold stock.

stocks <- stocks |> 
  arrange(symbol, date) |> 
  group_by(symbol) |> 
  mutate(
    unadjusted_return = close / lag(close)
  ) |> 
  ungroup()

stocks |> 
  mutate(margin = return - unadjusted_return) |> 
  filter(margin == max(margin, na.rm = TRUE))

# A tibble: 1 × 11
  symbol date        open  high   low close    volume adjusted return
  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>  <dbl>
1 MSFT   2004-11-15  27.3  27.5  27.2  27.4 104468000     19.0   1.02
# ℹ 2 more variables: unadjusted_return <dbl>, margin <dbl>

If you are curious: Look for an old news article describing the significance of that event and tell me what happened

Problem 2: When working with stock price fluctuations or other processes where a quantity increases or decreases according to some multiplicative process like a growth rate (for example population growth) it is often better to work with the log of the growth rate rather than the growth rate itself. This allows standard summary statistics such as the mean to have a useful interpretation (otherwise you would have to use the geometric mean). Furthermore, the log transform is often useful to use on variables that are strictly positive, such as population growth rates or daily stock returns. To see why, consider a hypothetical stock which had a return of 0.5 (50% loss) on one day and 1.8 on the next day (80% gain). The mean of these two returns would be 1.075, or 7.5% per day. However, at the end of the two day period the stock would have lost 10% of its value (0.5*1.8 = 0.9). If we had computed the mean of the log(return) instead, we would have found that (log(0.5)+log(1.8))/2 = log(0.9^(1/2)), or approximately -5.2% per day, matching the observed price change.

Create a new variable called log_return which is the log of the return variable you calculated in the previous problem. Generate either a histogram or density plot of the distribution of log_return for the entire dataset. Then create a QQ-plot of log_return using geom_qq() and geom_qq_line(). What do you notice about the “tails” (right and left side/extreme edges) of the distribution from the QQ-plot? Are there visible signs of this in the density plot/histogram that you made?

stocks <- stocks |> 
  mutate(log_return = log(return))

ggplot(stocks, aes(x = log_return)) +
  geom_density(fill = "navy", alpha = 0.75) +
  labs(title = "Density Plot of Log Return")

Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_density()`).

ggplot(stocks, aes(sample = log_return)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "QQ-plot of Log Return")

Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_qq()`).

Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_qq_line()`).

With the QQ-plot, we are seeing how closely log_return follows a normal distribution. The tails at the beginning and end being far away from the line indicate they do not represent a normal distribution. There are visible signs of what we see in the QQ-plot in the density plot: the long ‘tails’ at the low and high end of the x-axis. Whether we’re looking at the density plot or the QQ-plot, we can see the same signs of outliers far from the median log_return values.

Problem 3: Volume measures how many shares were traded of a given stock over a set time period, and high volume days often associate with important events or market dynamics.

Make a scatter plot of volume versus log_return, faceted by symbol to account for the fact that different stocks have different trading volumes. Do you see an association between volume and log_return in these scatter plots?

stocks |> 
  group_by(symbol) |>
  ggplot(aes(x = log_return, y = volume)) + 
  geom_point() + 
  facet_wrap(~ symbol, scales = "fixed") +
  labs(title = "Scatterplot of Volume vs log_return by Symbol",
       subtitle = "Fixed scales")

Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).

stocks |> 
  group_by(symbol) |>
  ggplot(aes(x = log_return, y = volume)) + 
  geom_point() + 
  facet_wrap(~ symbol, scales = "free") +
  labs(title = "Scatterplot of Volume vs log_return by Symbol",
       subtitle = "Free scales")

Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).

It’s hard to say if there’s an “association” visible here. I tried to view these faceted scatterplots with “free” and with “fixed” grids to see if there was a visible association that was easier to see with one or the other – it wasn’t. It doesn’t really seem like volume is an indicator of log_return or vice versa – the association is not there, to my untrained eye. I see larger volume with lower log_return, larger volume higher log_return; lower volume with lower log_return, and lower volume with higher log_return.

Use the cor function to compute the pearson’s correlation coefficient between volume and log_return for each symbol. Why do you think the correlations are close to 0?

Hint: use it with summarize and don’t forget that cor is a base R function so you will either need to filter NA values for volume and log_return or appropriately choose the use flag in the argument- see ?cor for more info.

stocks |> 
  group_by(symbol) |> 
  summarise(
    pearsonscorr = cor(volume, log_return, method = "pearson", use = "complete.obs")
  )

# A tibble: 14 × 2
   symbol pearsonscorr
   <chr>         <dbl>
 1 AAPL        -0.0786
 2 ADBE        -0.0800
 3 AMZN         0.0828
 4 CRM          0.0207
 5 CSCO        -0.0605
 6 GOOGL        0.0280
 7 IBM         -0.0714
 8 INTC        -0.0874
 9 META         0.0292
10 MSFT        -0.0570
11 NFLX        -0.0548
12 NVDA         0.0171
13 ORCL        -0.0437
14 TSLA         0.0619

I think that the correlations are close to zero for a similar reason that we didn’t see much association between volume and log_return in the previous question/answer; there apparently isn’t a significant association between volume and log_return.

Next compute the correlation in the same manner but this time transform log_return using the absolute value function. Recreate the faceted scatter-plots from the first part of the problem but with the absolute-value transformed log_return. How have the correlations changed from the previous summary?

stocks |> 
  group_by(symbol) |> 
  summarise(
    pearsonscorr = cor(volume, abs(log_return), 
                       method = "pearson", use = "complete.obs")
  )

# A tibble: 14 × 2
   symbol pearsonscorr
   <chr>         <dbl>
 1 AAPL          0.454
 2 ADBE          0.562
 3 AMZN          0.579
 4 CRM           0.547
 5 CSCO          0.541
 6 GOOGL         0.363
 7 IBM           0.598
 8 INTC          0.434
 9 META          0.549
10 MSFT          0.489
11 NFLX          0.519
12 NVDA          0.513
13 ORCL          0.520
14 TSLA          0.462

stocks |> 
  group_by(symbol) |>
  ggplot(aes(x = abs(log_return), y = volume)) + 
  geom_point() + 
  facet_wrap(~ symbol, scales = "fixed") +
  labs(title = 
         "Scatterplot of Volume vs absolute value of log_return by Symbol",
       subtitle = "Fixed scales")

Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).

stocks |> 
  group_by(symbol) |>
  ggplot(aes(x = abs(log_return), y = volume)) + 
  geom_point() + 
  facet_wrap(~ symbol, scales = "free") +
  labs(title = 
         "Scatterplot of Volume vs absolute value of log_return by Symbol",
       subtitle = "Free scales")

Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).

We can see there is more of a visible association between the volume and the absolute value of the log_return, compared with the volume and the log_return observed in the prior question. The correlations are no longer close to zero using the absolute value of log_return. Using the absolute value of log_return with volume we can see in the calculation of the Pearson’s correlation coefficient and in the faceted scatterplots above, that there is a more visible association.

Problem 4: For this problem we will implement a more complicated mathematical transformation of data by calculating a measure of liquidity for each stock.

Liquidity is defined loosely as the ability for a given asset to be bought or sold without a large impact on price. Liquid assets can be bought and sold quickly and easily, whereas illiquid assets have large increases or decreases in their price when someone tries to buy or sell them in large quantities. Liquidity is considered an important property of a well functioning financial market, and declines in liquidity have been blamed for worsening or triggering stock market crashes.

Many methods have been invented to measure liquidity, but for this problem we will focus on a method called “Kyle’s \(\lambda\)”. Kyle’s \(\lambda\) estimates liquidity by using a linear regression between the absolute value daily return of a stock and the logarithm of the dollar volume of that stock. The time periods used to estimate this regression can vary, but here we will use daily returns and a one month time period (defined as 20 trading days). You will learn a lot about linear models in DATA 606 and other classes, but to be complete, \(\lambda\) is a coefficient in the following linear model: \[ |R_t-1| = c + \lambda \log((\mathrm{Volume})_t (\mathrm{close})_t) + \epsilon_t \] where the coefficients \(c\) and \(\lambda\) will be calculated to minimize the error \(\epsilon_t\) over the past 20 trading days.

\(\lambda\) stands for the amount that the stock price will move in units of basis points for a given \(\log\) dollar volume of trade. A small \(\lambda\) indicates high liquidity, and a high \(\lambda\) indicates low liquidity.

\(\lambda\) can be be calculated using rolling averages on the time series data with the TTR package, specifically the function runMean which when used within a dplyr pipeline will calculate the mean over the past \(n\) data points. For example, the command:

library(TTR)
stocks |> group_by(symbol) |>  mutate(log_return_20d = runMean(log_return,n=20))

adds a new variable which is equal to the mean of the log_return over the past 20 days. The mathematical formula for \(\lambda\) is: \[ \lambda = \frac{\mathrm{mean}(R_a\log( p_c V )/20 - \mathrm{mean}\left(R_a\right) \mathrm{mean}\left(\log\left(p_c V\right) \right) } {\mathrm{mean}\left(\log\left( p_c V \right)^2\right)/20.0 -\mathrm{mean}\left( p_c V\right)^2 } \] where to make the formula easier to read we have defined \(R_a = |\mathrm{return} -1|\), \(p_c = \mathrm{close}\) and \(V = \mathrm{volume}\), and the averages have been taken over the past 20 days of data.

Add a new variable called kyle to the data frame by implementing the above formula for \(\lambda\). Make sure to read and implement the formula very carefully, and to use the runMean function to calculate the rolling average correctly.

stocks <- stocks |> 
  group_by(symbol) |> 
  arrange(date) |> 
  mutate(
    abs_return = abs(return - 1),
    log_close_x_volume = log(close * volume)
    ) |> 
  mutate(
    numerator_left = runMean((abs_return * log_close_x_volume), n = 20),
    numerator_right = runMean(abs_return, n = 20) * (runMean(log_close_x_volume, n = 20)),
    denominator_left = runMean(log_close_x_volume^2, n = 20),
    denominator_right = runMean(log_close_x_volume, n = 20)^2
  ) |> 
  mutate(
    kyle = (numerator_left - numerator_right) / (denominator_left - denominator_right)
  ) |> 
  ungroup()

Plot Kyle’s lambda for each stock over time (I would use a faceted scatterplot). What do you notice about how this measure of liquidity behaves (remember liquidity is high when \(\lambda\) is small)?

stocks |> 
  group_by(symbol) |> 
  ggplot(aes(x = date, y = kyle)) +
  geom_point() +
  facet_wrap(~ symbol, scales = "fixed") +
  labs(title = "Scatterplot of Kyle's lambda over time by Symbol",
       subtitle = "Fixed scales")

Warning: Removed 280 rows containing missing values or values outside the scale range
(`geom_point()`).

stocks |> 
  group_by(symbol) |> 
  ggplot(aes(x = date, y = kyle)) +
  geom_point() +
  facet_wrap(~ symbol, scales = "free") +
  labs(title = "Scatterplot of Kyle's lambda over time by Symbol",
       subtitle = "Free scales")

Warning: Removed 280 rows containing missing values or values outside the scale range
(`geom_point()`).

To be honest, I’m not really sure what I’m supposed to ‘see’ here. I’m not really familiar with viewing stocks this way, or any way. What I see is that there is a lot of variability, it seems, for this measure of liquidity. It doesn’t seem to have a particular ‘direction’ other than variability over time.

Next add a new variable to the dataframe called extreme which is true when the log_return for a given stock is either greater than 95% of other values of the log_return or less than 95% of all values of log_return. Use the percent_rank dplyr window function along with logical operators to create this variable. Then for each stock calculate the mean value of Kyle’s lambda for the days when the log_return had extreme values and for when it didn’t (as identified by the extreme variable). What do your calculations and figures indicate about liquidity during extreme events?

stocks <- stocks |>
  group_by(symbol) |> 
  mutate(
    log_return_percentile = percent_rank(log_return),
    extreme = log_return_percentile > 0.95 | log_return_percentile < 0.05
  ) |> 
  ungroup()

finalpartproblem4 <- stocks |> 
  group_by(symbol) |> 
  summarize(
    mean_kyle_extreme = mean(kyle[extreme], na.rm = TRUE),
    mean_kyle_not_extreme = mean(kyle[!extreme], na.rm = TRUE)
      )
finalpartproblem4

# A tibble: 14 × 3
   symbol mean_kyle_extreme mean_kyle_not_extreme
   <chr>              <dbl>                 <dbl>
 1 AAPL              0.0308                0.0203
 2 ADBE              0.0366                0.0194
 3 AMZN              0.0356                0.0240
 4 CRM               0.0265                0.0204
 5 CSCO              0.0256                0.0174
 6 GOOGL             0.0263                0.0177
 7 IBM               0.0251                0.0137
 8 INTC              0.0318                0.0193
 9 META              0.0337                0.0226
10 MSFT              0.0297                0.0172
11 NFLX              0.0362                0.0259
12 NVDA              0.0417                0.0279
13 ORCL              0.0323                0.0170
14 TSLA              0.0453                0.0343

Based on these calculations it appears there is greater liquidity (larger value of Kyle’s lambda) in extreme events compared with non-extreme events. For every stock symbol in the stocks dataset when we compare the mean_kyle_extreme to the mean_kyle_not_extreme (above), the mean_kyle_extreme is a larger number than the mean_kyle_not_extreme.