This document gives examples of common data operations needed for empirical finance.
Ideally, we should start our code by specifying the working directory, i.e. the folder R will look in when it tries to read and save files.
The easiest way is to specify the folder in which we saved the script. This way we can move the folder or save the file in the cloud to work from different machines.
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) ##set work directory
Then, load all the libraries that you will use in the code. We will use
library(readr)
library(data.table)
library(lubridate)
library(ggplot2)
library(zoo)
library(dplyr)
If you get an error, install the packages first by using e.g. install.packages(‘readr’).
Ideally you save your data as a csv (using Excel, save as), where the first row contains the names of the columns. Put it in the same folder or a sub-folder as the script.
For our purpose here, let’s generate data:
set.seed(123)
date <- seq(from = as.Date('2019-01-01'), to = as.Date('2019-12-31'), by = 'days')
date <- date[wday(date) %in% 2:6]
T. <- length(date)
Mkt <- data.table(
date = date,
Mkt_return = rnorm(T., 0.02/T., 0.1/(T.**0.5))
)
asset_prices_and_volumes <- data.table(
date = as.character(date),
A1.Price = exp(log(100) + cumsum(0.8 * Mkt$Mkt_return + rnorm(T., 0, 0.1/(T.**0.5)))) %>% round(2),
A1.Volume = exp(rnorm(T., 10, 1)) %>% floor,
A2.Price = exp(log(50) + cumsum(0.4 * Mkt$Mkt_return + rnorm(T., 0, 0.15/(T.**0.5)))) %>% round(2),
A2.Volume = exp(rnorm(T., 10, 1)) %>% floor,
A3.Price = exp(log(120) + cumsum(1.5 * Mkt$Mkt_return + rnorm(T., 0, 0.2/(T.**0.5)))) %>% round(2),
A3.Volume = exp(rnorm(T., 10, 1)) %>% floor
)
Let’s assume we loaded a dataset and named it asset_prices_and_volumes. To get to know how our data actually look like, we can use
View()str()head()tail()summary()If the dataset is a tibble or a data.table, we can also just print it in the console. Here, our data is a data.table:
asset_prices_and_volumes
## date A1.Price A1.Volume A2.Price A2.Volume A3.Price A3.Volume
## 1: 2019-01-01 99.43 10955 49.39 35034 120.35 54981
## 2: 2019-01-02 99.35 16714 48.76 7939 118.31 27947
## 3: 2019-01-03 100.93 67147 48.68 5922 120.21 17342
## 4: 2019-01-04 102.42 38179 49.06 13433 118.90 23605
## 5: 2019-01-07 103.48 75862 48.19 126976 119.06 82942
## ---
## 257: 2019-12-25 111.26 35415 56.60 10407 144.85 25832
## 258: 2019-12-26 110.90 16662 56.16 15968 145.79 9355
## 259: 2019-12-27 110.72 49682 56.46 6989 146.62 30002
## 260: 2019-12-30 110.55 54417 56.20 31393 143.63 52598
## 261: 2019-12-31 110.20 22085 56.00 33684 140.93 5521
A tibble also shows us the column classes
as_tibble(asset_prices_and_volumes)
## # A tibble: 261 x 7
## date A1.Price A1.Volume A2.Price A2.Volume A3.Price A3.Volume
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-01-01 99.4 10955 49.4 35034 120. 54981
## 2 2019-01-02 99.4 16714 48.8 7939 118. 27947
## 3 2019-01-03 101. 67147 48.7 5922 120. 17342
## 4 2019-01-04 102. 38179 49.1 13433 119. 23605
## 5 2019-01-07 103. 75862 48.2 126976 119. 82942
## 6 2019-01-08 104. 25313 47.6 23289 123. 10958
## 7 2019-01-09 103. 33199 47.3 30682 123. 10410
## 8 2019-01-10 103. 12601 47.4 18217 122. 11853
## 9 2019-01-11 102. 40350 46.9 35259 121. 4514
## 10 2019-01-14 103. 13275 47.2 8504 119. 49992
## # ... with 251 more rows
With many columns, str() is more concise
str(asset_prices_and_volumes)
## Classes 'data.table' and 'data.frame': 261 obs. of 7 variables:
## $ date : chr "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" ...
## $ A1.Price : num 99.4 99.3 100.9 102.4 103.5 ...
## $ A1.Volume: num 10955 16714 67147 38179 75862 ...
## $ A2.Price : num 49.4 48.8 48.7 49.1 48.2 ...
## $ A2.Volume: num 35034 7939 5922 13433 126976 ...
## $ A3.Price : num 120 118 120 119 119 ...
## $ A3.Volume: num 54981 27947 17342 23605 82942 ...
## - attr(*, ".internal.selfref")=<externalptr>
As we have seen, the column date is a character. Let’s parse this column to a proper date, i.e. tell R that these are actually dates. We use the ymd() function (for yearmonthday) from the package lubridate. If the date is in the format e.g. ‘DD-MM-YYYY’, we use dmy().
asset_prices_and_volumes[, date := ymd(date)]
Now, the date column should be parsed as a date.
str(asset_prices_and_volumes)
## Classes 'data.table' and 'data.frame': 261 obs. of 7 variables:
## $ date : Date, format: "2019-01-01" "2019-01-02" ...
## $ A1.Price : num 99.4 99.3 100.9 102.4 103.5 ...
## $ A1.Volume: num 10955 16714 67147 38179 75862 ...
## $ A2.Price : num 49.4 48.8 48.7 49.1 48.2 ...
## $ A2.Volume: num 35034 7939 5922 13433 126976 ...
## $ A3.Price : num 120 118 120 119 119 ...
## $ A3.Volume: num 54981 27947 17342 23605 82942 ...
## - attr(*, ".internal.selfref")=<externalptr>
The logic is: data[row, column]
Select the entry in the first row and second column
asset_prices_and_volumes[1, 2]
## A1.Price
## 1: 99.43
This is particularly important if you work with loops.
We can exclude rows and columns with a minus. E.g., select all but the first column:
asset_prices_and_volumes[, -1]
## A1.Price A1.Volume A2.Price A2.Volume A3.Price A3.Volume
## 1: 99.43 10955 49.39 35034 120.35 54981
## 2: 99.35 16714 48.76 7939 118.31 27947
## 3: 100.93 67147 48.68 5922 120.21 17342
## 4: 102.42 38179 49.06 13433 118.90 23605
## 5: 103.48 75862 48.19 126976 119.06 82942
## ---
## 257: 111.26 35415 56.60 10407 144.85 25832
## 258: 110.90 16662 56.16 15968 145.79 9355
## 259: 110.72 49682 56.46 6989 146.62 30002
## 260: 110.55 54417 56.20 31393 143.63 52598
## 261: 110.20 22085 56.00 33684 140.93 5521
Our dataset contains columns that we do not need. We want to calculate returns, but are not interested in volume data. So let’s take a subset of the data, that contains only the date and price columns. We name this subset of our data simply ‘data’.
cols <- grepl('(date)|(Price)', names(asset_prices_and_volumes), fixed = F)
data <- asset_prices_and_volumes[, ..cols]
data
## date A1.Price A2.Price A3.Price
## 1: 2019-01-01 99.43 49.39 120.35
## 2: 2019-01-02 99.35 48.76 118.31
## 3: 2019-01-03 100.93 48.68 120.21
## 4: 2019-01-04 102.42 49.06 118.90
## 5: 2019-01-07 103.48 48.19 119.06
## ---
## 257: 2019-12-25 111.26 56.60 144.85
## 258: 2019-12-26 110.90 56.16 145.79
## 259: 2019-12-27 110.72 56.46 146.62
## 260: 2019-12-30 110.55 56.20 143.63
## 261: 2019-12-31 110.20 56.00 140.93
grepl() is a function that returns TRUE or FALSE for each element of a character vector. It returns TRUE if the given pattern is in the element. If fixed = F, then the pattern is entered as a regular expression, a format for search patterns.
Data.table is a package that builds on the logic of data[row, column] that is both extremely fast and memory efficient.
To use the data.table syntax, we need to make sure that our dataset is in fact a data.table. We can check that with
class(data)
## [1] "data.table" "data.frame"
If that wasn’t the case, we could simple use
data <- as.data.table(data)
The idea is, that columns are variables.
We can select a subset of the dataset by e.g.
data[date < '2019-05-01', ]
Create and modify variables using the logic
data[rows, variable := function(variable)]
Note that := modifies the dataset itself and does not create a copy. That means we do not have to write data <- data[, x := f(x)].
data[rows, ] however only selects and returns the subset but does not overwrite the complete data.table.
For example
data[, double_A1.Price := 2 * A1.Price]
data
## date A1.Price A2.Price A3.Price double_A1.Price
## 1: 2019-01-01 99.43 49.39 120.35 198.86
## 2: 2019-01-02 99.35 48.76 118.31 198.70
## 3: 2019-01-03 100.93 48.68 120.21 201.86
## 4: 2019-01-04 102.42 49.06 118.90 204.84
## 5: 2019-01-07 103.48 48.19 119.06 206.96
## ---
## 257: 2019-12-25 111.26 56.60 144.85 222.52
## 258: 2019-12-26 110.90 56.16 145.79 221.80
## 259: 2019-12-27 110.72 56.46 146.62 221.44
## 260: 2019-12-30 110.55 56.20 143.63 221.10
## 261: 2019-12-31 110.20 56.00 140.93 220.40
Or combine both ideas and perform an operation on a subset
data[date < '2019-05-01', yearmon := as.yearmon(date)]
data
## date A1.Price A2.Price A3.Price double_A1.Price yearmon
## 1: 2019-01-01 99.43 49.39 120.35 198.86 Jan 2019
## 2: 2019-01-02 99.35 48.76 118.31 198.70 Jan 2019
## 3: 2019-01-03 100.93 48.68 120.21 201.86 Jan 2019
## 4: 2019-01-04 102.42 49.06 118.90 204.84 Jan 2019
## 5: 2019-01-07 103.48 48.19 119.06 206.96 Jan 2019
## ---
## 257: 2019-12-25 111.26 56.60 144.85 222.52 <NA>
## 258: 2019-12-26 110.90 56.16 145.79 221.80 <NA>
## 259: 2019-12-27 110.72 56.46 146.62 221.44 <NA>
## 260: 2019-12-30 110.55 56.20 143.63 221.10 <NA>
## 261: 2019-12-31 110.20 56.00 140.93 220.40 <NA>
We can also group the dataset and perform an operation on each group by adding ‘by’. It follows the logic
data[rows, variable := function(variable), by = grouping_variable]
For example
data[date < '2019-05-01', A1.av.monthly.price := mean(A1.Price), by = yearmon]
data
## date A1.Price A2.Price A3.Price double_A1.Price yearmon
## 1: 2019-01-01 99.43 49.39 120.35 198.86 Jan 2019
## 2: 2019-01-02 99.35 48.76 118.31 198.70 Jan 2019
## 3: 2019-01-03 100.93 48.68 120.21 201.86 Jan 2019
## 4: 2019-01-04 102.42 49.06 118.90 204.84 Jan 2019
## 5: 2019-01-07 103.48 48.19 119.06 206.96 Jan 2019
## ---
## 257: 2019-12-25 111.26 56.60 144.85 222.52 <NA>
## 258: 2019-12-26 110.90 56.16 145.79 221.80 <NA>
## 259: 2019-12-27 110.72 56.46 146.62 221.44 <NA>
## 260: 2019-12-30 110.55 56.20 143.63 221.10 <NA>
## 261: 2019-12-31 110.20 56.00 140.93 220.40 <NA>
## A1.av.monthly.price
## 1: 103.1726
## 2: 103.1726
## 3: 103.1726
## 4: 103.1726
## 5: 103.1726
## ---
## 257: NA
## 258: NA
## 259: NA
## 260: NA
## 261: NA
We can also perform multiple computations in one line by using ‘:=’(variable = function(variable), variable = function(variable)), i.e.
data[date < '2019-05-01', ':='(A1.av.monthly.price = mean(A1.Price), A2.av.monthly.price = mean(A2.Price)), by = yearmon]
data
## date A1.Price A2.Price A3.Price double_A1.Price yearmon
## 1: 2019-01-01 99.43 49.39 120.35 198.86 Jan 2019
## 2: 2019-01-02 99.35 48.76 118.31 198.70 Jan 2019
## 3: 2019-01-03 100.93 48.68 120.21 201.86 Jan 2019
## 4: 2019-01-04 102.42 49.06 118.90 204.84 Jan 2019
## 5: 2019-01-07 103.48 48.19 119.06 206.96 Jan 2019
## ---
## 257: 2019-12-25 111.26 56.60 144.85 222.52 <NA>
## 258: 2019-12-26 110.90 56.16 145.79 221.80 <NA>
## 259: 2019-12-27 110.72 56.46 146.62 221.44 <NA>
## 260: 2019-12-30 110.55 56.20 143.63 221.10 <NA>
## 261: 2019-12-31 110.20 56.00 140.93 220.40 <NA>
## A1.av.monthly.price A2.av.monthly.price
## 1: 103.1726 48.16
## 2: 103.1726 48.16
## 3: 103.1726 48.16
## 4: 103.1726 48.16
## 5: 103.1726 48.16
## ---
## 257: NA NA
## 258: NA NA
## 259: NA NA
## 260: NA NA
## 261: NA NA
If we want to perform the same function on multiple columns (=variables) (e.g. the mean here), we can use lapply(.SD, FUN = function).
.SD stands for subset of data and must be specified in the end by .SDcols = c(variable, variabe, …).
The logic is
data[rows, c(variable, variable, …) := lapply(.SD, FUN = function), by = grouping_variable, .SDcols = c(variable, variable, …)]
For example
data[date < '2019-05-01', c('A1.av.monthly.price', 'A2.av.monthly.price', 'A3.av.monthly.price') := lapply(.SD, mean), by = yearmon, .SDcols = c('A1.Price', 'A2.Price', 'A3.Price')]
data
## date A1.Price A2.Price A3.Price double_A1.Price yearmon
## 1: 2019-01-01 99.43 49.39 120.35 198.86 Jan 2019
## 2: 2019-01-02 99.35 48.76 118.31 198.70 Jan 2019
## 3: 2019-01-03 100.93 48.68 120.21 201.86 Jan 2019
## 4: 2019-01-04 102.42 49.06 118.90 204.84 Jan 2019
## 5: 2019-01-07 103.48 48.19 119.06 206.96 Jan 2019
## ---
## 257: 2019-12-25 111.26 56.60 144.85 222.52 <NA>
## 258: 2019-12-26 110.90 56.16 145.79 221.80 <NA>
## 259: 2019-12-27 110.72 56.46 146.62 221.44 <NA>
## 260: 2019-12-30 110.55 56.20 143.63 221.10 <NA>
## 261: 2019-12-31 110.20 56.00 140.93 220.40 <NA>
## A1.av.monthly.price A2.av.monthly.price A3.av.monthly.price
## 1: 103.1726 48.16 121.7135
## 2: 103.1726 48.16 121.7135
## 3: 103.1726 48.16 121.7135
## 4: 103.1726 48.16 121.7135
## 5: 103.1726 48.16 121.7135
## ---
## 257: NA NA NA
## 258: NA NA NA
## 259: NA NA NA
## 260: NA NA NA
## 261: NA NA NA
Sometimes, we do not want to write (repeated) values in our existing dataset, but create a summary of it.
This is done using the syntax
data[rows, .(summary_variable = function(variable)), by = grouping_variable]
For example
data[date < '2019-05-01', .(A1.av.monthly.price = mean(A1.Price)), by = yearmon]
## yearmon A1.av.monthly.price
## 1: Jan 2019 103.1726
## 2: Feb 2019 105.8635
## 3: Mrz 2019 108.5805
## 4: Apr 2019 110.9936
By using lapply() without c(variable, variable, ...) in front, we can also use it to summarise data.
data[date < '2019-05-01', lapply(.SD, mean), by = yearmon, .SDcols = c('A1.Price', 'A2.Price', 'A3.Price')]
## yearmon A1.Price A2.Price A3.Price
## 1: Jan 2019 103.1726 48.16000 121.7135
## 2: Feb 2019 105.8635 48.52350 117.6435
## 3: Mrz 2019 108.5805 46.82476 115.3705
## 4: Apr 2019 110.9936 48.71455 116.8477
Note that such summaries do not overwrite the file.
We have 3 assets. In wide format, each asset has its own column and each date is unique. If we downloaded prices of more assets, we would add them as new columns by matching dates.
It is easier to use log-returns (continuously compounded returns) because they are additive. They are simply first differences of log-prices.
data <- asset_prices_and_volumes[, ..cols]
cols <- names(data)[-1]
cols
## [1] "A1.Price" "A2.Price" "A3.Price"
data[, paste(cols) := lapply(.SD, FUN = function(x) c(NA, diff(log(x)))), .SDcols = cols]
data
## date A1.Price A2.Price A3.Price
## 1: 2019-01-01 NA NA NA
## 2: 2019-01-02 -0.000804910 -0.012837670 -0.017095866
## 3: 2019-01-03 0.015778238 -0.001642036 0.015931915
## 4: 2019-01-04 0.014654799 0.007775771 -0.010957410
## 5: 2019-01-07 0.010296351 -0.017892508 0.001344764
## ---
## 257: 2019-12-25 0.007306875 0.019625964 0.022480701
## 258: 2019-12-26 -0.003240910 -0.007804225 0.006468506
## 259: 2019-12-27 -0.001624402 0.005327663 0.005676976
## 260: 2019-12-30 -0.001536585 -0.004615666 -0.020603657
## 261: 2019-12-31 -0.003171011 -0.003565066 -0.018977235
Normal operations like \(*\) or \(/\) are performed element by element. If you want R to perform matrix operations, you have to place % around them, e.g. %*% or %/%. If A is a matrix,
solve(A);t(A).Let’s compute
\[w = \frac{\Sigma^{-1}\mu}{1 \Sigma^{-1}\mu}\]
sigma <- as.matrix(cov(data[,-1], use = 'pairwise.complete.obs'))
mu <- colMeans(data[, -1], na.rm = T)
ones <- rep(1, length(mu))
w <- t((solve(sigma) %*% mu) / as.numeric(ones %*% solve(sigma) %*% mu))
w
## A1.Price A2.Price A3.Price
## [1,] 0.4409991 0.4192866 0.1397143
For tasks it is easier to work with long format data. Instead of giving each asset its own column, we create a new column with the assets’ names and stack all observations in one column.
We take the dataset, keep the column data, take all other column names as a new variable.
The logic is: Wide to long means column names become a new variable.
Here, the column names A1.Price, …, A3.Volume become entries of a new variable.
data <- asset_prices_and_volumes
# wide format
data
## date A1.Price A1.Volume A2.Price A2.Volume A3.Price A3.Volume
## 1: 2019-01-01 99.43 10955 49.39 35034 120.35 54981
## 2: 2019-01-02 99.35 16714 48.76 7939 118.31 27947
## 3: 2019-01-03 100.93 67147 48.68 5922 120.21 17342
## 4: 2019-01-04 102.42 38179 49.06 13433 118.90 23605
## 5: 2019-01-07 103.48 75862 48.19 126976 119.06 82942
## ---
## 257: 2019-12-25 111.26 35415 56.60 10407 144.85 25832
## 258: 2019-12-26 110.90 16662 56.16 15968 145.79 9355
## 259: 2019-12-27 110.72 49682 56.46 6989 146.62 30002
## 260: 2019-12-30 110.55 54417 56.20 31393 143.63 52598
## 261: 2019-12-31 110.20 22085 56.00 33684 140.93 5521
data <- melt(data, id.vars = 'date', measure.vars = names(data)[-1])
# long format
data
## date variable value
## 1: 2019-01-01 A1.Price 99.43
## 2: 2019-01-02 A1.Price 99.35
## 3: 2019-01-03 A1.Price 100.93
## 4: 2019-01-04 A1.Price 102.42
## 5: 2019-01-07 A1.Price 103.48
## ---
## 1562: 2019-12-25 A3.Volume 25832.00
## 1563: 2019-12-26 A3.Volume 9355.00
## 1564: 2019-12-27 A3.Volume 30002.00
## 1565: 2019-12-30 A3.Volume 52598.00
## 1566: 2019-12-31 A3.Volume 5521.00
Now we have the problem, that prices and volumes are in the same columm. Let’s split them. We use the logic that in this case, the asset name and the variable name are separated by a ‘.’.
data[, c("asset", "variable") := tstrsplit(variable, ".", fixed = TRUE)]
data
## date variable value asset
## 1: 2019-01-01 Price 99.43 A1
## 2: 2019-01-02 Price 99.35 A1
## 3: 2019-01-03 Price 100.93 A1
## 4: 2019-01-04 Price 102.42 A1
## 5: 2019-01-07 Price 103.48 A1
## ---
## 1562: 2019-12-25 Volume 25832.00 A3
## 1563: 2019-12-26 Volume 9355.00 A3
## 1564: 2019-12-27 Volume 30002.00 A3
## 1565: 2019-12-30 Volume 52598.00 A3
## 1566: 2019-12-31 Volume 5521.00 A3
Now, let’s give volume its own column by transforming a part of the dataset to wide format again, i.e. we take the dataset, keep the columns date and asset and take the entries ofthe column variable as new columns and assign the value in the column values.
The logic of long to wide is: A variable gets transformed into columns.
Here, the entries in variable gets split into its two possible values Price and Volume as new column names.
data <- dcast(data, date + asset ~ variable, value.var = 'value')
data
## date asset Price Volume
## 1: 2019-01-01 A1 99.43 10955
## 2: 2019-01-01 A2 49.39 35034
## 3: 2019-01-01 A3 120.35 54981
## 4: 2019-01-02 A1 99.35 16714
## 5: 2019-01-02 A2 48.76 7939
## ---
## 779: 2019-12-30 A2 56.20 31393
## 780: 2019-12-30 A3 143.63 52598
## 781: 2019-12-31 A1 110.20 22085
## 782: 2019-12-31 A2 56.00 33684
## 783: 2019-12-31 A3 140.93 5521
We make use of the function tail() for prices and sum for volume.
monthly <- data
monthly[, yearmon := as.yearmon(date)]
monthly <- monthly[, .(Price = tail(Price, 1), Volume = sum(Volume)), by = c('yearmon', 'asset')]
monthly
## yearmon asset Price Volume
## 1: Jan 2019 A1 102.85 833457
## 2: Jan 2019 A2 48.95 804212
## 3: Jan 2019 A3 125.31 675417
## 4: Feb 2019 A1 105.92 863971
## 5: Feb 2019 A2 47.80 656993
## 6: Feb 2019 A3 111.26 917942
## 7: Mrz 2019 A1 109.52 544261
## 8: Mrz 2019 A2 46.99 692849
## 9: Mrz 2019 A3 120.64 1404199
## 10: Apr 2019 A1 111.19 666089
## 11: Apr 2019 A2 49.04 1354337
## 12: Apr 2019 A3 114.09 728136
## 13: Mai 2019 A1 111.74 795945
## 14: Mai 2019 A2 52.49 647896
## 15: Mai 2019 A3 119.78 817161
## 16: Jun 2019 A1 109.24 531981
## 17: Jun 2019 A2 54.21 615135
## 18: Jun 2019 A3 125.22 730113
## 19: Jul 2019 A1 104.71 850617
## 20: Jul 2019 A2 56.34 892919
## 21: Jul 2019 A3 118.82 1086321
## 22: Aug 2019 A1 108.28 701452
## 23: Aug 2019 A2 54.45 639586
## 24: Aug 2019 A3 118.55 635428
## 25: Sep 2019 A1 105.41 619049
## 26: Sep 2019 A2 50.88 558426
## 27: Sep 2019 A3 128.58 738536
## 28: Okt 2019 A1 111.90 796581
## 29: Okt 2019 A2 51.45 904534
## 30: Okt 2019 A3 137.54 941291
## 31: Nov 2019 A1 114.26 1384163
## 32: Nov 2019 A2 53.96 700942
## 33: Nov 2019 A3 147.62 744607
## 34: Dez 2019 A1 110.20 833115
## 35: Dez 2019 A2 56.00 623935
## 36: Dez 2019 A3 140.93 596355
## yearmon asset Price Volume
Here we grouped by two variables: yearmon and asset.
If we had log-returns instead of prices, we would also just use sum() instead of tail().
Let’s sort the monthly data first by asset and then by yearmon.
monthly <- monthly[order(asset, yearmon)]
monthly
## yearmon asset Price Volume
## 1: Jan 2019 A1 102.85 833457
## 2: Feb 2019 A1 105.92 863971
## 3: Mrz 2019 A1 109.52 544261
## 4: Apr 2019 A1 111.19 666089
## 5: Mai 2019 A1 111.74 795945
## 6: Jun 2019 A1 109.24 531981
## 7: Jul 2019 A1 104.71 850617
## 8: Aug 2019 A1 108.28 701452
## 9: Sep 2019 A1 105.41 619049
## 10: Okt 2019 A1 111.90 796581
## 11: Nov 2019 A1 114.26 1384163
## 12: Dez 2019 A1 110.20 833115
## 13: Jan 2019 A2 48.95 804212
## 14: Feb 2019 A2 47.80 656993
## 15: Mrz 2019 A2 46.99 692849
## 16: Apr 2019 A2 49.04 1354337
## 17: Mai 2019 A2 52.49 647896
## 18: Jun 2019 A2 54.21 615135
## 19: Jul 2019 A2 56.34 892919
## 20: Aug 2019 A2 54.45 639586
## 21: Sep 2019 A2 50.88 558426
## 22: Okt 2019 A2 51.45 904534
## 23: Nov 2019 A2 53.96 700942
## 24: Dez 2019 A2 56.00 623935
## 25: Jan 2019 A3 125.31 675417
## 26: Feb 2019 A3 111.26 917942
## 27: Mrz 2019 A3 120.64 1404199
## 28: Apr 2019 A3 114.09 728136
## 29: Mai 2019 A3 119.78 817161
## 30: Jun 2019 A3 125.22 730113
## 31: Jul 2019 A3 118.82 1086321
## 32: Aug 2019 A3 118.55 635428
## 33: Sep 2019 A3 128.58 738536
## 34: Okt 2019 A3 137.54 941291
## 35: Nov 2019 A3 147.62 744607
## 36: Dez 2019 A3 140.93 596355
## yearmon asset Price Volume
It orders alphabetically, increasing or from earlier to later. We can invert the order by putting a minus in front. E.g. monthly[order(-asset, yearmon)].
Let’s return to our dataset of daily data. We transformed it to long format. In long format, it is now way easier to calculate returns.
data[, returns := log(Price) - dplyr::lag(log(Price)), by = asset]
The dataset Mkt contains market returns.
Mkt
## date Mkt_return
## 1: 2019-01-01 -0.0033926288
## 2: 2019-01-02 -0.0013481345
## 3: 2019-01-03 0.0097247906
## 4: 2019-01-04 0.0005130643
## 5: 2019-01-07 0.0008768993
## ---
## 257: 2019-12-25 0.0067666093
## 258: 2019-12-26 0.0039808983
## 259: 2019-12-27 -0.0006267849
## 260: 2019-12-30 -0.0094117968
## 261: 2019-12-31 -0.0031490069
Let’s add the market return as a new column. We could use cbind(), but we want to make sure that the date columns actually match. For doing so, we perform a so called left join. Our dataset will be our left and the market dataset will be our right. The function looks at the date in data, looks up the matching date in Mkt and adds Mkt_return to data where they match. If Mkt contains dates that are not in data, they are dropped and not added.
data <- Mkt[data, on = 'date']
data
## date Mkt_return asset Price Volume yearmon returns
## 1: 2019-01-01 -0.003392629 A1 99.43 10955 Jan 2019 NA
## 2: 2019-01-01 -0.003392629 A2 49.39 35034 Jan 2019 NA
## 3: 2019-01-01 -0.003392629 A3 120.35 54981 Jan 2019 NA
## 4: 2019-01-02 -0.001348135 A1 99.35 16714 Jan 2019 -0.000804910
## 5: 2019-01-02 -0.001348135 A2 48.76 7939 Jan 2019 -0.012837670
## ---
## 779: 2019-12-30 -0.009411797 A2 56.20 31393 Dez 2019 -0.004615666
## 780: 2019-12-30 -0.009411797 A3 143.63 52598 Dez 2019 -0.020603657
## 781: 2019-12-31 -0.003149007 A1 110.20 22085 Dez 2019 -0.003171011
## 782: 2019-12-31 -0.003149007 A2 56.00 33684 Dez 2019 -0.003565066
## 783: 2019-12-31 -0.003149007 A3 140.93 5521 Dez 2019 -0.018977235
Let’s estimate the beta of each asset using OLS.
beta <- function(y, x) coef(lm(y ~ x))[2] # a function that returns the slope
data[, .(beta = beta(returns, Mkt_return)), by = asset]
## asset beta
## 1: A1 0.8487769
## 2: A2 0.2788113
## 3: A3 1.5661917
Let’s compute the 20 day rolling minimum and maximum price of each asset. We want the result of each calculation to be written at the end of the window (if you think of a window moving along a horizontal time-line).
data[, min_P := rollapply(Price, width = 20, align = 'right', FUN = min, na.rm = T, fill = NA), by = asset]
data[, max_P := rollapply(Price, width = 20, align = 'right', FUN = max, na.rm = T, fill = NA), by = asset]
data <- data[order(asset, date)] # order the data
data[15:25, ] # show some of the data
## date Mkt_return asset Price Volume yearmon returns
## 1: 2019-01-21 -0.003363942 A1 103.77 70633 Jan 2019 -0.0054778860
## 2: 2019-01-22 0.011137343 A1 104.82 31535 Jan 2019 0.0100676818
## 3: 2019-01-23 0.003158245 A1 105.13 11985 Jan 2019 0.0029530862
## 4: 2019-01-24 -0.012096426 A1 104.39 17993 Jan 2019 -0.0070637942
## 5: 2019-01-25 0.004417912 A1 104.78 16760 Jan 2019 0.0037290286
## 6: 2019-01-28 -0.002849877 A1 103.47 13784 Jan 2019 -0.0125811984
## 7: 2019-01-29 -0.006533034 A1 103.40 44541 Jan 2019 -0.0006767535
## 8: 2019-01-30 -0.001272602 A1 103.54 6651 Jan 2019 0.0013530494
## 9: 2019-01-31 -0.006274180 A1 102.85 52384 Jan 2019 -0.0066863954
## 10: 2019-02-01 -0.004435095 A1 102.56 52268 Feb 2019 -0.0028236229
## 11: 2019-02-04 -0.003792268 A1 102.34 6643 Feb 2019 -0.0021473898
## min_P max_P
## 1: NA NA
## 2: NA NA
## 3: NA NA
## 4: NA NA
## 5: NA NA
## 6: 99.35 105.13
## 7: 99.35 105.13
## 8: 100.93 105.13
## 9: 102.22 105.13
## 10: 102.22 105.13
## 11: 102.22 105.13
Let’s estimate the beta of each asset using a 30-day rolling window approach. First, we write a function that returns beta.
beta <- function(x) coef(lm(x[,1] ~ x[,2]))[2] # a function that returns the slope
Now let’s apply the function using rollapply again. This time rollapply will need a two-column data frame as input, since our beta function requires that. The option by.column = F tells rollapply that our function needs multiple columns as input instead of applying it to each column separately
data[, beta := rollapply(data.frame(returns, Mkt_return), width = 30, align = 'right', FUN = beta, by.column = F, fill = NA), by = asset]
data[25:35, ] # show some of the data
## date Mkt_return asset Price Volume yearmon returns
## 1: 2019-02-04 -0.003792268 A1 102.34 6643 Feb 2019 -0.0021473898
## 2: 2019-02-05 -0.010363741 A1 101.63 41751 Feb 2019 -0.0069618362
## 3: 2019-02-06 0.005262400 A1 103.10 250252 Feb 2019 0.0143606239
## 4: 2019-02-07 0.001025984 A1 103.05 12616 Feb 2019 -0.0004850837
## 5: 2019-02-08 -0.006968262 A1 102.58 51272 Feb 2019 -0.0045713254
## 6: 2019-02-11 0.007837548 A1 103.98 10074 Feb 2019 0.0135555910
## 7: 2019-02-12 0.002716376 A1 104.88 66883 Feb 2019 0.0086182665
## 8: 2019-02-13 -0.001749818 A1 105.48 28277 Feb 2019 0.0057045220
## 9: 2019-02-14 0.005617317 A1 105.58 114911 Feb 2019 0.0009475979
## 10: 2019-02-15 0.005512138 A1 107.37 5120 Feb 2019 0.0168118541
## 11: 2019-02-18 0.005162088 A1 107.86 20925 Feb 2019 0.0045532765
## min_P max_P beta
## 1: 102.22 105.13 NA
## 2: 101.63 105.13 NA
## 3: 101.63 105.13 NA
## 4: 101.63 105.13 NA
## 5: 101.63 105.13 NA
## 6: 101.63 105.13 0.9677070
## 7: 101.63 105.13 0.9788179
## 8: 101.63 105.48 0.9681462
## 9: 101.63 105.58 0.8879060
## 10: 101.63 107.37 0.9364054
## 11: 101.63 107.86 0.9222229
Let’s plot the Prices of all three asset in one plot.
ggplot(data, aes(x = date, y = Price, color = asset)) + geom_line()
Let’s plot the prices with their 20 day rolling minimum and maximum. This can get quite messy, so we want to give each asset its own plot.
ggplot(data, aes(x = date)) +
geom_line(aes(y = data$Price)) +
geom_line(aes(y = data$min_P), color = "grey") +
geom_line(aes(y = data$max_P), color = "grey") +
facet_wrap(~ asset, scales = 'free_y')
When f is a function and x is an object, then these expressions are identical:
f(x)x %>% f%>% : ctrl + shift + m?function_name or move cursor into function in your code and press F1