This document gives examples of common data operations needed for empirical finance.

1 Set up R and prepare data

1.1 Setting up the environment

Ideally, we should start our code by specifying the working directory, i.e. the folder R will look in when it tries to read and save files.

The easiest way is to specify the folder in which we saved the script. This way we can move the folder or save the file in the cloud to work from different machines.

setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) ##set work directory

Then, load all the libraries that you will use in the code. We will use

library(readr)
library(data.table)
library(lubridate)
library(ggplot2)
library(zoo)
library(dplyr)

If you get an error, install the packages first by using e.g. install.packages(‘readr’).

1.2 Load raw data

Ideally you save your data as a csv (using Excel, save as), where the first row contains the names of the columns. Put it in the same folder or a sub-folder as the script.

Then, click ‘Import Dataset’ in the top-right.
Select ‘From Text (readr)’ and a window should open.
Select the file under ‘Browse’.
Then adjust the settings until the data look alright.
If it shows only one column, try a different delimiter.
If decimals are wrong, adjust them by > Configure > Decimal Mark.
Pay attention to the class (character, double, etc.) of the columns. Are they correct? If not, click on the column and select the correct class.

For our purpose here, let’s generate data:

set.seed(123)
date <- seq(from = as.Date('2019-01-01'), to = as.Date('2019-12-31'), by = 'days')
date <- date[wday(date) %in% 2:6]
T. <- length(date)
Mkt <- data.table(
  date = date,
  Mkt_return = rnorm(T., 0.02/T., 0.1/(T.**0.5))
)
asset_prices_and_volumes <- data.table(
  date = as.character(date),
  A1.Price = exp(log(100) + cumsum(0.8 * Mkt$Mkt_return + rnorm(T., 0, 0.1/(T.**0.5)))) %>% round(2),
  A1.Volume = exp(rnorm(T., 10, 1)) %>% floor,
  A2.Price = exp(log(50) + cumsum(0.4 * Mkt$Mkt_return + rnorm(T., 0, 0.15/(T.**0.5)))) %>% round(2),
  A2.Volume = exp(rnorm(T., 10, 1)) %>% floor,
  A3.Price = exp(log(120) + cumsum(1.5 * Mkt$Mkt_return + rnorm(T., 0, 0.2/(T.**0.5)))) %>% round(2),
  A3.Volume = exp(rnorm(T., 10, 1)) %>% floor
)

1.3 Explore data

Let’s assume we loaded a dataset and named it asset_prices_and_volumes. To get to know how our data actually look like, we can use

View()
str()
head()
tail()
summary()

If the dataset is a tibble or a data.table, we can also just print it in the console. Here, our data is a data.table:

asset_prices_and_volumes

##            date A1.Price A1.Volume A2.Price A2.Volume A3.Price A3.Volume
##   1: 2019-01-01    99.43     10955    49.39     35034   120.35     54981
##   2: 2019-01-02    99.35     16714    48.76      7939   118.31     27947
##   3: 2019-01-03   100.93     67147    48.68      5922   120.21     17342
##   4: 2019-01-04   102.42     38179    49.06     13433   118.90     23605
##   5: 2019-01-07   103.48     75862    48.19    126976   119.06     82942
##  ---                                                                    
## 257: 2019-12-25   111.26     35415    56.60     10407   144.85     25832
## 258: 2019-12-26   110.90     16662    56.16     15968   145.79      9355
## 259: 2019-12-27   110.72     49682    56.46      6989   146.62     30002
## 260: 2019-12-30   110.55     54417    56.20     31393   143.63     52598
## 261: 2019-12-31   110.20     22085    56.00     33684   140.93      5521

A tibble also shows us the column classes

as_tibble(asset_prices_and_volumes)

## # A tibble: 261 x 7
##    date       A1.Price A1.Volume A2.Price A2.Volume A3.Price A3.Volume
##    <chr>         <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
##  1 2019-01-01     99.4     10955     49.4     35034     120.     54981
##  2 2019-01-02     99.4     16714     48.8      7939     118.     27947
##  3 2019-01-03    101.      67147     48.7      5922     120.     17342
##  4 2019-01-04    102.      38179     49.1     13433     119.     23605
##  5 2019-01-07    103.      75862     48.2    126976     119.     82942
##  6 2019-01-08    104.      25313     47.6     23289     123.     10958
##  7 2019-01-09    103.      33199     47.3     30682     123.     10410
##  8 2019-01-10    103.      12601     47.4     18217     122.     11853
##  9 2019-01-11    102.      40350     46.9     35259     121.      4514
## 10 2019-01-14    103.      13275     47.2      8504     119.     49992
## # ... with 251 more rows

With many columns, str() is more concise

str(asset_prices_and_volumes)

## Classes 'data.table' and 'data.frame':   261 obs. of  7 variables:
##  $ date     : chr  "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" ...
##  $ A1.Price : num  99.4 99.3 100.9 102.4 103.5 ...
##  $ A1.Volume: num  10955 16714 67147 38179 75862 ...
##  $ A2.Price : num  49.4 48.8 48.7 49.1 48.2 ...
##  $ A2.Volume: num  35034 7939 5922 13433 126976 ...
##  $ A3.Price : num  120 118 120 119 119 ...
##  $ A3.Volume: num  54981 27947 17342 23605 82942 ...
##  - attr(*, ".internal.selfref")=<externalptr>

1.4 Dates

As we have seen, the column date is a character. Let’s parse this column to a proper date, i.e. tell R that these are actually dates. We use the ymd() function (for yearmonthday) from the package lubridate. If the date is in the format e.g. ‘DD-MM-YYYY’, we use dmy().

asset_prices_and_volumes[, date := ymd(date)]

Now, the date column should be parsed as a date.

str(asset_prices_and_volumes)

## Classes 'data.table' and 'data.frame':   261 obs. of  7 variables:
##  $ date     : Date, format: "2019-01-01" "2019-01-02" ...
##  $ A1.Price : num  99.4 99.3 100.9 102.4 103.5 ...
##  $ A1.Volume: num  10955 16714 67147 38179 75862 ...
##  $ A2.Price : num  49.4 48.8 48.7 49.1 48.2 ...
##  $ A2.Volume: num  35034 7939 5922 13433 126976 ...
##  $ A3.Price : num  120 118 120 119 119 ...
##  $ A3.Volume: num  54981 27947 17342 23605 82942 ...
##  - attr(*, ".internal.selfref")=<externalptr>

1.5 Select entries within datasets

The logic is: data[row, column]

Select the entry in the first row and second column

asset_prices_and_volumes[1, 2]

##    A1.Price
## 1:    99.43

This is particularly important if you work with loops.

We can exclude rows and columns with a minus. E.g., select all but the first column:

asset_prices_and_volumes[, -1]

##      A1.Price A1.Volume A2.Price A2.Volume A3.Price A3.Volume
##   1:    99.43     10955    49.39     35034   120.35     54981
##   2:    99.35     16714    48.76      7939   118.31     27947
##   3:   100.93     67147    48.68      5922   120.21     17342
##   4:   102.42     38179    49.06     13433   118.90     23605
##   5:   103.48     75862    48.19    126976   119.06     82942
##  ---                                                         
## 257:   111.26     35415    56.60     10407   144.85     25832
## 258:   110.90     16662    56.16     15968   145.79      9355
## 259:   110.72     49682    56.46      6989   146.62     30002
## 260:   110.55     54417    56.20     31393   143.63     52598
## 261:   110.20     22085    56.00     33684   140.93      5521

1.6 Keep/Drop columns by name

Our dataset contains columns that we do not need. We want to calculate returns, but are not interested in volume data. So let’s take a subset of the data, that contains only the date and price columns. We name this subset of our data simply ‘data’.

cols <- grepl('(date)|(Price)', names(asset_prices_and_volumes), fixed = F)
data <- asset_prices_and_volumes[, ..cols]
data

##            date A1.Price A2.Price A3.Price
##   1: 2019-01-01    99.43    49.39   120.35
##   2: 2019-01-02    99.35    48.76   118.31
##   3: 2019-01-03   100.93    48.68   120.21
##   4: 2019-01-04   102.42    49.06   118.90
##   5: 2019-01-07   103.48    48.19   119.06
##  ---                                      
## 257: 2019-12-25   111.26    56.60   144.85
## 258: 2019-12-26   110.90    56.16   145.79
## 259: 2019-12-27   110.72    56.46   146.62
## 260: 2019-12-30   110.55    56.20   143.63
## 261: 2019-12-31   110.20    56.00   140.93

grepl() is a function that returns TRUE or FALSE for each element of a character vector. It returns TRUE if the given pattern is in the element. If fixed = F, then the pattern is entered as a regular expression, a format for search patterns.

1.7 data.table syntax

Data.table is a package that builds on the logic of data[row, column] that is both extremely fast and memory efficient.

To use the data.table syntax, we need to make sure that our dataset is in fact a data.table. We can check that with

class(data)

## [1] "data.table" "data.frame"

If that wasn’t the case, we could simple use

data <- as.data.table(data)

The idea is, that columns are variables.

We can select a subset of the dataset by e.g.

data[date < '2019-05-01', ]

Create and modify variables using the logic

data[rows, variable := function(variable)]

Note that := modifies the dataset itself and does not create a copy. That means we do not have to write data <- data[, x := f(x)].

data[rows, ] however only selects and returns the subset but does not overwrite the complete data.table.

For example

data[, double_A1.Price := 2 * A1.Price]
data

##            date A1.Price A2.Price A3.Price double_A1.Price
##   1: 2019-01-01    99.43    49.39   120.35          198.86
##   2: 2019-01-02    99.35    48.76   118.31          198.70
##   3: 2019-01-03   100.93    48.68   120.21          201.86
##   4: 2019-01-04   102.42    49.06   118.90          204.84
##   5: 2019-01-07   103.48    48.19   119.06          206.96
##  ---                                                      
## 257: 2019-12-25   111.26    56.60   144.85          222.52
## 258: 2019-12-26   110.90    56.16   145.79          221.80
## 259: 2019-12-27   110.72    56.46   146.62          221.44
## 260: 2019-12-30   110.55    56.20   143.63          221.10
## 261: 2019-12-31   110.20    56.00   140.93          220.40

Or combine both ideas and perform an operation on a subset

data[date < '2019-05-01', yearmon := as.yearmon(date)]
data

##            date A1.Price A2.Price A3.Price double_A1.Price  yearmon
##   1: 2019-01-01    99.43    49.39   120.35          198.86 Jan 2019
##   2: 2019-01-02    99.35    48.76   118.31          198.70 Jan 2019
##   3: 2019-01-03   100.93    48.68   120.21          201.86 Jan 2019
##   4: 2019-01-04   102.42    49.06   118.90          204.84 Jan 2019
##   5: 2019-01-07   103.48    48.19   119.06          206.96 Jan 2019
##  ---                                                               
## 257: 2019-12-25   111.26    56.60   144.85          222.52     <NA>
## 258: 2019-12-26   110.90    56.16   145.79          221.80     <NA>
## 259: 2019-12-27   110.72    56.46   146.62          221.44     <NA>
## 260: 2019-12-30   110.55    56.20   143.63          221.10     <NA>
## 261: 2019-12-31   110.20    56.00   140.93          220.40     <NA>

We can also group the dataset and perform an operation on each group by adding ‘by’. It follows the logic

data[rows, variable := function(variable), by = grouping_variable]

For example

data[date < '2019-05-01', A1.av.monthly.price := mean(A1.Price), by = yearmon]
data

##            date A1.Price A2.Price A3.Price double_A1.Price  yearmon
##   1: 2019-01-01    99.43    49.39   120.35          198.86 Jan 2019
##   2: 2019-01-02    99.35    48.76   118.31          198.70 Jan 2019
##   3: 2019-01-03   100.93    48.68   120.21          201.86 Jan 2019
##   4: 2019-01-04   102.42    49.06   118.90          204.84 Jan 2019
##   5: 2019-01-07   103.48    48.19   119.06          206.96 Jan 2019
##  ---                                                               
## 257: 2019-12-25   111.26    56.60   144.85          222.52     <NA>
## 258: 2019-12-26   110.90    56.16   145.79          221.80     <NA>
## 259: 2019-12-27   110.72    56.46   146.62          221.44     <NA>
## 260: 2019-12-30   110.55    56.20   143.63          221.10     <NA>
## 261: 2019-12-31   110.20    56.00   140.93          220.40     <NA>
##      A1.av.monthly.price
##   1:            103.1726
##   2:            103.1726
##   3:            103.1726
##   4:            103.1726
##   5:            103.1726
##  ---                    
## 257:                  NA
## 258:                  NA
## 259:                  NA
## 260:                  NA
## 261:                  NA

We can also perform multiple computations in one line by using ‘:=’(variable = function(variable), variable = function(variable)), i.e.

data[date < '2019-05-01', ':='(A1.av.monthly.price = mean(A1.Price), A2.av.monthly.price = mean(A2.Price)), by = yearmon]
data

##            date A1.Price A2.Price A3.Price double_A1.Price  yearmon
##   1: 2019-01-01    99.43    49.39   120.35          198.86 Jan 2019
##   2: 2019-01-02    99.35    48.76   118.31          198.70 Jan 2019
##   3: 2019-01-03   100.93    48.68   120.21          201.86 Jan 2019
##   4: 2019-01-04   102.42    49.06   118.90          204.84 Jan 2019
##   5: 2019-01-07   103.48    48.19   119.06          206.96 Jan 2019
##  ---                                                               
## 257: 2019-12-25   111.26    56.60   144.85          222.52     <NA>
## 258: 2019-12-26   110.90    56.16   145.79          221.80     <NA>
## 259: 2019-12-27   110.72    56.46   146.62          221.44     <NA>
## 260: 2019-12-30   110.55    56.20   143.63          221.10     <NA>
## 261: 2019-12-31   110.20    56.00   140.93          220.40     <NA>
##      A1.av.monthly.price A2.av.monthly.price
##   1:            103.1726               48.16
##   2:            103.1726               48.16
##   3:            103.1726               48.16
##   4:            103.1726               48.16
##   5:            103.1726               48.16
##  ---                                        
## 257:                  NA                  NA
## 258:                  NA                  NA
## 259:                  NA                  NA
## 260:                  NA                  NA
## 261:                  NA                  NA

If we want to perform the same function on multiple columns (=variables) (e.g. the mean here), we can use lapply(.SD, FUN = function).

.SD stands for subset of data and must be specified in the end by .SDcols = c(variable, variabe, …).

The logic is

data[rows, c(variable, variable, …) := lapply(.SD, FUN = function), by = grouping_variable, .SDcols = c(variable, variable, …)]

For example

data[date < '2019-05-01', c('A1.av.monthly.price', 'A2.av.monthly.price', 'A3.av.monthly.price') := lapply(.SD, mean), by = yearmon, .SDcols = c('A1.Price', 'A2.Price', 'A3.Price')]
data

##            date A1.Price A2.Price A3.Price double_A1.Price  yearmon
##   1: 2019-01-01    99.43    49.39   120.35          198.86 Jan 2019
##   2: 2019-01-02    99.35    48.76   118.31          198.70 Jan 2019
##   3: 2019-01-03   100.93    48.68   120.21          201.86 Jan 2019
##   4: 2019-01-04   102.42    49.06   118.90          204.84 Jan 2019
##   5: 2019-01-07   103.48    48.19   119.06          206.96 Jan 2019
##  ---                                                               
## 257: 2019-12-25   111.26    56.60   144.85          222.52     <NA>
## 258: 2019-12-26   110.90    56.16   145.79          221.80     <NA>
## 259: 2019-12-27   110.72    56.46   146.62          221.44     <NA>
## 260: 2019-12-30   110.55    56.20   143.63          221.10     <NA>
## 261: 2019-12-31   110.20    56.00   140.93          220.40     <NA>
##      A1.av.monthly.price A2.av.monthly.price A3.av.monthly.price
##   1:            103.1726               48.16            121.7135
##   2:            103.1726               48.16            121.7135
##   3:            103.1726               48.16            121.7135
##   4:            103.1726               48.16            121.7135
##   5:            103.1726               48.16            121.7135
##  ---                                                            
## 257:                  NA                  NA                  NA
## 258:                  NA                  NA                  NA
## 259:                  NA                  NA                  NA
## 260:                  NA                  NA                  NA
## 261:                  NA                  NA                  NA

Sometimes, we do not want to write (repeated) values in our existing dataset, but create a summary of it.

This is done using the syntax

data[rows, .(summary_variable = function(variable)), by = grouping_variable]

For example

data[date < '2019-05-01', .(A1.av.monthly.price = mean(A1.Price)), by = yearmon]

##     yearmon A1.av.monthly.price
## 1: Jan 2019            103.1726
## 2: Feb 2019            105.8635
## 3: Mrz 2019            108.5805
## 4: Apr 2019            110.9936

By using lapply() without c(variable, variable, ...) in front, we can also use it to summarise data.

data[date < '2019-05-01', lapply(.SD, mean), by = yearmon, .SDcols = c('A1.Price', 'A2.Price', 'A3.Price')]

##     yearmon A1.Price A2.Price A3.Price
## 1: Jan 2019 103.1726 48.16000 121.7135
## 2: Feb 2019 105.8635 48.52350 117.6435
## 3: Mrz 2019 108.5805 46.82476 115.3705
## 4: Apr 2019 110.9936 48.71455 116.8477

Note that such summaries do not overwrite the file.

2 Operations on wide format data

2.1 Wide format data

We have 3 assets. In wide format, each asset has its own column and each date is unique. If we downloaded prices of more assets, we would add them as new columns by matching dates.

2.2 Returns using lapply

It is easier to use log-returns (continuously compounded returns) because they are additive. They are simply first differences of log-prices.

data <- asset_prices_and_volumes[, ..cols]
cols <- names(data)[-1]
cols

## [1] "A1.Price" "A2.Price" "A3.Price"

data[, paste(cols) := lapply(.SD, FUN = function(x) c(NA, diff(log(x)))), .SDcols = cols]
data

##            date     A1.Price     A2.Price     A3.Price
##   1: 2019-01-01           NA           NA           NA
##   2: 2019-01-02 -0.000804910 -0.012837670 -0.017095866
##   3: 2019-01-03  0.015778238 -0.001642036  0.015931915
##   4: 2019-01-04  0.014654799  0.007775771 -0.010957410
##   5: 2019-01-07  0.010296351 -0.017892508  0.001344764
##  ---                                                  
## 257: 2019-12-25  0.007306875  0.019625964  0.022480701
## 258: 2019-12-26 -0.003240910 -0.007804225  0.006468506
## 259: 2019-12-27 -0.001624402  0.005327663  0.005676976
## 260: 2019-12-30 -0.001536585 -0.004615666 -0.020603657
## 261: 2019-12-31 -0.003171011 -0.003565066 -0.018977235

2.3 Mean-Variance-Efficient portfolio weights (using matrix operations in R)

Normal operations like \(*\) or \(/\) are performed element by element. If you want R to perform matrix operations, you have to place % around them, e.g. %*% or %/%. If A is a matrix,

then \(A^{-1}\), the inverse, is solve(A);
\(A^t\), the transpose, is t(A).

Let’s compute

\[w = \frac{\Sigma^{-1}\mu}{1 \Sigma^{-1}\mu}\]

sigma <- as.matrix(cov(data[,-1], use = 'pairwise.complete.obs'))
mu <- colMeans(data[, -1], na.rm = T)
ones <- rep(1, length(mu))
w <- t((solve(sigma) %*% mu) / as.numeric(ones %*% solve(sigma) %*% mu))
w

##       A1.Price  A2.Price  A3.Price
## [1,] 0.4409991 0.4192866 0.1397143

3 Operations on long format data

3.1 Wide to long format data

For tasks it is easier to work with long format data. Instead of giving each asset its own column, we create a new column with the assets’ names and stack all observations in one column.

We take the dataset, keep the column data, take all other column names as a new variable.

The logic is: Wide to long means column names become a new variable.

Here, the column names A1.Price, …, A3.Volume become entries of a new variable.

data <- asset_prices_and_volumes
# wide format
data

##            date A1.Price A1.Volume A2.Price A2.Volume A3.Price A3.Volume
##   1: 2019-01-01    99.43     10955    49.39     35034   120.35     54981
##   2: 2019-01-02    99.35     16714    48.76      7939   118.31     27947
##   3: 2019-01-03   100.93     67147    48.68      5922   120.21     17342
##   4: 2019-01-04   102.42     38179    49.06     13433   118.90     23605
##   5: 2019-01-07   103.48     75862    48.19    126976   119.06     82942
##  ---                                                                    
## 257: 2019-12-25   111.26     35415    56.60     10407   144.85     25832
## 258: 2019-12-26   110.90     16662    56.16     15968   145.79      9355
## 259: 2019-12-27   110.72     49682    56.46      6989   146.62     30002
## 260: 2019-12-30   110.55     54417    56.20     31393   143.63     52598
## 261: 2019-12-31   110.20     22085    56.00     33684   140.93      5521

data <- melt(data, id.vars = 'date', measure.vars = names(data)[-1])
# long format
data

##             date  variable    value
##    1: 2019-01-01  A1.Price    99.43
##    2: 2019-01-02  A1.Price    99.35
##    3: 2019-01-03  A1.Price   100.93
##    4: 2019-01-04  A1.Price   102.42
##    5: 2019-01-07  A1.Price   103.48
##   ---                              
## 1562: 2019-12-25 A3.Volume 25832.00
## 1563: 2019-12-26 A3.Volume  9355.00
## 1564: 2019-12-27 A3.Volume 30002.00
## 1565: 2019-12-30 A3.Volume 52598.00
## 1566: 2019-12-31 A3.Volume  5521.00

Now we have the problem, that prices and volumes are in the same columm. Let’s split them. We use the logic that in this case, the asset name and the variable name are separated by a ‘.’.

data[, c("asset", "variable") := tstrsplit(variable, ".", fixed = TRUE)]
data

##             date variable    value asset
##    1: 2019-01-01    Price    99.43    A1
##    2: 2019-01-02    Price    99.35    A1
##    3: 2019-01-03    Price   100.93    A1
##    4: 2019-01-04    Price   102.42    A1
##    5: 2019-01-07    Price   103.48    A1
##   ---                                   
## 1562: 2019-12-25   Volume 25832.00    A3
## 1563: 2019-12-26   Volume  9355.00    A3
## 1564: 2019-12-27   Volume 30002.00    A3
## 1565: 2019-12-30   Volume 52598.00    A3
## 1566: 2019-12-31   Volume  5521.00    A3

3.2 Long to wide format data

Now, let’s give volume its own column by transforming a part of the dataset to wide format again, i.e. we take the dataset, keep the columns date and asset and take the entries ofthe column variable as new columns and assign the value in the column values.

The logic of long to wide is: A variable gets transformed into columns.

Here, the entries in variable gets split into its two possible values Price and Volume as new column names.

data <- dcast(data, date + asset ~ variable, value.var = 'value')
data

##            date asset  Price Volume
##   1: 2019-01-01    A1  99.43  10955
##   2: 2019-01-01    A2  49.39  35034
##   3: 2019-01-01    A3 120.35  54981
##   4: 2019-01-02    A1  99.35  16714
##   5: 2019-01-02    A2  48.76   7939
##  ---                               
## 779: 2019-12-30    A2  56.20  31393
## 780: 2019-12-30    A3 143.63  52598
## 781: 2019-12-31    A1 110.20  22085
## 782: 2019-12-31    A2  56.00  33684
## 783: 2019-12-31    A3 140.93   5521

3.3 Transform daily data to monthly

We make use of the function tail() for prices and sum for volume.

monthly <- data
monthly[, yearmon := as.yearmon(date)]
monthly <- monthly[, .(Price = tail(Price, 1), Volume = sum(Volume)), by = c('yearmon', 'asset')]
monthly

##      yearmon asset  Price  Volume
##  1: Jan 2019    A1 102.85  833457
##  2: Jan 2019    A2  48.95  804212
##  3: Jan 2019    A3 125.31  675417
##  4: Feb 2019    A1 105.92  863971
##  5: Feb 2019    A2  47.80  656993
##  6: Feb 2019    A3 111.26  917942
##  7: Mrz 2019    A1 109.52  544261
##  8: Mrz 2019    A2  46.99  692849
##  9: Mrz 2019    A3 120.64 1404199
## 10: Apr 2019    A1 111.19  666089
## 11: Apr 2019    A2  49.04 1354337
## 12: Apr 2019    A3 114.09  728136
## 13: Mai 2019    A1 111.74  795945
## 14: Mai 2019    A2  52.49  647896
## 15: Mai 2019    A3 119.78  817161
## 16: Jun 2019    A1 109.24  531981
## 17: Jun 2019    A2  54.21  615135
## 18: Jun 2019    A3 125.22  730113
## 19: Jul 2019    A1 104.71  850617
## 20: Jul 2019    A2  56.34  892919
## 21: Jul 2019    A3 118.82 1086321
## 22: Aug 2019    A1 108.28  701452
## 23: Aug 2019    A2  54.45  639586
## 24: Aug 2019    A3 118.55  635428
## 25: Sep 2019    A1 105.41  619049
## 26: Sep 2019    A2  50.88  558426
## 27: Sep 2019    A3 128.58  738536
## 28: Okt 2019    A1 111.90  796581
## 29: Okt 2019    A2  51.45  904534
## 30: Okt 2019    A3 137.54  941291
## 31: Nov 2019    A1 114.26 1384163
## 32: Nov 2019    A2  53.96  700942
## 33: Nov 2019    A3 147.62  744607
## 34: Dez 2019    A1 110.20  833115
## 35: Dez 2019    A2  56.00  623935
## 36: Dez 2019    A3 140.93  596355
##      yearmon asset  Price  Volume

Here we grouped by two variables: yearmon and asset.

If we had log-returns instead of prices, we would also just use sum() instead of tail().

3.4 Sort

Let’s sort the monthly data first by asset and then by yearmon.

monthly <- monthly[order(asset, yearmon)]
monthly

##      yearmon asset  Price  Volume
##  1: Jan 2019    A1 102.85  833457
##  2: Feb 2019    A1 105.92  863971
##  3: Mrz 2019    A1 109.52  544261
##  4: Apr 2019    A1 111.19  666089
##  5: Mai 2019    A1 111.74  795945
##  6: Jun 2019    A1 109.24  531981
##  7: Jul 2019    A1 104.71  850617
##  8: Aug 2019    A1 108.28  701452
##  9: Sep 2019    A1 105.41  619049
## 10: Okt 2019    A1 111.90  796581
## 11: Nov 2019    A1 114.26 1384163
## 12: Dez 2019    A1 110.20  833115
## 13: Jan 2019    A2  48.95  804212
## 14: Feb 2019    A2  47.80  656993
## 15: Mrz 2019    A2  46.99  692849
## 16: Apr 2019    A2  49.04 1354337
## 17: Mai 2019    A2  52.49  647896
## 18: Jun 2019    A2  54.21  615135
## 19: Jul 2019    A2  56.34  892919
## 20: Aug 2019    A2  54.45  639586
## 21: Sep 2019    A2  50.88  558426
## 22: Okt 2019    A2  51.45  904534
## 23: Nov 2019    A2  53.96  700942
## 24: Dez 2019    A2  56.00  623935
## 25: Jan 2019    A3 125.31  675417
## 26: Feb 2019    A3 111.26  917942
## 27: Mrz 2019    A3 120.64 1404199
## 28: Apr 2019    A3 114.09  728136
## 29: Mai 2019    A3 119.78  817161
## 30: Jun 2019    A3 125.22  730113
## 31: Jul 2019    A3 118.82 1086321
## 32: Aug 2019    A3 118.55  635428
## 33: Sep 2019    A3 128.58  738536
## 34: Okt 2019    A3 137.54  941291
## 35: Nov 2019    A3 147.62  744607
## 36: Dez 2019    A3 140.93  596355
##      yearmon asset  Price  Volume

It orders alphabetically, increasing or from earlier to later. We can invert the order by putting a minus in front. E.g. monthly[order(-asset, yearmon)].

3.5 Returns in long format

Let’s return to our dataset of daily data. We transformed it to long format. In long format, it is now way easier to calculate returns.

data[, returns := log(Price) - dplyr::lag(log(Price)), by = asset]

3.6 Left-Join

The dataset Mkt contains market returns.

Mkt

##            date    Mkt_return
##   1: 2019-01-01 -0.0033926288
##   2: 2019-01-02 -0.0013481345
##   3: 2019-01-03  0.0097247906
##   4: 2019-01-04  0.0005130643
##   5: 2019-01-07  0.0008768993
##  ---                         
## 257: 2019-12-25  0.0067666093
## 258: 2019-12-26  0.0039808983
## 259: 2019-12-27 -0.0006267849
## 260: 2019-12-30 -0.0094117968
## 261: 2019-12-31 -0.0031490069

Let’s add the market return as a new column. We could use cbind(), but we want to make sure that the date columns actually match. For doing so, we perform a so called left join. Our dataset will be our left and the market dataset will be our right. The function looks at the date in data, looks up the matching date in Mkt and adds Mkt_return to data where they match. If Mkt contains dates that are not in data, they are dropped and not added.

data <- Mkt[data, on = 'date']
data

##            date   Mkt_return asset  Price Volume  yearmon      returns
##   1: 2019-01-01 -0.003392629    A1  99.43  10955 Jan 2019           NA
##   2: 2019-01-01 -0.003392629    A2  49.39  35034 Jan 2019           NA
##   3: 2019-01-01 -0.003392629    A3 120.35  54981 Jan 2019           NA
##   4: 2019-01-02 -0.001348135    A1  99.35  16714 Jan 2019 -0.000804910
##   5: 2019-01-02 -0.001348135    A2  48.76   7939 Jan 2019 -0.012837670
##  ---                                                                  
## 779: 2019-12-30 -0.009411797    A2  56.20  31393 Dez 2019 -0.004615666
## 780: 2019-12-30 -0.009411797    A3 143.63  52598 Dez 2019 -0.020603657
## 781: 2019-12-31 -0.003149007    A1 110.20  22085 Dez 2019 -0.003171011
## 782: 2019-12-31 -0.003149007    A2  56.00  33684 Dez 2019 -0.003565066
## 783: 2019-12-31 -0.003149007    A3 140.93   5521 Dez 2019 -0.018977235

3.7 Linear Regression: Estimate Beta

Let’s estimate the beta of each asset using OLS.

beta <- function(y, x) coef(lm(y ~ x))[2] # a function that returns the slope
data[, .(beta = beta(returns, Mkt_return)), by = asset]

##    asset      beta
## 1:    A1 0.8487769
## 2:    A2 0.2788113
## 3:    A3 1.5661917

3.8 Rolling window using rollapply

Let’s compute the 20 day rolling minimum and maximum price of each asset. We want the result of each calculation to be written at the end of the window (if you think of a window moving along a horizontal time-line).

data[, min_P := rollapply(Price, width = 20, align = 'right',  FUN = min, na.rm = T, fill = NA), by = asset]
data[, max_P := rollapply(Price, width = 20, align = 'right', FUN = max, na.rm = T, fill = NA), by = asset]
data <- data[order(asset, date)] # order the data
data[15:25, ] # show some of the data

##           date   Mkt_return asset  Price Volume  yearmon       returns
##  1: 2019-01-21 -0.003363942    A1 103.77  70633 Jan 2019 -0.0054778860
##  2: 2019-01-22  0.011137343    A1 104.82  31535 Jan 2019  0.0100676818
##  3: 2019-01-23  0.003158245    A1 105.13  11985 Jan 2019  0.0029530862
##  4: 2019-01-24 -0.012096426    A1 104.39  17993 Jan 2019 -0.0070637942
##  5: 2019-01-25  0.004417912    A1 104.78  16760 Jan 2019  0.0037290286
##  6: 2019-01-28 -0.002849877    A1 103.47  13784 Jan 2019 -0.0125811984
##  7: 2019-01-29 -0.006533034    A1 103.40  44541 Jan 2019 -0.0006767535
##  8: 2019-01-30 -0.001272602    A1 103.54   6651 Jan 2019  0.0013530494
##  9: 2019-01-31 -0.006274180    A1 102.85  52384 Jan 2019 -0.0066863954
## 10: 2019-02-01 -0.004435095    A1 102.56  52268 Feb 2019 -0.0028236229
## 11: 2019-02-04 -0.003792268    A1 102.34   6643 Feb 2019 -0.0021473898
##      min_P  max_P
##  1:     NA     NA
##  2:     NA     NA
##  3:     NA     NA
##  4:     NA     NA
##  5:     NA     NA
##  6:  99.35 105.13
##  7:  99.35 105.13
##  8: 100.93 105.13
##  9: 102.22 105.13
## 10: 102.22 105.13
## 11: 102.22 105.13

3.9 Rolling window regression using rollapply

Let’s estimate the beta of each asset using a 30-day rolling window approach. First, we write a function that returns beta.

beta <- function(x) coef(lm(x[,1] ~ x[,2]))[2] # a function that returns the slope

Now let’s apply the function using rollapply again. This time rollapply will need a two-column data frame as input, since our beta function requires that. The option by.column = F tells rollapply that our function needs multiple columns as input instead of applying it to each column separately

data[, beta := rollapply(data.frame(returns, Mkt_return), width = 30, align = 'right', FUN = beta, by.column = F, fill = NA), by = asset]
data[25:35, ] # show some of the data

##           date   Mkt_return asset  Price Volume  yearmon       returns
##  1: 2019-02-04 -0.003792268    A1 102.34   6643 Feb 2019 -0.0021473898
##  2: 2019-02-05 -0.010363741    A1 101.63  41751 Feb 2019 -0.0069618362
##  3: 2019-02-06  0.005262400    A1 103.10 250252 Feb 2019  0.0143606239
##  4: 2019-02-07  0.001025984    A1 103.05  12616 Feb 2019 -0.0004850837
##  5: 2019-02-08 -0.006968262    A1 102.58  51272 Feb 2019 -0.0045713254
##  6: 2019-02-11  0.007837548    A1 103.98  10074 Feb 2019  0.0135555910
##  7: 2019-02-12  0.002716376    A1 104.88  66883 Feb 2019  0.0086182665
##  8: 2019-02-13 -0.001749818    A1 105.48  28277 Feb 2019  0.0057045220
##  9: 2019-02-14  0.005617317    A1 105.58 114911 Feb 2019  0.0009475979
## 10: 2019-02-15  0.005512138    A1 107.37   5120 Feb 2019  0.0168118541
## 11: 2019-02-18  0.005162088    A1 107.86  20925 Feb 2019  0.0045532765
##      min_P  max_P      beta
##  1: 102.22 105.13        NA
##  2: 101.63 105.13        NA
##  3: 101.63 105.13        NA
##  4: 101.63 105.13        NA
##  5: 101.63 105.13        NA
##  6: 101.63 105.13 0.9677070
##  7: 101.63 105.13 0.9788179
##  8: 101.63 105.48 0.9681462
##  9: 101.63 105.58 0.8879060
## 10: 101.63 107.37 0.9364054
## 11: 101.63 107.86 0.9222229

4 Plots and shortcuts

4.1 Plots of multiple lines in one plot

Let’s plot the Prices of all three asset in one plot.

ggplot(data, aes(x = date, y = Price, color = asset)) + geom_line()

4.2 Generate multiple plots at once

Let’s plot the prices with their 20 day rolling minimum and maximum. This can get quite messy, so we want to give each asset its own plot.

ggplot(data, aes(x = date)) +
  geom_line(aes(y = data$Price)) +
  geom_line(aes(y = data$min_P), color = "grey") +
  geom_line(aes(y = data$max_P), color = "grey") +
  facet_wrap(~ asset, scales = 'free_y')

4.3 Piping

When f is a function and x is an object, then these expressions are identical:

f(x)
x %>% f

4.4 Shortcuts

run single line of code: move cursor in that line and press ctrl + enter
run certain part of a line of code or multiple lines: highlight, then control + enter
clear console: ctrl + l
jump to script: ctrl + 1
jump to console: ctrl + 2
indent lines: highlight code, then ctrl + i
comment/uncomment code: ctrl + shift + c
%>% : ctrl + shift + m
help: ?function_name or move cursor into function in your code and press F1

R for Finance

Tobias

27 October 2019

1 Set up R and prepare data

1.1 Setting up the environment

1.2 Load raw data

1.3 Explore data

1.4 Dates

1.5 Select entries within datasets

1.6 Keep/Drop columns by name

1.7 data.table syntax

2 Operations on wide format data

2.1 Wide format data

2.2 Returns using lapply

2.3 Mean-Variance-Efficient portfolio weights (using matrix operations in R)

3 Operations on long format data

3.1 Wide to long format data

3.2 Long to wide format data

3.3 Transform daily data to monthly

3.4 Sort

3.5 Returns in long format

3.6 Left-Join

3.7 Linear Regression: Estimate Beta

3.8 Rolling window using rollapply

3.9 Rolling window regression using rollapply

4 Plots and shortcuts

4.1 Plots of multiple lines in one plot

4.2 Generate multiple plots at once

4.3 Piping

4.4 Shortcuts