In this vignette, I will be working with the
ReinforcementLearning R package to introduce Q-learning in
the context of asset allocation. Specifically, I will demonstrate how to
train the agent based on state-action-reward tuples. The illustration is
model and environment free. The idea is to determine an optimal policy
given randomized choices and rewards, from which the agent learns and
adapts. Based on these policies, we evaluate its performance and
consider using different parametrization for sensitivity analysis.
A famous example is the gridworld one, where the agent has to go from point A to point D via a \(2 \times 2\) grid as follows: \[ \left[\begin{array}{cc} A| & D\\ B & C \end{array}\right] \] At each cell, the agent can take the following actions: up, right, down, or left. The agent cannot go directly from \(A\) to \(D\) by moving to the right as there is a barrier between the two cells. Clearly, to the naked eye, we can easily determine that the optimal policy is to take action “down” at state A, action “right” at state B, and action “up” at state C in order to reach the objective goal. At the same time, the agent might stall by looping between \(B\) and \(C\). Hence, the idea is also to find the path as soon as possible. After all, the longer we wait, the less utility we enjoy in consuming goods.
Such an environment is built in the
ReinforcementLearning library. We can retrieve the library
as follows:
library(ReinforcementLearning)
env <- gridworldEnvironment
print(env)
function (state, action)
{
next_state <- state
if (state == state("s1") && action == "down")
next_state <- state("s2")
if (state == state("s2") && action == "up")
next_state <- state("s1")
if (state == state("s2") && action == "right")
next_state <- state("s3")
if (state == state("s3") && action == "left")
next_state <- state("s2")
if (state == state("s3") && action == "up")
next_state <- state("s4")
if (next_state == state("s4") && state != state("s4")) {
reward <- 10
}
else {
reward <- -1
}
out <- list(NextState = next_state, Reward = reward)
return(out)
}
<bytecode: 0x55a9986a8788>
<environment: namespace:ReinforcementLearning>
The environment function take two arguments and returns two outputs. The arguments denote the current state \(s_t\) and action \(a_t\) taken at time \(t\). In such a dynamic environment, the action leads to a new state. In the grid case, taking action “down” while at state A leads to a new state B. Such interaction can be described using a tuple \((s_t,a_t,s_{t+1},r_{t+1})\), where \(r_{t+1}\) denoting the reward for taking action \(a_t\) at \(s_t\) and transitioning to state \(s_{t+1}\) as result. The above environment determines the next state \(s_{t+1}\) and reward \(r_{t+1}\) based on current state and action. The idea behind setting the environment infrastructure is to capture the correct “physics” and the right incentives.
To better understand the above, consider the case where we stand at cell A. Taking a different action than “down” leads to stalling, where the agent remains stuck in cell A. Such stalling results in dis-utility since we are not making progress. Nonetheless, if we take the right action, we are still far from reaching the goal, i.e., cell D. Hence, the idea is to train the agent not to “celebrate” too early. Celebration eventually comes later when the agent receives the “trophy.”
In total, we have four different states, which will denote by
states <- paste("s",1:4,sep = "")
names(states) <- LETTERS[1:4]
states
A B C D
"s1" "s2" "s3" "s4"
On the other hand, we have four different actions:
actions <- c("up","down","left","right")
Because we have the environment set up with respect to different actions and states, we can easily simulate data. For instance, we can create a single tuple by running the following command
env("s1","up")
$NextState
[1] "s1"
$Reward
[1] -1
We could simulate some data that provides insights into how the agent interacts with the environment using randomized choices. Let us “cook” some data in this regard:
N <- 10^3
set.seed(1234)
ds <- data.frame(State = sample(states,N,replace = TRUE),
Action = sample(actions,N,replace = TRUE))
ds2 <- lapply(1:N, function(i) env(ds[i,1],ds[i,2]))
ds2 <- lapply(ds2,data.frame)
ds2 <- Reduce(rbind,ds2)
ds <- data.frame(ds,ds2)
ds$Action <- as.character(ds$Action)
ds$State <- as.character(ds$State)
ds$NextState <- as.character(ds$NextState)
head(ds)
Based on the environment and randomized states and actions, we created 1000 observations denoting the \((s_t,a_t,s_{t+1},r_{t+1})\) tuple. To train the agent using the package, all we need is such data, as we shall see shortly. Nonetheless, we can easily generate such data using the package as follows
set.seed(1234)
data <- sampleExperience(N = N,env = env, states, actions)
head(data)
To compare the two we can look into the transition probabilities based on the frequency of the \((s_t,s_{t+1})\) pairs:
table(ds$State,ds$NextState)/N
s1 s2 s3 s4
s1 0.176 0.057 0.000 0.000
s2 0.055 0.134 0.069 0.000
s3 0.000 0.078 0.133 0.070
s4 0.000 0.000 0.000 0.228
table(data$State,data$NextState)/N
s1 s2 s3 s4
s1 0.177 0.056 0.000 0.000
s2 0.069 0.120 0.069 0.000
s3 0.000 0.062 0.148 0.071
s4 0.000 0.000 0.000 0.228
We observe that in both cases, we get a similar transition matrix.
Given either data, we can train the agent using the `ReinforcementLearning’ function. The function takes, obviously, the data as the first input. Additionally, it requires states the names of the state, next state, action, and reward columns in the data. In terms of tuning, the command takes the control parameters as the last input. These parameters are summarized as follows:
Learning Rate: \(\alpha \in (0,1)\) relates to \(Q\)-learning and captures the learning rate. Similar to iterative algorithms, this parameter determines how fast the algorithm adapts. If it is too small, the learning speed is low, whereas a larger value denotes a faster learning rate.
Discount Factor: \(\gamma \in (0,1)\) denotes the discount factor. We prefer consuming goods immediately. Something to be consumed in a later period is not as enjoyable unless one expects to have a greater reward to compensate for the waiting period. Hence, the smaller the value of \(\gamma\) is, the less patient the agent is, such that it attributes greater weights to current rewards.
Exploration Rate: \(\epsilon \in (0,1)\) is the fraction of training episodes in which the agent ignores past learning experiences and takes a random action. This input captures the idea of exploration versus exploitation. As soon as the agent starts figuring this out, it might get too comfortable repeating the same actions. However, the idea behind this is to encourage the agent to perhaps there might be something else out there worth trying. Hence, a larger \(\epsilon\) value denotes a greater exploration rate.
For now, we set the following configurations:
control <- list(alpha = 0.1, gamma = 0.95, epsilon = 0.1)
Given the above, all we need to do is to run the following:
model <- ReinforcementLearning(data,
s = "State",
a = "Action",
r = "Reward",
s_new = "NextState",
control = control)
model
State-Action function Q
right up down left
s1 3.281970 3.442551 4.819512 3.443300
s2 5.827021 3.320670 4.706644 4.827368
s3 5.939138 6.767722 5.852077 4.770131
s4 -3.959678 -4.150526 -3.970640 -3.993890
Policy
s1 s2 s3 s4
"down" "right" "up" "right"
Reward (last iteration)
[1] -219
A couple of comments of order. We observe that the algorithm returns
the \(Q\) function. Such a function
captures the optimal policy. In this regard, it tells us how much reward
there is b taking either one of the four actions at each state. For
instance, we observe that the larger value at state \(A\) (s1) is when the action
“down” is taken. The same applies to the other states. However, at cell
\(D\) (state s4), we
observe that it does not matter at that point since the main trophy has
been received, and taking further action would not result in a further
reward. Indeed, we can see this from the “Policy” output, which provides
guidance in terms of what action to be taken based on each state, which
is consistent with the shortest path.
As a confirmation, we repeat the above but use our simulated data to see confirm whether we get the same result:
model2 <- ReinforcementLearning(ds,
s = "State",
a = "Action",
r = "Reward",
s_new = "NextState",
control = control)
model2
State-Action function Q
right up down left
s1 3.517780 3.455531 4.939327 3.378204
s2 6.053089 3.455700 4.988266 4.946638
s3 5.926172 6.779527 5.907719 5.009214
s4 -4.150526 -3.970640 -3.993890 -3.959678
Policy
s1 s2 s3 s4
"down" "right" "up" "left"
Reward (last iteration)
[1] -230
In the following, we will work with the data object for
brevity, but the results follow suit regardless. Before we move on, note
that either model returns a reward. The reward here denotes the
summation of the total rewards in the data, i.e.,
sum(data$Reward) == model$Reward
[1] TRUE
and
sum(ds$Reward) == model2$Reward
[1] TRUE
Since we know the solution, we can conclude that the final result is the optimal policy. However, what if the environment were more complex than the one considered in this example? Also, suppose that we get new data by the time we are done training the model. Can we improve the policy further? Without delving into technical details, we could argue that the model learns better by providing it greater data based after interacting with the environment with guidance. In this case, we are interested in updating existing policy as data become available.
Specifically, in the above, we generated data using random
interactions without guidance. Now that we figured out something, we can
perhaps generate data in some directions. For instance, we know to some
extent what the agent should day at state s1 given the
policy from model. We can use the original model as a
reference point and generate additional data as follows
data_new <- sampleExperience(N = N,
env = env,
states = states,
actions = actions,
model = model, # important
actionSelection = "epsilon-greedy",
control = control)
Different from before, the sampleExperience now uses two
additional arguments. The first is specifying an existing model we could
rely on to generate the data. The second is the action method. In the
first one, the action method was completely random, whereas now it \(\epsilon\)-greedy, i.e., the agent will
follow the policy \(1-\epsilon\) of the
time, and \(\epsilon\) of the cases it
will choose random actions as we did originally. We can see that the
sample data, in this case, results in a much higher reward since the
agent can rely on some policy to determine the best action
sum(data_new$Reward)
[1] 1486
To make things more interesting, let us create a function where we update the policy \(M\) times. The function will summarize all the above steps in a single function as follows.
M <- 10
RL_update_fun <- function(control_i) {
set.seed(1)
data <- sampleExperience(N = N,env = env, states, actions)
model <- ReinforcementLearning(data,
s = "State",
a = "Action",
r = "Reward",
s_new = "NextState",
control = control_i)
reward_seq <- sum(model$Reward)
for (iter in 2:M) {
set.seed(iter)
data_new <- sampleExperience(N = N,
env = env,
states = states,
actions = actions,
model = model,
actionSelection = "epsilon-greedy",
control = control_i)
# update model
model <- ReinforcementLearning(data_new,
s = "State",
a = "Action",
r = "Reward",
s_new = "NextState",
control = control)
reward_seq <- c(reward_seq,sum(model$Reward))
}
return(reward_seq)
}
Given the above function, we can analyze how the policy updates depend on the inputs. In the following, I consider different values of \(\epsilon\), which denotes the proportion of time the algorithm takes a random chance. Naturally, we expect more noise to be associated with higher values. By noise, I mean that if the algorithm figures out the right optimal policy, then it should reach the maximum possible reward earlier and exhibit more stability regardless of the iterations.
library(parallel)
control1 <- list(alpha = 0.1, gamma = 0.95, epsilon = 0.5)
control2 <- list(alpha = 0.1, gamma = 0.95, epsilon = 0.25)
control3 <- list(alpha = 0.1, gamma = 0.95, epsilon = 0.1)
control4 <- list(alpha = 0.1, gamma = 0.95, epsilon = 0.01)
control_list <- list(control1,control2,control3,control4)
RL_list <- mclapply(control_list,RL_update_fun,mc.cores = 4)
Let us plot the rewards from each algorithm
library(ggplot2)
library(plotly)
rewards_iter <- lapply(1:length(control_list),
function(i) data.frame(Reward = RL_list[[i]],
Iteration = 1:M,
epsilon = control_list[[i]]$epsilon ))
rewards_iter <- Reduce(rbind,rewards_iter)
rewards_iter$epsilon <- as.factor(rewards_iter$epsilon)
p <- ggplot(data = rewards_iter,aes(x = Iteration,y = Reward, type = epsilon,colour = epsilon)) +
geom_line() + geom_point()
ggplotly(p)
We can see that the specification with the lowest \(\epsilon\) value results in higher rewards during the last \(M - 1\) iterations. At the same time, we note that the algorithm will always follow a random policy regardless of the number of iterations. One potential adjustment is to reduce the value of the \(\epsilon\) the more iterations we perform. The following function does so by setting \(\epsilon/m\), where \(m = 1,...,50\). If we start with \(\epsilon = 0.5\), by the 50th iteration the algorithm sets \(\epsilon = 0.01\).
RL_update_fun <- function(control_i) {
set.seed(1)
data <- sampleExperience(N = N,env = env, states, actions)
model <- ReinforcementLearning(data,
s = "State",
a = "Action",
r = "Reward",
s_new = "NextState",
control = control_i)
reward_seq <- sum(model$Reward)
for (iter in 2:M) {
control_i$epsilon <- control_i$epsilon/iter
set.seed(iter)
data_new <- sampleExperience(N = N,
env = env,
states = states,
actions = actions,
model = model,
actionSelection = "epsilon-greedy",
control = control_i)
# update model
model <- ReinforcementLearning(data_new,
s = "State",
a = "Action",
r = "Reward",
s_new = "NextState",
control = control)
reward_seq <- c(reward_seq,sum(model$Reward))
}
return(reward_seq)
}
M <- 50
rew_seq <- RL_update_fun(control1)
We can visualize the total rewards as a function of iterations as follows:
ds_plot <- data.frame(Reward = rew_seq, Iteration = 1:M)
p <- ggplot(data = ds_plot,aes(x = Iteration,y = Reward)) +
geom_line() + geom_point()
ggplotly(p)
We observe that the algorithm demonstrates lower volatility as we approach the last iteration.
Similar to other models in R, such as linear regressions or machine
learning algorithms via caret, the RL model can be utilized
in the generic predict function. For instance, given our
initial data, we can see how the actions get updated based on learning.
The table below demonstrates the difference between the random actions
taken versus those that the policy guides. Greater values in the
off-diagonal elements indicate greater discrepancy. If the model did not
learn/update, we expect to get the same policy as the random one.
action_predict <- predict(model,data$State)
table(data$Action,action_predict)
action_predict
down right up
down 56 116 70
left 57 124 62
right 52 114 78
up 68 132 71
This function is helpful in evaluating the model after each training episode. Additionally, it allows determining the reward regardless of the context used.
I will work with two main packages to manipulate time series and dates. In terms of assets, I will consider the problem of a tactical asset allocation where the agent rotates between high and low risk assets. For the high risk assets, I consider the SPY ETF to represent the stock market. For the low risk, I will consider cash with zero return for simplicity. This can be replaced by a Treasury bond or a more defensive asset. In this example, I assume that the return on cash is zero regardless of interest/inflation rates.
library(quantmod)
library(lubridate)
library(PerformanceAnalytics)
tics <- "SPY"
P_list <- lapply(tics, function(x) get( getSymbols(x,from = "1990-01-01") ) )
P_list <- lapply(P_list,function(x) x[,6])
P <- Reduce(merge,P_list)
P <- apply.monthly(P,last)
R <- na.omit(P/lag(P)) - 1
ds_plot <- R
R_port <- 0.6*R + 0.4*0
R_port <- merge(R,R_port)
chart.CumReturns(R_port)
The black line above denotes the return on a 100% position in the SPY, whereas the red line denotes the 60-40 portfolio that allocates 60% to the SPY and 40% to cash. Overall, we observe that the SPY outperforms the 60-40 portfolio in terms of total return since it has greater loading on the risk premium of the stock market. At the same time, we observe that the 60-40 provides a greater hedge during market sell-offs mitigating the downside risk. Traditionally, the 60-40 allocates 40% to Treasury bonds which serve as a safe haven asset (flight-to-quality) during increased uncertainty in the market. These dynamics imply a negative correlation between the two assets, which results in enhanced risk-adjusted returns. However, in our example with cash, the correlation is zero.
The idea behind RL is to train the agent to learn the dynamics of the tactical asset allocation problem. Specifically, we are interested in a model-free environment where the agent observes certain actions and rewards to determine the optimal policy. To do so, we will consider descriptive analysis first, i.e., in-sample. Then, we will consider some backtesting to evaluate the appeal of RL out-of-sample.
Let us set up the platform so we can run the RL function. Recall that the RL main function takes the data in the format of \((s_t,a_t,s_{t+1},r_{t+1})\) tuples. Hence, we need to determine the state at time \(t\), corresponding to month \(t\) in our data. Note that the asset return is continuous, whereas the state \(s_t\) is discrete. Hence, we need to utilize a signal from the returns to determine the market state. This can be done in different ways, e.g., economic index or other financial indicators such as VIX. In the following illustration, I consider a simple case where we have three states determined based on the return level of the asset, denoting low, medium, and high levels.
Before we get started, we set up a number of main parameters
alph <- 0.1 # learning speed
r <- 0.02 # risk free rate
gam <- 1/(1+r)^(1/12) # monthly discount rate
eps <- 0.1
tau <- 0.01 # sets the threshold which determines the state
We determine \(\gamma\) by assuming a risk-free annual rate of 2%. Hence, the discount factor on a monthly basis is \[ \gamma = \left(\frac{1}{1+r} \right)^{1/12} \] The parameter \(\tau\) denotes the threshold of the SPY return that captures the state of the market. In our case, a return below \(-\tau\) (above \(\tau\)) denotes state \(s_t = -1\) (\(s_t = 1\)). Otherwise, the state is \(s_t = 0\). We can easily define the states as follows
ds <- R
ds <- data.frame(Date = date(ds), ds)
ds$State <- (ds$SPY.Adjusted> tau)*1 + -(ds$SPY.Adjusted < -tau)*1
ds$State <- as.character(ds$State)
ds$NextState <- data.table::shift(ds$State,-1)
head(ds)
Next, we need to determine action and rewards. The action space is also discrete. For simplicity, we consider a discrete action space where the agent longs a fraction of the risky asset as \(a_t = 1/k\) for \(k = 1,...,10\). This imposes that the agent longs a portion of the stock market regardless of the current state. Same as in the grid example, we consider random actions where the agent takes random portfolio choices as
set.seed(1)
ds$Action <- sample((1:10)/10,nrow(ds),replace = TRUE)
head(ds)
We have determined the \((s_t,a_t,s_{t+1})\) tuple so far. What left
to determine is the reward for taking action \(a_t\) while transitioning from state \(s_t\) to state \(s_{t+1}\). To do so, we need a reward
function that takes into account the dis-utility of losses. A natural
candidate is to map returns via a constant absolute risk aversion (CARA)
utility function with a given risk aversion parameter denoted by
RA to control for the agent risk aversion:
RA <- 5 # sets the risk aversion level
U <- function(x,RA) 1-exp(-RA*x)
x <- seq(-0.1,1,length = 100)
U_3 <- function(x) U(x,3)
U_10 <- function(x) U(x,10)
y1 <- sapply(x,U_3)
y2 <- sapply(x,U_10)
plot(y1 ~ x,type = "l",ylim = range(c(y1,y2)), ylab = "Utility", xlab = "Return")
lines(y2 ~ x,col = 2)
abline(v = 0,lty = 2)
grid(10)
The above plot demonstrates the utility of two agents where the red line is the one with higher risk aversion. For negative returns, we observe that the agent with higher risk aversion “suffers” more. On the other hand, such an agent enjoys higher utility as returns become positive. Given the CARA utility, we can map the portfolio return positions into rewards as follows
ds$Reward <- ds$Action*data.table::shift(ds$SPY.Adjusted,-1)
ds$Action <- as.character(ds$Action)
ds$Reward <- U(ds$Reward,RA)
ds <- na.omit(ds)
head(ds)
Now that we have represented the data using a \((s_t,a_t,s_{t+1})\) tuple, we can move on to train the model. We have the default controls as before
control <- list(alpha = alph, gamma = gam, epsilon = eps)
model_spy <- ReinforcementLearning(ds,
s = "State",
a = "Action",
r = "Reward",
s_new = "NextState",
control = control)
ds$RL_Action <- as.numeric(predict(model_spy,ds$State))
ds$Next_Ret <- data.table::shift(ds$SPY.Adjusted,-1)
ds$Portfolio_Ret_RL <- ds$RL_Action*ds$Next_Ret
ret <- na.omit(ds$Portfolio_Ret_RL)
SR_RL <- mean(U(ret,RA))
# compare to random choices
ret<- as.numeric( ds$Action)*ds$Next_Ret
ret <- na.omit(ret)
SR_random <- mean(U(ret,RA))
round(data.frame(Random = SR_random,RL = SR_RL)*100,2)
The above commands train the agent and result in a higher utility relative to random actions. However, the question remains whether this is an optimal policy. The above is based on a single random iteration. Perhaps if we feed the model more data, the agent can attain a higher reward.
The major difference here is that we do not know the environment of how each action affects the transition from state \(s_t\) to \(s_{t+1}\). Hence, updating the policy is not straightforward. One potential extension is to consider an equilibrium model such that returns are endogenously determined based on action and state. Nonetheless, in this example, we will refrain from such analysis and consider an environment-free framework. The challenge here, however, is how to simulate data without knowing the environment.
To overcome the above challenge, we consider an ad-hoc modification. Similar to the \(\epsilon\)-greedy approach, we simulate a proportion of actions that are chosen randomly, whereas the other actions follow an initial model. At each iteration, we choose \(\epsilon\) proportion of months where the agent takes random actions, whereas, in the \(1-\epsilon\) months, the agent follows the previous policy. We iterate this several times until we attain a certain satisfactory threshold, which we set based on a benchmark. Specifically, the benchmark is set relative to holding the SPY passively, which corresponds to the following reward
x <- R$SPY.Adjusted
SPY_SR <- mean(U(x,RA))
SPY_SR
[1] 0.01924738
We observe that the initial policy above underperforms the passive
one that longs the SPY 100% of the time. Our goal is to find a policy
that outperforms the benchmark. We perform the following
while function, which we run until we beat the benchmark or
exhaust a certain number of iterations, which we set to 100.
tot_iter <- 100
iter <- 2
model_list <- list(model_spy)
rewards_seq <- SR_RL
i <- 0
while (!SR_RL > SPY_SR) {
i <- i + 1
ds <- R
ds <- data.frame(Date = date(ds), ds)
ds$State <- (ds$SPY.Adjusted > tau)*1 + -(ds$SPY.Adjusted < -tau)*1
ds$State <- as.character(ds$State)
ds$NextState <- data.table::shift(ds$State,-1)
# take action with respect to the previous model
ds$RL_Action <- as.numeric(predict(model_spy,ds$State))
sample_index <- sample(1:nrow(ds),1 + floor(nrow(ds)*0.5*(1 - iter/tot_iter)))
ds$Action <- ds$RL_Action
ds$Action[sample_index] <- sample(1:10/10,length(sample_index),replace = TRUE)
ds$Reward <- ds$Action*data.table::shift(ds$SPY.Adjusted,-1)
ds$Reward <- U(ds$Reward,RA)
ds$Action <- as.character(ds$Action)
ds <- na.omit(ds)
# train the model based on random choices
model_spy_new <- ReinforcementLearning(ds,
s = "State",
a = "Action",
r = "Reward",
s_new = "NextState",
control = control,
model = model_spy)
ds$RL_Action <- as.numeric(predict(model_spy_new,ds$State))
ds$Next_Ret <- data.table::shift(ds$SPY.Adjusted,-1)
ds$Portfolio_Ret_RL <- ds$RL_Action*ds$Next_Ret
ret <- na.omit(ds$Portfolio_Ret_RL)
SR_RL_new <- mean(U(ret,RA))
# compute the benchmark for consistent evaluation
x <- na.omit(ds$Next_Ret)
SPY_reward <- mean(U(x,RA))
SPY_SR <- SPY_reward
if(i > 1000)
break
# update the model only if performance is better than previous
if(SR_RL_new > SR_RL) {
model_spy <- model_spy_new
SR_RL <- SR_RL_new
iter <- iter + 1
cat("Model Updated ", iter, "\n")
model_list <- c(model_list,list(model_spy))
rewards_seq <- c(rewards_seq,SR_RL)
i <- 0
}
if(iter > tot_iter)
break
}
Model Updated 3
For the above risk aversion, which we set to 5, the loop terminates during the third iteration. In unreported results, the model takes longer to find a policy that outperforms the benchmark when the risk aversion is low. Indeed, if the agent is risk tolerant, it makes it harder to find a policy that outperforms the SPY since the agent could be better off holding the SPY passively. On the other hand, the risk-averse agent would be better off allocating less than 100% to the SPY to mitigate risk aversion.
Let us take a look at the final model and policy:
model_spy_new
State-Action function Q
0.7 0.8 0.9 1 0.1 0.2 0.3 0.4
-1 0.3816035 0.2078880 0.1597925 0.26549405 0.1971610 0.19540204 0.1638846 0.1659740
0 0.1944525 0.3260284 0.1370731 0.05240283 0.1051063 0.04714781 0.1214657 0.1488058
1 0.3037000 0.2732329 0.3564709 0.21113729 0.2755970 0.20727959 0.2733207 0.2663940
0.5 0.6
-1 0.1820516 0.2452856
0 0.1285650 0.1093987
1 0.3162321 0.2457833
Policy
-1 0 1
"0.7" "0.8" "0.9"
Reward (last iteration)
[1] 5.267385
Consistent with common sense, the policy indicates that the agent reduces exposure to risky assets in states accompanied by low returns. Specifically, the exposure to reduced to 70% equity. On the other hand, during medium (high) states, the exposure is 80% (90%).
Let us take a look at how such a policy performs in practice:
ds_plot <- ds[,c("Next_Ret","Portfolio_Ret_RL")]
date_names <- rownames(ds_plot)[-1]
ds_plot <- na.omit(ds_plot)
rownames(ds_plot) <- date_names
ds_plot <- as.xts(ds_plot)
chart.CumReturns(ds_plot,geometric = FALSE)
We observe that the RL agent underperforms the passive fund in terms of terminal wealth. However, let us take into account the risk
my_sum <- function(x) {
m <- mean(x)*12
s <- sd(x)*sqrt(12)
sr <- m/s
sv <- sd(x[x<0])*sqrt(12)
sr2 <- m/sv
VaR <- mean(x) - quantile(x,0.05)
return(c(m,s,sv,sr,sr2,VaR))
}
result <- data.frame(apply(ds_plot,2,my_sum))
rownames(result) <- c("Mean","Volatility","Semi-Volatility","Sharpe","Sortino","VaR")
round(result,3)
Consistent with the above plot, we see that the RL policy results in lower returns and risk simultaneously. Regarding risk-adjusted returns, we observe that the RL results in higher Sharpe and Sortino ratios. Additionally, we measure downside risk using value-at-risk (VaR). Finally, we observe that the RL strategy results in lower tail risk.
The above analysis is conducted in-sample and, hence, is descriptive by design. Nevertheless, one can address interesting questions regarding setting the right investment policy for investors with different risk preferences. This sheds interesting light on the appeal of Robo-advising. Moreover, in terms of portfolio selection, future research should consider the out-of-sample analysis and evaluate the appeal of RL from a predictive point of view. I leave these for future research.