Understanding RL Agent Behavior

Load packages

Preprocess data

full_df <- read.csv('reply_editor_ppo.csv') # Initialize df with column names
full_df$condition <- ff(full_df$file, c("-curr-", "curr8K", "2M", "curr_rev", "200K", "nocur"), c("400K", "800K", "2M", "0", "200K", "never"), NA, ignore.case = TRUE)
full_df$condition <- factor(full_df$condition, levels = c("0", "200K", "400K", "800K", "2M", "never"))

wide_df <- full_df %>% 
  pivot_wider(names_from = metric, values_from = value)

Reward over training

Clear decrements to reward after onset of the movement penalty (sensible). Note that with immediate-onset penalty (condition=0), some agents perform badly (although they all start the same? is this actually onset = 1?), while others start to approach 0, and a third group does even better (coming close to no movement penalty performance, a good ceiling). What are these three distinct strategies?

wide_df %>% ggplot(aes(x=step, y=`Environment/Cumulative Reward`, color=condition)) + 
  geom_point(alpha=.3) + 
  geom_line(aes(group=file), alpha=.3) +
  theme_classic()

Final performance

Mean performance after 3 million episodes (when agent performance has converged – although not the agents with movement penalty onset at 2M episodes!).

wide_df %>% filter(step > 3000000) %>%
  group_by(condition) %>%
  summarise(`Mean Reward` = mean(`Environment/Cumulative Reward`), 
            `Mean distance moved` = mean(`Distance moved`), 
            `Mean RT` = mean(RT, na.rm=T), 
            `Mean distance to base` = mean(`Distance to base`),
            `sd(Reward)` = sd(`Environment/Cumulative Reward`),
            `sd(Distance moved)` = sd(`Distance moved`),
            `sd(RT)` = sd(RT, na.rm=T),
            `sd(Distance to base)` = sd(`Distance to base`)) %>%
  #arrange(desc(`Mean Reward`)) %>%
  kable(digits = 2, format = "html", table.attr = "style='width:80%;'")

condition	Mean Reward	Mean distance moved	Mean RT	Mean distance to base	sd(Reward)	sd(Distance moved)	sd(RT)	sd(Distance to base)
0	0.35	0.04	6.26	5.29	0.29	0.02	3.40	0.29
200K	-0.13	0.18	25.09	5.53	0.16	0.03	13.10	0.12
400K	0.55	0.06	6.94	5.63	0.08	0.01	2.34	0.30
800K	0.55	0.06	6.78	5.29	0.07	0.01	2.26	0.13
2M	-0.92	0.31	98.01	5.24	0.17	0.05	22.72	0.10
never	0.74	0.89	15.97	8.86	0.05	0.11	5.63	0.49

No movement penalty achieves the best average reward (of course), but also moves the most, stays the farthest from the base, and has moderately slow RTs. Of the movement-penalized conditions, 400K and 800K onset show the highest rewards, along with little movement, fast RTs, and stay fairly close to the base. 200K and 2M show long RTs and more distance moved than 400K / 800K onsets.

Note that there are 577 NA values in RT, all but 1 of which are from the 0-onset penalty condition (the group of agents that never move, and thus never reach a target? we should mention the proportion of agents adopting this strategy.)

Relations between reward and main variables

Scatterplot Matrix

wide_df %>%
  filter(step>3000000) %>%
  ggpairs(columns=c("Environment/Cumulative Reward", "Distance moved", "RT",
                      "Distance to base"), # "Policy/Learning Rate", "Losses/Policy Loss"
        ggplot2::aes(colour=condition, alpha=.3)) + theme_bw()