Reproducibility Report for Sequential Inverse Plan Search by Zhi-Xuan et al. (2020, NeurIPS)

Background

Sequential Inverse Plan Search (SIPS) is a probabilistic program designed to perform efficient Bayesian inference over the goals of another angent whose internal planning process is unknown. In addition to SIPS, inverse reinforcement learning methods are used to invert the planning process of another agent by observing its behavior. However, inverse reinforcement learning methods typically assume that agents act optimally, which can quickly lead to an agent solving an intractable and cognitively implausible problem. SIPS models other agents as planners who are boundedly rational, meaning that the amount of planning conducted by an agent is bounded and resource-limited. This leads SIPS observers to discover sub-optimal trajectories that can be improved through online Bayesian inference. The results in the paper show that SIPS predicts human judgements for goal inference better than inverse reinforcement learning methods across 4 different domains.

For this project I will re-implement SIPS and evaluate it by reproducing results over several experiments that demonstrate its human-likeness, accuracy, and efficiency. To evaluate human-likeness I will reproduce a set of qualitative experiments by running SIPS over a domain in which an agent navigates a maze with doors, keys, and gems. In this domain, the agent can use keys to unlock doors to collect gems. My goal here is not to reproduce the human experiments part of the paper, but instead, I plan to re-implement SIPS, and observe the goals inferred by a SIPS observer, and finally compare those to the results from the original paper. For the purposes of this project, I will assume the results from the human experiments conducted in the paper are valid and use them to additionally verify that SIPS is more human-like than the results from other algorithms reported in the paper. Thus, the goal here is to reproduce SIPS human-likeness results by achieving the same inferences as those made in the paper.

Analyses for this project will include a correlation between the inferences made over time by my re-implementation of SIPS and those reported in the original paper for a suboptimal plan with a given start state and goal (in this case a bue gem) for the gem, doors, and keys domain. Additionally, I will run SIPS on several problems consistising of intital start states and goal pairs on the doors, keys, and gems domain, and compute the average probability of the true goal conditioned on observed history at three different quartiles of the trajectory, the proportion of time the true goal is the top 1 accuracy in the inferred distribution, and the total number of states visted during the entire trajectory.

Finally I will also compare to one baseline presented in the supplement of the paper known as Plan Recognition as Planning (PRP).

Justification for choice of study

Please describe why you chose to reproduce the results of this study.

I am interested in planning in physical environments. This paper will provide me an opportunity to become acquainted with some of the state of the art tools for planning and decision making. Furthermore, I will learn about Julia, a relatively new numerical programming language that is both high level and fast, and Gen.jl, the probabilistic programming language, which is central to many of the paradigms around which planning has been modeled, and is developed by researchers I am interested in collaborating with at MIT. Finally, I am not just interested in planning per se, but also, in topics revolving around planning and computational theory of mind. This paper sits right at the intersection of these research areas and I am highly interested in how these algorithms are implemented and potentially, beyond this course, how they might be extended to 3D physical environments with multiagent social interactions.

Anticipated challenges

Do you anticipate running into any challenges when attempting to reproduce these result(s)? If so please, list them here.

Challenges will mostly be around interpretation of the pseudo algorithm presented in the paper. I have decided to re-implement the paper, but not to translate it into another language like Python. The reason for this is because I am interested in learning Julia and Gen.jl, and as is usually the case, the same function can be implemented in many ways in code.

Another challenge I may face is the amount of training time required to get results. Planning algorithms can take long periods of training and subsequently may be hard to debug efficiently when errors are made. I anticipate needing to devote a significant amount of time and attention to not only coding, but possibly reaching out to the authors to understand bits of psudo code that I may not understand to reduce errors in my own implementation.

Links

Project repository (on Github): repository

Original paper (as hosted in your repo): original paper

Preregistration link: preregistration

Methods

The reimplementation of SIPS for online bayesian search will be based on the algorithm pseudocode provided in the paper and is shown below.

Algorithm Pseudocode for a SIPS Observer. The folowing figures show the trajectories for the inferences humans, SIPS, and PRP make for an agent in the doors keys, and gems domain executing a suboptimal plab with backtracking.

SIPS Inferences

Human Trajectories

Qualitative Comparisons

Exclusions and Differences from the original study

There will be to my knowledge no known differences in the environments (experiments). The main difference in the planning algorithms however will come from the re-implementation from Algorithm 1 (SIPS). These difference my be minor quantitative differences in final values for distribution predictions over the trajectories but also in the efficiency of the algorithm (i.e. how fast it runs).

An agent executing SIPS over another agent acting within an environment is called an observer. This observer will, as the name denotes, observe an agent execute its plans in one of the four environments described above. As the observer receives observations at each time step, it is tasked with modeling the agent’s goals and achieves this by also modeling the plans it forms in order to achieve those goals. Most importantly, SIPS observers assume agents are modeling their environment through generative processes that are resource rational and instead of modeling agents as full horizon planners, they model them as agents that form partial plans, that do not search for further plans until they have executed their partial ones. The horizon of these partial plans is modeled as a random variable which is sampled from a negative binomial distribution. The search process for these partial plans is itself a stochastic version of the A* search algorithm. Furthermore, Monte Carlo sampling it used to perform inference in an online fashion. Thus, this re-implementation will require several steps of random sampling. Given that random sampling depends on random seeds which have inherent variability I anticipate that the stochastic nature of this algorithm will lead to minor variations on the final results.

I will not compute analysis for 3 other domains Taxi, Intrusion Detection, and Word Blocks domains. Also I wil only compute the trajectory correlations for one problem in the Doors, keys, gems domain while computing the average true goall probabiltiy, top 1 accuracy, and efficiency of SIPS and PRP over severall problems in the doors, keys, gems domain. I will not run or compute any results Bayesian Reinforcement learning baeslines, only PRP.

Description of the steps required to reproduce the results

Re-implement Algorithm 1 (SIPS) using Julia within the Gen.jl and PDDL.jl framework
Execute an SIPs observer over an agent acting in the the doors, keys, and gems environment for both sub-optimal plans (online sequential partial plans) and failed plans (plans that fail to achieve their intended goals).
Record the distributions predicted over time (trajectories predicted) for the doors, keys and gems environment over the three possible goals (Red Gem, Yellow Gem, and Blue Gem) that a SIPs observer makes over agents under a suboptimal plan with backtracking.
Compute the inferences of a SIPs observer for the the doorsm keysm and gems domain over several distinct problems (problems consist of distinct intitate state and goal pairs). Infernces to be compared in the analysis are the probability of the true goal given the history of trajectories, the proportion of that probability being the top 1 probability in the distribution both at each of three quartiels, and the efficiency of the inferences over the entire trajectory
Rerun the same computations using off the shelf PRP.

Data Preparation

See code.

This was done by cloning the original repository, implementing a new particle filter reimplementation for the SIPS aglorithm to call in and then running experiements with new implementation with commands as detailed in the README.md.

Reproduced data is found in the results folder. This folder is not from the origina data from the paper, but is from running the reimplementation of SIPS and from running the original baseline PRP. Original data for running these algorithms is found in the supplementary for the paper, which was manually copied into a datafram using Julia, the notebook where this is done can be found here Playground.ipynb.

Results

Load Relevant Libraries and Functions

library(tidyverse)
library(reshape2)
library(ggplot2)
library(cowplot)
library(latex2exp)
library(knitr)
library(lemon)

Import Data

The data below are from running the SIPS reimplemenation and the baseline PRP as well as original data from the paper manually copied to a dataframe in Julia and printed as a csv file imported here. See Julia notebook at .

# import reproduced summary data for sips and prp under optimal plans for all four domains over several problems (intial state and goal pairs)
reproduced_optimal_df <- read_csv("/Users/juliomartinez/Documents/Repositories/zhi-xuan2020/data/reproduced_optimal.csv", show_col_types = FALSE)
#head(reproduced_optimal_df)

# import reproduced summary data for sips and prp under suboptimal plans for all four domains over several problems (intial state and goal pairs)
reproduced_suboptimal_df <- read_csv("/Users/juliomartinez/Documents/Repositories/zhi-xuan2020/data/reproduced_suboptimal.csv", show_col_types = FALSE)
#head(reproduced_suboptimal_df)

# import original summary data (copied from paper) for sips and prp under optimal plans for all four domains over several problems (intial state and goal pairs)
original_optimal_df <- read_csv("/Users/juliomartinez/Documents/Repositories/zhi-xuan2020/data/original_optimal.csv", show_col_types = FALSE)
#head(original_optimal_df)

# import original summary data (copied from paper) for sips and prp under suboptimal plans for all four domains over several problems (intial state and goal pairs)
original_suboptimal_df <- read_csv("/Users/juliomartinez/Documents/Repositories/zhi-xuan2020/data/original_suboptimal.csv", show_col_types = FALSE)
#head(original_suboptimal_df)

# import original summary data for human inferences over doors-gems-gems domain for several problems (intial state and goal pairs)
human_inferences_df <- read_csv("/Users/juliomartinez/Documents/Repositories/zhi-xuan2020/data/original_human_inferences.csv", show_col_types = FALSE)
#head(human_inferences_df)

# import reproduced data for sips for a problem (initial state and goal pair) and specific suboptimal plan on the doors-keys-gems domain 
reproduced_sips_supoptimal_inferences_df <- read_csv("/Users/juliomartinez/Documents/Repositories/zhi-xuan2020/code/experiments/results/doors-keys-gems-run-0-sips-suboptimal/doors-keys-gems_problem_6_goal2_0.csv", show_col_types = FALSE)
#head(reproduced_sips_supoptimal_inferences_df)

# import reproduced data for prp for a problem (initial state and goal pair) and specific suboptimal plan on the doors-keys-gems domain 
reproduced_prp_supoptimal_inferences_df <- read_csv("/Users/juliomartinez/Documents/Repositories/zhi-xuan2020/code/experiments/results/doors-keys-gems-run-0-prp-suboptimal/doors-keys-gems_problem_6_goal2_0.csv", show_col_types = FALSE)
#head(reproduced_prp_supoptimal_inferences_df)

Data Analysis

human_dist = human_inferences_df %>%
  filter(problem == "problem_6_goal_2_0") %>%
  select(timestep, goal_probs_0, goal_probs_1, goal_probs_2) %>%
  pivot_longer(cols=-c("timestep"),
               names_to='goal', 
               values_to='prob') %>%
  group_by(timestep, goal) %>%
  summarize_at(c("prob"), list(mean = mean, sd = sd))

human_ribbon_dist <- ggplot(human_dist, aes(x=timestep, y=mean, group=goal, color=goal)) + 
  scale_color_manual(
    labels = c("red gem", "yellow gem", "blue gem"), 
    values=c("#e6102d","#ffc533", "#669bcc")) + 
  scale_fill_manual(values=c("#e6102d","#ffc533", "#669bcc")) +
  geom_line(mapping=aes(colour=goal)) + 
  geom_point() + 
  geom_ribbon(aes(x=timestep, ymin=mean-sd, ymax=mean+sd, fill=goal), 
              alpha = 0.5, 
              color = NA) +
  ylab("probability") + 
  xlab("time") +
  ggtitle("Human Inferences (n=7)") +
  coord_fixed(ratio=10)

suboptimal_sips_dist = reproduced_sips_supoptimal_inferences_df %>%
  tibble::rowid_to_column("timestep") %>%
  select(timestep, goal_probs_0, goal_probs_1, goal_probs_2) %>%
  pivot_longer(cols=-c("timestep"),
               names_to='goal', 
               values_to='prob') 

suboptimal_sips_plot <- ggplot(suboptimal_sips_dist, 
                               aes(x=timestep, y=prob, group=goal, color=goal)) + 
  scale_color_manual(
    labels = c("yellow gem", "red gem", "blue gem"), 
    values=c("#ffc533", "#e6102d", "#669bcc")) + 
  scale_fill_manual(values=c("#e6102d","#ffc533", "#669bcc")) +
  geom_line(mapping=aes(colour=goal)) + 
  geom_point() + 
  ylab("probability") + 
  xlab("time") + 
  ggtitle("SIPS Inferences") +
  coord_fixed(ratio=10)

suboptimal_prp_dist = reproduced_prp_supoptimal_inferences_df %>%
  tibble::rowid_to_column("timestep") %>%
  select(timestep, goal_probs_0, goal_probs_1, goal_probs_2) %>%
  pivot_longer(cols=-c("timestep"),
               names_to='goal', 
               values_to='prob') 


  
suboptimal_prp_plot <- ggplot(suboptimal_prp_dist, 
                               aes(x=timestep, y=prob, group=goal, color=goal)) + 
  scale_color_manual(
    labels = c("yellow gem","red gem", "blue gem"), 
    values=c("#ffc533", "#e6102d", "#669bcc")) + 
  scale_fill_manual(values=c("#e6102d","#ffc533", "#669bcc")) +
  geom_line(mapping=aes(colour=goal)) + 
  geom_point() + 
  ylab("probability") + 
  xlab("time") + 
  ggtitle("PRP Inferences") +
  coord_fixed(ratio=10)

xout = seq(1,36,length.out=50) 

temp_humans <- human_dist  %>%
  filter(goal=="goal_probs_0") 
humans_p0  <- approx(x=temp_humans$timestep , y=temp_humans$mean, xout=xout)

temp_humans <- human_dist  %>%
  filter(goal=="goal_probs_1") 
humans_p1  <- approx(x=temp_humans$timestep , y=temp_humans$mean, xout=xout)

temp_humans <- human_dist  %>%
  filter(goal=="goal_probs_2") 
humans_p2  <- approx(x=temp_humans$timestep , y=temp_humans$mean, xout=xout)

temp_sips <- suboptimal_sips_dist  %>%
  filter(goal=="goal_probs_0") 
sips_p0  <- approx(x=temp_sips$timestep , y=temp_sips$prob, xout=xout)

temp_sips <- suboptimal_sips_dist  %>%
  filter(goal=="goal_probs_1") 
sips_p1  <- approx(x=temp_sips$timestep , y=temp_sips$prob, xout=xout)

temp_sips <- suboptimal_sips_dist  %>%
  filter(goal=="goal_probs_2") 
sips_p2  <- approx(x=temp_sips$timestep , y=temp_sips$prob, xout=xout)

temp_prp <- suboptimal_prp_dist  %>%
  filter(goal=="goal_probs_0") 

prp_p0  <- approx(x=temp_prp$timestep , y=temp_prp$prob, xout=xout)

temp_prp <- suboptimal_prp_dist  %>%
  filter(goal=="goal_probs_1") 
prp_p1  <- approx(x=temp_prp$timestep , y=temp_prp$prob, xout=xout)

temp_prp <- suboptimal_prp_dist  %>%
  filter(goal=="goal_probs_2") 
prp_p2  <- approx(x=temp_prp$timestep , y=temp_prp$prob, xout=xout)

sips_trajectory_vector <- c(sips_p0$y, sips_p1$y, sips_p2$y)
prp_trajectory_vector <- c(prp_p0$y, prp_p1$y, prp_p2$y)
human_trajectory_vector <- c(humans_p0$y, humans_p1$y, humans_p2$y)

r_sips <- cor(x=sips_trajectory_vector, human_trajectory_vector, method = c("pearson"))
#print(r_sips)

r_prp <- cor(x=prp_trajectory_vector, human_trajectory_vector, method = c("pearson"))
#print(r_prp)

correlations <- data.frame(c(r_sips, r_prp), c("sips", "prp"))    
names(correlations) <- c("Correlation", "Method")

mean_prob_true_goal_quartiles <- function(optimality, domain, df, stat, data) {

  if (data=="original"){
    colnames(df) <- c("Domain", "Method", 'Q1_mean','Q1_std', 'Q2_mean','Q2_std', 'Q3_mean','Q3_std', 
                              "Q1_top_ranked_mean", "Q1_top_ranked_std", "Q2_top_ranked_mean", "Q2_top_ranked_std", 
                              "Q3_top_ranked_mean", "Q3_top_ranked_std","N_mean", "N_std")
  }
  
  if (stat=="proportion_top1") {
    selections <- c("Q1_top_ranked_mean", "Q2_top_ranked_mean", "Q3_top_ranked_mean", "Q1_top_ranked_std", "Q2_top_ranked_std", "Q3_top_ranked_std")
    x_y_names <- c("x1", "x2", "x3", "y1","y2","y3")
  } else if (stat=="average_probability") {
    selections <- c("Q1_mean", "Q2_mean", "Q3_mean", "Q1_std", "Q2_std", "Q3_std")
    x_y_names <- c("x1", "x2", "x3", "y1","y2","y3")
  } else if (stat=="average_states_visited") {
    selections <- c("N_mean", "N_std")
    x_y_names <- c("x1", "y1")
  }
  
  # filter out method (sips or prp) and domain (keys-doors-gems, taxi, etc)
  if (optimality=="Optimal") {
    # filter out unused baselines
    df2 <- df %>%
      filter(Method != "BIRL-U")  %>%
      filter(Method != "BIRL-O") 
    
    sips_df <- df2 %>%
      filter(Domain == domain, Method == "sips") %>%
      select(contains(selections))
    prp_df <- df2 %>%
      filter(Domain == domain, Method == "prp") %>%
      select(contains(selections))
  } else {
    sips_df <- df%>%
      filter(Domain == domain, Method == "sips") %>%
      select(contains(selections))
    prp_df <- df%>%
      filter(Domain == domain, Method == "prp") %>%
      select(contains(selections))
  }
  
  # rewrite column names for grouping
  colnames(sips_df) <- x_y_names
  colnames(prp_df) <- x_y_names
  
  # prep to summarize
  sips_df <- sips_df %>%
    pivot_longer(everything(),
                 names_to = c(".value", "set"),
                 names_pattern = "(.)(.)")
  
  # prep to summarize
  prp_df <- prp_df %>%
    pivot_longer(everything(),
                 names_to = c(".value", "set"),
                 names_pattern = "(.)(.)")
  
  sips_df$method <- "sips"
  prp_df$method <- "prp"
  
  rbind_df <- rbind(sips_df, prp_df)
  
  colnames(rbind_df) <- c("quartile", "mean", "sd", "method")
  
  return(rbind_df) 
}

get_mean_prob_true_goal_quartiles_plot <- function(quartiles_df, optimality, stat, data){
  if (stat=="proportion_top1"){
    y_label <- TeX(r'(Prop $P(g_{true}|o)$ is Top 1)')
    x_label <- "Trajectory Quartile"
  } else if (stat=="average_probability") {
    y_label <- TeX(r'(Average $P(g_{true}|o)$)') 
    x_label <- "Trajectory Quartile"
  } else if (stat=="average_states_visited"){
    y_label <- TeX(r'(Average N)')
    x_label <- ""
  }
  if (optimality=="optimal"){
    title_label<- "Optimal"
  } else {
    title_label <- "Suboptimal"
  }
  if (data=="new"){
    title_label <- paste("New Data,", title_label)
  } else {
    title_label <- paste("Original Data,", title_label)
  }
  quartiles_plot <- ggplot(quartiles_df, aes(x=quartile, y=mean, fill=method)) + 
    geom_bar(stat="identity", color="black", 
             position=position_dodge()) +
    geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=.2,
                   position=position_dodge(.9)) +
    labs(title=title_label, x=x_label, y=y_label)+
     theme_classic() #+ ylim(0,1.25)
  return(quartiles_plot)
}

orig_opt_df_avg_prob <- mean_prob_true_goal_quartiles(optimality="Optimal", 
                                             domain="doors-keys-gems", 
                                             df=original_optimal_df, 
                                             stat="average_probability",
                                             data="original") 
orig_opt_plot_avg_prob = get_mean_prob_true_goal_quartiles_plot(quartiles_df=orig_opt_df_avg_prob, 
                                                       optimality="optimal",
                                                       stat="average_probability",
                                                       data="original")

orig_sub_df_avg_prob <- mean_prob_true_goal_quartiles(optimality="Suboptimal", 
                                             domain="doors-keys-gems", 
                                             df=original_suboptimal_df,
                                             stat="average_probability",
                                             data="original")  
orig_sub_plot_avg_prob = get_mean_prob_true_goal_quartiles_plot(quartiles_df=orig_sub_df_avg_prob,
                                                       optimality="suboptimal",
                                                       stat="average_probability",
                                                       data="original")

new_opt_df_avg_prob <- mean_prob_true_goal_quartiles(optimality="Optimal", 
                                             domain="doors-keys-gems", 
                                             df=reproduced_optimal_df, 
                                             stat="average_probability",
                                             data="new") 
new_opt_plot_avg_prob = get_mean_prob_true_goal_quartiles_plot(quartiles_df=new_opt_df_avg_prob, 
                                                      optimality="optimal",
                                                      stat="average_probability",
                                                      data="new")

new_sub_df_avg_prob <- mean_prob_true_goal_quartiles(optimality="Suboptimal", 
                                             domain="doors-keys-gems", 
                                             df=reproduced_suboptimal_df,
                                             stat="average_probability",
                                             data="new") 
new_sub_plot_avg_prob = get_mean_prob_true_goal_quartiles_plot(quartiles_df=new_sub_df_avg_prob, 
                                                      optimality="suboptimal",
                                                      stat="average_probability",
                                                      data="new")

orig_opt_df_top1 <- mean_prob_true_goal_quartiles(optimality="Optimal", 
                                             domain="doors-keys-gems", 
                                             df=original_optimal_df, 
                                             stat="proportion_top1",
                                             data="original") 
orig_opt_plot_top1 = get_mean_prob_true_goal_quartiles_plot(quartiles_df=orig_opt_df_top1, 
                                                       optimality="optimal",
                                                       stat="proportion_top1",
                                                       data="original")

orig_sub_df_top1 <- mean_prob_true_goal_quartiles(optimality="Suboptimal", 
                                             domain="doors-keys-gems", 
                                             df=original_suboptimal_df,
                                             stat="proportion_top1",
                                             data="original")  
orig_sub_plot_top1 = get_mean_prob_true_goal_quartiles_plot(quartiles_df=orig_sub_df_top1,
                                                       optimality="suboptimal",
                                                       stat="proportion_top1",
                                                       data="original")

new_opt_df_top1 <- mean_prob_true_goal_quartiles(optimality="Optimal", 
                                             domain="doors-keys-gems", 
                                             df=reproduced_optimal_df, 
                                             stat="proportion_top1",
                                             data="new") 
new_opt_plot_top1 = get_mean_prob_true_goal_quartiles_plot(quartiles_df=new_opt_df_top1, 
                                                      optimality="optimal",
                                                      stat="proportion_top1",
                                                      data="new")

new_sub_df_top1 <- mean_prob_true_goal_quartiles(optimality="Suboptimal", 
                                             domain="doors-keys-gems", 
                                             df=reproduced_suboptimal_df,
                                             stat="proportion_top1",
                                             data="new") 
new_sub_plot_top1 = get_mean_prob_true_goal_quartiles_plot(quartiles_df=new_sub_df_top1, 
                                                      optimality="suboptimal",
                                                      stat="proportion_top1",
                                                      data="new")

orig_opt_df_avg_states <- mean_prob_true_goal_quartiles(optimality="Optimal", 
                                             domain="doors-keys-gems", 
                                             df=original_optimal_df, 
                                             stat="average_states_visited",
                                             data="original") 
orig_opt_plot_avg_states = get_mean_prob_true_goal_quartiles_plot(quartiles_df=orig_opt_df_avg_states, 
                                                       optimality="optimal",
                                                       stat="average_states_visited",
                                                       data="original")

orig_sub_df_avg_states <- mean_prob_true_goal_quartiles(optimality="Suboptimal", 
                                             domain="doors-keys-gems", 
                                             df=original_suboptimal_df,
                                             stat="average_states_visited",
                                             data="original")  
orig_sub_plot_avg_states = get_mean_prob_true_goal_quartiles_plot(quartiles_df=orig_sub_df_avg_states,
                                                       optimality="suboptimal",
                                                       stat="average_states_visited",
                                                       data="original")

new_opt_df_avg_states <- mean_prob_true_goal_quartiles(optimality="Optimal", 
                                             domain="doors-keys-gems", 
                                             df=reproduced_optimal_df, 
                                             stat="average_states_visited",
                                             data="new") 
new_opt_plot_avg_states = get_mean_prob_true_goal_quartiles_plot(quartiles_df=new_opt_df_avg_states, 
                                                      optimality="optimal",
                                                      stat="average_states_visited",
                                                      data="new")

new_sub_df_avg_states <- mean_prob_true_goal_quartiles(optimality="Suboptimal", 
                                             domain="doors-keys-gems", 
                                             df=reproduced_suboptimal_df,
                                             stat="average_states_visited",
                                             data="new") 
new_sub_plot_avg_states = get_mean_prob_true_goal_quartiles_plot(quartiles_df=new_sub_df_avg_states, 
                                                      optimality="suboptimal",
                                                      stat="average_states_visited",
                                                      data="new")

t.test2 <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)
{
    if( equal.variance==FALSE ) 
    {
        se <- sqrt( (s1^2/n1) + (s2^2/n2) )
        # welch-satterthwaite df
        df <- ( (s1^2/n1 + s2^2/n2)^2 )/( (s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1) )
    } else
    {
        # pooled standard deviation, scaled by the sample sizes
        se <- sqrt( (1/n1 + 1/n2) * ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2) ) 
        df <- n1+n2-2
    }      
    t <- (m1-m2-m0)/se 
    dat <- c(m1-m2, se, t, 2*pt(-abs(t),df))    
    names(dat) <- c("Diff", "StdError", "t", "p_value")
    return(dat) 
}

orig_opt_df_avg_prob$optimality <- "optimal"
orig_opt_df_avg_prob$metric <- "avg_probability"
orig_sub_df_avg_prob$optimality <- "suboptimal"
orig_sub_df_avg_prob$metric <- "avg_probability"

orig_opt_df_top1$optimality <- "optimal"
orig_opt_df_top1$metric <- "top1"
orig_sub_df_top1$optimality <- "suboptimal"
orig_sub_df_top1$metric <- "top1"

orig_opt_df_avg_states$optimality <- "optimal"
orig_opt_df_avg_states$metric <- "avg_states"
orig_sub_df_avg_states$optimality <- "suboptimal"
orig_sub_df_avg_states$metric <- "avg_states"


new_opt_df_avg_prob$optimality <- "optimal"
new_opt_df_avg_prob$metric <- "avg_probability"
new_sub_df_avg_prob$optimality <- "suboptimal"
new_sub_df_avg_prob$metric <- "avg_probability"

new_opt_df_top1$optimality <- "optimal"
new_opt_df_top1$metric <- "top1"
new_sub_df_top1$optimality <- "suboptimal"
new_sub_df_top1$metric <- "top1"

new_opt_df_avg_states$optimality <- "optimal"
new_opt_df_avg_states$metric <- "avg_states"
new_sub_df_avg_states$optimality <- "suboptimal"
new_sub_df_avg_states$metric <- "avg_states"



original_data_summary <- rbind(orig_opt_df_avg_prob, 
                               orig_sub_df_avg_prob, 
                               orig_opt_df_top1, 
                               orig_sub_df_top1,
                               orig_opt_df_avg_states, 
                               orig_sub_df_avg_states)

new_data_summary <- rbind(new_opt_df_avg_prob, 
                          new_sub_df_avg_prob, 
                          new_opt_df_top1, 
                          new_sub_df_top1,
                          new_opt_df_avg_states, 
                          new_sub_df_avg_states)

n = nrow(original_data_summary)

difference_of_means <- c()
std_errors <- c()
ts <- c()
p_values <- c()
optimality <- c()
metric <- c()
method <- c()

for (row in 1:n) {
  m1 <- original_data_summary$mean[row]
  s1 <- original_data_summary$sd[row]
  m2 <- new_data_summary$mean[row] 
  s2 <- new_data_summary$sd[row] 
  n1 <-13.0
  n2 <-13.0
  
  ttest_result <- t.test2(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)
  
  difference_of_means <- c(difference_of_means, ttest_result["Diff"])
  std_errors <- c(std_errors , ttest_result["StdError"])
  ts <- c(ts, ttest_result["t"])
  p_values <- c(p_values, ttest_result["p_value"])
  optimality <- c(optimality, original_data_summary$optimality[row])
  metric <- c(metric, original_data_summary$metric[row])
  method <- c(method, original_data_summary$method[row])
}

significance_tests <- data.frame(difference_of_means, std_errors, ts, p_values, optimality, metric, method)
colnames(significance_tests) <- c("Difference", "Std Error", 't','p-Value', "Optimality", "Metric", "Method")

significance_tests <- significance_tests %>% 
  mutate(across(where(is.numeric), ~ round(., digits = 2)))

Key Analysis 1: Qualitative Comparison of a Suboptimal Plan

plot_grid(human_ribbon_dist, 
          suboptimal_sips_plot, 
          suboptimal_prp_plot,
          ncol = 1)

Key Analysis 2: Correlations for SIPS vs Humans, and PRP vs Humans

kable(correlations,caption="Correlation to Human Judgements")

Correlation to Human Judgements
Correlation	Method
0.7619832	sips
0.7482961	prp

Key Analysis 3: Accuracy and Efficiency Comparison

plot_grid(orig_opt_plot_avg_prob, 
          orig_sub_plot_avg_prob,
          new_opt_plot_avg_prob,
          new_sub_plot_avg_prob,
          ncol = 2)

plot_grid(orig_opt_plot_top1, 
          orig_sub_plot_top1,
          new_opt_plot_top1,
          new_sub_plot_top1,
          ncol = 2)

plot_grid(orig_opt_plot_avg_states, 
          orig_sub_plot_avg_states,
          new_opt_plot_avg_states,
          new_sub_plot_avg_states,
          ncol = 2)

Key Analysis 4: Significance Tests

knit_print.data.frame <- lemon_print
kable(significance_tests,caption='Two Tailed T-Tests: New vs Original Results')

Two Tailed T-Tests: New vs Original Results
Difference	Std Error	t	p-Value	Optimality	Metric	Method
0.00	0.07	-0.03	0.98	optimal	avg_probability	sips
-0.13	0.13	-1.06	0.30	optimal	avg_probability	sips
-0.20	0.11	-1.80	0.09	optimal	avg_probability	sips
0.00	0.07	0.04	0.97	optimal	avg_probability	prp
0.00	0.12	0.02	0.98	optimal	avg_probability	prp
0.00	0.10	-0.04	0.97	optimal	avg_probability	prp
-0.05	0.07	-0.72	0.49	suboptimal	avg_probability	sips
0.02	0.14	0.13	0.90	suboptimal	avg_probability	sips
-0.09	0.15	-0.57	0.57	suboptimal	avg_probability	sips
0.00	0.07	-0.01	0.99	suboptimal	avg_probability	prp
0.00	0.13	0.00	1.00	suboptimal	avg_probability	prp
0.00	0.16	0.02	0.98	suboptimal	avg_probability	prp
-0.07	0.17	-0.41	0.69	optimal	top1	sips
-0.07	0.17	-0.41	0.69	optimal	top1	sips
-0.20	0.11	-1.76	0.10	optimal	top1	sips
0.00	0.00	NaN	NaN	optimal	top1	prp
0.00	0.00	NaN	NaN	optimal	top1	prp
0.00	0.00	NaN	NaN	optimal	top1	prp
-0.02	0.17	-0.10	0.92	suboptimal	top1	sips
0.05	0.18	0.28	0.78	suboptimal	top1	sips
-0.03	0.18	-0.18	0.85	suboptimal	top1	sips
0.00	0.12	0.00	1.00	suboptimal	top1	prp
0.00	0.18	0.00	1.00	suboptimal	top1	prp
0.00	0.15	-0.02	0.98	suboptimal	top1	prp
-16768.47	3183.39	-5.27	0.00	optimal	avg_states	sips
-272900.60	63771.02	-4.28	0.00	optimal	avg_states	prp
-15171.83	2263.82	-6.70	0.00	suboptimal	avg_states	sips
-296794.28	74590.09	-3.98	0.00	suboptimal	avg_states	prp

Discussion

Key Analysis 1 and 2:

For my results, I did not see significant differences between human judgements and SIPS inferences. However, unlike the paper, SIPS inferences become even sharper, where inferences become overly confident earlier than those shown in the paper. This also happens to be the case for PRP although like the results in the paper it reacts less sharply to observations in comparison to SIPS. The paper concluded that SIPS deomnstrated much more human like characteristics, but from my results, I see that both SIPS and PRP are not that different. This can be more clearly seen by the correlations of 0.76 vs 0.75 which does not demonstrate a big difference in qualitative difference between tthe two algorithms. In the paper a correlation of 0.99 was found between human judgements and SIPS, which is much higher than was was found here. Nevertheless, the characteristic is still indicative of showing that SIPS picks up the cues that humans do when making new observations of an agents trajectory leading it to sharply change its inferences to be based on reqwighting the likelihood of its plans given new observations.

Key Analysis 3 and 4:

In almost al cases of the optimal and suboptimal plans, SIPS and PRP are the same with little chance. There are a few cases where in the original data, the average probability of the true goal is greater for PRP than SIPS and this observation is reversed in the new data from the reproduction. However, looking at the significance test we can conclude that these differences are not significant.

In fact, most of the reproduction here was a success, without any p-values dropping below 0.05 with the exception of the reprroduction for the average number of states visited by both prp and sips where we have significant differences between my new data and the original data. Despite these significant differences, the results still hold primarily because the main claim here is that SIPS is significantly more efficeient that PRP. That is still true despite the fact that in both cases the number of states visited is significantly higher, two orders of magnitude higher, than that reported in the paper.

There are a few NaN values reported in the t-tests. This is dude to the lack of variance for some of the runs when PRP becomes very confident, there is 0 standard deviation in both the original data and the new results and hence no significant difference between the two and hence a p-value is unecessary.

Summary of Reproduction Attempt

The primary results are broken into several parts. 1. SIPS shows human like inferences. This was a success, however, not significantly more than PRP which was a claim made about one of the significant advantages in the paper. 2. All differences between the average probability of the underlying true goal and the proportion of time the underlying prbability of true goal is correct are insignificant and were reproduced successfully. The efficiency claim also holds despite both methods being more costly to compute. Because both increased the same relationship, namely that SIPS is more efficent than PRP still holds and to a large degree. Overall I would say the reproduction was a success with the exception of the main benefit of SIPS is efficiency. It is not clear from my results that is is more human like. However, I only tested the trajectory characteristic of SIPS vs PRP one a single problem. It is possible that over several runs of the same problem that SIPS demonstrates significant differences and due to the variation in sampling procedures I sampled an edge case where SIPS does not show an strong advantage over PRP to modeling human behavior (aside from efficinecy). In computing the average probility, the top 1 accuracy over several problems I found no significant differences, and comuting these values over several probem is likelly what yiielded good results, since the inherent nature of smapling would indicate that many runs are need to get a good sence of the performance of SIPS and PRP.

Commentary

I learned alot from this project. I was able to get in touch with the first author and was given clarity of some of the questions I had when I felt some details were missing from the paper. Getting the code to work required some backporting in order for the original code to work despite is being only a year since code was made available. I learend about Julia the programming lanauge and Gen which I am excited to use more in my own research.

I initially set out to reproduce results for several domains listed inthe paper (a tota of 4 each on several problems). However, while I did run SIPS and PRP on all four domains, I did not have time to analyze all of the results. These are included in the data folder in the github repo and my future plan is to rerun my analyses on the the left out three domains taxi, intrusion detection, and world bocks, respectively.