Reproducibility of Ranking Visualization of Correlation using Weber’s Law by Harrison et al. (2014, TVCG)

Author

Fuling Sun

Published

December 8, 2024

Introduction

My research focuses on information visualization, particularly how visualizations can be design effectively to align with authors’ communication goals and support readers’ analytic tasks. The target paper, Ranking Visualizations of Correlation Using Weber’s Law, models the perception of correlation afforded by visualizations, providing quantitative methods to compare and rank them. Replicating this study will contribute to my understandings in both theoretical development and tool design, while also enhancing my skills in statistical analysis.

To reproduce the analysis from the paper, I will use the open-source data provided by the authors, follow the analysis pipeline, and implement the process using R. Specifically, I will conduct the following tasks:

  1. Data processing: I will understand the meaning of the variables in the dataset, and aggregate the data based on the experimental conditions, as described in Harrison et al.
  2. Statistical analysis: I will perform analysis on both the aggregated data (from Harrison et al.) and the individual judgments (from Kay and Heer), applying techniques such as the Kruskal-Wallis test, Mann-Whitney-Wilcoxon tests, linear models, log-linear models and Bayesian estimation.
  3. Result comparison: I will compare my results with those reported in the original papers to identify any discrepancies.
  4. Additional analysis: I will propose and conduct further analyses to explore new questions that arise during the process.

Some potential challenges in this project include learning each statistic method and implementing them in R, which I am not yet fully familiar with. Additionally, since part of the analysis will draw from the follow-up study, I will need to understand the differences between these methods, their strengths and limitations, and when to apply each one.

Links:

Methods

This section presents the methodology, procedure, and analysis pipeline from the original study. Quoted materials from the original paper are indicated with quotation marks, while certain sections have been summarized for brevity due to the report’s length.

Methodology

The original study aimed to infer just-noticeable differences (JNDs) of the perception of correlation on different visualizations. There were nine types of visualization, six correlation values r (0.3, 0.4, 0.5, 0.6, 0.7 and 0.8) and two approach conditions (above and below).

The staircase procedure was used. “In the staircase procedure, given a target value for correlation, r, participants are given two visualization stimuli side-by-side (two scatterplots in this case) and asked to choose which they perceive to have a higher correlation. With an”above” approach, the participant is given one visualization with the target r, and another with an r value higher than the target. For example, if the target r is 0.7, then the second r value would be 0.8 (assuming a starting distance of 0.1). Conversely, with an “below” approach, the participant would be given a visualization with the target r, and another that has an r value lower than the target.”

“In both cases, if a participant chooses correctly, the distance in correlation between the two visualizations is decreased by 0.01 while keeping the target r constant (e.g. 0.7 versus 0.79 in the”above” condition, or 0.7 versus 0.61 in the “below” condition). If a participant chooses incorrectly, the distance in correlation between two visualizations is increased by 0.03, making the next judgment easier. The staircase procedure “hones in” on the JND by penalizing incorrect choices more than correct choices.”

“The staircase procedure ends when either 50 individual judgments are reached or when a convergence criteria is met.”

Planned Sample

The original experiment was conducted with 1687 participants (834 female) on Amazon Mechanical Turk (MTurk). In total, there were 6772 records. No other demographics or pre-selection rules were mentioned in the paper.

Materials

“Our study used a total of nine visualizations, two correlation directions (positive/negative), and six correlation values (0.3 to 0.8) yielding 54 main groups. Since each participant was assigned to one visualization, one correlation direction, and two correlation values (above and below), roughly 30 participants were assigned to each visualization×direction×r-value group.”

“We chose nine visualizations for this experiment based on two main criteria: a) they must be commonly used in either infovis or commercial software (external validity), and b) they must be viable within the constraints of the experiment methodology. The nine visualizations chosen include: scatterplots, parallel coordinates plots, stacked area charts, stacked bar charts, stacked line charts, line charts, ordered line charts, radar charts, and donut charts.”

“All visualizations were 300×300 pixels, contained 100 data points and displayed datasets generated from same algorithm. For visualizations that required more than one color, we chose a single color scheme from ColorBrewer.”

Figure 1 (from the original paper): a) A sample starting comparison from the experiment: r = 0.7 on the left and r = 0.6 on the right. Participants were asked to choose which of the two appeared to be more highly correlated. b) The staircase procedure hones in on the just-noticeable difference by gradually making comparisons more difficult: r = 0.7 on the left and r = 0.65 on the right.

Procedure

To accommodate the diverse education background of participants from MTurk, each study started with a training and a practice session. Participants first read the definition of correlation and examples of visualization with different correlation values. Then they were given practice sessions of 30 judgments of which of the two visualization showed the higher correlation. Correct answers were presented after each judgment.

“Informed by early pilot testing, participants were randomly assigned to two correlation values: one from [0.3, 0.4, 0.5], and one from [0.6, 0.7, 0.8]. These groups roughly correspond to”hard” and “easy”, since high correlations are more easily discriminated than low correlations. For the two correlation values chosen, participants complete both the above and below approach, resulting in a total of four trials (easy×above, easy×below, hard×above and hard×below).”

“After completing the training and practice sessions, participants began the four main trials. The order of the correlation-approach pairs was randomized in this session. Upon completing a trial set (either by reaching the convergence criteria or 50 individual judgments), participants were given the option to take a short break, and notified as to how many experiment trials remained. Following the completion of all four trials, a demographics questionnaire was given, which included a question that asked participants to describe the strategy they used to assess correlation. Finally, participants were given a short debrief explaining the purpose of the experiment.”

Analysis Plan

The analysis plan includes data processing and cleaning process, Weber’s Law model fitting, and comparison of visualization types on correlation perception.

Data Processing and Cleaning

According to the paper, the authors aggregated the JNDs of each r value under each condition (visualization type x direction x approach).

“The resulting data were non-normally distributed, so to mitigate the effect of outliers, JNDs that fell outside 3 median-absolute deviations from the median (within one of the 54 groups) were excluded from the following analyses. Because the staircase methodology penalizes incorrect responses and controls for guessing by defining a convergence criteria, this exclusion criteria also mitigates the effect of”click-through” responses that often impact crowdsourced experiments.”

“An exclusion criteria was also enforced for visualization × direction pairs that exceeded 20% of values falling on or outside the”chance” boundary of JND = 0.45 established previously. Six of the eighteen pairs met this exclusion criteria: stacked area-positive, stacked bar-positive, stacked line-positive, donut-positive, radar-negative, and line-negative. The following analyses include the remaining twelve visualization×direction pairs.”

Weber’s Law Model Fitting

“Specifically, each correlation value r was moved by half of the average JND from the above and below approach. For the above approach, the correlation r was moved towards r = 1, while the r from the below approach was moved towards r = 0. Specifically, correlation r is transformed into adjusted-correlation r_A by:”

r_A = r ± 0.5*{\tt {jnd}}(r)

“Linear models were then fit to the data, and errors were computed based on the square root of the mean squares of the residuals (RMS error).”

Comparison among Visualization Types

“Examining the JND data alone, there appear to be large differences in performance between many of the visualizations, as well as asymmetries between many of the positive/negative pairs. In order to confirm these observations, an overall Kruskal-Wallis test was conducted on the raw JNDs to evaluate whether there is an interaction between visualization and correlation direction conditions.”

“To explore further, several visualization × direction pairs were compared via Mann-Whitney-Wilcoxon tests. Rather than compare all possible pairs, we instead investigate 14 pairings reflecting the original motivations for choosing the visualizations tested.”

Additional Analysis: Winsorization and Permutation Test

One of the data exclusion criteria was removing data of JNDs that fell outside 3 median-absolute deviations from the median (within one of the 54 groups). Winsorizing JNDs means replacing values of these “outliers” with a specified percentile of the data, rather than eliminating them directly. By Winsorizing JNDs and refitting the model, we can observe the differences of model performance between excluded data and winsorized data.

For the permutation test, the r values will be randomly shuffled within each visualization x direction pair and refitted to the Weber’s Law model. By comparing the two models, we can observe if the relationship is stronger than random chance.

Differences from Original Study

The overall analysis is the same as the one in the paper. For comparing visualization x direction pairs, the original analysis only considered 14 pairs, I will analyze more pairs. For example, comparing visualization using the same type of visual marks (e.g., line, area) can help us understand which specific visualization types work better for depicting correlation. Another difference is the additional model robustness analysis with Winsorization and permutation test.

Design Overview

Manipulated Factors

In the study, the visualization type, the direction and the value of correlation r and the approach (i.e., above and below) were manipulated.

Measures

The measure was the JND of each condition from participants’ judgement. The measures were repeated with different correlation values, above and below approaches with the staircase method.

Study Design

This study was a between-participants design. Each participant was assigned to a specific visualization type, two random correlation values for both above and below approaches.

Alternative Study Design

Running this study with the within-subject design would introduce the issues of carryover effect and fatigue. Since there were 54 groups of conditions, participants may gradually learn how to make judgment. On the other hand, it would take too long to complete each study, which may cause fatigue and influence the performance of participants. Therefore, the between-subjects design taken in the paper was more reasonable.

Reducing Demand Characteristics

This study had little concern on demand characteristics, since participants were aware of the purpose of the trial (to pick the visualization with higher correlation values).

Potential Issues of the Study

The distance to the screen when participants did the study might be a confound to the results. Since the study was conducted through MTurk, participants might view the screen, such as different distances, angles and tilt. It might be easier or harder for people to identify some patterns from the stimuli.

The study covered nine types of visualization initially, but skipped some for later analysis due to the unreliable results. Therefore, the results might be generalizable to these visualizations, as well as those not used in the study. The paper also did not provide demographic information other than sex. The lack of knowledge of the recruited participants made it hard to evaluate the generalizability of population.

Results

Power Analysis

For the Kruskal-Wallis test, the authors used “the raw JNDs to evaluate whether there is an interaction between visualization and correlation direction conditions.” Since Kruskal-Wallis test does not yeild an R^2 effect size directly, I estimate an f^2 effect size from \chi^2 :

f^2 = \frac{\chi^2}{N-k}

where \chi^2 is the Kruskal-Wallis test statistic, N is the total sample size (6772 judgments), and k is the number of groups (18). Given the results reported in the paper is \chi^2(17) = 3147.70, p < 0.001, α = 0.05, the derived f^2 is

f^2 = \frac{3147.70}{6772-18} = 0.4660497483

Using G*Power for a post-hoc power analysis, using F-test with “ANOVA: fixed effects, omnibus, one-way” and the above parameters, the power is 1, indicating a very low probability of a Type II error.

Indicators of Success of Reproduction

Since it is a reproduction based on the original study data, the measures for success are the direction comparison of the results reported in the paper and the results from my code. I will compare the statistics of each analysis, and if the numbers match, the reproduction is successful.

Data preparation

Data preparation following the analysis plan.

### Data Preparation

#### Load Relevant Libraries and Functions
library(ggplot2)
library(dplyr)
library(tidyverse)
library(tibble)
library(cowplot)
library(ggplot2)
library(knitr)


#### Import data
df = read.csv("../data/master.csv")

Each record is a participant’s JND on the given visualization x direction x r x approach. Approaches (i.e., below and above) indicated whether the second stimulus had a larger or smaller r compared to the base r in the first stimulus.

head(df)
  participant                 vis rdirection sign                  visandsign
1    hrp9vbr9 parallelCoordinates   positive    1 parallelCoordinatespositive
2    hrp9vbr9 parallelCoordinates   positive    1 parallelCoordinatespositive
3    hrp9vbr9 parallelCoordinates   positive    1 parallelCoordinatespositive
4    hrp9vbr9 parallelCoordinates   positive    1 parallelCoordinatespositive
5    hs1u042j                line   positive    1                linepositive
6    hs1u042j                line   positive    1                linepositive
  rbase approach       jnd
1   0.6    above 0.2733333
2   0.6    below 0.4716667
3   0.3    below 0.2841667
4   0.3    above 0.4833333
5   0.8    below 0.2650000
6   0.5    below 0.3500000

To prepare the data for analysis, I first calculated some derived attributes as described in the paper, including proportion_cutoff and mad_cutoff, the two data exclusion metrics.

#### Data exclusion / filtering
#### Prepare data for analysis - create columns etc.

# rename the visualization x direction condition to make it more readable
renames <- function(value) {
  value <- str_replace(value, 'negative', ' - negative')
  value <- str_replace(value, 'positive', ' - positive')
  value <- str_replace(value, 'parallelCoordinates', 'parallel coordinates')
  value <- str_replace(value, 'ordered_line', 'ordered line')
  return (value)
}

data = df |> 
  group_by(visandsign) |>
  mutate(proportion_worse_than_chance = mean(jnd > 0.45)) |> 
  mutate(proportion_cutoff = proportion_worse_than_chance > 0.2) |>
  group_by(visandsign, rbase, approach) |>
  mutate(mad_cutoff = abs(jnd - median(jnd)) > 3 * mad(jnd, constant=1)) |> #fell outside 3 median-absolute deviations from the median (within one of the 54 groups)
  ungroup() |>
  mutate(approach_value = ifelse(approach == "above", -1, 1)) 

data = data |> mutate(visandsign = renames(visandsign)) |>
  mutate(visandsign = factor(visandsign, levels=c("scatterplot - positive", "scatterplot - negative", "parallel coordinates - positive", "parallel coordinates - negative", "stackedarea - positive", "stackedarea - negative", "stackedline - positive", "stackedline - negative", "stackedbar - positive", "stackedbar - negative", "donut - positive", "donut - negative", "line - positive", "line - negative", "radar - positive", "radar - negative", "ordered line - positive", "ordered line - negative")))

Confirmatory analysis

Weber’s Law Model Fit

Each correlation value r was moved by the half of the average JND from the above and below approach, following the model fitting procedure of Weber’s Law.

r_A = r \pm 0.5 {\tt jnd}(r)

# remove jnd data outside of 3mad
weber_data = filter(data, !mad_cutoff)
  
# group data by conditions to get the mean and sem
weber_data = weber_data |>
  group_by(visandsign, rbase, approach, approach_value, sign, vis, proportion_cutoff) |>
  summarise(sem_jnd=sd(jnd)/sqrt(length(jnd)), jnd = mean(jnd), .groups = "drop") |>
  group_by(rbase, visandsign) |>
  mutate(mean_jnd_within_r = mean(jnd)) |>
  mutate(r_A = rbase + approach_value * (-0.5) * mean_jnd_within_r) |>
  mutate(approach = stringr::str_to_title(approach))
overall_dis = ggplot(weber_data, aes(x=rbase, y=jnd, col=approach)) +
  geom_point(size=2) +
  xlim(0, 1) + 
  geom_hline(aes(yintercept=0.45), colour=c("gray75"), linetype = 'dotted') +
  geom_abline(slope=-1, intercept = 1, colour=c("gray75"), linetype = 'dotted') +
  facet_wrap(~visandsign, ncol=4, scales = "free") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"), strip.text = element_text(color = "black", face = "bold", size=12),
    strip.background = element_rect(fill = "white", color = "white")) +
  ggh4x::facetted_pos_scales(
    x = lapply(1:18, function(i) scale_x_continuous(expand = c(0, 0), limits = c(0, 1), breaks = seq(0, 1, by = 0.2), labels = scales::label_number(drop0trailing = TRUE))),
    y = lapply(1:18, function(i) scale_y_continuous(expand = c(0, 0), limits = c(0, 0.6), labels = scales::label_number(drop0trailing = TRUE)))
  ) +
  geom_errorbar(aes(ymin=jnd - sem_jnd, ymax=jnd+sem_jnd), color="black", width = 0.05)

overall_dis
Figure 1
# to store fitting results of each visxdirection condition
model_results <- list()

# only report conditions that have no over 20% data fall out of chance
valid_vis_sign = unique(filter(weber_data, !proportion_cutoff)$visandsign)

for (vis_sign in valid_vis_sign) {
  this_data <- filter(weber_data, visandsign ==vis_sign)
  model <- lm(jnd ~ r_A, this_data)
  correlation <- cor(this_data$jnd, this_data$r_A)
  
  # storing results
  model_results[[vis_sign]] <- list('visandsign'=vis_sign, 'inter'=model$coefficients[1], 'slope'=model$coefficients[2], 'r'=correlation, 'r2'=correlation^2, 'rms'=sqrt(mean((residuals(model))^2)))
  
}
  
# round up numbers
model_results <- lapply(model_results, function(model) {
  lapply(model, function(x) if(is.numeric(x)) round(x, 3) else x)
})

# create a dataframe of model results (to display)
model_results_df = data.frame(matrix(unlist(model_results), nrow=length(model_results), byrow=TRUE))

# make the column name readable
colnames(model_results_df) <- c(
  "vis_and_sign", 'intercept', 'slope', 'r', 'r^2', 'RMS')
kable(model_results_df, format = "html")
vis_and_sign intercept slope r r^2 RMS
scatterplot - positive 0.174 -0.174 -0.991 0.982 0.004
scatterplot - negative 0.206 -0.222 -0.95 0.903 0.013
parallel coordinates - positive 0.368 -0.266 -0.863 0.745 0.032
parallel coordinates - negative 0.159 -0.144 -0.948 0.898 0.009
stackedarea - negative 0.27 -0.218 -0.928 0.862 0.016
stackedline - negative 0.355 -0.321 -0.916 0.84 0.027
stackedbar - negative 0.221 -0.193 -0.951 0.904 0.011
donut - negative 0.256 -0.232 -0.962 0.925 0.012
line - positive 0.459 -0.323 -0.859 0.738 0.043
radar - positive 0.443 -0.356 -0.954 0.911 0.024
ordered line - positive 0.256 -0.239 -0.955 0.911 0.014
ordered line - negative 0.322 -0.305 -0.883 0.78 0.031
model_results_df[, c("intercept", "slope")] <- lapply(model_results_df[, c("intercept", "slope")], as.numeric)

model_results_plot <- model_results_df |> 
  tidyr::expand_grid(x = seq(0, 1, length.out=100)) |>
  mutate(y = intercept + slope * x) |>
  rowwise() |>
  mutate(vis = strsplit(vis_and_sign, split=" - ")[[1]][1], 
         direction = strsplit(vis_and_sign, split=" - ")[[1]][2])

model_results_plot$direction <- factor(model_results_plot$direction, levels = c("positive", "negative"))

ggplot(model_results_plot, aes(x = x, y=y, group=vis_and_sign,color = vis, linetype = direction)) +
  theme_minimal()+
  geom_line() +
  labs(x = "R", y="JND") +
  scale_y_continuous(expand=c(0,0), limits=c(0, 0.6)) +
  scale_x_continuous(expand=c(0,0), limits=c(0, 1), labels = scales::label_number(drop0trailing=TRUE)) +
  theme(
    legend.position = c(0.6, 1), 
    legend.justification = c(0.5, 1), 
    legend.direction = "horizontal", # Arranges legend items horizontally
    legend.background = element_rect(fill = "transparent", color = "transparent"),
    legend.title = element_blank(), # Removes legend titles
    legend.box = "horizontal"
  ) + 
  guides( # organize legend items
    color = guide_legend(nrow = 3), 
    linetype = guide_legend(nrow = 2)
  )
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_line()`).

All numerical results were the same as in the original paper. The comparison of visualizations is shown in Figure 2,

p1 <- ggdraw() + draw_image("./figures/regression_my.png", scale = 1) + theme(aspect.ratio = 1)
p2 <- ggdraw() + draw_image("./figures/regression_paper.png", scale = 0.8) + theme(aspect.ratio = 1)

plot_grid(p1, p2, align = "v", axis = "lr", labels = c("A: My reproduction", "B: From original paper"), label_size = 8)     
Figure 2

Kruskal-Wallis Test

From the Figure 1, we can spot the difference JND distributions of different visualization x direction. Following the paper, I run the Kruskal-Wallis test to confirm the difference.

kruskal_res = kruskal.test(jnd ~ visandsign, data)

kruskal_res

    Kruskal-Wallis rank sum test

data:  jnd by visandsign
Kruskal-Wallis chi-squared = 3147.7, df = 17, p-value < 2.2e-16

The Kruskal-Wallis test showed an overall effect between visualization and correlation direction conditions (\chi^2=3147.70, p < 0.001, \alpha = 0.05 ). And the result is the same from the paper.

Mann-Whitney-Wilcoxon Tests

To understand how different are the conditions from each other, I conducted the Mann-Whitney-Wilcoxon tests. Mann-Whitney-Wilcoxon tests compare two samples or groups to determine if their population distributions differ (non-parametric).

# store all test results
wilcox_res <- list()

# helper function to run wilcoxon test
wilcox_pair <- function (con_1, con_2, data) {
  data_1 = subset(data, visandsign == con_1 & !mad_cutoff)$jnd
  data_2 = subset(data, visandsign == con_2 & !mad_cutoff)$jnd
  
  res = wilcox.test(data_1, data_2)
  return (list("con_1"=con_1, "con_2"=con_2, "stats"=res$statistic, "p"=format(res$p.value, scientific = FALSE)))
}

Following the paper, there were four groups of comparisons:

# comparison pairs
# scatterplot and parallel coordinates
con_1_arr <- c("scatterplot - negative", "scatterplot - negative", "scatterplot - positive", "parallel coordinates - negative", "parallel coordinates - negative", "parallel coordinates - negative")
con_2_arr <- c("scatterplot - positive", "parallel coordinates - positive", "parallel coordinates - positive", "scatterplot - negative", "scatterplot - positive", "parallel coordinates - positive")

for (i in 1:length(con_1_arr)) {
  res = wilcox_pair(con_1_arr[i], con_2_arr[i], data)
  wilcox_res[[paste("group_1", i)]] = res
}
# comparison pairs
# stacked-style charts
con_1_arr <- c("stackedbar - negative", "stackedbar - negative", "stackedbar - negative", "stackedline - negative")
con_2_arr <- c("stackedline - negative", "stackedarea - negative", "donut - negative", "stackedarea - negative")

for (i in 1:length(con_1_arr)) {
  res = wilcox_pair(con_1_arr[i], con_2_arr[i], data)
  wilcox_res[[paste("group_2", i)]] = res
}
# comparison pairs
# line-style charts
con_1_arr <- c("line - positive", "line - positive", "line - positive")
con_2_arr <- c("radar - positive", "ordered line - positive", "ordered line - negative")

for (i in 1:length(con_1_arr)) {
  res = wilcox_pair(con_1_arr[i], con_2_arr[i], data)
  wilcox_res[[paste("group_3", i)]] = res
}
# comparison pairs
# order lines
con_1_arr <- c("ordered line - negative")
con_2_arr <- c("ordered line - positive")

for (i in 1:length(con_1_arr)) {
  res = wilcox_pair(con_1_arr[i], con_2_arr[i], data)
  wilcox_res[[paste("group_4", i)]] = res
}

Because 14 pairs of comparison are conducted here, the chances of Type-I error increase. Therefore, a Bonferroni correction is needed to adjust the significant level.

num_compair <- length(wilcox_res)
original_alpha <- 0.05

adjusted_alpha <- original_alpha / num_compair

As a result, the adjusted significant level is 0.0035714.

wilcox_results_df = data.frame(matrix(unlist(wilcox_res), nrow=length(wilcox_res), byrow=TRUE))

colnames(wilcox_results_df) <- c(
  "visualization - direction 1", 'visualization - direction 2', 'W', 'p')

# format the p values as they were in the paper
format_p <- function(p_value, alpha) {
  p_value = as.numeric(p_value)
  p_value = round(p_value, 4)
  
  if (p_value == 0) {
    p_value = "< 0.001*"
  } else if (p_value < alpha) {
    p_value = paste(p_value, '*')
  } else {
    p_value = as.character(p_value)
  }
  return (p_value)
  
}

wilcox_results_df = wilcox_results_df |> 
  rowwise() |>
  mutate(p = format_p(p, adjusted_alpha)) |> # format p values
  ungroup()
kable(as.data.frame(wilcox_results_df), format = "html")
visualization - direction 1 visualization - direction 2 W p
scatterplot - negative scatterplot - positive 51165.5 0.5406
scatterplot - negative parallel coordinates - positive 10885.5 < 0.001*
scatterplot - positive parallel coordinates - positive 8623 < 0.001*
parallel coordinates - negative scatterplot - negative 51291 0.4179
parallel coordinates - negative scatterplot - positive 51491 0.1624
parallel coordinates - negative parallel coordinates - positive 8641.5 < 0.001*
stackedbar - negative stackedline - negative 34421 < 0.001*
stackedbar - negative stackedarea - negative 33348.5 < 0.001*
stackedbar - negative donut - negative 43361 0.0368
stackedline - negative stackedarea - negative 66646 0.0139
line - positive radar - positive 73775.5 0.0017 *
line - positive ordered line - positive 104163.5 < 0.001*
line - positive ordered line - negative 101883 < 0.001*
ordered line - negative ordered line - positive 66292 0.0075

The tests successfully replicated the findings from the paper. The results highlight the symmetric performance of scatterplots in representing both correlation directions, whereas parallel coordinates exhibit asymmetrical performance. Additionally, comparisons between charts with similar design styles (e.g., stacked visual marks) reveal notable differences in performance.

Additional Analyses

Winsorization

The original analysis excludes JNDs that fall more than 3 MADs from the median. Another method is to Winsorize the JND data by capping extreme values at a lower threshold.

win_model_results <- list()

# re-process the original data
for (vis_sign in valid_vis_sign) {
  this_data <- filter(data, visandsign == vis_sign)
  lower_threshold <- quantile(this_data$jnd, 0.05)
  upper_threshold <- quantile(this_data$jnd, 0.95)
  
  # capping the extreme data in jnd
  this_jnd_win <- pmin(pmax(this_data$jnd, lower_threshold), upper_threshold)
  
  # update the data df
  this_data_win = this_data |> mutate(jnd = this_jnd_win)
  
  # re-process / aggregate the data
  this_weber_data = this_data_win |>
    group_by(visandsign, rbase, approach, approach_value, sign, vis, proportion_cutoff) |>
    summarise(sem_jnd=sd(jnd)/sqrt(length(jnd)), jnd = mean(jnd), .groups = "drop") |>
    group_by(rbase, visandsign) |>
    mutate(mean_jnd_within_r = mean(jnd)) |>
    mutate(r_A = rbase + approach_value * (-0.5) * mean_jnd_within_r) |>
    mutate(approach = stringr::str_to_title(approach))
  
  # re-fit the model
  model <- lm(jnd ~ r_A, this_weber_data)
  correlation <- cor(this_weber_data$jnd, this_weber_data$r_A)
  
  win_model_results[[vis_sign]] <- list('visandsign'=vis_sign, 'inter'=model$coefficients[1], 'slope'=model$coefficients[2], 'r'=correlation, 'r2'=correlation^2, 'rms'=sqrt(mean((residuals(model))^2)))
  
}
  

win_model_results <- lapply(win_model_results, function(model) {
  lapply(model, function(x) if(is.numeric(x)) round(x, 3) else x)
})


win_model_results_df = data.frame(matrix(unlist(win_model_results), nrow=length(win_model_results), byrow=TRUE))

colnames(win_model_results_df) <- c(
  "vis_and_sign", 'intercept', 'slope', 'r', 'r^2', 'RMS')
kable(win_model_results_df, format = "html")
vis_and_sign intercept slope r r^2 RMS
scatterplot - positive 0.19 -0.186 -0.99 0.981 0.005
scatterplot - negative 0.219 -0.219 -0.966 0.933 0.011
parallel coordinates - positive 0.34 -0.197 -0.731 0.535 0.038
parallel coordinates - negative 0.174 -0.134 -0.887 0.787 0.012
stackedarea - negative 0.287 -0.199 -0.921 0.847 0.016
stackedline - negative 0.368 -0.283 -0.909 0.826 0.026
stackedbar - negative 0.238 -0.179 -0.972 0.944 0.008
donut - negative 0.266 -0.219 -0.949 0.901 0.014
line - positive 0.457 -0.309 -0.883 0.779 0.037
radar - positive 0.426 -0.304 -0.934 0.873 0.025
ordered line - positive 0.27 -0.222 -0.923 0.853 0.017
ordered line - negative 0.335 -0.282 -0.936 0.876 0.021

Take a closer look at the winsorized with the scatterplot - positive condition as shown in Table 1. The results are slightly different but the r^2 are both close to 1, indicating that the model fits the data very well under both methods. This suggests that the relationship between r_A and JND is robust and largely unaffected by the choice of outlier-handling method—whether extreme values are capped (Winsorization) or removed (3 MAD rule). The close agreement in r^2 highlights that the core pattern in the data is not driven by these extreme values, and the Weber’s Law model captures the underlying relationship effectively.

Table 1
cond = "scatterplot - positive"
rbind(filter(win_model_results_df, vis_and_sign==cond)|>mutate(method = "winsorizating"), filter(model_results_df, vis_and_sign==cond) |> mutate(method = "removing"))
            vis_and_sign intercept  slope      r   r^2   RMS        method
1 scatterplot - positive      0.19 -0.186  -0.99 0.981 0.005 winsorizating
2 scatterplot - positive     0.174 -0.174 -0.991 0.982 0.004      removing
# visually compare model results (original method and winsorizating)
win_compare_df <- rbind(model_results_df |> mutate(method = "Original Method"), win_model_results_df |> mutate(method = "Winsorizating"))

win_compare_df[, c("intercept", "slope")] <- lapply(win_compare_df[, c("intercept", "slope")], as.numeric)

win_compare_plot <- win_compare_df %>%
  tidyr::expand_grid(x = seq(0, 1, length.out = 100)) %>% # Create x values
  mutate(y = intercept + slope * x)  # Calculate y values based on slope and intercept

# Create the facet_wrap plot
ggplot(win_compare_plot, aes(x = x, y = y, col=method)) +
  geom_line() + # Add lines
  facet_wrap(~ vis_and_sign, ncol=4) + 
  labs(
    x = "R",
    y = "JND"
  ) +
  theme_minimal() +
  theme(panel.spacing = unit(1.5, "lines"),
        strip.placement = "outside") +
  ylim(0, 0.6) +
  scale_x_continuous(expand = c(0, 0), limits = c(0, 1), labels = scales::label_number(drop0trailing = TRUE))

We can see that across different visualization x direction conditions, the predicted linear models have similar outcomes.

Permutation

Permutation test can examine whether the observed Weber’s Law relationship is stronger than what might be occur by random chance. Therefore, I shuffle the r_A values within each visualization x direction condition and refit the regression model.

# Number of permutations
n_permutations <- 500

# Permutation results
permutation_results <- replicate(n_permutations, {
  # Shuffle r within each vis_and_sign group
  permuted_df <- weber_data %>%
    group_by(visandsign) %>%
    mutate(r_A = sample(r_A))
  
  # Fit the Weber's Law model to the permuted data
  permuted_fit <- permuted_df %>%
    group_by(visandsign) %>%
    summarise(R2 = summary(lm(jnd ~ r_A))$r.squared)
  
  # Average R2 across all groups
  mean(permuted_fit$R2)
}, simplify = TRUE)

model_results_df$`r^2` = as.numeric(model_results_df$`r^2`)

original_mean_R2 <- mean(model_results_df$`r^2`)
# Visualize results
hist(permutation_results, main = "Permutation Test: r2 Distribution",
     xlab = "Permuted r2", breaks = 30, col = "lightblue", xlim=c(0, 1))
abline(v = original_mean_R2, col = "red", lwd = 2)
legend("topright", legend = c("Original r2 (Mean)"), col = "red", lwd = 2)

From the histogram, we can see that the original averaged r^2 value is far outside the range of the distribution generated by the permutation test, this suggests that the original relationship between r_A and JND is stronger than what would be expected by chance.

Discussion

Summary of Reproduction Attempt

Overall, the reproduction included the major analysis in the original paper and was able to derive same results from the paper.

Commentary

One future direction of this line of work can be focusing on the reason of the symmetric or asymmetric performance of the same type of visualization with different correlation directions.

One major challenge in this reproduction was to come up with specific analysis in code from text description in the paper. Sometimes word is arbitrary and can mean different ways to implement.