My scripts:

  • rendered html versions: Rpubs/thomas-weissensteiner
  • .rmd files with executable code chunks: www.github.com/thomas-weissensteiner/portfolio/tree/main/

About myself: www.linkedin.com/in/ThomasWs-Mopfair

## - chunk libraries_and_data

## Load R libraries and the data file                             - #

library(dplyr)        # version 1.1.4  # general R grammar
library(tidyr)        # version 1.3.1  # pivot_longer, pivot_wider, nest
library(purrr)        # version 1.0.2  # map
library(ggplot2)      # version 3.4.0  # plots

HyKal_simData <- 
  read.csv("https://raw.githubusercontent.com/VIS-SIG/Wonderful-Wednesdays/refs/heads/master/data/2025/2025-06-11/serum_potassium_study_data.csv") %>% 
  # Avoid confusion with the function "group"
  rename(., Treatment = Group)


1. Background and aim of the visualisations


The concentration of potassium ions (K+) in human serum is maintained at 4.0–4.9 mEq/L under normal conditions.1
Serum or plasma potassium levels above the normal range, usually greater than 5.0 mEq/L to 5.5 mEq/L, are defined as hyperkalemia. Mild hyperkalemia is usually asymptomatic, but high potassium levels may cause life-threatening cardiac arrhythmias, muscle weakness, or paralysis. Symptoms usually develop at higher levels, 6.5 mEq/L to 7 mEq/L, but the rate of change is more important than the numerical value. Chronic hyperkalemia may be asymptomatic, whereas acute concentration shifts may cause severe symptoms, even at lower K+ levels.2
Major risk factors for hyperkalemia among chronic kidney disease patients include lower estimated glomerular filtration rate (eGFR), use of renin–angiotensin–aldosterone system inhibitors (RAASis), diabetes, older age and male gender.3

A recently published trial compared sodium zirconium cyclosilicate versus sodium polystyrene sulfonate for treatment of hyperkalemia in hemodialysis patients.1
The primary results were presented as a single figure showing mean serum potassium concentrations versus time in a separate panel for each treatment group. Error bars probably represent the standard deviation (see Methods/Statistical Analysis, no figure caption):

Mean serum K⁺ in patients receiving SPS and SZC vs. week of treatment

Published summary results: a) SZC (orange line), b) SPS(blue line)

<br> Means +/- 1 standard deviation


Means +/- 1 standard deviation


The preceding PSI “Wonderful Wednesday” challenge invited improvements for this figure. In this follow-up, the task was to generate a plot that made use of patient level information mentioned in the table of baseline characteristics. Because the publication reported only summary data, “Wonderful Wednesday” provided a simulated data-set. Plotting mean serum K[+] versus time with these simulated data shows a general agreement with the published figure. I used colors similar to the original and included the standard deviations as lighter shaded bands for comparison:

# Normal serum K[+] as defined in reference 2
normokalemia <- c(4, 4.9)

HyKal_simData %>% 
  select(c(Treatment, Week, Serum_Potassium_mEq_L)) %>% 
  group_by(Treatment, Week) %>% 
  summarise(., n = n(), mean = mean(Serum_Potassium_mEq_L), sd = sd(Serum_Potassium_mEq_L)) %>% 
  mutate(
    Treatment = factor(Treatment),
    SDhigh = mean -sd,
    SDlow = mean +sd,
    CI = 1.96*sd/sqrt(n),
    CIlow = mean - CI,
    CIhigh = mean + CI) %>% 
  ggplot(
    aes(
      x = Week, 
      y = mean,
      colour = Treatment,
      fill = Treatment
      ) 
    ) +
  geom_ribbon(
    inherit.aes = FALSE,
    aes(
      x = Week,
      ymin = min(normokalemia), 
      ymax = max(normokalemia)),
      fill = "lightgrey",
      alpha = 0.5
      ) +
  geom_line(
    size = 1
  ) +
  geom_ribbon(
    aes(
      ymin = SDlow, 
      ymax = SDhigh), 
    linetype = 0,
    alpha = 0.3
    ) +
  geom_ribbon(
    aes(
      ymin = CIlow, 
      ymax = CIhigh), 
    linetype = 0,
    alpha = 0.4
    ) +
  annotate(
    "text", 
    label = "Normal range",
    size = 3,
    x = 0.3, 
    y = range(normokalemia) %>% sum()/2,
    angle = 90  
    ) +
   geom_label(
    inherit.aes = FALSE,
    aes(
      x = 7.5, 
      y = 6.3 - as.numeric(Treatment), 
      label = Treatment,
      color = Treatment
      ),
      fontface = "bold",
      size = 3,
      label.padding = unit(0.25, "lines"),
      label.size = NA,
      fill = "white"
    ) +
  scale_color_manual(values = c("#27C8F5", "#F58E27") ) +
  scale_fill_manual(values = c("#27C8F5", "#F58E27") ) +
  scale_y_continuous(
    breaks = c(3:6), 
    limits = c(3.5, 6.5),
    minor_breaks = NULL,
    ) +
  labs(
    title = "Mean serum K[+] in patients receiving SPS and SZC \nvs. week of treatment",
    subtitle = "Simulated summary results",
    y = "Mean serum K[+] (mEq/L)",
    caption = "Dark colored bands: 95% confidence intervals. \nLight colored bands: standard deviation."
    ) +
  theme_light() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    plot.caption = element_text(hjust = 0.5, size = 8),
    legend.position = "none"
    ) 


The published study showed that the average serum K[+] response in patients on SZC was greater than in patients treated with SPS. My aim for this challenge was to visualise patient subgroups in which the efficacy of SPS and SZC differed most, based on the simulated data.



2. A look at the raw data


2.1. Weekly measurements of serum potassium levels in patients receiving SPS or SZC


## - Chunk 1
# Required objects and functions: HyKal_simData (chunk libraries_and_data)


## Individual patient responses vs. week of treatment

# Normal serum K[+] as defined in reference 2
normokalemia <- c(4, 4.9)

# K_threshold is set based on results in chunk 3
K_threshold <- 5.2

# Number of patients in each treatment group
N_group <- 
  HyKal_simData %>% 
  group_by(Treatment) %>% 
  summarize(., N = unique(Patient_ID) %>% length )
    
HyKal_simData %>%
  # Add a column showing whether baseline levels were below or above K_threshold
  mutate(
    K_base = case_when(
      Serum_Potassium_mEq_L < K_threshold & Time_Point == "Baseline" 
        ~ paste0(Treatment, ".K_low"), 
      Serum_Potassium_mEq_L >= K_threshold & Time_Point == "Baseline" 
        ~ paste0(Treatment, ".K_high"),
      .default = NA_character_
      ) %>% factor()
    ) %>% 
  group_by(Patient_ID) %>%
  fill(K_base, .direction = "downup") %>% 
  # Add patient numbers in each treatment group, and a column showing whether serum K[+] levels were normal at any time during treatment
  left_join(., N_group) %>% 
  mutate(
    hyKal = ifelse(
      Serum_Potassium_mEq_L <= max(normokalemia), 
      Week, NA),
    hyKal = if (all(is.na(hyKal[-1]))) "Yes" else "No",
    color = paste(K_base, hyKal, sep = "_") %>% factor(),
    Treatment = paste0(Treatment, " (N = ", N, ")") 
    ) %>% 
  ungroup() %>%

  ggplot(
    aes(
      x = Week, 
      y = Serum_Potassium_mEq_L,
      group = Patient_ID,
      color = color,
      size = K_base
      )
    ) +
  geom_ribbon(
    inherit.aes = FALSE,
    aes(
      x = Week,
      ymin = min(normokalemia), 
      ymax = max(normokalemia)),
      fill = "lightgrey",
      alpha = 0.7
      ) +
  geom_line() +
  annotate(
    "text", 
    label = "Normal range",
    size = 3,
    x = 0.3, 
    y = range(normokalemia) %>% sum()/2,
    angle = 90  
    ) +
  scale_color_manual(
    values = c(
      SPS.K_low_No = "#27C8F5", SPS.K_high_No = "#4795A8", 
      SPS.K_low_Yes = "#551182", SPS.K_high_Yes = "#CC2DB7",  
      SZC.K_high_No = "#F58E27", SZC.K_low_No = "#A34018"
      # Levels that do not exist in this data set:
      # SZC.K_low_Yes = "#551182", SZC.K_high_Yes = "#CC2DB7",  
      ) 
    ) +
  scale_size_manual(values = c(0.7 ,1.4 , 0.7, 1.4) ) +
  scale_y_continuous(
    breaks = c(4:6, K_threshold),
    minor_breaks = NULL,
    ) +
  facet_wrap(.~Treatment) +
  labs(
    title = "Serum potassium levels vs. week of treatment in individual patients receiving SPS or SZC",
    y = "Serum K[+] (mEq/L)",
    caption = paste("Darker and thicker lines represent patients with a baseline serum K[+] below the", K_threshold, "mEq/L threshold (purple y-axis tick label). \nPurple lines indicate patients who never achieved normokalemia. See the following figures for an explanation.")
    ) +
  theme_light() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.y = element_text(
      face = c("plain", "plain", "plain", "bold"),
      color = c("black", "black", "black", "#CC2DB7")),
    strip.background = element_blank(),
    strip.text.x = element_text(
      color = "black",
      size = 11
      ),
    plot.caption = element_text(hjust = 0, size = 8),
    legend.position = "none"
    ) 



The raw data already show a more consistent decline of serum K[+] levels, with lower inter- and intra-individual variability in SZC- compared to SPS-treated patients. In the SPS group only, several patient progressed directly between hyper- and hypokalemia at consecutive time points, while eight patients never achieved normal levels throughout the treatment period (purple lines).


2.2. Defining the outcomes to be visualised


At the previous PSI-webinar, it was noted that reaching normokalemia took on average about five weeks in patients treated with SPS, compared to 1.5 weeks in patients receiving SZC. Marking these time points was a suggested improvement of the figure. Accordingly, I choose to estimate the times at which normal serum K[+] were first achieved in individual patients, and compare the means between treatment groups.
However, the first crossing point into the normal range is determined by the patients’ treatment response as well as their baseline serum K[+] levels. Moreover, baseline levels of serum K[+] might also be covariate associated with the type and/or strength of the treatment response. Therefore, I decided to compare treatment response as a separate outcome, modelling it as a linear function of serum K[+] vs. time. Change from baseline could have been another option, but with only a single baseline value for each patient “rate of change” seemed a better metric for the efficacy of each drug for lowering serum K[+] concentrations.

K[+] measurements in individual patients fluctuated considerably over time, especially in the SPS group. This means that estimates of the response rate or the time of reaching normokalemia are uncertain, and might even be inappropriate for comparing the two treatment groups. A better and clinically more meaningful outcome might therefore be the total time during the trial when patients achieved normokalemia.

3. R Functions for analyses and visualitions


3.1. Plot outcomes vs. continuous baseline variables


## - quantBase_dotPlot
# Function for plotting outcomes vs. continuous baseline variables

# Setting argument "lambda" to N (number of patients in each treatment group) adds median regression lines, smaller values will result in median smoothing similar to the panels for individual groups in a STEPP plot
# Moving the facet strip text to the bottom is a workaround for generating individual x-axis titles for each facet

quantBase_dotPlot <- function(data, outcome, label_nudge, label_offset, lambda = 1) {
    
  outcome <- sym(outcome)
    
# Reformat input data-frame
  data_long <- 
    data %>%
      rename(
        "Age \n(years)" = Age,
        "Baseline eGFR \n(ml/min/1.73m^2)" = Baseline_eGFR,
        "Baseline serum K[+] \n(mEq/L)" = Serum_Potassium_mEq_L) %>% 
      pivot_longer(
        cols = !c(Treatment, Gender, Diabetes, Hypertension, !!outcome),
        names_to = "variables"
        ) %>% 
      mutate(., variables =
        factor(variables, 
          levels = c(
            "Baseline serum K[+] \n(mEq/L)", 
            "Baseline eGFR \n(ml/min/1.73m^2)", 
            "Age \n(years)"
            )
          )
        )

    # Labelled horizontal lines for SPS and SZC group medians
    median_labels <- 
      left_join(
        # Group median values
        data_long %>%
        group_by(variables, Treatment) %>%
        summarise(median_outcome = median(!!outcome), .groups = "drop"),
        # x-limits for label positioning
        data_long %>%
        group_by(variables) %>%
        summarise(x_pos = max(value, na.rm = TRUE), .groups = "drop")
        ) %>% 
        # y-values for label positioning, with +/- nudging from median line
        mutate(
          y_pos = median_outcome + 
            ((as.factor(median_outcome) %>% as.numeric)*label_nudge - label_offset),
          Label = Treatment
        )
    # Show treatment group labels only in last facet
    median_labels[1:(nrow(median_labels)-2), "Label"] <- ""
    
    data_long %>% 
      ggplot(
        aes(
          x = value,
          y = !!outcome,
          color = Treatment
          )
        ) +
        geom_point() +
        geom_hline(
          data = median_labels,
          aes(yintercept = median_outcome, color = Treatment),
          linetype = "dotted",
          linewidth = 0.8
          ) +
      geom_quantile(
        method = "rqss", quantiles = 0.5, lambda = lambda, size = 2, alpha = 0.3 )  +
      geom_text(
        data = median_labels,
        aes(
          x = x_pos, y = y_pos, 
          label = paste(Label),
          color = Treatment
          ),
        fontface = "bold",
        hjust = 0.65, 
        size = 4
        ) +
      scale_color_manual(values = c("#27C8F5", "#F58E27") ) +
      facet_wrap(
        .~variables,
        scales = "free_x",
        strip.position = "bottom",
        ncol = 3
        ) +
      theme_minimal() +
      theme(
        strip.placement = "outside",
        strip.text = element_text(size = 11),
        axis.title.x = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0, size = 8),
        legend.position = "none",
        )
}


3.2. Split violin plots


## - geom_split_violin
# Custom ggproto object for generating split violin plots (source: https://stackoverflow.com/questions/35717353/split-violin-plot-with-ggplot2)


GeomSplitViolin <- ggplot2::ggproto(
    "GeomSplitViolin",
    ggplot2::GeomViolin,
    draw_group = function(self,
                          data,
                          ...,
                          # add the nudge here
                          nudge = 0,
                          draw_quantiles = NULL) {
        data <- transform(data,
                          xminv = x - violinwidth * (x - xmin),
                          xmaxv = x + violinwidth * (xmax - x))
        grp <- data[1, "group"]
        newdata <- plyr::arrange(transform(data,
                                           x = if (grp %% 2 == 1) xminv else xmaxv),
                                 if (grp %% 2 == 1) y else -y)
        newdata <- rbind(newdata[1, ],
                         newdata,
                         newdata[nrow(newdata), ],
                         newdata[1, ])
        newdata[c(1, nrow(newdata)-1, nrow(newdata)), "x"] <- round(newdata[1, "x"])

        # now nudge them apart
        newdata$x <- ifelse(newdata$group %% 2 == 1,
                            newdata$x - nudge,
                            newdata$x + nudge)

        if (length(draw_quantiles) > 0 & !scales::zero_range(range(data$y))) {

            stopifnot(all(draw_quantiles >= 0), all(draw_quantiles <= 1))

            quantiles <- ggplot2:::create_quantile_segment_frame(data,
                                                             draw_quantiles)
            aesthetics <- data[rep(1, nrow(quantiles)),
                               setdiff(names(data), c("x", "y")),
                               drop = FALSE]
            aesthetics$alpha <- rep(1, nrow(quantiles))
            both <- cbind(quantiles, aesthetics)
            quantile_grob <- ggplot2::GeomPath$draw_panel(both, ...)
            ggplot2:::ggname("geom_split_violin",
                             grid::grobTree(ggplot2::GeomPolygon$draw_panel(newdata, ...),
                                            quantile_grob))
        }
    else {
            ggplot2:::ggname("geom_split_violin",
                             ggplot2::GeomPolygon$draw_panel(newdata, ...))
        }
    }
)

geom_split_violin <- function(mapping = NULL,
                              data = NULL,
                              stat = "ydensity",
                              position = "identity",
                              # nudge param here
                              nudge = 0,
                              ...,
                              draw_quantiles = NULL,
                              trim = TRUE,
                              scale = "area",
                              na.rm = FALSE,
                              show.legend = NA,
                              inherit.aes = TRUE) {

    ggplot2::layer(data = data,
                   mapping = mapping,
                   stat = stat,
                   geom = GeomSplitViolin,
                   position = position,
                   show.legend = show.legend,
                   inherit.aes = inherit.aes,
                   params = list(trim = trim,
                                 scale = scale,
                                 # don't forget the nudge
                                 nudge = nudge,
                                 draw_quantiles = draw_quantiles,
                                 na.rm = na.rm,
                                 ...))
}


3.3. Custom labels for each half violin in split violin plots


## - geom_split_violin_labels
# Custom geom for split violin labels
# Default "colour = NULL" in geom_split_violin_labels inherits color aesthetics from geom_split_violin, aesthetics cannot be defined in this geom

splitViolinLabels <- 
  ggproto(
    "splitViolinLabels", Stat,
    required_aes = c("x", "y", "fill"
    ),
  setup_params = function(data, params) {
    params
  },
  
  # Compute panel for same y-coordinates
  compute_panel = function(data, scales, nudge_amount = 0.4, x_offset = 0, 
                          y_offset = 0, y_position_pct = 0.9, label_colour = NULL) {
    
    # Calculate y position based on OVERALL data range (not per group)
    y_max <- max(data$y)
    y_min <- min(data$y)
    y_pos <- y_min + (y_max - y_min) * y_position_pct + y_offset
    
    # Process each group separately but use the same y_pos for all
    result_list <- split(data, data$group)
    results <- lapply(result_list, function(group_data) {
      # Get the group information
      group_val <- unique(group_data$fill)
      x_val <- unique(group_data$x)
      group_num <- unique(group_data$group)
      
      # Calculate x position based on ggplot2's internal group number
      x_pos <- if(group_num %% 2 == 1) {
        x_val - nudge_amount/2 + x_offset  # Odd groups (left)
      } else {
        x_val + nudge_amount/2 + x_offset  # Even groups (right)
      }
      
      # Create result data-frame
      result <- data.frame(
        x = x_pos,
        y = y_pos,
        label = as.character(group_val),
        fill = group_val,
        group = group_num,
        stringsAsFactors = FALSE
      )
      
      # Handle colour aesthetic
      if ("colour" %in% names(group_data)) {
        # Use mapped colour aesthetic
        result$colour <- unique(group_data$colour)
      } else if (!is.null(label_colour)) {
        # Use parameter-specified colour
        result$colour <- label_colour
      }
      # If neither, let default geom aesthetic handle it
      
      return(result)
    })
    
    # Combine all results
    do.call(rbind, results)
  }
)

# Create custom Geom for the labels
GeomSplitViolinLabels <- ggproto("GeomSplitViolinLabels", GeomText,
  default_aes = aes(colour = "black", size = 4, angle = 0, hjust = 0.5, 
                    vjust = 0.5, alpha = 1, family = "", fontface = "bold", 
                    lineheight = 1.2)
)

geom_split_violin_labels <- 
  function(
    mapping = NULL, data = NULL, 
    nudge_amount = 0.4, x_offset = 0, y_offset = 0, 
    y_position_pct = 0.9, size = 4, fontface = "bold",
    colour = NULL, color = NULL, ..., na.rm = FALSE, 
    show.legend = FALSE, inherit.aes = TRUE) {
    
      # Handle color/colour argument
      if (!is.null(color)) colour <- color
    
      # Create the layer parameters
      params <- list(
        nudge_amount = nudge_amount,
        x_offset = x_offset,
        y_offset = y_offset,
        y_position_pct = y_position_pct,
        size = size,
        fontface = fontface,
        na.rm = na.rm,
        label_colour = colour,  # Pass colour to stat like original
        ...
        )
    
      # Don't remove colour from params if using default
      if (!is.null(colour) && is.null(mapping)) {
        # Only remove colour from params if we have aesthetic mapping
        # Otherwise keep it for the geom
        params$colour <- colour
        params <- params[names(params) != "label_colour"]
      }
    
    layer(
      data = data,
      mapping = mapping,
      stat = splitViolinLabels,
      geom = GeomSplitViolinLabels,
      position = "identity",
      show.legend = show.legend,
      inherit.aes = inherit.aes,
      params = params
      )
    }


3.4. Plot outcomes vs. categorical baseline variables, with p-value annotations


## - split_violin_plots_and_annotations
# Functions for generating faceted split violin plots comparing outcomes across treatments and patient subgroups

# Required objects and functions: geom_split_violin, geom_split_violin_labels


## Function for thresholding continuous variables and transforming the input dataframes to long format
# Input: outcome and comparison tables derived from Hykal_simData ("data", "data_p"), thresholds for categorising continuous baseline characteristic variables ("Age_threshold", "eGFR_threshold", "K_threshold")

reformat_data <- function(data, Age_threshold, eGFR_threshold, K_threshold) {
  data %>% 
    mutate(
            Age_level = ifelse(
                Age < Age_threshold, 
                "Yes", "No"
            ), 
            eGFR_level = ifelse(
                Baseline_eGFR < eGFR_threshold, 
                "Yes", "No"
            ),
            K_level = ifelse(
                Serum_Potassium_mEq_L < K_threshold, 
                "Yes", "No"
            )
        ) %>%
        pivot_longer(
            cols = c(Age_level, eGFR_level, K_level, Gender, Diabetes, Hypertension),
            names_to = "pChar",
            values_to = "value"
        ) %>%
        mutate(
            value = factor(value, levels = c("Yes", "No", "Male", "Female") ),
            pChar = case_when(
                pChar == "Age_level" ~ paste("Age <", Age_threshold, "years"), 
                pChar == "eGFR_level" ~ paste("Baseline eGFR <", eGFR_threshold, "ml/min/1.73m^2"),
                pChar == "K_level" ~ paste("Baseline K[+] <", K_threshold, "mEq/mL"),
                .default = pChar
            ),
            Treatment = factor(Treatment)
        )
  }



## Standalone functions for treatment comparisons (SPS vs SZC), for generating p-values for plot annotations
# Input: outcome table derived from Hykal_simData ("data"), name of outcome variable ("outcome")

treatment_comparisons <- function(data, outcome) {
  data %>%
    group_by(pChar, value) %>%
    summarise(
      p_value_groups = tryCatch({
        # Use string column name to extract values
        sps_values <- .data[[outcome]][Treatment == "SPS"]
        szc_values <- .data[[outcome]][Treatment == "SZC"]
        wilcox.test(sps_values, szc_values)$p.value
        },
        error = function(e) NA_real_
        ),
      delta_group = tryCatch({
        # Use string column name to extract values
        sps_values <- .data[[outcome]][Treatment == "SPS"]
        szc_values <- .data[[outcome]][Treatment == "SZC"]
        median(sps_values) - median(szc_values)
        },
        error = function(e) NA_real_
        ),
      n_SPS = sum(Treatment == "SPS"),
      n_SZC = sum(Treatment == "SZC"),
      .groups = "drop"
      ) %>%
    group_by(pChar) %>%
    mutate(
      min.p_value = min(p_value_groups, na.rm = TRUE),
      max_delta = abs(delta_group) %>% max(., na.rm = TRUE)) %>%
    ungroup() %>%
    # Clean and format
    filter(!is.na(p_value_groups)) %>%
    mutate(
      
      # Order patient subgroups by differences of their medians or greatest significance
      pChar = factor(
        pChar, 
        levels = unique(
          pChar[order(max_delta) %>% rev]
          # pChar[order(min.p_value)]
          )
      )
    )
  }


# Standalone functions for level comparisons (factor levels within groups)

level_comparisons <- function(data, outcome, pChar_order = NULL) {
  # Helper function for safe t-tests
  perform_comparisons <- function(rates, values, outcome_col) {
    val_numeric <- as.numeric(values)
    
    safe_ttest <- function(group1_idx, group2_idx) {
      rates1 <- rates[val_numeric == group1_idx]
      rates2 <- rates[val_numeric == group2_idx]
      
      if(length(rates1) > 0 & length(rates2) > 0) {
        tryCatch(
          wilcox.test(rates1, rates2)$p.value,
          error = function(e) NA_real_
        )
      } else {
        NA_real_
      }
    }
    
    tibble(
      comparison = c(
        paste(levels(values)[1], "vs", levels(values)[2]),
        paste(levels(values)[3], "vs", levels(values)[4])
      ),
      p_value_levels = c(
        safe_ttest(1, 2),
        safe_ttest(3, 4)
      ),
      n_group1 = c(
        sum(val_numeric == 1),
        sum(val_numeric == 3)
      ),
      n_group2 = c(
        sum(val_numeric == 2),
        sum(val_numeric == 4)
      )
    )
  }
  
  result <- data %>%
    group_by(pChar, Treatment) %>%
    group_modify(~ perform_comparisons(.x[[outcome]], .x$value, outcome)) %>%
    # Clean and format
    filter(!is.na(p_value_levels)) 
  
  # Apply pChar ordering if provided
  if (!is.null(pChar_order)) {
    result <- result %>%
      mutate(pChar = factor(pChar, levels = pChar_order))
  }
  
  return(result)
}


## Combine the two comparison functions 

all_comparisons <- function(data, outcome, K_threshold) {
  
  # Run treatment comparison
  treatment_results <- treatment_comparisons(
    data, 
    outcome
    )

  # Run level comparisons (with same pChar ordering as treatment_results)
  level_results <- level_comparisons(
    data, 
    outcome, 
    pChar_order = levels(treatment_results$pChar)
    )

  # Join results and re-format data
  left_join(
    treatment_results,
    level_results,
    relationship = "many-to-many"
    ) %>%
  mutate(
    max_delta_value = ifelse(max_delta == abs(delta_group), 0, 1),
    Treatment = factor(Treatment)
    )
}


## Function for generating the annotated split-violin plots
# Input: outcome and comparison tables derived from Hykal_simData ("data", "data_p"), name of outcome variable ("outcome")

halfviolin_facetPlot <- function(data, outcome, data_p) {

  outcome <- sym(outcome)
  labels_yOffset <- min(data[[outcome]]) %>% abs()
  data_p <- data_p %>% 
    mutate(., labels_yOffset = labels_yOffset)

 data %>% 
 # If facets are to be arranged in order of greatest difference between the group medians or decreasing significance of the differences
 # (see options for function reformat_data above)
  mutate(
    pChar = factor(pChar, levels = levels(data_p$pChar) )
    )%>% 
  ggplot(
    aes(
      x = value,
      y = !!outcome,
      color = Treatment,
      fill = Treatment,
      label = Treatment
    )
  ) +
  geom_tile(
    inherit.aes = FALSE,
    data = data_p,
    aes(
      x = value,
      y = 0,
      height = max_delta_value*Inf,
      width = 1
      ),
    fill = "lightgrey",
    alpha = 0.15
    ) +
  geom_split_violin( 
    width = 0.9,
    nudge = 0.01,
    alpha = 0.3,
    color = "grey") +
  geom_point(
    pch = 1,
    size = 2,
    position = 
      position_jitterdodge(
        dodge.width = 0.5, 
        jitter.width = 0.35
        ),
    stroke = 1
    ) +
  geom_boxplot(
      color = "black",
      alpha = 0,
      width = 0.3, 
      outlier.shape = NA) +
  # p-value annotation for treatment comparisons
   geom_text(
    data = data_p,
    inherit.aes = FALSE,
    aes(
      alpha = 1 - max_delta_value/1.5,
      label = 
       ifelse (
          p_value_groups > 0.05, "ns", 
          paste0(
            "p == ", 
            signif(p_value_groups, digits = 2)
          )
        ),
      x = value,
      y = -labels_yOffset - 1
      ), 
    parse = TRUE
    ) +
  # Brackets and p-value annotation for level comparisons
  geom_errorbarh(
        inherit.aes = FALSE,
        aes(
            color = Treatment,
            y = - as.numeric(Treatment) * 2 - labels_yOffset -1,
            xmin = as.numeric(Treatment)/5.75 + 0.733, 
            xmax = as.numeric(Treatment)/5.75 + 1.733
        )
    ) +
   geom_text(
    data = data_p,
    inherit.aes = FALSE,
    aes(
      label = 
       ifelse (
          p_value_levels > 0.05, "ns", 
          paste0(
            "p == ", 
            signif(p_value_levels, digits = 2)
          )
        ),
      y = -as.numeric(Treatment) * 2 - labels_yOffset,
      color = Treatment
      ), 
    x = 1.5,
    parse = TRUE
    ) +   
  geom_split_violin_labels() +
  # Prevent automatic re-scaling of alpha by ggplot2 to ensures alpha is controlled by max_delta_value in geom_text
  scale_alpha_identity() +
  scale_color_manual(values = c("#27C8F5", "#F58E27") ) +
  scale_fill_manual(values = c("#27C8F5", "#F58E27") ) +
  scale_x_discrete(expand = c(0,0)) +
  facet_wrap(
    .~pChar,
    scales = "free_x") + 
  theme_light () +
  theme(
    strip.background = element_rect(fill = "#f2f3f4", color = "grey"),
    strip.text = element_text(color = "black"),
    text = element_text( size = 13), 
    plot.title = element_text(
      size = 14,
      hjust = 0.5,
      vjust = 5
      ),
    axis.text = element_text(size = 12),
    plot.caption = element_text(
      size = 8,
      hjust = 0,
      vjust = -5
      ),
    plot.margin = unit(c(0.5,0.1,0.5,0), "cm"),
    legend.position = "none"
    )
  }


3.5. Find threshold values for outcomes that result in the greatest or most significant differences between the two treatment groups


# Required objects and functions: HyKal_simData_rate (chunk 2), HyKal_simData_normalFirst (chunk 3), HyKal_simData_normalPeriod (chunk 4), reformat_data (split_violin_plots_and_annotations)

optimize_thresholds <- function(data, 
                                 Age_threshold, 
                                 eGFR_threshold, 
                                 K_threshold,
                                 analysis_type = "Rate",
                                 plot_results = TRUE) {
  
  # Function to optimize a single parameter
  optimize_single <- function(param_name, param_values, fixed_age, fixed_eGFR, fixed_K) {
    
    # Define the pattern to filter by based on optimization parameter
    filter_pattern <- switch(param_name,
                            "Age" = "^Age",
                            "eGFR" = "^Baseline eGFR", 
                            "K" = "^Baseline K")
    
    # Perform optimisation for this parameter
    results <- lapply(param_values, function(x) {
      
      # Set the threshold values based on which parameter we're optimizing
      current_Age <- ifelse(param_name == "Age", x, fixed_age)
      current_eGFR <- ifelse(param_name == "eGFR", x, fixed_eGFR)
      current_K <- ifelse(param_name == "K", x, fixed_K)
      
      # Run the analysis pipeline
      reformat_data(data, current_Age, current_eGFR, current_K) %>% 
        treatment_comparisons(., analysis_type) %>% 
        filter(grepl(filter_pattern, pChar)) %>% 
        select(max_delta) %>% 
        unique()
    })
    
    # Combine results
    results_vector <- as.numeric(list_cbind(results))
    
    return(list(
      parameter = param_name,
      threshold_values = param_values,
      max_delta_values = results_vector
    ))
  }
  
  # Use first values as fixed values for optimization
  fixed_age <- Age_threshold[1]
  fixed_eGFR <- eGFR_threshold[1] 
  fixed_K <- K_threshold[1]
  
  # Initialize results list
  optimization_results <- list()
  
  # Optimize Age threshold (if vector provided)
  if (length(Age_threshold) > 1) {
    optimization_results$Age <- optimize_single("Age", Age_threshold, fixed_age, fixed_eGFR, fixed_K)
    
    if (plot_results) {
      plot(optimization_results$Age$threshold_values, 
           optimization_results$Age$max_delta_values,
           xlab = "Age threshold", 
           ylab = "max_delta",
           main = paste("Age Threshold Optimization -", analysis_type),
           type = "b")
    }
  }
  
  # Optimize eGFR threshold (if vector provided)
  if (length(eGFR_threshold) > 1) {
    optimization_results$eGFR <- optimize_single("eGFR", eGFR_threshold, fixed_age, fixed_eGFR, fixed_K)
    
    if (plot_results) {
      plot(optimization_results$eGFR$threshold_values, 
           optimization_results$eGFR$max_delta_values,
           xlab = "eGFR threshold", 
           ylab = "max_delta", 
           main = paste("eGFR Threshold Optimization -", analysis_type),
           type = "b")
    }
  }
  
  # Optimize K threshold (if vector provided)  
  if (length(K_threshold) > 1) {
    optimization_results$K <- optimize_single("K", K_threshold, fixed_age, fixed_eGFR, fixed_K)
    
    if (plot_results) {
      plot(optimization_results$K$threshold_values, 
           optimization_results$K$max_delta_values,
           xlab = "K threshold", 
           ylab = "max_delta",
           main = paste("K Threshold Optimization -", analysis_type), 
           type = "b")
    }
  }
  
  return(optimization_results)
}


# Only run after HyKal_simData_rate has been created (chunk 2)
optimize_thresholds(
    data = HyKal_simData_rate,
    Age_threshold = seq(50, 80, 1),
    eGFR_threshold = seq(0, 60, 1), 
     K_threshold = seq(4, 7, 0.1),
    analysis_type = "Rate"
    )
# Only run after HyKal_simData_normalFirst has been created (chunk 3)
optimize_thresholds(
    data = HyKal_simData_normalFirst,
    Age_threshold = seq(50, 80, 1),
    eGFR_threshold = seq(0, 60, 1), 
     K_threshold = seq(4, 7, 0.1),
    analysis_type = "First_Week"
    )

# Only run after HyKal_simData_normalPeriod has been created (chunk 4)
optimize_thresholds(
    data = HyKal_simData_normalPeriod,
    Age_threshold = seq(50, 80, 1),
    eGFR_threshold = seq(0, 60, 1), 
    K_threshold = seq(4, 7, 0.1),
    analysis_type = "Period"
    )



4. Exploratory analyses of patient subgroups showing the greatest divergence in clinical response to SPS and SZC


4.1. Outcomes vs. continuous baseline characteristic variables


4.1.1. Rate of serum K[+] change
4.1.1. Rate of serum K[+] change


## - Chunk 2
# Required objects and functions: HyKal_simData (chunk libraries_and_data), quantBase_dotPlot


## Rate of serum K[+] change vs. continuous baseline characteristic variables

# Calculate treatments response rates (change in serum K[+] per week, fitted as a linear model)

HyKal_simData_rate <- 
  HyKal_simData %>%
    group_by(Patient_ID) %>%
    summarise(
        Rate = coef(lm(Week ~ Serum_Potassium_mEq_L))[2], 
        .groups = "drop"
    ) %>% 
    right_join(HyKal_simData, by = "Patient_ID") %>% 
    filter(Time_Point == "Baseline") %>% 
    select(
    Treatment,
    Age, Gender, 
    Baseline_eGFR, Serum_Potassium_mEq_L,
    Diabetes, Hypertension,
    Rate
    )

# Plot rate of K[+] level change vs. continuous baseline variables 

quantBase_dotPlot(HyKal_simData_rate, outcome = "Rate", label_nudge = 6, label_offset = 9, lambda = 20) +
    labs(
      title = "Rate of serum K[+] decline vs. continuous baseline characteristic variables",
      x ="",
      y = bquote("Serum K[+] change (" ~ Delta ~ "mEq /L /Week)"),
      caption = "Thick solid lines: smoothed medians of treatment response rates. Thin dotted lines: overall median."
    ) 



For age and baseline eGFR, smoothed medians of the treatment response rate (thick solid lines) diverged minimally from the overall median (thin dotted lines), showing no evidence of an association. Baseline serum K[+] levels appeared to be negatively associated with response rate variance. Weak trends existed for declining response rates with increasing K[+] levels in SPS-treated patients, and increasing rates in patients treated with SZC.

4.1.2. Time to first observation of normal serum K[+]
4.1.2. Time to first observation of normal serum K[+]


## - Chunk 3
# Required objects and functions: HyKal_simData (chunk libraries_and_data), normokalemia (chunk 1), quantBase_dotPlot


## Time of first normal serum K[+] vs. continuous baseline characteristic variables

# Interpolate time points at which serum K[+] levels move in and out of the normal range

HyKal_simData_normal <-
  HyKal_simData %>% 
    select(Patient_ID, Week, Serum_Potassium_mEq_L) %>% 
    nest(response = !Patient_ID) %>% 
    mutate(response_diff = map(response, ~ {
        x <- .
        Week1 <- x$Week[-length(x$Week)]
        K1 <- x$Serum_Potassium_mEq_L[-length(x$Serum_Potassium_mEq_L)]
        Delta_Week <- diff(x$Week)
        Delta_K <- diff(x$Serum_Potassium_mEq_L)
        m <- Delta_K / Delta_Week
        
        # Estimate crossing points in and out of normal K[+] range
        hyper_normokal <- (max(normokalemia) - K1) / m
        hyper_normokal <- ifelse(hyper_normokal < 0 | hyper_normokal > Delta_Week, NA, hyper_normokal + Week1)
        hypo_normokal <- (min(normokalemia) - K1) / m
        hypo_normokal <- ifelse(hypo_normokal < 0 | hypo_normokal > Delta_Week, NA, hypo_normokal + Week1)      
        
        tibble(
            Week1 = Week1,
            K1 = K1,
            Delta_Week = Delta_Week,
            Delta_K = Delta_K,
            m = m,
            hyper_normokal = hyper_normokal,
            hypo_normokal = hypo_normokal
        )
    })) %>% 
    select(Patient_ID, response_diff) %>% 
    unnest(cols = c(response_diff))

# Select first crossing point from elevated to normal serum K[+] levels (all patients start as hyperkalemic) 

HyKal_simData_normalFirst <- 
  HyKal_simData_normal %>% 
  # Remove duplicate when a crossing point coincides with a measured time-point
  filter(hyper_normokal > Week1 | hypo_normokal > Week1) %>% 
  select("Patient_ID", "hyper_normokal") %>% 
  group_by(Patient_ID) %>%
  slice_min(., hyper_normokal) %>%
  ungroup() %>%
  na.omit %>% 
  left_join(
    ., 
    HyKal_simData %>% filter(Time_Point == "Baseline") %>%    
      select(!c(Week, Time_Point) )
  ) %>% 
  rename("First_Week" = "hyper_normokal") %>% 
  select(!Patient_ID)


# Plot first week within normal range vs. continuous baseline characteristic variables

HyKal_simData_normalFirst %>% 
quantBase_dotPlot(., outcome = "First_Week", label_nudge = 0.1, label_offset = -0.9, lambda =  20) +
  labs(
      title = "Time of first normal serum K[+] vs. continuous baseline characteristic variables",
      x ="",
      y = bquote("Time of first normal serum K[+] (Week)"),
      caption = "Thick solid lines: smoothed medians of treatment response rates. Thin dotted lines: overall median."
    )



Time of first normal serum K[+] does not appear to correlate with age, baseline eGFR, or baseline serum K[+] levels.

4.1.3. Total length of time a normal serum K[+] was observed
4.1.3. Total length of time a normal serum K[+] was observed


## - Chunk 4
# Required objects and functions: HyKal_simData (chunk libraries_and_data), HyKal_simData_normal (chunk 3), quantBase_dotPlot


## Length of time normal serum K[+] levels were observed vs. continuous baseline characteristic variables

# Calculate time intervals in the normal range and their sums

HyKal_simData_normalPeriod <-
  HyKal_simData_normal %>% 
    select(Patient_ID, hyper_normokal, hypo_normokal) %>% 
    nest(normal_periods = !Patient_ID) %>% 
    mutate(
      Period = map(
        normal_periods, ~ {
          x <- .
          unlist(x) %>% 
            # Times of all crossing points, in and out of normokalemia, plus time of last observation (right censored)
            {c(.[!is.na(.)], max(HyKal_simData$Week) )} %>%
            # Calculate time intervals between consecutive crossing points
            sort() %>% diff() %>% 
            # Keep only the intervals when K[+] is in the normal range (first crossing is into the normal range because all patients start hyperkalemic)
            .[seq(1,max(HyKal_simData_normal$Week1),2)] %>% 
            sum(na.rm = TRUE)
          }
        )
      ) %>% 
    select(Patient_ID, Period) %>% 
    unnest(cols = c(Period) ) %>% 
  left_join(
    ., 
    HyKal_simData %>% filter(Time_Point == "Baseline") 
    ) %>% 
  select(., !c(Patient_ID, Time_Point, Week))


# Plot first week within normal range vs. continuous baseline variables

HyKal_simData_normalPeriod %>%
quantBase_dotPlot(., outcome = "Period", label_nudge = 1.7, label_offset = 2.5, lambda =  20) +
    labs(
      title = "Length of time normal serum K[+] levels were observed vs. continuous baseline characteristic variables",
      x ="",
      y = bquote("Total time at normal serum K[+] levels (Weeks)"),
      caption = "Thick solid lines: smoothed medians of treatment response rates. Thin dotted lines: overall median."
    )



The total length of time normal serum K[+] levels were observed correlates with age, baseline eGFR, and baseline serum K[+] levels in SZC-treated patients.

4.2. Outcomes vs. categorical or categorized baseline characteristic variables


Continuous baseline characteristic variables were categorised with thresholds that resulted in the greatest differences between median outcomes in SPS and SZC-treated patients. I decided to show the distribution of outcome values for all patient subgroups. The R script dynamically shades the comparison with the smaller difference for each baseline category value pair, and arranges the baseline variables from left to right in order of decreasing maximum difference.


4.2.1 Rate of serum K[+] change
4.2.1 Rate of serum K[+] change
## - Chunk 5
# Required objects and functions: HyKal_simData_rate (chunk 2), reformat_data, all_comparisons, halfviolin_facetPlot


## Plot distribution of treatment response rates and their differences between treatment groups

# Thresholds that resulted in the greatest or most significant differences between serum K[+] response rates in the two treatment groups

Age_threshold <- 61 # greatest significance: 60
eGFR_threshold <- 28 # greatest significance: 41
K_threshold <- 5.3 # greatest significance: 5.4


# Generate a data-frame with test statistics for annotating the plot and sorting comparisons in stratified groups by significance

HyKal_simData_rate_p <-
  HyKal_simData_rate %>% 
  reformat_data(. , Age_threshold, eGFR_threshold, K_threshold) %>% 
  all_comparisons(., "Rate")

# Plot first week within normal range vs. categorical and categorised baseline variables

HyKal_simData_rate %>% 
  reformat_data(. , Age_threshold, eGFR_threshold, K_threshold) %>%
  halfviolin_facetPlot(., "Rate", HyKal_simData_rate_p) +
  labs(
    title = paste(
    "Rate of serum K[+] change in patients treated with SPS or SZC, \nstratified by categorical or categorised baseline patient characteristics"),
    x = NULL,
    y = bquote("Rate of serum K[+] " ~ "change (" ~ Delta ~ "mEq /L /Week)"),
    caption = "Facets with baseline characteristics are arranged in order of decreasing maximum difference between the median responses in SPS and SZC groups (shaded background: \nbaseline characteristic value associated with the smaller difference ). \nP-values: two-sided Wilcox test, no adjustment for multiple comparisons."
    )


4.2.2. Time to first observation of normal serum K[+]
4.2.2. Time to first observation of normal serum K[+]


## - Chunk 6
# Required objects and functions: HyKal_simData_normalFirst (chunk 4), reformat_data, all_comparisons, halfviolin_facetPlot


## Plot distribution of times to first normal serum K[+] and their differences between treatment groups

# Thresholds that resulted in the greatest or most significant differences between the times of first normokalemia in the two treatment groups

Age_threshold <- 65 # greatest significance: 60
eGFR_threshold <- 23.5 # greatest significance: 41
K_threshold <- 5.2 # greatest significance: 5.2


# Generate a data-frame with test statistics for annotating the plot and sorting comparisons in stratified groups by significance

HyKal_simData_normalFirst_p <-
  HyKal_simData_normalFirst %>% 
  reformat_data(. , Age_threshold, eGFR_threshold, K_threshold) %>% 
  all_comparisons(., "First_Week")

# Plot time to first observation of normal serum K[+] vs. categorical and categorised baseline variables

HyKal_simData_normalFirst %>% 
  reformat_data(. , Age_threshold, eGFR_threshold, K_threshold) %>%
  halfviolin_facetPlot(., "First_Week", HyKal_simData_normalFirst_p) +
  # Do not show negative scale for First_Week
  scale_y_continuous(breaks = seq(0, 8, 2)) +
  labs(
    title = paste(
    "Time of first normal serum K[+] in patients treated with SPS or SZC \nand stratified by categorical or categorised baseline patient characteristics"),
    x = NULL,
    caption = "Facets with baseline characteristics are arranged in order of decreasing maximum difference between the median responses in SPS and SZC groups (shaded background: \nbaseline characteristic value associated with the smaller difference ). \nP-values: two-sided Wilcox test, no adjustment for multiple comparisons."
    ) 


4.2.3. Total length of time a normal serum K[+] was observed
4.2.3. Total length of time a normal serum K[+] was observed


## - Chunk 7
# Required objects and functions: HyKal_simData_normalPeriod (chunk 6), reformat_data, all_comparisons, halfviolin_facetPlot


## Plot distribution of total lengths of time with normal serum K[+], and their differences between treatment groups

# Thresholds that resulted in the greatest or most significant differences between total times of normokalemia in the two treatment groups

Age_threshold <- 69 # greatest significance: 60
eGFR_threshold <- 28 # greatest significance: 41
K_threshold <- 5.2 # greatest significance: 5.2

# Generate a data-frame with test statistics for annotating the plot and sorting comparisons in stratified groups by significance

HyKal_simData_normalPeriod_p <-
  HyKal_simData_normalPeriod %>% 
  reformat_data(. , Age_threshold, eGFR_threshold, K_threshold) %>% 
  all_comparisons(., "Period")

# Plot total lengths of time when serum K[+] was within the normal range vs. categorical and categorised baseline variables

HyKal_simData_normalPeriod %>% 
  reformat_data(. , Age_threshold, eGFR_threshold, K_threshold) %>%
  halfviolin_facetPlot(., "Period", HyKal_simData_normalPeriod_p) +
  # Do not show negative scale for First_Week
  scale_y_continuous(breaks = seq(0, 8, 2)) +
  labs(
    title = paste(
    "Total time serum K[+] levels were within the normal range in patients treated with SPS or SZC \nand stratified by categorical or categorise baseline patient characteristics"),
    x = NULL,
    y = "Total lenght of time (Weeks)",
    caption = "Facets with baseline characteristics are arranged in order of decreasing maximum difference between the median responses in SPS and SZC groups (shaded background: \nbaseline characteristic value associated with the smaller difference ). \nP-values: two-sided Wilcox test, no adjustment for multiple comparisons."
    )



The largest difference between the two treatments existed in patients with low baseline serum K[+], for all three outcomes. However, the significance of these difference was low (insignificant after correction for multible comparisons) because the patient numbers in this subgroup are small.
The next-largest more significant differences were in diabetics (rate), patients aged 65 years or older (time to first normal K[+]), and baseline eGFR (total time of normokalemia). Total length of time with normokalemia was the outcome that showed the most significant differences between SPS and SZC across patient subgroups.
[Note: Because I “optimised” the threshold for categorising the continuous variables, Wilcox test p-values are not really appropriate here, and just serve as illustration for how a visualisation of the data might work.]

4.3. Patients who achieved normal serum K[+] levels at some time during treatment vs. patients who never did


First week of normal serum K[+] levels differed from the other two analyses by not showing patients in whom normokalemia was never observed at any time. Only eight of these non-responders existed, but for completeness compared their baseline characteristics with those of the responders in both treatment groups.

4.3.1. Achievement of normokalemia at any timepoint vs. continuous baseline characteristic variables


# Required objects and functions: R libraries, HyKal_simData (chunk libraries_and_data)

## Association of baseline variables with achieving normokalemia in at least one week of treatment


# Identify patients whose serum K[+] was outside the normal range throughout treatment

HyKal_simData_normoKal <- 
  HyKal_simData %>%
    # Add a column showing whether baseline levels were below or above K_threshold
    mutate(
        K_base = case_when(
            Serum_Potassium_mEq_L < K_threshold & Time_Point == "Baseline" 
            ~ paste0(Treatment, ".K_low"), 
            Serum_Potassium_mEq_L > K_threshold & Time_Point == "Baseline" 
            ~ paste0(Treatment, ".K_high"),
            .default = NA_character_
        ) %>% factor()
    ) %>% 
    group_by(Patient_ID) %>%
    fill(K_base, .direction = "downup") %>% 
    # Add a column showing whether serum K[+] levels were normal at any time during treatment
    mutate(
        normoKal = ifelse(
            Serum_Potassium_mEq_L < max(normokalemia), 
            Week, NA),
        normoKal = if (all(is.na(normoKal[-1]))) "Never" else "At least once"
    ) %>%
    ungroup() 


# Plot distribution of continuous baseline variable values for patients who achieved normokalemia and those who did not 

HyKal_simData_normoKal %>% 
  filter(Time_Point == "Baseline") %>%  
  mutate(
    normoKal = paste0(normoKal, "\n(", Treatment, ")"),
    normoKal = factor(normoKal, levels = rev(unique(normoKal)))
    ) %>%
  select(Treatment, Age, Baseline_eGFR, Serum_Potassium_mEq_L, normoKal) %>% 
  rename(
    "Age \n(years)" = Age,
    "Baseline eGFR \n(ml/min/1.73m^2)" = Baseline_eGFR,
    "Baseline serum K[+] \n(mEq/L)" = Serum_Potassium_mEq_L) %>% 
  pivot_longer(
    cols = !c(Treatment, normoKal), 
    names_to = "pChar"
    ) %>% 
  ggplot(
    aes(
      x = normoKal,
      y = value,
      fill = Treatment,
      color = Treatment
      )
    ) + 
  geom_violin( 
    width = 0.5,
    alpha = 0.4,
    color = "grey") +
  geom_point(
    pch = 1,
    size = 2,
    position = 
      position_jitterdodge(
        dodge.width = 0.6, 
        jitter.width = 0.2
        ),
    stroke = 1,
    alpha = 0.7
    ) +
  geom_boxplot(
    color = "black",
    alpha = 0,
    width = 0.3, 
    outlier.shape = NA) +
  scale_color_manual(values = c("#27C8F5", "#F58E27") ) +
  scale_fill_manual(values = c("#27C8F5", "#F58E27") ) +
    facet_grid(
      pChar ~ Treatment,
      switch = "y", 
      scales = "free", space = "free_x"
      ) +
  labs(
    x = "Number of times normal serum K[+] levels were observed",
    y = "",
    title = "Association of continuous baseline characteristic variables \nwith achievement of normokalemia at any timepoint during treatment",
    caption = "All patients treated with SZC had normal serum K[+] levels in at least one week of treatment. \nHigh eGFR and low K[+] at baseline could be risk factors for not achieving normokalemia at any time in patients treated with SPS."
    ) +
  theme_light () +
  theme(
    strip.background = element_blank(),
    strip.text.x = element_text(
      color = "black",
      vjust = 3,
      size = 12,
      ),
    strip.text.y = element_text(
      color = "black",
      size = 12
      ),
    strip.placement = "outside",
    plot.title = element_text(
      size = 14,
      hjust = 0.5,
      vjust = 5
      ),
    axis.text = element_text(size = 11),
    plot.caption = element_text(
      size = 8,
      hjust = 0,
      vjust = -5
      ),
    plot.margin = unit(c(0.5,0,0.5,0), "cm"),
    legend.position = "none"
    ) 



4.3.2. Achievement of normokalemia at any timepoint vs. categorical baseline characteristic variables


# Required objects and functions: - R libraries (chunk libraries_and_data)
#           - HyKal_simData_normoKal (chunk 8)


## Plot fraction of categorical baseline characteristics among who did and did not achieve normokalemia 

HyKal_simData_normoKal %>% 
    filter(Time_Point == "Baseline") %>%  
    mutate(normoKal = paste0(normoKal, "\n(", Treatment, ")")) %>%
    select(Treatment, Gender, Diabetes, Hypertension, normoKal) %>% 
    pivot_longer(
        cols = !c(Treatment, normoKal), 
        names_to = "pChar",
        values_to = "pChar_value"
    ) %>% 
    group_by(pChar, normoKal) %>%
    summarise(
        "Yes" = mean(pChar_value == "Yes") * 100,
        "No" = mean(pChar_value == "No") * 100,
        "Male" = mean(pChar_value == "Male") * 100,
        "Female" = mean(pChar_value == "Female") * 100,
        n = n()
    ) %>% ungroup() %>% 
    pivot_longer(
        cols = c(Yes, No, Male, Female),
        names_to = "pChar_value",
        values_to = "pChar_value_perc"
    ) %>% 
    mutate(
      Treatment = ifelse(grepl("SZC", normoKal), "SZC", "SPS"),
      Treatment_value = paste0(Treatment, "_", pChar_value),
      normoKal = paste0(normoKal, "\nn = ", n),
      normoKal = factor(normoKal, levels = rev(unique(normoKal)))
      ) %>% 
    filter(!pChar_value_perc == 0) %>% 
    ggplot(
      aes(
        x = normoKal,
        y = pChar_value_perc,
        fill = Treatment_value,
        group = Treatment_value
        )
      ) + 
    geom_col(
      width = 0.5
    ) +
    geom_text(
      aes(label = pChar_value),
      color = "white",
      fontface = "bold",
      size = 3.5,
                position = position_stack(vjust = 0.5)
            ) +
    scale_fill_manual(
      values = c(
        "#27C8F5", "#4795A8", "#27C8F5", "#4795A8",
        "#F58E27", "#A34018", "#F58E27", "#A34018"
        )
      ) +
    facet_grid(
      pChar ~ Treatment,
      switch = "y", 
      scales = "free_x", space = "free_x"
      ) +
    labs(
      x = "Number of times normal serum K[+] levels were observed",
      y = "Fraction (%)",
      title = "Association of categorical baseline characteristic variables \nwith achievement of normokalemia at any timepoint during treatment"
      ) +
    theme_light () +
    theme(
        strip.background = element_blank(),
        strip.text.x = element_text(
            color = "black",
            vjust = 3,
            size = 12,
        ),
        strip.text.y = element_text(
            color = "black",
            size = 12
            ),
        strip.placement = "outside",
        text = element_text( size = 13), 
        plot.title = element_text(
            size = 14,
            hjust = 0.5,
            vjust = 5
        ),
        axis.text = element_text(size = 11),
        plot.caption = element_text(
            hjust = 0,
            vjust = -5
        ),
        plot.margin = unit(c(0.5,0,0.5,0), "cm"),
        legend.position = "none"
    ) 



5. Outlook

I am grateful to the panelists of the PSI VIS webinar for suggestions for improving the figures. One that I did not implement is adding a Subpopulation Treatment Effect Pattern Plot (STEPP). STEPP plots were originally developed for the analysis of time-to-event (i.e., survival), or generic data where the outcome has an exponential distribution 5. They commonly consist of three panels: the first is similar to the smoothed median lines in the quantitative baseline characteristic variable plots of section 4.1. of this document. The second and third panel show difference and ratio, respectively, of the smoothed medians of two groups with confidence intervals. Visual inspection may help to identify covariate value ranges for which the treatment effect might differ. However, STEPP is intended as an exploratory tool rather than for determining specific cut-points. An R package “stepp” is available at CRAN 6.

6. References


  1. Chronic Hyperkaliemia in Chronic Kidney Disease: An Old Concern with New Answers. Borrelli S, Matarazzo I, Lembo E, Peccarino L, Annoiato C, Scognamiglio MR, Foderini A, Ruotolo C, Franculli A, Capozzi F, Yavorskiy P, Merheb F, Provenzano M, La Manna G, De Nicola L, Minutolo R, Garofalo C. Int J Mol Sci. 2022 Jun 7;23(12):6378 https://doi.org/10.3390/ijms23126378

  2. https://www.ncbi.nlm.nih.gov/books/NBK470284/

  3. Stephen L Seliger, Hyperkalemia in patients with chronic renal failure, Nephrology Dialysis Transplantation, Volume 34, Issue Supplement_3, December 2019, Pages iii12–iii18, https://doi.org/10.1093/ndt/gfz231

  4. Sodium zirconium cyclosilicate versus sodium polystyrene sulfonate for treatment of hyperkalemia in hemodialysis patients: a randomized clinical trial. Elsayed MM, Abdelrahman MA, Sorour AM, Rizk IG, Hassab MAA. BMC Nephrol. 2025 May 6;26(1):227 https://doi.org/10.1186/s12882-025-04129-9

  5. Sergio Venturini, Marco Bonetti, Ann A. Lazar, Bernard F. Cole, Xin Victoria Wang, Richard D. Gelber, Wai-Ki Yip, Subpopulation Treatment Effect Pattern Plot (STEPP) Methods with R and Stata, J. data sci. 21(2022), no. 1, 106-126 https://doi.org/10.6339/22-JDS1060

  6. https://cran.r-project.org/web/packages/stepp/stepp.pdf