My scripts:

  • rendered html versions: Rpubs/thomas-weissensteiner
  • .rmd files with executable code chunks: www.github.com/thomas-weissensteiner/portfolio/tree/main/

About myself: www.linkedin.com/in/ThomasWs-Mopfair




1. Background and motivation

These plots show the distribution of patient age at the onset of CLN6 disease, and the relations of age at onset to the age at which a diagnosis was obtained, gender, and type of genetic variant (1).

2. Code and figures

## Load required R libraries

library(tidyverse)
library(ggpubr)

## Get example data file
# Note that although the example file produces summary results that are similar to those in the 
# publication, its data differ from those of the actual patient population

variants_sim <- 
  read.delim (
  "https://raw.githubusercontent.com/thomas-weissensteiner/portfolio/main/Summarizing_Human_Phenotype_Ontology_Terms/variants_simulated.csv", 
    quote = ""
  ) %>% 
  select(
    Gender, Coding_Effect, Clinical_Significance, 
    Age.at.onset, Age.at.referral, Age.at.diagnosis
    )


2.1. Example file and data structure

The input file was a tab-delimited csv document. Data in the example file are for illustration and not the actual data of the study (1).

Gender Coding_Effect Clinical_Significance Age.at.onset Age.at.referral Age.at.diagnosis
43 F In-frame P 3 4.92 5
44 ? In-frame P Not provided unknown unknown
49 F Splicing mutation LP 8.75 8.83



2.2. Patient ages at inferred disease onset, and time of referral for genetic testing

(corresponds to figure 3 in (1)).

(A) Ages at onset of symptoms (left) and at referral for genetic testing that resulted in a diagnosis of CLN6 disease (right).
(B) Time intervals from symptoms onset to referral/diagnosis.
Each dot represents one of 34 (A) or 32 patients (B) where sufficient information was available. Grey areas show the smoothed distribution of ages or years, white boxes the interquartile ranges, and the line inside each box the medians.

# Here I skipped an explicit data cleaning step, but "as.numeric" coerces all blanks and non-numeric values in the age columns to "NA", allowing them to be filtered out by "na.rm = T" when reshaping the dataframe

p1 <- 
  variants_sim %>% 
    mutate(
      Onset= as.numeric(.$Age.at.onset), 
      Referral = as.numeric(.$Age.at.referral)
      ) %>% 
    gather(
      ., key=observation, 
      value=age, - !c(Onset, Referral), 
      na.rm = T) %>%
    ggplot(
        aes(
         x = factor(
           observation, 
           levels = c("Onset", "Referral")
           ), 
         y = age)
        ) +
        geom_violin(
          fill = "lightgrey", width = 0.75) +
        geom_boxplot(
          width=0.2 , outlier.shape = NA) +
        geom_jitter(
          height = 0.025, width = 0.05, size=2, alpha = 0.2) +
        theme_bw() +
        theme(
          text=element_text(size = 13), 
            axis.text.x = element_text(, vjust = -1)
            ) +
        labs(
          x = NULL, y = "Age [years]", 
          size = 12, hjust = -15)

p2 <- 
  variants_sim %>% 
  mutate(
    age = as.numeric(.$Age.at.referral) - as.numeric(.$Age.at.onset)
    ) %>% 
  ggplot(., aes(
    x = "", y = age)) +
  geom_violin(
    fill = "lightgrey", width = 0.75) +
  geom_boxplot(
    width=0.2 , outlier.shape = NA) +
  geom_jitter(
    height = 0.025, width = 0.05, size=2, alpha = 0.2) +
  theme_bw() +
  theme(
    text=element_text(size = 13), 
    axis.text.x = element_text(, vjust = -1)) +
  labs(
    x = NULL, 
    y = "Time from onset to referral [years]", 
    size = 12, hjust = -15)

ggarrange(
  p1, p2, 
  labels = c("A", "B"),
  font.label = list(size = 14),
  ncol = 2, widths = c(2.5, 1)
  )


Most patients develop CLN6 disease around the age of three, but it can take year before referral to genetic testing and diagnosis.

2.3. Age at disease onset, stratified by patient gender

# Distributions are compared with the Wilcoxon test, using the stat_compare_means function in ggpubr (2).

variants_sim %>% 
  select(
    Gender, Age.at.onset) %>% 
  filter (!Gender == "?") %>%  
  na.omit %>%
  ggplot(
    aes(
      x = Gender, y = as.numeric(Age.at.onset)
      )
    ) +
  geom_violin(
    fill = "lightgrey", width = 0.6) +
  geom_boxplot(
    width = 0.2 , outlier.shape = NA) +
  geom_jitter(
    height = 0.05, width = 0.025, size=3, alpha=0.3) +
  theme_bw() +
  theme(
    text = element_text(size = 21), 
    axis.text.x = element_text(, vjust = -1)) +
  labs(x=NULL, y="Age of onset [years]", size = 20, hjust = -5) + 
  stat_compare_means(size = 6, hjust = -0.2, vjust = 0.2)



2.4. Association between age at onset and predicted coding effect of CLN6 variants

(corresponds to figure 7 in (1))

# Note: The Wilcoxon test could not be performed because of multiple ties between groups

variants_sim %>% 
    filter(
      !Coding_Effect == "Effect_unknown" & 
      !Age.at.onset %in% c("Not provided", "")
      ) %>% 
  mutate(
    Coding_Effect = ifelse (
      .$Coding_Effect %in% c(
        "Frameshift", "Splicing mutation", "Nonsense", "Start loss"), 
      "LoF", 
      .$Coding_Effect  
      ) %>% 
    factor(
      levels = c("LoF", "In-frame", "Missense")
      )
    ) %>% 
  mutate(
    Age.at.onset = as.numeric(Age.at.onset) ) %>%
  ggplot(
     aes(
      x = Coding_Effect, 
      y = Age.at.onset)) +      
    geom_violin(
     fill = "lightgrey", width = 0.75) +
    geom_boxplot(
     width=0.2 , outlier.shape = NA) +
    geom_jitter(
     height = 0.025, width = 0.05, size=3, alpha=0.5) +
    theme_bw() +
    theme(
     text=element_text(size=20), 
      axis.text.x = element_text(, vjust = -1)) +
    labs(
     x = NULL, 
     y = "Age at onset [years]", 
     size = 20, hjust = -15
     ) 



The term “Loss of Function” (LoF) is used for variants causing frame shifts in the protein sequence, errors in RNA transcription and processing (i.e., “Start loss”, “Splicing mutation”), and RNA instability (“Nonsense”). Their impact on protein function is generally greater than changes in a few amino acid residues (“In frame” deletion or insertion, “Missense” substitution). In CLN6 disease, more severely impaired protein function can lead to faster accumulation of toxic metabolites, and thereby earlier clinical symptoms.

3. References

  1. Clinical and genetic characterization of a cohort of 97 CLN6 patients tested at a single center Rus CM, Weissensteiner T, Pereira C, …, Beetz C. Orphanet J Rare Dis. 2022 May 3;17(1):179

  2. Add P-values and Significance Levels to ggplots kassambara 31/08/2017 STHDA: ggpubr: Publication Ready Plots