My scripts:
About myself: www.linkedin.com/in/ThomasWs-Mopfair
These plots show the distribution of patient age at the onset of CLN6
disease, and the relations of age at onset to the age at which a
diagnosis was obtained, gender, and type of genetic variant (1).
## Load required R libraries
library(tidyverse)
library(ggpubr)
## Get example data file
# Note that although the example file produces summary results that are similar to those in the
# publication, its data differ from those of the actual patient population
variants_sim <-
read.delim (
"https://raw.githubusercontent.com/thomas-weissensteiner/portfolio/main/Summarizing_Human_Phenotype_Ontology_Terms/variants_simulated.csv",
quote = ""
) %>%
select(
Gender, Coding_Effect, Clinical_Significance,
Age.at.onset, Age.at.referral, Age.at.diagnosis
)
The input file was a tab-delimited csv document. Data in the example file are for illustration and not the actual data of the study (1).
Gender | Coding_Effect | Clinical_Significance | Age.at.onset | Age.at.referral | Age.at.diagnosis | |
---|---|---|---|---|---|---|
43 | F | In-frame | P | 3 | 4.92 | 5 |
44 | ? | In-frame | P | Not provided | unknown | unknown |
49 | F | Splicing mutation | LP | 8.75 | 8.83 |
(corresponds to figure 3 in (1)).
(A) Ages at onset of symptoms (left) and at referral for genetic
testing that resulted in a diagnosis of CLN6 disease (right).
(B) Time intervals from symptoms onset to referral/diagnosis.
Each dot represents one of 34 (A) or 32 patients (B) where sufficient
information was available. Grey areas show the smoothed distribution of
ages or years, white boxes the interquartile ranges, and the line inside
each box the medians.
# Here I skipped an explicit data cleaning step, but "as.numeric" coerces all blanks and non-numeric values in the age columns to "NA", allowing them to be filtered out by "na.rm = T" when reshaping the dataframe
p1 <-
variants_sim %>%
mutate(
Onset= as.numeric(.$Age.at.onset),
Referral = as.numeric(.$Age.at.referral)
) %>%
gather(
., key=observation,
value=age, - !c(Onset, Referral),
na.rm = T) %>%
ggplot(
aes(
x = factor(
observation,
levels = c("Onset", "Referral")
),
y = age)
) +
geom_violin(
fill = "lightgrey", width = 0.75) +
geom_boxplot(
width=0.2 , outlier.shape = NA) +
geom_jitter(
height = 0.025, width = 0.05, size=2, alpha = 0.2) +
theme_bw() +
theme(
text=element_text(size = 13),
axis.text.x = element_text(, vjust = -1)
) +
labs(
x = NULL, y = "Age [years]",
size = 12, hjust = -15)
p2 <-
variants_sim %>%
mutate(
age = as.numeric(.$Age.at.referral) - as.numeric(.$Age.at.onset)
) %>%
ggplot(., aes(
x = "", y = age)) +
geom_violin(
fill = "lightgrey", width = 0.75) +
geom_boxplot(
width=0.2 , outlier.shape = NA) +
geom_jitter(
height = 0.025, width = 0.05, size=2, alpha = 0.2) +
theme_bw() +
theme(
text=element_text(size = 13),
axis.text.x = element_text(, vjust = -1)) +
labs(
x = NULL,
y = "Time from onset to referral [years]",
size = 12, hjust = -15)
ggarrange(
p1, p2,
labels = c("A", "B"),
font.label = list(size = 14),
ncol = 2, widths = c(2.5, 1)
)
Most patients develop CLN6 disease around the age of three, but it
can take year before referral to genetic testing and diagnosis.
# Distributions are compared with the Wilcoxon test, using the stat_compare_means function in ggpubr (2).
variants_sim %>%
select(
Gender, Age.at.onset) %>%
filter (!Gender == "?") %>%
na.omit %>%
ggplot(
aes(
x = Gender, y = as.numeric(Age.at.onset)
)
) +
geom_violin(
fill = "lightgrey", width = 0.6) +
geom_boxplot(
width = 0.2 , outlier.shape = NA) +
geom_jitter(
height = 0.05, width = 0.025, size=3, alpha=0.3) +
theme_bw() +
theme(
text = element_text(size = 21),
axis.text.x = element_text(, vjust = -1)) +
labs(x=NULL, y="Age of onset [years]", size = 20, hjust = -5) +
stat_compare_means(size = 6, hjust = -0.2, vjust = 0.2)
(corresponds to figure 7 in (1))
# Note: The Wilcoxon test could not be performed because of multiple ties between groups
variants_sim %>%
filter(
!Coding_Effect == "Effect_unknown" &
!Age.at.onset %in% c("Not provided", "")
) %>%
mutate(
Coding_Effect = ifelse (
.$Coding_Effect %in% c(
"Frameshift", "Splicing mutation", "Nonsense", "Start loss"),
"LoF",
.$Coding_Effect
) %>%
factor(
levels = c("LoF", "In-frame", "Missense")
)
) %>%
mutate(
Age.at.onset = as.numeric(Age.at.onset) ) %>%
ggplot(
aes(
x = Coding_Effect,
y = Age.at.onset)) +
geom_violin(
fill = "lightgrey", width = 0.75) +
geom_boxplot(
width=0.2 , outlier.shape = NA) +
geom_jitter(
height = 0.025, width = 0.05, size=3, alpha=0.5) +
theme_bw() +
theme(
text=element_text(size=20),
axis.text.x = element_text(, vjust = -1)) +
labs(
x = NULL,
y = "Age at onset [years]",
size = 20, hjust = -15
)
The term “Loss of Function” (LoF) is used for variants causing frame
shifts in the protein sequence, errors in RNA transcription and
processing (i.e., “Start loss”, “Splicing mutation”), and RNA
instability (“Nonsense”). Their impact on protein function is generally
greater than changes in a few amino acid residues (“In frame” deletion
or insertion, “Missense” substitution). In CLN6 disease, more severely
impaired protein function can lead to faster accumulation of toxic
metabolites, and thereby earlier clinical symptoms.
Clinical and genetic characterization of a cohort of 97 CLN6 patients tested at a single center Rus CM, Weissensteiner T, Pereira C, …, Beetz C. Orphanet J Rare Dis. 2022 May 3;17(1):179
Add P-values and Significance Levels to ggplots kassambara 31/08/2017 STHDA: ggpubr: Publication Ready Plots