This vignette gives an overview of how to perform second-order, concordance network-based unsupervised clustering to identify communities of patients with unique symptom network patterns. For more information on this clustering method and examples of its applications, please see Henry et al., Plos One, 2018, Bergsneider et al., Neuro-Oncology Advances, 2022, and Bergsneider et al., Cancer Medicine, 2024.
For this example, we will use the data from the introduction vignette to construct a GGM. Please refer to the introduction vignette for more details on where this data is from and how the network is constructed.
library(PRONA)
reduced_df <- read.csv('../../PRONA_additional_files/test_scripts/output/reduced_data.csv', stringsAsFactors = FALSE)
ptsd_network <- construct_ggm(reduced_df, normal = FALSE)## Loading required namespace: huge
## Conducting nonparanormal (npn) transformation via skeptic....done.
To run second-order unsupervised clustering, we use the get_communities function, which has 6 parameters:
data: Dataframe of symptom severities (formatted as required by PRONA)
nrows: Number of rows in the dataframe (Defaults to the number of rows in data, you should not change this parameter)
ncols: Number of columns in the dataframe (Defaults to the number of columns in data, you should not change this parameter)
detectAlgo: The community detection algorithm to use. Options include FG (fast greedy), IM (infomap), LP (label propagation), LE (leading eigen), LV (louvain), or WT (walktrap). (Default: WT)
simil_measure: The similarity measure to use, either ‘Euclidean’ or ‘ARI’ (Adjusted Rand Index) (Default: ARI)
simplify_graphs: Boolean that indicates whether to simplify the graph by removing multi-edges and loops (Default: TRUE)
It returns a new dataframe that is a copy of the original symptom severities dataframe, but with a new column representing which community each patient belongs to.
Lets see how many patients are in each community
##
## 1 2 3
## 136 218 5
Next, lets plot a heatmap of symptom severities for each community using the plot_community_heatmap function, which has four parameters:
df: The output of the get_communities function
cluster_rows: Boolean representing whether or not to cluster the rows (symptoms) via heirarchical clustering (Default: TRUE)
network: A network object output by construct_ggm. If a network is passed into this function, the heatmap will automatically order rows (symptoms) by the symptom clusters identified in the network. This overrides the cluster_rows parameter. (Default: NULL)
row_label_size: The font size of the row labels (Default: 10)
Here, “PC” stands for “Patient Community”:
Here’s how to order rows based on the symptom clusters identified during network construction:
We can also visualize the symptom severities of each community using line plots. The community_line_plot function has two parameters:
data: output of get_communities
communities: ector specifying which communities to plot. If you include 0 in the vector, it will plot the severity for all patients combined. (Default: c(0))
We can also get a summary of each symptom in each community using the get_community_summary function, which returns a dataframe with the mean, median, upper bound, and lower bound of each symptom in each community.
## # A tibble: 45 × 6
## # Groups: community [3]
## community variable mean median upper lower
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 AVOID.REMINDERS.OF.THE.TRAUMA 0.551 0 0.716 0.387
## 2 1 BAD.DREAMS.ABOUT.THE.TRAUMA 0.309 0 0.431 0.187
## 3 1 DISTANT.OR.CUT.OFF.FROM.PEOPLE 0.838 0 1.02 0.657
## 4 1 FEELING.EMOTIONALLY.NUMB 0.640 0 0.802 0.478
## 5 1 FEELING.IRRITABLE 0.853 1 1.01 0.699
## 6 1 FEELING.PLANS.WONT.COME.TRUE 0.625 0 0.789 0.461
## 7 1 HAVING.TROUBLE.CONCENTRATING 1.37 1 1.55 1.18
## 8 1 HAVING.TROUBLE.SLEEPING 1.60 2 1.80 1.40
## 9 1 LESS.INTEREST.IN.ACTIVITIES 0.309 0 0.445 0.172
## 10 1 NOT.ABLE.TO.REMEMBER 0.397 0 0.535 0.259
## # ℹ 35 more rows
We can construct GGM networks and conduct statistical assessment for individual communities using the same pipeline as described in the introduction and statistical assessment vignettes.
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
comm1_df <- community_df[community_df$community==1,] %>% select(-community)
comm2_df <- community_df[community_df$community==2,] %>% select(-community)## Conducting nonparanormal (npn) transformation via skeptic....done.
## Warning: The network input contains unconnected nodes:
## AVOID.REMINDERS.OF.THE.TRAUMA, BAD.DREAMS.ABOUT.THE.TRAUMA, FEELING.PLANS.WONT.COME.TRUE, HAVING.TROUBLE.CONCENTRATING, HAVING.TROUBLE.SLEEPING, NOT.THINKING.ABOUT.TRAUMA, PHYSICAL.REACTIONS, RELIVING.THE.TRAUMA, JUMPY.STARTLED.OVER.ALERT
## Warning: Some variables did not belong to a dimension: AVOID.REMINDERS.OF.THE.TRAUMA, BAD.DREAMS.ABOUT.THE.TRAUMA, FEELING.PLANS.WONT.COME.TRUE, HAVING.TROUBLE.CONCENTRATING, HAVING.TROUBLE.SLEEPING, NOT.THINKING.ABOUT.TRAUMA, PHYSICAL.REACTIONS, RELIVING.THE.TRAUMA, JUMPY.STARTLED.OVER.ALERT
##
## Use caution: These variables have been removed from the TEFI calculation
## Model: GLASSO (EBIC with gamma = 0)
## Correlations: auto
## Lambda: 0.301079593387959 (n = 100, ratio = 0.1)
##
## Number of nodes: 15
## Number of edges: 4
## Edge density: 0.038
##
## Non-zero edge weights:
## M SD Min Max
## 0.087 0.063 0.038 0.178
##
## ----
##
## Algorithm: Walktrap
##
## Number of communities: 2
##
## AVOID.REMINDERS.OF.THE.TRAUMA BAD.DREAMS.ABOUT.THE.TRAUMA
## NA NA
## DISTANT.OR.CUT.OFF.FROM.PEOPLE FEELING.EMOTIONALLY.NUMB
## 1 1
## FEELING.IRRITABLE FEELING.PLANS.WONT.COME.TRUE
## 1 NA
## HAVING.TROUBLE.CONCENTRATING HAVING.TROUBLE.SLEEPING
## NA NA
## LESS.INTEREST.IN.ACTIVITIES NOT.ABLE.TO.REMEMBER
## 1 2
## NOT.THINKING.ABOUT.TRAUMA PHYSICAL.REACTIONS
## NA NA
## RELIVING.THE.TRAUMA JUMPY.STARTLED.OVER.ALERT
## NA NA
## UPSETTING.REMINDERS
## 2
##
## ----
##
## Unidimensional Method: Louvain
## Unidimensional: No
##
## ----
##
## TEFI: -2.513
## Conducting nonparanormal (npn) transformation via skeptic....done.
## Model: GLASSO (EBIC with gamma = 0.25)
## Correlations: auto
## Lambda: 0.129502975706062 (n = 100, ratio = 0.1)
##
## Number of nodes: 15
## Number of edges: 47
## Edge density: 0.448
##
## Non-zero edge weights:
## M SD Min Max
## 0.081 0.068 0.001 0.343
##
## ----
##
## Algorithm: Walktrap
##
## Number of communities: 2
##
## AVOID.REMINDERS.OF.THE.TRAUMA BAD.DREAMS.ABOUT.THE.TRAUMA
## 1 1
## DISTANT.OR.CUT.OFF.FROM.PEOPLE FEELING.EMOTIONALLY.NUMB
## 2 2
## FEELING.IRRITABLE FEELING.PLANS.WONT.COME.TRUE
## 2 2
## HAVING.TROUBLE.CONCENTRATING HAVING.TROUBLE.SLEEPING
## 2 2
## LESS.INTEREST.IN.ACTIVITIES NOT.ABLE.TO.REMEMBER
## 2 2
## NOT.THINKING.ABOUT.TRAUMA PHYSICAL.REACTIONS
## 1 1
## RELIVING.THE.TRAUMA JUMPY.STARTLED.OVER.ALERT
## 1 1
## UPSETTING.REMINDERS
## 1
##
## ----
##
## Unidimensional Method: Louvain
## Unidimensional: No
##
## ----
##
## TEFI: -7.476
Because of its relatively low number of patients and low overall symptom severity, the network for PC1 is sparse.
We can statistically assess the differences between the two networks using the Network Comparison Test. For details on how the Network Comparison Test works, please see Borkulo et al., Psychol Methods, 2022. The functions in PRONA serve as a wrapper for the functions in the NetworkComparisonTest package.
The run_NCT function takes 9 parameters:
df1: First dataframe of symptom severity/frequency data
df2: Second dataframe of symptom severity/frequency data
normal: Boolean. Whether to consider all variables normally distrubted. If false, conducts a non-paranormal transformation (Default: FALSE)
it: Number of bootstrapping iterations to run (Default: 2500)
p.adjust.methods: Character. Can be one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, or “none”. To control (or not) for testing of multiple edges. Defaults to “none”.
test.edges: Boolean. Whether to test differences in individual edge weights. (Default: TRUE)
test.centrality: Boolean. Whether to test differences in centrality measures (Default: TRUE)
centrality: Vector of which centrality measures to test (Default: c(‘closeness’,‘betweenness’,‘strength’,‘expectedInfluence’))
gamma: The EBIC tuning parameter to use. Must be 0, 0.25, or 0.5. (Default: 0)
# Replace file path with wherever you want to save the results of NCT
nct.file <- "../../PRONA_additional_files/test_scripts/output/nct.RDS"
if (!file.exists(nct.file)){
# since the construct_ggm function for comm1_ega used a gamma = 0, we will do the same here
nct.results <- run_NCT(comm1_df, comm2_df, gamma = 0)
saveRDS(nct.results, nct.file)}
nct.results <- readRDS(nct.file)We can plot the results of the NCT comparing global network strength and overall network structure.
We can also directly get the p-values from comparing each edge weight and centrality measure
## Var1 Var2 p-value
## 16 AVOID.REMINDERS.OF.THE.TRAUMA BAD.DREAMS.ABOUT.THE.TRAUMA 1.000
## 31 AVOID.REMINDERS.OF.THE.TRAUMA DISTANT.OR.CUT.OFF.FROM.PEOPLE 1.000
## 32 BAD.DREAMS.ABOUT.THE.TRAUMA DISTANT.OR.CUT.OFF.FROM.PEOPLE 1.000
## 46 AVOID.REMINDERS.OF.THE.TRAUMA FEELING.EMOTIONALLY.NUMB 1.000
## 47 BAD.DREAMS.ABOUT.THE.TRAUMA FEELING.EMOTIONALLY.NUMB 1.000
## 48 DISTANT.OR.CUT.OFF.FROM.PEOPLE FEELING.EMOTIONALLY.NUMB 0.922
## 61 AVOID.REMINDERS.OF.THE.TRAUMA FEELING.IRRITABLE 1.000
## 62 BAD.DREAMS.ABOUT.THE.TRAUMA FEELING.IRRITABLE 1.000
## 63 DISTANT.OR.CUT.OFF.FROM.PEOPLE FEELING.IRRITABLE 0.594
## 64 FEELING.EMOTIONALLY.NUMB FEELING.IRRITABLE 0.325
## 76 AVOID.REMINDERS.OF.THE.TRAUMA FEELING.PLANS.WONT.COME.TRUE 1.000
## 77 BAD.DREAMS.ABOUT.THE.TRAUMA FEELING.PLANS.WONT.COME.TRUE 1.000
## 78 DISTANT.OR.CUT.OFF.FROM.PEOPLE FEELING.PLANS.WONT.COME.TRUE 0.508
## 79 FEELING.EMOTIONALLY.NUMB FEELING.PLANS.WONT.COME.TRUE 0.506
## 80 FEELING.IRRITABLE FEELING.PLANS.WONT.COME.TRUE 0.441
## 91 AVOID.REMINDERS.OF.THE.TRAUMA HAVING.TROUBLE.CONCENTRATING 0.242
## 92 BAD.DREAMS.ABOUT.THE.TRAUMA HAVING.TROUBLE.CONCENTRATING 1.000
## 93 DISTANT.OR.CUT.OFF.FROM.PEOPLE HAVING.TROUBLE.CONCENTRATING 0.192
## 94 FEELING.EMOTIONALLY.NUMB HAVING.TROUBLE.CONCENTRATING 0.481
## 95 FEELING.IRRITABLE HAVING.TROUBLE.CONCENTRATING 0.103
## 96 FEELING.PLANS.WONT.COME.TRUE HAVING.TROUBLE.CONCENTRATING 0.598
## 106 AVOID.REMINDERS.OF.THE.TRAUMA HAVING.TROUBLE.SLEEPING 1.000
## 107 BAD.DREAMS.ABOUT.THE.TRAUMA HAVING.TROUBLE.SLEEPING 1.000
## 108 DISTANT.OR.CUT.OFF.FROM.PEOPLE HAVING.TROUBLE.SLEEPING 1.000
## 109 FEELING.EMOTIONALLY.NUMB HAVING.TROUBLE.SLEEPING 0.804
## 110 FEELING.IRRITABLE HAVING.TROUBLE.SLEEPING 1.000
## 111 FEELING.PLANS.WONT.COME.TRUE HAVING.TROUBLE.SLEEPING 0.714
## 112 HAVING.TROUBLE.CONCENTRATING HAVING.TROUBLE.SLEEPING 0.464
## 121 AVOID.REMINDERS.OF.THE.TRAUMA LESS.INTEREST.IN.ACTIVITIES 1.000
## 122 BAD.DREAMS.ABOUT.THE.TRAUMA LESS.INTEREST.IN.ACTIVITIES 1.000
## 123 DISTANT.OR.CUT.OFF.FROM.PEOPLE LESS.INTEREST.IN.ACTIVITIES 0.007
## 124 FEELING.EMOTIONALLY.NUMB LESS.INTEREST.IN.ACTIVITIES 0.793
## 125 FEELING.IRRITABLE LESS.INTEREST.IN.ACTIVITIES 0.692
## 126 FEELING.PLANS.WONT.COME.TRUE LESS.INTEREST.IN.ACTIVITIES 0.088
## 127 HAVING.TROUBLE.CONCENTRATING LESS.INTEREST.IN.ACTIVITIES 0.832
## 128 HAVING.TROUBLE.SLEEPING LESS.INTEREST.IN.ACTIVITIES 1.000
## 136 AVOID.REMINDERS.OF.THE.TRAUMA NOT.ABLE.TO.REMEMBER 1.000
## 137 BAD.DREAMS.ABOUT.THE.TRAUMA NOT.ABLE.TO.REMEMBER 1.000
## 138 DISTANT.OR.CUT.OFF.FROM.PEOPLE NOT.ABLE.TO.REMEMBER 0.352
## 139 FEELING.EMOTIONALLY.NUMB NOT.ABLE.TO.REMEMBER 1.000
## 140 FEELING.IRRITABLE NOT.ABLE.TO.REMEMBER 1.000
## 141 FEELING.PLANS.WONT.COME.TRUE NOT.ABLE.TO.REMEMBER 1.000
## 142 HAVING.TROUBLE.CONCENTRATING NOT.ABLE.TO.REMEMBER 1.000
## 143 HAVING.TROUBLE.SLEEPING NOT.ABLE.TO.REMEMBER 1.000
## 144 LESS.INTEREST.IN.ACTIVITIES NOT.ABLE.TO.REMEMBER 1.000
## 151 AVOID.REMINDERS.OF.THE.TRAUMA NOT.THINKING.ABOUT.TRAUMA 0.525
## 152 BAD.DREAMS.ABOUT.THE.TRAUMA NOT.THINKING.ABOUT.TRAUMA 1.000
## 153 DISTANT.OR.CUT.OFF.FROM.PEOPLE NOT.THINKING.ABOUT.TRAUMA 0.839
## 154 FEELING.EMOTIONALLY.NUMB NOT.THINKING.ABOUT.TRAUMA 1.000
## 155 FEELING.IRRITABLE NOT.THINKING.ABOUT.TRAUMA 1.000
## 156 FEELING.PLANS.WONT.COME.TRUE NOT.THINKING.ABOUT.TRAUMA 1.000
## 157 HAVING.TROUBLE.CONCENTRATING NOT.THINKING.ABOUT.TRAUMA 0.573
## 158 HAVING.TROUBLE.SLEEPING NOT.THINKING.ABOUT.TRAUMA 1.000
## 159 LESS.INTEREST.IN.ACTIVITIES NOT.THINKING.ABOUT.TRAUMA 1.000
## 160 NOT.ABLE.TO.REMEMBER NOT.THINKING.ABOUT.TRAUMA 0.515
## 166 AVOID.REMINDERS.OF.THE.TRAUMA PHYSICAL.REACTIONS 0.296
## 167 BAD.DREAMS.ABOUT.THE.TRAUMA PHYSICAL.REACTIONS 0.610
## 168 DISTANT.OR.CUT.OFF.FROM.PEOPLE PHYSICAL.REACTIONS 1.000
## 169 FEELING.EMOTIONALLY.NUMB PHYSICAL.REACTIONS 0.805
## 170 FEELING.IRRITABLE PHYSICAL.REACTIONS 0.824
## 171 FEELING.PLANS.WONT.COME.TRUE PHYSICAL.REACTIONS 1.000
## 172 HAVING.TROUBLE.CONCENTRATING PHYSICAL.REACTIONS 0.609
## 173 HAVING.TROUBLE.SLEEPING PHYSICAL.REACTIONS 1.000
## 174 LESS.INTEREST.IN.ACTIVITIES PHYSICAL.REACTIONS 1.000
## 175 NOT.ABLE.TO.REMEMBER PHYSICAL.REACTIONS 1.000
## 176 NOT.THINKING.ABOUT.TRAUMA PHYSICAL.REACTIONS 0.689
## 181 AVOID.REMINDERS.OF.THE.TRAUMA RELIVING.THE.TRAUMA 1.000
## 182 BAD.DREAMS.ABOUT.THE.TRAUMA RELIVING.THE.TRAUMA 1.000
## 183 DISTANT.OR.CUT.OFF.FROM.PEOPLE RELIVING.THE.TRAUMA 1.000
## 184 FEELING.EMOTIONALLY.NUMB RELIVING.THE.TRAUMA 1.000
## 185 FEELING.IRRITABLE RELIVING.THE.TRAUMA 1.000
## 186 FEELING.PLANS.WONT.COME.TRUE RELIVING.THE.TRAUMA 1.000
## 187 HAVING.TROUBLE.CONCENTRATING RELIVING.THE.TRAUMA 1.000
## 188 HAVING.TROUBLE.SLEEPING RELIVING.THE.TRAUMA 1.000
## 189 LESS.INTEREST.IN.ACTIVITIES RELIVING.THE.TRAUMA 1.000
## 190 NOT.ABLE.TO.REMEMBER RELIVING.THE.TRAUMA 0.952
## 191 NOT.THINKING.ABOUT.TRAUMA RELIVING.THE.TRAUMA 0.897
## 192 PHYSICAL.REACTIONS RELIVING.THE.TRAUMA 0.290
## 196 AVOID.REMINDERS.OF.THE.TRAUMA JUMPY.STARTLED.OVER.ALERT 0.532
## 197 BAD.DREAMS.ABOUT.THE.TRAUMA JUMPY.STARTLED.OVER.ALERT 1.000
## 198 DISTANT.OR.CUT.OFF.FROM.PEOPLE JUMPY.STARTLED.OVER.ALERT 0.724
## 199 FEELING.EMOTIONALLY.NUMB JUMPY.STARTLED.OVER.ALERT 1.000
## 200 FEELING.IRRITABLE JUMPY.STARTLED.OVER.ALERT 1.000
## 201 FEELING.PLANS.WONT.COME.TRUE JUMPY.STARTLED.OVER.ALERT 1.000
## 202 HAVING.TROUBLE.CONCENTRATING JUMPY.STARTLED.OVER.ALERT 0.690
## 203 HAVING.TROUBLE.SLEEPING JUMPY.STARTLED.OVER.ALERT 1.000
## 204 LESS.INTEREST.IN.ACTIVITIES JUMPY.STARTLED.OVER.ALERT 1.000
## 205 NOT.ABLE.TO.REMEMBER JUMPY.STARTLED.OVER.ALERT 1.000
## 206 NOT.THINKING.ABOUT.TRAUMA JUMPY.STARTLED.OVER.ALERT 0.319
## 207 PHYSICAL.REACTIONS JUMPY.STARTLED.OVER.ALERT 0.480
## 208 RELIVING.THE.TRAUMA JUMPY.STARTLED.OVER.ALERT 1.000
## 211 AVOID.REMINDERS.OF.THE.TRAUMA UPSETTING.REMINDERS 1.000
## 212 BAD.DREAMS.ABOUT.THE.TRAUMA UPSETTING.REMINDERS 0.118
## 213 DISTANT.OR.CUT.OFF.FROM.PEOPLE UPSETTING.REMINDERS 1.000
## 214 FEELING.EMOTIONALLY.NUMB UPSETTING.REMINDERS 0.432
## 215 FEELING.IRRITABLE UPSETTING.REMINDERS 0.759
## 216 FEELING.PLANS.WONT.COME.TRUE UPSETTING.REMINDERS 1.000
## 217 HAVING.TROUBLE.CONCENTRATING UPSETTING.REMINDERS 0.487
## 218 HAVING.TROUBLE.SLEEPING UPSETTING.REMINDERS 0.376
## 219 LESS.INTEREST.IN.ACTIVITIES UPSETTING.REMINDERS 1.000
## 220 NOT.ABLE.TO.REMEMBER UPSETTING.REMINDERS 0.704
## 221 NOT.THINKING.ABOUT.TRAUMA UPSETTING.REMINDERS 0.045
## 222 PHYSICAL.REACTIONS UPSETTING.REMINDERS 0.255
## 223 RELIVING.THE.TRAUMA UPSETTING.REMINDERS 0.015
## 224 JUMPY.STARTLED.OVER.ALERT UPSETTING.REMINDERS 1.000
## closeness betweenness strength expectedInfluence
## AVOID.REMINDERS.OF.THE.TRAUMA NA 1.0000 0.1024 0.0964
## BAD.DREAMS.ABOUT.THE.TRAUMA NA 1.0000 0.2308 0.1804
## DISTANT.OR.CUT.OFF.FROM.PEOPLE 0 0.0452 0.0024 0.0016
## FEELING.EMOTIONALLY.NUMB 0 0.0280 0.0304 0.0276
## FEELING.IRRITABLE 0 1.0000 0.0272 0.0256
## FEELING.PLANS.WONT.COME.TRUE NA 1.0000 0.0116 0.0032
## HAVING.TROUBLE.CONCENTRATING NA 0.1076 0.0004 0.0004
## HAVING.TROUBLE.SLEEPING NA 1.0000 0.3428 0.2264
## LESS.INTEREST.IN.ACTIVITIES 0 0.6756 0.0024 0.0024
## NOT.ABLE.TO.REMEMBER NA 1.0000 0.5332 0.4704
## NOT.THINKING.ABOUT.TRAUMA NA 0.1420 0.0008 0.0008
## PHYSICAL.REACTIONS NA 0.1376 0.0000 0.0000
## RELIVING.THE.TRAUMA NA 1.0000 0.0464 0.0272
## JUMPY.STARTLED.OVER.ALERT NA 1.0000 0.0752 0.0756
## UPSETTING.REMINDERS NA 0.0000 0.0000 0.0000
To assess the stability of patient communities (aka how reliable they are), we can use sub-sampling analysis. This technique takes random subsamples of patients (default is subsamples of 100%, 99%, 95%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, and 10% of the original dataset) and reruns concordance network-based patient clustering on each of the subsamples.
To perform this, we can use the cluster_subsamples function, which takes 5 parameters:
data: Original dataframe of symptom severities (formatted as required by PRONA)
detectAlgo: The community detection algorithm to use. It should be the same as what you used when you ran get_communities. (Default: WT)
simil_measure: The similarity measure to use. It should be the same as what you used when you ran get_communities. (Default: ARI)
simplify_graphs: Boolean that indicates whether to simplify the graph by removing multi-edges and loops. It should be the same as what you used when you ran get_communities. (Default: TRUE)
sampling_rates: A vector containing the percentages to subset the data by. Default is c(1, 0.99, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1). If you change this, the vector must start with 1 (meaning the first subsample is 100% of the dataset) for downstream analysis to work.
The output is a dataframe that contains the community designations for each patient in each of the subsampling runs:
## ID community_1 community_0.99 community_0.95 community_0.9
## 1 15_000651 2 2 1 1
## 2 15_000786 1 1 2 2
## 3 15_000969 2 2 3 3
## 4 15_000989 2 2 3 3
## 5 15_001332 2 3 1 1
## community_0.8 community_0.7 community_0.6 community_0.5 community_0.4
## 1 4 2 1 1 3
## 2 4 2 3 3 NA
## 3 4 2 1 1 4
## 4 1 3 4 4 4
## 5 3 2 1 1 3
## community_0.3 community_0.2 community_0.1
## 1 3 NA NA
## 2 NA NA NA
## 3 2 1 NA
## 4 2 1 NA
## 5 3 3 1
To see the distribution of patients across different communities in each subsampling run, we can use the following code. It is important to note that cluster labels do not necessarily overlap across different runs (for example, community #2 is “subsample_communities_df$community_1” does not necessarily correspond to community #2 in “subsample_communities_df$community_0.99”). Thus, you should compare the overall distribution of patients across different communities in different runs, but not the number of patients under each specific label:
##
## 1 2 3
## 136 218 5
##
## 1 2 3 4
## 99 124 128 4
##
## 1 2 3 4 5
## 159 79 70 30 3
##
## 1 2 3 4 5
## 155 80 62 24 2
##
## 1 2 3 4 5
## 78 33 125 49 2
##
## 1 2 3 4 5
## 49 71 56 73 2
##
## 1 2 3 4 5
## 110 32 36 35 2
##
## 1 2 3 4 5
## 92 29 26 30 2
##
## 1 2 3 4 5
## 13 33 70 25 2
##
## 1 2 3 4
## 9 36 60 2
##
## 1 2 3 4 5 6
## 17 11 33 7 2 1
##
## 1 2 3 4 5
## 12 4 2 15 2
Next, to compute the stability of each of the original clusters across all the subsampling runs, we can use the compute_cluster_stabilities and plot_cluster_stabilities functions. These functions define “stability” as how often data points that belong to the same cluster in the original data are still clustered together in the subsamples (this is also known as Prediction Strength). It is calculated by taking all the patients that belong to one of the individual clusters, finding which communities those patients have been grouped into in the new clustering iteration, and dividing the size of the largest community in the new iteration by the size of the original community.
For example, say clustering on a dataset of 1000 patients yields three communities, one of 500 patients (community 1), one of 400 (community 2), and one of 100 (community 3). We rerun clustering on a subsample of the data, and out of the 500 patients that originally belonged to community 1, 400 of them still belong to the same community, 50 belong to a different community, and 50 were randomly removed during subsampling. The “stability” of the original community 1 for this subsampling iteration is 400/500 = 0.8 (aka 80%).
To perform this analysis on each of the original clusters for each of the subsampling iterations, use the following code:
## Warning in max(freq_clusters[names(freq_clusters) != ""]): no non-missing
## arguments to max; returning -Inf
## SamplingRate Cluster 1 Stability Cluster 2 Stability Cluster 3 Stability
## 1 1 1.00000000 1.00000000 1.0
## 2 0.99 0.58715596 0.72794118 0.8
## 3 0.95 0.72935780 0.52941176 0.6
## 4 0.9 0.68348624 0.58823529 0.4
## 5 0.8 0.57339450 0.30147059 0.4
## 6 0.7 0.33486239 0.36029412 0.4
## 7 0.6 0.49082569 0.25735294 0.4
## 8 0.5 0.39908257 0.21323529 0.4
## 9 0.4 0.32110092 0.23529412 0.4
## 10 0.3 0.27522936 0.20588235 0.4
## 11 0.2 0.15137615 0.07352941 0.2
## 12 0.1 0.06880734 0.02941176 -Inf
To plot these results, use the plot_cluster_stabilities function:
## `geom_smooth()` using formula = 'y ~ x'
Finally, we can use the Adjusted Rand Index (ARI) to assess the global stability of the community groupings across the different subsampling iterations. This helps us understand if the overall structure of clusters is consistent across different subsamples. We can calculate the ARI between the original clusters (represented by subsample_communities_df$community_1) and each of the subsamples using the following code:
## Warning: package 'mclust' was built under R version 4.2.3
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
## [1] 0.4325011
## [1] 0.4971679
## [1] 0.5135012
## [1] 0.3745413
## [1] 0.2428197
## [1] 0.5322329
## [1] 0.4751524
## [1] 0.5615193
## [1] 0.664666
## [1] 0.4221146
## [1] 0.3372875
Although ARI is a helpful metric, it can be disproportionately skewed if there are specific clusters that are particularly unstable, even if the other clusters are generally stable. For this reason, we generally prefer the above analysis of individual cluster stability over ARI calculation because it gives you more specific information on which clusters are reliable and which are not.