Perform concordance network-based clustering using PRONA

This vignette gives an overview of how to perform second-order, concordance network-based unsupervised clustering to identify communities of patients with unique symptom network patterns. For more information on this clustering method and examples of its applications, please see Henry et al., Plos One, 2018, Bergsneider et al., Neuro-Oncology Advances, 2022, and Bergsneider et al., Cancer Medicine, 2024.

1. Load PRONA, download data, and construct a GGM

For this example, we will use the data from the introduction vignette to construct a GGM. Please refer to the introduction vignette for more details on where this data is from and how the network is constructed.

library(PRONA)
reduced_df <- read.csv('../../PRONA_additional_files/test_scripts/output/reduced_data.csv', stringsAsFactors = FALSE)
ptsd_network <- construct_ggm(reduced_df, normal = FALSE)
## Loading required namespace: huge
## Conducting nonparanormal (npn) transformation via skeptic....done.
plot_ggm(ptsd_network, label.size = 2.5)

2. Get communities of patients with unique symptom severity/frequency patterns

To run second-order unsupervised clustering, we use the get_communities function, which has 6 parameters:

It returns a new dataframe that is a copy of the original symptom severities dataframe, but with a new column representing which community each patient belongs to.

community_df <- get_communities(reduced_df)

Lets see how many patients are in each community

table(community_df$community)
## 
##   1   2   3 
## 136 218   5

Next, lets plot a heatmap of symptom severities for each community using the plot_community_heatmap function, which has four parameters:

Here, “PC” stands for “Patient Community”:

plot_community_heatmap(community_df, row_label_size = 9)

Here’s how to order rows based on the symptom clusters identified during network construction:

plot_community_heatmap(community_df, network = ptsd_network, row_label_size = 9)

We can also visualize the symptom severities of each community using line plots. The community_line_plot function has two parameters:

community_line_plot(community_df, c(0,1,2))

We can also get a summary of each symptom in each community using the get_community_summary function, which returns a dataframe with the mean, median, upper bound, and lower bound of each symptom in each community.

get_community_summary(community_df)
## # A tibble: 45 × 6
## # Groups:   community [3]
##    community variable                        mean median upper lower
##        <dbl> <fct>                          <dbl>  <dbl> <dbl> <dbl>
##  1         1 AVOID.REMINDERS.OF.THE.TRAUMA  0.551      0 0.716 0.387
##  2         1 BAD.DREAMS.ABOUT.THE.TRAUMA    0.309      0 0.431 0.187
##  3         1 DISTANT.OR.CUT.OFF.FROM.PEOPLE 0.838      0 1.02  0.657
##  4         1 FEELING.EMOTIONALLY.NUMB       0.640      0 0.802 0.478
##  5         1 FEELING.IRRITABLE              0.853      1 1.01  0.699
##  6         1 FEELING.PLANS.WONT.COME.TRUE   0.625      0 0.789 0.461
##  7         1 HAVING.TROUBLE.CONCENTRATING   1.37       1 1.55  1.18 
##  8         1 HAVING.TROUBLE.SLEEPING        1.60       2 1.80  1.40 
##  9         1 LESS.INTEREST.IN.ACTIVITIES    0.309      0 0.445 0.172
## 10         1 NOT.ABLE.TO.REMEMBER           0.397      0 0.535 0.259
## # ℹ 35 more rows

3. Construct networks for individual communities

We can construct GGM networks and conduct statistical assessment for individual communities using the same pipeline as described in the introduction and statistical assessment vignettes.

# Get symptom data for individual communities
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
comm1_df <- community_df[community_df$community==1,] %>% select(-community)
comm2_df <- community_df[community_df$community==2,] %>% select(-community)
comm1_ega <- construct_ggm(comm1_df, normal = FALSE)
## Conducting nonparanormal (npn) transformation via skeptic....done.
## Warning: The network input contains unconnected nodes:
## AVOID.REMINDERS.OF.THE.TRAUMA, BAD.DREAMS.ABOUT.THE.TRAUMA, FEELING.PLANS.WONT.COME.TRUE, HAVING.TROUBLE.CONCENTRATING, HAVING.TROUBLE.SLEEPING, NOT.THINKING.ABOUT.TRAUMA, PHYSICAL.REACTIONS, RELIVING.THE.TRAUMA, JUMPY.STARTLED.OVER.ALERT
## Warning: Some variables did not belong to a dimension: AVOID.REMINDERS.OF.THE.TRAUMA, BAD.DREAMS.ABOUT.THE.TRAUMA, FEELING.PLANS.WONT.COME.TRUE, HAVING.TROUBLE.CONCENTRATING, HAVING.TROUBLE.SLEEPING, NOT.THINKING.ABOUT.TRAUMA, PHYSICAL.REACTIONS, RELIVING.THE.TRAUMA, JUMPY.STARTLED.OVER.ALERT 
## 
## Use caution: These variables have been removed from the TEFI calculation
summary(comm1_ega)
## Model: GLASSO (EBIC with gamma = 0)
## Correlations: auto
## Lambda: 0.301079593387959 (n = 100, ratio = 0.1)
## 
## Number of nodes: 15
## Number of edges: 4
## Edge density: 0.038
## 
## Non-zero edge weights: 
##      M    SD   Min   Max
##  0.087 0.063 0.038 0.178
## 
## ----
## 
## Algorithm:  Walktrap
## 
## Number of communities:  2
## 
##  AVOID.REMINDERS.OF.THE.TRAUMA    BAD.DREAMS.ABOUT.THE.TRAUMA 
##                             NA                             NA 
## DISTANT.OR.CUT.OFF.FROM.PEOPLE       FEELING.EMOTIONALLY.NUMB 
##                              1                              1 
##              FEELING.IRRITABLE   FEELING.PLANS.WONT.COME.TRUE 
##                              1                             NA 
##   HAVING.TROUBLE.CONCENTRATING        HAVING.TROUBLE.SLEEPING 
##                             NA                             NA 
##    LESS.INTEREST.IN.ACTIVITIES           NOT.ABLE.TO.REMEMBER 
##                              1                              2 
##      NOT.THINKING.ABOUT.TRAUMA             PHYSICAL.REACTIONS 
##                             NA                             NA 
##            RELIVING.THE.TRAUMA      JUMPY.STARTLED.OVER.ALERT 
##                             NA                             NA 
##            UPSETTING.REMINDERS 
##                              2 
## 
## ----
## 
## Unidimensional Method: Louvain
## Unidimensional: No
## 
## ----
## 
## TEFI: -2.513
plot_ggm(comm1_ega, label.size = 2.5)

comm2_ega <- construct_ggm(comm2_df, normal = FALSE)
## Conducting nonparanormal (npn) transformation via skeptic....done.
summary(comm2_ega)
## Model: GLASSO (EBIC with gamma = 0.25)
## Correlations: auto
## Lambda: 0.129502975706062 (n = 100, ratio = 0.1)
## 
## Number of nodes: 15
## Number of edges: 47
## Edge density: 0.448
## 
## Non-zero edge weights: 
##      M    SD   Min   Max
##  0.081 0.068 0.001 0.343
## 
## ----
## 
## Algorithm:  Walktrap
## 
## Number of communities:  2
## 
##  AVOID.REMINDERS.OF.THE.TRAUMA    BAD.DREAMS.ABOUT.THE.TRAUMA 
##                              1                              1 
## DISTANT.OR.CUT.OFF.FROM.PEOPLE       FEELING.EMOTIONALLY.NUMB 
##                              2                              2 
##              FEELING.IRRITABLE   FEELING.PLANS.WONT.COME.TRUE 
##                              2                              2 
##   HAVING.TROUBLE.CONCENTRATING        HAVING.TROUBLE.SLEEPING 
##                              2                              2 
##    LESS.INTEREST.IN.ACTIVITIES           NOT.ABLE.TO.REMEMBER 
##                              2                              2 
##      NOT.THINKING.ABOUT.TRAUMA             PHYSICAL.REACTIONS 
##                              1                              1 
##            RELIVING.THE.TRAUMA      JUMPY.STARTLED.OVER.ALERT 
##                              1                              1 
##            UPSETTING.REMINDERS 
##                              1 
## 
## ----
## 
## Unidimensional Method: Louvain
## Unidimensional: No
## 
## ----
## 
## TEFI: -7.476
plot_ggm(comm2_ega, label.size = 2.5)

Because of its relatively low number of patients and low overall symptom severity, the network for PC1 is sparse.

4. Perform Network Comparison Test (NCT)

We can statistically assess the differences between the two networks using the Network Comparison Test. For details on how the Network Comparison Test works, please see Borkulo et al., Psychol Methods, 2022. The functions in PRONA serve as a wrapper for the functions in the NetworkComparisonTest package.

The run_NCT function takes 9 parameters:

# Replace file path with wherever you want to save the results of NCT
nct.file <- "../../PRONA_additional_files/test_scripts/output/nct.RDS"
if (!file.exists(nct.file)){
  # since the construct_ggm function for comm1_ega used a gamma = 0, we will do the same here
  nct.results <- run_NCT(comm1_df, comm2_df, gamma = 0)
  saveRDS(nct.results, nct.file)}
nct.results <- readRDS(nct.file)

We can plot the results of the NCT comparing global network strength and overall network structure.

library(NetworkComparisonTest)
plot_NCT(nct.results, what = "strength")

plot_NCT(nct.results, what = "network")

We can also directly get the p-values from comparing each edge weight and centrality measure

get_NCT_pvalues(nct.results, what = "edges")
##                               Var1                           Var2 p-value
## 16   AVOID.REMINDERS.OF.THE.TRAUMA    BAD.DREAMS.ABOUT.THE.TRAUMA   1.000
## 31   AVOID.REMINDERS.OF.THE.TRAUMA DISTANT.OR.CUT.OFF.FROM.PEOPLE   1.000
## 32     BAD.DREAMS.ABOUT.THE.TRAUMA DISTANT.OR.CUT.OFF.FROM.PEOPLE   1.000
## 46   AVOID.REMINDERS.OF.THE.TRAUMA       FEELING.EMOTIONALLY.NUMB   1.000
## 47     BAD.DREAMS.ABOUT.THE.TRAUMA       FEELING.EMOTIONALLY.NUMB   1.000
## 48  DISTANT.OR.CUT.OFF.FROM.PEOPLE       FEELING.EMOTIONALLY.NUMB   0.922
## 61   AVOID.REMINDERS.OF.THE.TRAUMA              FEELING.IRRITABLE   1.000
## 62     BAD.DREAMS.ABOUT.THE.TRAUMA              FEELING.IRRITABLE   1.000
## 63  DISTANT.OR.CUT.OFF.FROM.PEOPLE              FEELING.IRRITABLE   0.594
## 64        FEELING.EMOTIONALLY.NUMB              FEELING.IRRITABLE   0.325
## 76   AVOID.REMINDERS.OF.THE.TRAUMA   FEELING.PLANS.WONT.COME.TRUE   1.000
## 77     BAD.DREAMS.ABOUT.THE.TRAUMA   FEELING.PLANS.WONT.COME.TRUE   1.000
## 78  DISTANT.OR.CUT.OFF.FROM.PEOPLE   FEELING.PLANS.WONT.COME.TRUE   0.508
## 79        FEELING.EMOTIONALLY.NUMB   FEELING.PLANS.WONT.COME.TRUE   0.506
## 80               FEELING.IRRITABLE   FEELING.PLANS.WONT.COME.TRUE   0.441
## 91   AVOID.REMINDERS.OF.THE.TRAUMA   HAVING.TROUBLE.CONCENTRATING   0.242
## 92     BAD.DREAMS.ABOUT.THE.TRAUMA   HAVING.TROUBLE.CONCENTRATING   1.000
## 93  DISTANT.OR.CUT.OFF.FROM.PEOPLE   HAVING.TROUBLE.CONCENTRATING   0.192
## 94        FEELING.EMOTIONALLY.NUMB   HAVING.TROUBLE.CONCENTRATING   0.481
## 95               FEELING.IRRITABLE   HAVING.TROUBLE.CONCENTRATING   0.103
## 96    FEELING.PLANS.WONT.COME.TRUE   HAVING.TROUBLE.CONCENTRATING   0.598
## 106  AVOID.REMINDERS.OF.THE.TRAUMA        HAVING.TROUBLE.SLEEPING   1.000
## 107    BAD.DREAMS.ABOUT.THE.TRAUMA        HAVING.TROUBLE.SLEEPING   1.000
## 108 DISTANT.OR.CUT.OFF.FROM.PEOPLE        HAVING.TROUBLE.SLEEPING   1.000
## 109       FEELING.EMOTIONALLY.NUMB        HAVING.TROUBLE.SLEEPING   0.804
## 110              FEELING.IRRITABLE        HAVING.TROUBLE.SLEEPING   1.000
## 111   FEELING.PLANS.WONT.COME.TRUE        HAVING.TROUBLE.SLEEPING   0.714
## 112   HAVING.TROUBLE.CONCENTRATING        HAVING.TROUBLE.SLEEPING   0.464
## 121  AVOID.REMINDERS.OF.THE.TRAUMA    LESS.INTEREST.IN.ACTIVITIES   1.000
## 122    BAD.DREAMS.ABOUT.THE.TRAUMA    LESS.INTEREST.IN.ACTIVITIES   1.000
## 123 DISTANT.OR.CUT.OFF.FROM.PEOPLE    LESS.INTEREST.IN.ACTIVITIES   0.007
## 124       FEELING.EMOTIONALLY.NUMB    LESS.INTEREST.IN.ACTIVITIES   0.793
## 125              FEELING.IRRITABLE    LESS.INTEREST.IN.ACTIVITIES   0.692
## 126   FEELING.PLANS.WONT.COME.TRUE    LESS.INTEREST.IN.ACTIVITIES   0.088
## 127   HAVING.TROUBLE.CONCENTRATING    LESS.INTEREST.IN.ACTIVITIES   0.832
## 128        HAVING.TROUBLE.SLEEPING    LESS.INTEREST.IN.ACTIVITIES   1.000
## 136  AVOID.REMINDERS.OF.THE.TRAUMA           NOT.ABLE.TO.REMEMBER   1.000
## 137    BAD.DREAMS.ABOUT.THE.TRAUMA           NOT.ABLE.TO.REMEMBER   1.000
## 138 DISTANT.OR.CUT.OFF.FROM.PEOPLE           NOT.ABLE.TO.REMEMBER   0.352
## 139       FEELING.EMOTIONALLY.NUMB           NOT.ABLE.TO.REMEMBER   1.000
## 140              FEELING.IRRITABLE           NOT.ABLE.TO.REMEMBER   1.000
## 141   FEELING.PLANS.WONT.COME.TRUE           NOT.ABLE.TO.REMEMBER   1.000
## 142   HAVING.TROUBLE.CONCENTRATING           NOT.ABLE.TO.REMEMBER   1.000
## 143        HAVING.TROUBLE.SLEEPING           NOT.ABLE.TO.REMEMBER   1.000
## 144    LESS.INTEREST.IN.ACTIVITIES           NOT.ABLE.TO.REMEMBER   1.000
## 151  AVOID.REMINDERS.OF.THE.TRAUMA      NOT.THINKING.ABOUT.TRAUMA   0.525
## 152    BAD.DREAMS.ABOUT.THE.TRAUMA      NOT.THINKING.ABOUT.TRAUMA   1.000
## 153 DISTANT.OR.CUT.OFF.FROM.PEOPLE      NOT.THINKING.ABOUT.TRAUMA   0.839
## 154       FEELING.EMOTIONALLY.NUMB      NOT.THINKING.ABOUT.TRAUMA   1.000
## 155              FEELING.IRRITABLE      NOT.THINKING.ABOUT.TRAUMA   1.000
## 156   FEELING.PLANS.WONT.COME.TRUE      NOT.THINKING.ABOUT.TRAUMA   1.000
## 157   HAVING.TROUBLE.CONCENTRATING      NOT.THINKING.ABOUT.TRAUMA   0.573
## 158        HAVING.TROUBLE.SLEEPING      NOT.THINKING.ABOUT.TRAUMA   1.000
## 159    LESS.INTEREST.IN.ACTIVITIES      NOT.THINKING.ABOUT.TRAUMA   1.000
## 160           NOT.ABLE.TO.REMEMBER      NOT.THINKING.ABOUT.TRAUMA   0.515
## 166  AVOID.REMINDERS.OF.THE.TRAUMA             PHYSICAL.REACTIONS   0.296
## 167    BAD.DREAMS.ABOUT.THE.TRAUMA             PHYSICAL.REACTIONS   0.610
## 168 DISTANT.OR.CUT.OFF.FROM.PEOPLE             PHYSICAL.REACTIONS   1.000
## 169       FEELING.EMOTIONALLY.NUMB             PHYSICAL.REACTIONS   0.805
## 170              FEELING.IRRITABLE             PHYSICAL.REACTIONS   0.824
## 171   FEELING.PLANS.WONT.COME.TRUE             PHYSICAL.REACTIONS   1.000
## 172   HAVING.TROUBLE.CONCENTRATING             PHYSICAL.REACTIONS   0.609
## 173        HAVING.TROUBLE.SLEEPING             PHYSICAL.REACTIONS   1.000
## 174    LESS.INTEREST.IN.ACTIVITIES             PHYSICAL.REACTIONS   1.000
## 175           NOT.ABLE.TO.REMEMBER             PHYSICAL.REACTIONS   1.000
## 176      NOT.THINKING.ABOUT.TRAUMA             PHYSICAL.REACTIONS   0.689
## 181  AVOID.REMINDERS.OF.THE.TRAUMA            RELIVING.THE.TRAUMA   1.000
## 182    BAD.DREAMS.ABOUT.THE.TRAUMA            RELIVING.THE.TRAUMA   1.000
## 183 DISTANT.OR.CUT.OFF.FROM.PEOPLE            RELIVING.THE.TRAUMA   1.000
## 184       FEELING.EMOTIONALLY.NUMB            RELIVING.THE.TRAUMA   1.000
## 185              FEELING.IRRITABLE            RELIVING.THE.TRAUMA   1.000
## 186   FEELING.PLANS.WONT.COME.TRUE            RELIVING.THE.TRAUMA   1.000
## 187   HAVING.TROUBLE.CONCENTRATING            RELIVING.THE.TRAUMA   1.000
## 188        HAVING.TROUBLE.SLEEPING            RELIVING.THE.TRAUMA   1.000
## 189    LESS.INTEREST.IN.ACTIVITIES            RELIVING.THE.TRAUMA   1.000
## 190           NOT.ABLE.TO.REMEMBER            RELIVING.THE.TRAUMA   0.952
## 191      NOT.THINKING.ABOUT.TRAUMA            RELIVING.THE.TRAUMA   0.897
## 192             PHYSICAL.REACTIONS            RELIVING.THE.TRAUMA   0.290
## 196  AVOID.REMINDERS.OF.THE.TRAUMA      JUMPY.STARTLED.OVER.ALERT   0.532
## 197    BAD.DREAMS.ABOUT.THE.TRAUMA      JUMPY.STARTLED.OVER.ALERT   1.000
## 198 DISTANT.OR.CUT.OFF.FROM.PEOPLE      JUMPY.STARTLED.OVER.ALERT   0.724
## 199       FEELING.EMOTIONALLY.NUMB      JUMPY.STARTLED.OVER.ALERT   1.000
## 200              FEELING.IRRITABLE      JUMPY.STARTLED.OVER.ALERT   1.000
## 201   FEELING.PLANS.WONT.COME.TRUE      JUMPY.STARTLED.OVER.ALERT   1.000
## 202   HAVING.TROUBLE.CONCENTRATING      JUMPY.STARTLED.OVER.ALERT   0.690
## 203        HAVING.TROUBLE.SLEEPING      JUMPY.STARTLED.OVER.ALERT   1.000
## 204    LESS.INTEREST.IN.ACTIVITIES      JUMPY.STARTLED.OVER.ALERT   1.000
## 205           NOT.ABLE.TO.REMEMBER      JUMPY.STARTLED.OVER.ALERT   1.000
## 206      NOT.THINKING.ABOUT.TRAUMA      JUMPY.STARTLED.OVER.ALERT   0.319
## 207             PHYSICAL.REACTIONS      JUMPY.STARTLED.OVER.ALERT   0.480
## 208            RELIVING.THE.TRAUMA      JUMPY.STARTLED.OVER.ALERT   1.000
## 211  AVOID.REMINDERS.OF.THE.TRAUMA            UPSETTING.REMINDERS   1.000
## 212    BAD.DREAMS.ABOUT.THE.TRAUMA            UPSETTING.REMINDERS   0.118
## 213 DISTANT.OR.CUT.OFF.FROM.PEOPLE            UPSETTING.REMINDERS   1.000
## 214       FEELING.EMOTIONALLY.NUMB            UPSETTING.REMINDERS   0.432
## 215              FEELING.IRRITABLE            UPSETTING.REMINDERS   0.759
## 216   FEELING.PLANS.WONT.COME.TRUE            UPSETTING.REMINDERS   1.000
## 217   HAVING.TROUBLE.CONCENTRATING            UPSETTING.REMINDERS   0.487
## 218        HAVING.TROUBLE.SLEEPING            UPSETTING.REMINDERS   0.376
## 219    LESS.INTEREST.IN.ACTIVITIES            UPSETTING.REMINDERS   1.000
## 220           NOT.ABLE.TO.REMEMBER            UPSETTING.REMINDERS   0.704
## 221      NOT.THINKING.ABOUT.TRAUMA            UPSETTING.REMINDERS   0.045
## 222             PHYSICAL.REACTIONS            UPSETTING.REMINDERS   0.255
## 223            RELIVING.THE.TRAUMA            UPSETTING.REMINDERS   0.015
## 224      JUMPY.STARTLED.OVER.ALERT            UPSETTING.REMINDERS   1.000
get_NCT_pvalues(nct.results, what = "centralities")
##                                closeness betweenness strength expectedInfluence
## AVOID.REMINDERS.OF.THE.TRAUMA         NA      1.0000   0.1024            0.0964
## BAD.DREAMS.ABOUT.THE.TRAUMA           NA      1.0000   0.2308            0.1804
## DISTANT.OR.CUT.OFF.FROM.PEOPLE         0      0.0452   0.0024            0.0016
## FEELING.EMOTIONALLY.NUMB               0      0.0280   0.0304            0.0276
## FEELING.IRRITABLE                      0      1.0000   0.0272            0.0256
## FEELING.PLANS.WONT.COME.TRUE          NA      1.0000   0.0116            0.0032
## HAVING.TROUBLE.CONCENTRATING          NA      0.1076   0.0004            0.0004
## HAVING.TROUBLE.SLEEPING               NA      1.0000   0.3428            0.2264
## LESS.INTEREST.IN.ACTIVITIES            0      0.6756   0.0024            0.0024
## NOT.ABLE.TO.REMEMBER                  NA      1.0000   0.5332            0.4704
## NOT.THINKING.ABOUT.TRAUMA             NA      0.1420   0.0008            0.0008
## PHYSICAL.REACTIONS                    NA      0.1376   0.0000            0.0000
## RELIVING.THE.TRAUMA                   NA      1.0000   0.0464            0.0272
## JUMPY.STARTLED.OVER.ALERT             NA      1.0000   0.0752            0.0756
## UPSETTING.REMINDERS                   NA      0.0000   0.0000            0.0000

5. Assess the stability of patient communities using sub-sampling analysis

To assess the stability of patient communities (aka how reliable they are), we can use sub-sampling analysis. This technique takes random subsamples of patients (default is subsamples of 100%, 99%, 95%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, and 10% of the original dataset) and reruns concordance network-based patient clustering on each of the subsamples.

To perform this, we can use the cluster_subsamples function, which takes 5 parameters:

subsample_communities_df <- cluster_subsamples(reduced_df)

The output is a dataframe that contains the community designations for each patient in each of the subsampling runs:

head(subsample_communities_df, 5)
##          ID community_1 community_0.99 community_0.95 community_0.9
## 1 15_000651           2              2              1             1
## 2 15_000786           1              1              2             2
## 3 15_000969           2              2              3             3
## 4 15_000989           2              2              3             3
## 5 15_001332           2              3              1             1
##   community_0.8 community_0.7 community_0.6 community_0.5 community_0.4
## 1             4             2             1             1             3
## 2             4             2             3             3            NA
## 3             4             2             1             1             4
## 4             1             3             4             4             4
## 5             3             2             1             1             3
##   community_0.3 community_0.2 community_0.1
## 1             3            NA            NA
## 2            NA            NA            NA
## 3             2             1            NA
## 4             2             1            NA
## 5             3             3             1

To see the distribution of patients across different communities in each subsampling run, we can use the following code. It is important to note that cluster labels do not necessarily overlap across different runs (for example, community #2 is “subsample_communities_df$community_1” does not necessarily correspond to community #2 in “subsample_communities_df$community_0.99”). Thus, you should compare the overall distribution of patients across different communities in different runs, but not the number of patients under each specific label:

table(subsample_communities_df$community_1)
## 
##   1   2   3 
## 136 218   5
table(subsample_communities_df$community_0.99)
## 
##   1   2   3   4 
##  99 124 128   4
table(subsample_communities_df$community_0.95)
## 
##   1   2   3   4   5 
## 159  79  70  30   3
table(subsample_communities_df$community_0.9)
## 
##   1   2   3   4   5 
## 155  80  62  24   2
table(subsample_communities_df$community_0.8)
## 
##   1   2   3   4   5 
##  78  33 125  49   2
table(subsample_communities_df$community_0.7)
## 
##  1  2  3  4  5 
## 49 71 56 73  2
table(subsample_communities_df$community_0.6)
## 
##   1   2   3   4   5 
## 110  32  36  35   2
table(subsample_communities_df$community_0.5)
## 
##  1  2  3  4  5 
## 92 29 26 30  2
table(subsample_communities_df$community_0.4)
## 
##  1  2  3  4  5 
## 13 33 70 25  2
table(subsample_communities_df$community_0.3)
## 
##  1  2  3  4 
##  9 36 60  2
table(subsample_communities_df$community_0.2)
## 
##  1  2  3  4  5  6 
## 17 11 33  7  2  1
table(subsample_communities_df$community_0.1)
## 
##  1  2  3  4  5 
## 12  4  2 15  2

Next, to compute the stability of each of the original clusters across all the subsampling runs, we can use the compute_cluster_stabilities and plot_cluster_stabilities functions. These functions define “stability” as how often data points that belong to the same cluster in the original data are still clustered together in the subsamples (this is also known as Prediction Strength). It is calculated by taking all the patients that belong to one of the individual clusters, finding which communities those patients have been grouped into in the new clustering iteration, and dividing the size of the largest community in the new iteration by the size of the original community.

For example, say clustering on a dataset of 1000 patients yields three communities, one of 500 patients (community 1), one of 400 (community 2), and one of 100 (community 3). We rerun clustering on a subsample of the data, and out of the 500 patients that originally belonged to community 1, 400 of them still belong to the same community, 50 belong to a different community, and 50 were randomly removed during subsampling. The “stability” of the original community 1 for this subsampling iteration is 400/500 = 0.8 (aka 80%).

To perform this analysis on each of the original clusters for each of the subsampling iterations, use the following code:

stabilities_df <- compute_cluster_stabilities(subsample_communities_df)
## Warning in max(freq_clusters[names(freq_clusters) != ""]): no non-missing
## arguments to max; returning -Inf
stabilities_df
##    SamplingRate Cluster 1 Stability Cluster 2 Stability Cluster 3 Stability
## 1             1          1.00000000          1.00000000                 1.0
## 2          0.99          0.58715596          0.72794118                 0.8
## 3          0.95          0.72935780          0.52941176                 0.6
## 4           0.9          0.68348624          0.58823529                 0.4
## 5           0.8          0.57339450          0.30147059                 0.4
## 6           0.7          0.33486239          0.36029412                 0.4
## 7           0.6          0.49082569          0.25735294                 0.4
## 8           0.5          0.39908257          0.21323529                 0.4
## 9           0.4          0.32110092          0.23529412                 0.4
## 10          0.3          0.27522936          0.20588235                 0.4
## 11          0.2          0.15137615          0.07352941                 0.2
## 12          0.1          0.06880734          0.02941176                -Inf

To plot these results, use the plot_cluster_stabilities function:

plot_cluster_stabilities(stabilities_df)
## `geom_smooth()` using formula = 'y ~ x'

Finally, we can use the Adjusted Rand Index (ARI) to assess the global stability of the community groupings across the different subsampling iterations. This helps us understand if the overall structure of clusters is consistent across different subsamples. We can calculate the ARI between the original clusters (represented by subsample_communities_df$community_1) and each of the subsamples using the following code:

library(mclust)
## Warning: package 'mclust' was built under R version 4.2.3
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.99)
## [1] 0.4325011
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.95)
## [1] 0.4971679
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.9)
## [1] 0.5135012
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.8)
## [1] 0.3745413
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.7)
## [1] 0.2428197
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.6)
## [1] 0.5322329
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.5)
## [1] 0.4751524
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.4)
## [1] 0.5615193
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.3)
## [1] 0.664666
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.2)
## [1] 0.4221146
adjustedRandIndex(subsample_communities_df$community_1, subsample_communities_df$community_0.1)
## [1] 0.3372875

Although ARI is a helpful metric, it can be disproportionately skewed if there are specific clusters that are particularly unstable, even if the other clusters are generally stable. For this reason, we generally prefer the above analysis of individual cluster stability over ARI calculation because it gives you more specific information on which clusters are reliable and which are not.