Sequence Analysis - R Tutorial

Content

Introduction to State Sequence Analysis
R Libraries & example dataset
Create a STate Sequence (STS) object
Sequences’ visualisation: representation plots
- Index plots
- Frequency plots and tables
Sequences’ visualisation: summarisation plots
- Sequence distribution plots (chronograms)
- Entropy
- Transition rate matrix
Sankey diagrams
Sequence clustering

All code is available on GitHUb

Introduction to State Sequence Analysis

We will look at some techniques that allow to visualise and analyse temporal sequences of categorical data (“states”).
The techniques were originally created in computer science to detect dissimilarities between long strings of codes and then used computational biology to analyse DNA sequences.
They have then be adapted by the social sciences to visualise and analyse life trajectories (e.g. work, family, residential)
Within healthcare, they can be adapted to model, for example, longitudinal care pathways for patients or career trajectories of healthcare workers

R Libraries

TraMineR: a package for mining (“life Trajectory Miner for R”), describing and visualising sequences of states or events, and more generally discrete sequence data.
ggseqplot: it contains functions that reproduce the sequence plots from TraMineR using ggplot2.

library(dplyr) ## data wrangling
library(ggplot2) ## general plotting
library(gt) ## format tables

library(TraMineR) ## event sequence analysis
library(ggseqplot) ## event sequence analysis ggplots
library(networkD3) ## for Sankey diagrams
library(cluster) ## for hierarchical clustering
library(WeightedCluster) ## compare clustering solutions

Example dataset (`mvad`)

We will look at data of transitions school-to-work in Northern Ireland.

Cohort survey of 712 students from Northern Ireland.
Followed from September 1993 to June 1999 (70 months), after the end of high school. The time sequence includes the monthly occupation state for each student (6 possible states: “Employment”, “Further Education”, “Higher Education”, “Gap”, “School”, “Training”)

This is an extract of the first 10 individuals in the dataset for the first 5 months available:

head(mock_data[,c(1:7)], 11)

   id GCSE_results      Sep.93     Oct.93     Nov.93     Dec.93   Jan.94
1   1  Lower score  employment employment employment employment training
2   2 Higher score          FE         FE         FE         FE       FE
3   3  Lower score    training   training   training   training training
4   4  Lower score    training   training   training   training training
5   5  Lower score          FE         FE         FE         FE       FE
6   6  Lower score joblessness   training   training   training training
7   7  Lower score          FE         FE         FE         FE       FE
8   8  Lower score          FE         FE         FE         FE       FE
9   9  Lower score    training   training   training   training training
10 10  Lower score      school     school     school     school   school
11 11  Lower score          FE         FE         FE         FE       FE

Background characteristic (at survey time):

Binary GCSE score, measuring how well each student has done in their final GCSE exam (Lower/Higher score),

Data wrangling

For this example, the data is organised in “wide” format (one row per individual) and ready to be used with TraMineR library
However, it is often the case that the raw data is available in long/spell format (multiple rows per individual) and needs to be wrangled to wide format.

The TraMineR::seqformat() function allows to convert data from long/spell format into wide format

Create a STate Sequence (STS) object

A matrix storing all the state sequences from the data
A series of attributes to encode for:
- The alphabet of all possible states
- Labels associated to the alphabet states (for tables and plots)
- Colours of state legends

We first define the STS object’s attributes:

STS_object_attributes<-
  data.frame(
    short_label = c("EM", "FE", "HE", "GA", "SC", "TR"),
    long_label = c("Employment", "Further Education", 
               "Higher Education", "Gap", 
               "School", "Training"),
    colour = c("green","blue", "red", 
               "yellow","cyan", "grey")
  )

STS_object_attributes

  short_label        long_label colour
1          EM        Employment  green
2          FE Further Education   blue
3          HE  Higher Education    red
4          GA               Gap yellow
5          SC            School   cyan
6          TR          Training   grey

Then we use TraMineR::seqdef to create the STS object:

STS_object <-
  TraMineR::seqdef(
    ## sequences' data in wide format
    data = mock_data, 
    
    ## list of columns containing sequences
    var = 3:72,
    
    ## state labels used in plots
    labels = STS_object_attributes$long_label,
    
    ## state alphabet labels
    states = STS_object_attributes$short_label
  )

 [>] state coding:

       [alphabet]  [label]  [long label]

     1  employment  EM       Employment

     2  FE          FE       Further Education

     3  HE          HE       Higher Education

     4  joblessness GA       Gap

     5  school      SC       School

     6  training    TR       Training

 [>] 712 sequences in the data set

 [>] min/max sequence length: 70/70

We can now set up the colout palette using TraMineR::cpal:

## set colour palette for each state category
TraMineR::cpal(seqdata = STS_object) <- STS_object_attributes$colour

This is how the first 4 sequences of the STS object appear:

head(STS_object, 4)

  Sequence                                                                                                                                                                                                         
1 EM-EM-EM-EM-TR-TR-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM
2 FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE
3 TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-GA-GA
4 TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-GA-GA-GA-GA-GA-GA-GA-GA-GA

We can have a more compact view using this command: complete SPS (State Permanent Sequence) format

print(head(STS_object, 4), format = "SPS")

  Sequence                      
1 (EM,4)-(TR,2)-(EM,64)         
2 (FE,36)-(HE,34)               
3 (TR,24)-(FE,34)-(EM,10)-(GA,2)
4 (TR,47)-(EM,14)-(GA,9)

Sequences’ visualisation: representation plots

They display actually observed sequences.

Representation plots/1: Index plot

index_plot <-
  ggseqplot::ggseqiplot(
    seqdata = STS_object
    )

index_plot +
  labs(title = "Sequence index plot")

All observed sequences are displayed
First glance at the data, although the result can be a bit messy…

Representation plots/2: Sorted index plot

To facilitate visual identification of patterns, the set of sequences can be sorted according to some specified criterion

## calculate pairwise dissimilarities 
## from a reference sequence
dist.mostfreq <- 
  TraMineR::seqdist(
    seqdata = STS_object, ## the sequence object
    
    refseq = 0, ## the most frequent sequence
    
    method = "LCS" ## Longest Common Subsequence
    ) 

index_plot_sorted <-
  ggseqplot::ggseqiplot(
    seqdata = STS_object, ## the sequence object
    sortv = dist.mostfreq  ## sorting vector
    )

index_plot_sorted +
  labs(title = "Sorted sequence index plot")

In this example the most common sequence was first identified as reference sequence (refseq = 0) and all other sequences were sorted according to their similarity to the reference (dist.mostfreq).
In the plot, the reference sequence is represented at the bottom.
An interesting pattern emerges: many individuals start in “School” (light blue) or “Further Education” (blue) and then progress into “Employment” (green) or “Higher Education” (red)

We can also visualise the same plot by group, adding it as a parameter. For example here we look at the same data by “GCSE results” and notice the difference in the most common sequences.

ggseqplot::ggseqiplot(
  seqdata = STS_object, ## the sequence object
  sortv = dist.mostfreq,  ## sorting vector
  group = mock_data$GCSE_results
    )

Representation plots/3: Frequency plot

We can also choose to represent only the most frequent sequences in the data and extract their frequency as well. For example, to visualise the top 10 most frequent sequences:

freqtop10_plot <-
  ggseqplot::ggseqfplot(
    seqdata = STS_object,
    ranks = 1:10,
    border = TRUE
  )

freqtop10_plot +
  labs(title = "Top 10 most frequent sequences")

A table with counts and percentages of the most frequent sequences can also be extracted:

freq_table<-attr(
  x = 
    TraMineR::seqtab(
      seqdata = STS_object,
      idxs=0,
      ## specify SPS = State Permanent Sequence format
      format = "SPS"
      ),
  which = "freq") %>% 
    tibble::rownames_to_column(var = "Sequence")

head(freq_table, 10) |> 
  gt() |>
  tab_options(table.font.size = 12)

Sequence	Freq	Percent
EM/70	50	7.0224719
TR/22-EM/48	18	2.5280899
FE/22-EM/48	17	2.3876404
SC/24-HE/46	16	2.2471910
SC/25-HE/45	13	1.8258427
FE/25-HE/45	8	1.1235955
FE/34-EM/36	7	0.9831461
FE/46-EM/24	7	0.9831461
FE/10-EM/60	6	0.8426966
FE/24-HE/46	6	0.8426966

Again, we can look at the same plot by GCSE group. The most frequent sequence of events for those with “Higher score” is to start in “School” (light blue) at the beginning of their trajectories and then go into “Higher Education”, while those with a “Lower Score” most commonly start directly into “Employment” (green) and remain there

ggseqplot::ggseqfplot(
  seqdata = STS_object,
  ranks = 1:10,
  border = TRUE,
  group = mock_data$GCSE_results
  )

Sequences’ visualisation: summarisation plots

These plots provide an aggregated overview of the data (rather than representing directly individual trajectories).

Summarisation plots/1: Sequence Distribution plot (Chronogram)

A chronogram is a series of stacked bar charts that visualise the cross-sectional state distributions of individuals (i.e. at each time point of the sequence).

chronogram <-
  ggseqplot::ggseqdplot(
    seqdata = STS_object,
    border = TRUE
  )

chronogram +
  labs(title = "Sequence distribution plot (Chronogram)")

Summarisation plots/2: Entropy

The entropy curve shows the degree of diversity across state distributions at each time point of the sequence. Entropy values range from 0 to 1:

entropy = 0 –> all individuals are in the same state
entropy = 1 –> each state is associated to the same number of individuals

chronogram_with_entropy <-
  ggseqplot::ggseqdplot(
    seqdata = STS_object,
    border = TRUE,
    with.entropy = TRUE
  )

chronogram_with_entropy +
  labs(title = "Sequence distribution plot (Chronogram) 
       with Entropy")

Note how the cross-sectional entropy of the student cohort decreases at the end of the time sequence (as more and more individuals enter in the same state “Employment”)

Summarisation plots/3: Transition Rates

The following code allows to show the transition rates between states as a matrix.

The transition rates are first calculated using the function TraMiner::seqtrate()
ggseqplot has its own function ggseqtrplot() to create the matrix, however I prefer aesthetically to build it directly with ggplot2 (more customisation on colours, font sizes and labels)

STS_object_trates <- 
  seqtrate(STS_object) |> ## Calculate transition rates
  reshape2::melt() ## reshape data from wide to long format

## plot the heatmap
ggplot(STS_object_trates, aes(Var2, Var1)) +
  geom_tile(aes(fill = value), colour = "black") +
  geom_text(aes(label = round(value, 3), fontface = 2), colour = "red") +
  scale_fill_continuous(high = "#132B43", low = "#FFFFFF", name="Transition rates")+
  xlab("Transition to state")+
  ylab("Transition from state")+
  theme(axis.text=element_text(size=10),
        axis.title=element_text(size=14))

The Start State is represented vertically while the End State is represented horizontally
The higher the transition rates, the more intense the blue
The highest rates (in the diagonal, all > 0.9) indicate that individuals in a particular state tend to remain in the same state
The next highest transition rate (0.042) is [GA] –> [EM] (from Gap to Employment)

Sankey diagrams

We can also look at sequences of transitions from one state into another, without considering the time spent in each state. These are called DSS (Discrete Successive State) sequences and can be visualised using all the tools shown previously and also using Sankey diagrams.

These are the explicit State sequences (STS):

head(STS_object, 4)

  Sequence                                                                                                                                                                                                         
1 EM-EM-EM-EM-TR-TR-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM
2 FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE-HE
3 TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-GA-GA
4 TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-TR-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-EM-GA-GA-GA-GA-GA-GA-GA-GA-GA

They can also be written in “compact” SPS (State Permanent Sequence) format:

print(head(STS_object, 4), format = "SPS")

  Sequence                      
1 (EM,4)-(TR,2)-(EM,64)         
2 (FE,36)-(HE,34)               
3 (TR,24)-(FE,34)-(EM,10)-(GA,2)
4 (TR,47)-(EM,14)-(GA,9)

These are the same sequences in DSS (Discrete Successive State) format, removing the information of time spent in each state. The DSS states can be extracted using TraMineR::seqdss function:

## extract DSS frequencies

# extract sequence of distinct states 
DSS_sequences<-
  TraMineR::seqdss(
    seqdata = STS_object,
    
    ## to consider missing values as a valid "state"
    with.missing=TRUE 
    )

head(DSS_sequences, 4)

  Sequence   
1 EM-TR-EM   
2 FE-HE      
3 TR-FE-EM-GA
4 TR-EM-GA

We can use a Sankey diagram to visualise the DSS sequences. Since plotting the whole dataset would look a bit messy, I will just focus on those students starting their trajectories from a “Further Education” state (FE)

STS_object_FEstart <- 
  ## we can extract only those individuals who
  ## started their trajectories from "Further Education" state
  subset(STS_object,mock_data$`Sep.93` == "FE")

# extract sequence of distinct states 
DSS_sequences_FEstart<-
  TraMineR::seqdss(
    seqdata = STS_object_FEstart,
    
    ## to consider missing values as a valid "state"
    with.missing=TRUE 
    )

## create DSS_freqTable_FEstart
DSS_freq_table_FEstart<-
  attr(
    TraMineR::seqtab(
      seqdata = DSS_sequences_FEstart,
      idxs=0,
      format = "SPS"),
    which="freq") %>% 
  tibble::rownames_to_column(var = "Sequence") 


head(DSS_freq_table_FEstart  |>
  mutate(Sequence = gsub('/1', '', Sequence)), 10)

      Sequence Freq   Percent
1        FE-EM   78 28.363636
2        FE-HE   22  8.000000
3     FE-GA-EM   16  5.818182
4     FE-TR-EM   16  5.818182
5     FE-HE-EM   13  4.727273
6     FE-EM-HE   10  3.636364
7  FE-EM-FE-EM    8  2.909091
8        FE-GA    8  2.909091
9  FE-EM-TR-EM    7  2.545455
10    FE-EM-GA    6  2.181818

A Sankey diagram represents flows of individuals moving from one state to another. It uses weighted connections going from one node to another. Each node represents a state and the weighted connections correspond to the number of people moving between different states (the flows). I will use a custom made function created with the networkD3 library.

source("SankeyDiagram_function.R")

sankey_diagram <- 
  sankey_diagram_outputs(
  DSS_freqTable = DSS_freq_table_FEstart,
  
  ## assign the same colour palette previously used
  colour_palette_file = 
    STS_object_attributes |>
    select(short_label,colour) |>
    rename(Labels = short_label, Exadecimal.Codes = colour),
  
  plot_title = "Sankey diagram tutorial"
    
)

sankey_diagram$Sankey_plot

Sankey diagram tutorial

On the plot, ST1 = State 1, ST2 = State 2 etc, going forward in the sequence of states.
At the beginning (State 1 = ST1), we can see there are, for example, 275 individuals who start their trajectories in the “Further Education” (FE/ST1 = blue) state. If you follow the connections, you can see that 154 of them flows into the “Employment” state (EM/ST2 = green), 38 into the “Higher Education” state (HE/ST2 = red), 51 into the “Gap” state (GA/ST2 = yellow), 25 into the “Training” state (TR/ST2 = grey), 6 into the “School” state (SC/ST2 = light blue), while 1 remains into the “Further Education” state (FE/ST2 = blue). And so on…

Sequence clustering

Clustering creates group objects (e.g. state sequences) in such a way that the groups obtained are as internally homogeneous as possible and as different from one another as possible.
The method discussed here is based on comparing state sequences and quantifying how dissimilar they are, by using a metric of sequence dissimilarity.
Sequence dissimilarity can be computed with the Optimal Matching (OM) algorithm, which expresses distances between sequences in terms of amount of effort required to transform one sequence into the other one. Three operations are possible to transform one sequence into another:
- insertion (one state is inserted into the first sequence to obtain the second)
- deletion (one state is deleted from the first sequence to obtain the second)
- substitution (one state is replaced by another state)
Each operation is associated to a cost. The distance between two sequences can be defined as the minimum cost to transform one sequence into the other. Definition of costs is often subjective, although it is possible to use data driven methods (e.g. between-states transition rates to define the substitution costs)

The following code uses TraMineR::seqdist() to compute the pairwise sequence dissimilarity matrix (e.g. distances between each pair of sequences) using OM. The pairwise distances for the first 4 sequences are shown:

diss_matrix_OM <- 
  TraMineR::seqdist(
    seqdata = STS_object,
    
    ## Optimal Matching
    method = "OM", 
    
    ## Insertion/Deletion cost (subjective)
    indel = 1, 
    
    ## Substitution cost matrix 
    ## (TRATE = Transition Rates of States --> data-driven)
    sm = "TRATE") 

diss_matrix_OM[1:4,1:4]

         1         2         3         4
1   0.0000 138.51418 113.99505 105.74294
2 138.5142   0.00000  71.82969 139.32555
3 113.9950  71.82969   0.00000  67.70437
4 105.7429 139.32555  67.70437   0.00000

We can now carry out clustering. Different methods are available and we will focus on hierarchical clustering since it does not require to specify in advance the number of clusters (e.g. compare to k-means)
We will use the Ward algorithm, a hierarchical clustering method that attempts to minimise the within-cluster variance.

The function cluster::agnes is used to compute hierarchical clustering (“agnes” = Agglomerative Nesting). The function’s output can be visualised as a dendrogram, a diagram with a tree structures showing the grouping hierarchy of the sequences:

## compute hierarchical clustering
cluster_ward <- 
  cluster::agnes(
    
    ## x is the dissimilarity matrix
    x = diss_matrix_OM, 
    diss = TRUE, 
    
    ## we choose the Ward algorithm
    method = "ward")

## plot the dendrogram
plot(
  as.dendrogram(cluster_ward), 
  type = "rectangle", 
  ylab = "Height")

…but, how many clusters should we use?
There might be knowledge domain-driven reasons to expect a specific number of clusters
Otherwise, we can use data-driven clustering quality metrics, which give an indication on the optimal number of clusters for the specific dataset.

The following code will use the WeightedCluster::as.clustrange function to compare different clustering solutions (from 2 to 25 clusters). Multiple clustering quality metrics for each number of cluster considered. We will focus only on metrics:

Point Bi-serial Correlation (PBC): to maximise
Average Silhouette Width (ASW): to maximise
Hubert’s C coefficient (HC): to minimise

ward_test <- 
  WeightedCluster::as.clustrange(
    
    ## the hierarchical clustering output
    object = cluster_ward,
    
    ## the dissimilarity matrix 
    diss = diss_matrix_OM, 
    
    ## the maximum number of cluster we want to evaluate
    ncluster = 25)

names(ward_test$stats)

 [1] "PBC"  "HG"   "HGSD" "ASW"  "ASWw" "CH"   "R2"   "CHsq" "R2sq" "HC"

head(ward_test$stats[,c("PBC","ASW","HC")],10)

                PBC       ASW         HC
cluster2  0.5964016 0.4338651 0.12844021
cluster3  0.5092477 0.2960673 0.19392389
cluster4  0.6154016 0.3462502 0.11295295
cluster5  0.6312816 0.3631951 0.09881787
cluster6  0.6303661 0.3779820 0.08855149
cluster7  0.6429058 0.4047889 0.06410220
cluster8  0.5632632 0.3692324 0.07625703
cluster9  0.5469581 0.3514947 0.07154070
cluster10 0.5482885 0.3662150 0.06846215
cluster11 0.5427097 0.3681408 0.06156989

We will plot these metrics so it is easier to visually inspect them:

plot(
  ward_test, 
  stat = c("PBC", "HC", "ASW"), 
  norm = "zscore", 
  lwd = 4, 
  xaxp =c(0,25,25))

Looking at both metrics simultaneously, 7 clusters seems a good compromise. We can use the cutree function to cut the clustering tree into 7 groups:

The following code assigns each individual to one of the 7 clusters and add this new information to the original dataset. The new cluster variable can be used as a group variable to look at the sequences now. We can have a quick look at how many individuals belong to each cluster:

## assigning each individual in the cohort to 
## one of the 7 extracted clusters
cluster_groups <- 
  factor(
    # cutree "cuts" the dendrogram into 7 groups
    x = cutree(cluster_ward,k = 7), 
    labels = paste("Cluster", 1:7)
    )

# add clusters' groups to the original dataset:
mock_data$cluster_group<-cluster_groups

# sample sizes by cluster group
mock_data |>
  group_by(cluster_group) |>
  summarise (
    `N individuals` = n()
  )

# A tibble: 7 × 2
  cluster_group `N individuals`
  <fct>                   <int>
1 Cluster 1                 265
2 Cluster 2                  63
3 Cluster 3                 149
4 Cluster 4                  59
5 Cluster 5                  45
6 Cluster 6                  90
7 Cluster 7                  41

barplot(
  table(mock_data$cluster_group), 
  main="Clusters' distribution", 
  xlab="Cluster group", 
  ylab = "Number of individuals")

We can use all the plotting and analysis tools available in TraMineR to analyse the sequence data divided by cluster group. For example we can look at their chronograms to have an idea of how individuals in each cluster might differ in their trajectories:

ggseqplot::ggseqdplot(
    seqdata = STS_object, 
    group = cluster_groups,
    border = FALSE
  )

Some interesting trajectory patterns emerge from each cluster group:

Cluster 1: individuals who enter the “Employment” state relatively early
Cluster 2: individuals who start in “Further Education” and mainly progress into “Higher Education”
Cluster 3: individuals who start in “Further Education” and mainly progress into “Employment”
Cluster 4: individuals who start in “Training” and mainly progress into “Employment”
Cluster 5: individuals who start in “School” and mainly progress into “Further Education” and “Employment”
Cluster 6: individuals who start in “School” and mainly progress into “Higher Education”
Cluster 7: individuals who end into “Gap” (Unemployment) state.

Finally, we can carry out some statistical analysis to identify whether there is any relevant association between clusters an other variables of interest. In this example we have a binary variable “GCSE results”. We carry out a \(\chi^2\) test to assess its statistical association with the clusters

table(mock_data$cluster_group,mock_data$GCSE_results)

           
            Higher score Lower score
  Cluster 1           37         228
  Cluster 2           51          12
  Cluster 3           56          93
  Cluster 4           12          47
  Cluster 5           22          23
  Cluster 6           76          14
  Cluster 7            6          35

mock_data |>
  group_by(cluster_group,GCSE_results) |>
  summarise(
    Counts = n()
  ) |>
  ungroup() |>
  ggplot(
    aes(x = cluster_group, y = Counts, fill = GCSE_results)
    ) +
  geom_bar(stat="identity", position = position_dodge())

chisq.test(
  table(mock_data$cluster_group,
        as.factor(mock_data$GCSE_results))
  )


    Pearson's Chi-squared test

data:  table(mock_data$cluster_group, as.factor(mock_data$GCSE_results))
X-squared = 219.17, df = 6, p-value < 2.2e-16

chisq.posthoc.test::chisq.posthoc.test(
  table(mock_data$cluster_group,
        as.factor(mock_data$GCSE_results))
  )

   Dimension     Value      Higher score        Lower score
1  Cluster 1 Residuals -9.62427906103099   9.62427906103099
2  Cluster 1  p values                0*                 0*
3  Cluster 2 Residuals  7.67259652544357  -7.67259652544357
4  Cluster 2  p values                0*                 0*
5  Cluster 3 Residuals 0.304216804824907 -0.304216804824906
6  Cluster 3  p values                 1                  1
7  Cluster 4 Residuals -2.69497585862664   2.69497585862664
8  Cluster 4  p values            0.0986             0.0986
9  Cluster 5 Residuals  1.78093708475993  -1.78093708475993
10 Cluster 5  p values                 1                  1
11 Cluster 6 Residuals  10.1035910021959  -10.1035910021959
12 Cluster 6  p values                0*                 0*
13 Cluster 7 Residuals -2.99775284936756   2.99775284936756
14 Cluster 7  p values           0.0381*            0.0381*

The test Post-hoc tests reveal a significant statistical difference between the two GCSE groups in Clusters 1 and 7 (more individuals with GCSE Lower scores), and Clusters 2 and 6 (more individuals with GCSE Higher scores)