Sequence Analysis - First Example

Data Set: School to Work Transition

This example use data from McVicar and Anyadike-Danes (2002) which is included with the package TraMineR. The data consist of static background characteristics and a time series sequence of 72 monthly labour market activities for each of 712 individuals in a cohort survey. The individuals were followed up from July 1993 to June 1999. The monthly states are recorded in columns 15 (Jul.93) to 86 (Jun.99).

States are:

States            |  Notation
----------------  |  --------
employment        | (EM)
further education | (FE)
higher education  | (HE)
joblessness       | (JL)
school            | (SC)
training          | (TR)

The data set contains also ids (id) and sample weights (weight) as well as the following binary covariates:

male
catholic
Belfast, N.Eastern, Southern, S.Eastern, Western (location of school, one of five Education and Library Board areas in Northern Ireland)
Grammar (type of secondary education, 1=grammar school)
funemp (father’s employment status at time of survey, 1=father unemployed)
gcse5eq (qualifications gained by the end of compulsory education, 1=5+ GCSEs at grades A-C, or equivalent)
fmpr (SOC code of father’s current or most recent job, 1=SOC1 (professional, managerial or related))
livboth (living arrangements at time of first sweep of survey (June 1995), 1=living with both parents)

Format A data frame containing 712 rows, 72 state variables, 1 id variable and 13 covariates. The following analysis excludes the first two months of 1993 (summer holidays after compulsory education), thus observing from Sep 1993 to Jun 1999 (70 monthly statuses)

rm(list = ls())
library(TraMineR)
data(mvad)

Sequence Presentations

The weight variable in mvad contains case weights that account for the selective attrition during the survey and we attach them to the sequence object as shown below.

mvad.labels <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.scode <- c("EM","FE","HE","JL","SC","TR")
mvad.seq <- seqdef(mvad,var = 17:86,states = mvad.scode, labels = mvad.labels, weights = mvad$weight, xtstep = 6) #17:86 state variables in dataframe, xtstep = 6 -> display separation every 6 months

mvad[1:5, 17:22]

##       Sep.93     Oct.93     Nov.93     Dec.93   Jan.94   Feb.94
## 1 employment employment employment employment training training
## 2         FE         FE         FE         FE       FE       FE
## 3   training   training   training   training training training
## 4   training   training   training   training training training
## 5         FE         FE         FE         FE       FE       FE

print(mvad.seq[1:5, 17:22], format = "STS") #STate-Sequence

##   Sequence         
## 1 EM-EM-EM-EM-EM-EM
## 2 FE-FE-FE-FE-FE-FE
## 3 TR-TR-TR-TR-TR-TR
## 4 TR-TR-TR-TR-TR-TR
## 5 FE-FE-FE-FE-FE-FE

print(mvad.seq[1:5, ], format = "SPS") #State-Permanence-Sequence

##   Sequence                      
## 1 (EM,4)-(TR,2)-(EM,64)         
## 2 (FE,36)-(HE,34)               
## 3 (TR,24)-(FE,34)-(EM,10)-(JL,2)
## 4 (TR,47)-(EM,14)-(JL,9)        
## 5 (FE,25)-(HE,45)

Visualizing State Sequences

Sequence Index

# Draw an index plot for whole sample, sorted by initial state (See Section 5.1 for other sorting mechanisms)
seqIplot(mvad.seq, sortv = "from.start", with.legend = F, border = NA, main = "Sequence Index Plot (Whole sample)") #idxs = 1:10 is default to plot first 10 seq

# Draw an index plot of the first 10 sequences
seqiplot(mvad.seq, with.legend = "right",cex.legend = 0.8, border = NA, main = "Index Plot (10 first sequences)") #idxs = 1:10 is default to plot first 10 seq

This plot shows early life trajectories for first 10 persons
The bar width of each sequence is proportional to its weight, could remove this by setting weighted = FALSE
Note that the individuals are in the same cohort

seqIplot(mvad.seq, group = mvad$gcse5eq, border = NA, sortv = "from.start", main = c("GCSE below C","GCSE at A-C")) #, with.legend = "right",cex.legend = 0.8, border = NA, main = "Index Plot (10 first sequences)")

Sequence Frequency

seqtab(mvad.seq, tlim = 1:4)

##             Freq Percent
## SC/24-HE/46   33     4.7
## SC/25-HE/45   25     3.5
## TR/22-EM/48   18     2.5
## EM/70         15     2.1

The most frequent sequence in the mvad.seq object is a spell of two years of school followed by 46 months of higher education. It accounts, however, for only 4.7% of the total weights of the 712 cases considered. The second most frequent sequence, which concerns 3.5% of the weighted individuals, is indeed very similar to the previous one.

# Set a 2x2 graph display
par(mfrow=c(1,2))

# Plot 10 most frequent sequences with bar width proportional to the frequencies"
seqfplot(mvad.seq, with.legend = FALSE, border = NA, main = "Weighted Frequencies")

# Plot 10 most frequent sequences with unweighted frequencies
seqfplot(mvad.seq, weighted = FALSE, with.legend = FALSE, border = NA, main = "Unweighted Frequencies")

Interpretation for the unweighted plot
This plot shows 10 most frequent sequences or 10 most popular trajectories among this cohort of individuals. As seen, the most popular trajectories is full employed (all green), followed by school-higher education, training-working, ….
We could notice that joblessness is not common among this cohort during this time period (this could also be seen with in the State Distribution Plot following)

State Distribution

Mean-time Plot

Mean time spent in each state

seqmtplot(mvad.seq, group = mvad$male, ylim = c(0,30), main = c("Female", "Male"))

Retrieve values using seqmeant() function

#Compute mean time spent in each state for group of male/female; check `seqistatd` for individual
by(mvad.seq, mvad$male, seqmeant)

## mvad$male: no
##    Mean
## EM 24.8
## FE  9.8
## HE 12.6
## JL  6.3
## SC 12.0
## TR  4.5
## ------------------------------------------------------------ 
## mvad$male: yes
##    Mean
## EM 29.2
## FE  8.5
## HE  8.5
## JL  5.9
## SC  8.0
## TR 10.0

Transition rates

The transition rate between states \(s_i\) and \(s_j\) is obtained as:

\[p(s_j|s_i) = \frac{\sum_{t=1}^{L-1}n_{t,t+1}(s_i,s_j)}{\sum_{t=1}^{L-1}n_t(s_i)}\]

where \(n_t(s_i)\) is the number of sequences that is longer than \(t\) and is with state \(s_i\) at position \(t\); \(n_{t,t+1}\) is the number of sequences with state \(s_i\) at position \(t\) and state \(s_j\) at position \(t+1\); \(L\) is the maximal observed sequence length.

Transition Matrix

mvad.trate <- seqtrate(mvad.seq)
round(mvad.trate, 2)

##         [-> EM] [-> FE] [-> HE] [-> JL] [-> SC] [-> TR]
## [EM ->]    0.99    0.00    0.00    0.01    0.00    0.00
## [FE ->]    0.03    0.95    0.01    0.01    0.00    0.00
## [HE ->]    0.01    0.00    0.99    0.00    0.00    0.00
## [JL ->]    0.04    0.01    0.00    0.94    0.00    0.01
## [SC ->]    0.01    0.01    0.02    0.01    0.95    0.00
## [TR ->]    0.04    0.00    0.00    0.01    0.00    0.94

The figures along the diagonal capture the stability of each state, while each figure off-diagonal represents the probability of changing from the row-state to the column-state. Notice that the total by each row should be 1. The highest transition rates (0.04) are observed between JL (joblessness) and EM (employment), and between TR (training) and EM. Moreover, states JL and TR are the most unstable with a probability of 6% to leave the state at each position \(t\) (although by itself 6% is not a big number)

Time-varying transition rates

The following code produces the t-dependent transition matrix, i.e transition rates that are time-varying: depending on each point in time. The results is a 3-dimensional array (kxkxL-1) where kxk is transition matrix for k distinct states and (L-1) is the number of times when transition takes place.

mvad.tvrate <- seqtrate(mvad.seq, time.varying = TRUE)
trans_at1 <- round(mvad.tvrate[,,1],2)
trans_at2 <- round(mvad.tvrate[,,2],2)

Transversal state distributions

seqstatd(mvad.seq[,1:8])

##     [State frequencies]
##    Sep.93 Oct.93 Nov.93 Dec.93 Jan.94 Feb.94 Mar.94 Apr.94
## EM  0.034  0.036  0.044  0.051  0.055  0.058  0.066  0.067
## FE  0.223  0.217  0.215  0.215  0.209  0.206  0.204  0.200
## HE  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
## JL  0.048  0.055  0.051  0.050  0.058  0.068  0.064  0.067
## SC  0.472  0.466  0.464  0.461  0.451  0.451  0.446  0.446
## TR  0.223  0.227  0.226  0.223  0.227  0.218  0.221  0.220
## 
##      [Valid states]
##    Sep.93 Oct.93 Nov.93 Dec.93 Jan.94 Feb.94 Mar.94 Apr.94
## N     712    712    712    712    712    712    712    712
## 
##      [Entropy index]
##    Sep.93 Oct.93 Nov.93 Dec.93 Jan.94 Feb.94 Mar.94 Apr.94
## H    0.72   0.73   0.73   0.74   0.75   0.76   0.77   0.77

State Distribution Plot

# State distribution by time points 
seqdplot(mvad.seq, group = mvad$gcse5eq, border = NA, main = c("GCSE below C","GCSE at A-C"))

This plot shows stacked distributions (chronogram) of the states at every observed period.
During the first few years (93-95) individual states vary notably between further education, school, training,.. while towards the middle most individuals finished formal education/training and either starting employed or taking higher education; near the end most of them were working, some stay at higher education and a small proportion was unemployed.
Not that this plot does not render individual sequences or individual follow-ups as in the Index Plot or Frequency plots, here the plot provide aggregated views made of successive slices, each of which represents transversal characteristics (a slice is transversal in the sense that it scans over every individual at a give point). We read state distribution plot VERTICALLY (state distribution at a given time point) not HORIZONTALLY (as per individual sequence).

Modal State Plot

Provide a summary of State Distribution by plotting only the most frequent state (mode) at each position. The height of the bar at each position is proportional to the frequency of the displayed state (mode) at that position. "O occurences imply that the modal state sequences do not coincide with any actual sequence in the dataset (no individual sequence exactly coincides with the mode sequence). (Recall trade-off between readability/simplification and loss of information)

seqmsplot(mvad.seq, group = mvad$gcse5eq, border= NA, main = c("GCSE below C","GCSE at A-C"))

Entropy plot

Transversal entropy of state distributions

The Shannon’s entropy, also known as the entropy index is a measure of the diversity of states observed at a given time point. The following formula applied for \(h_t\) (entropy index at time \(t\)) is the same for every \(t\) so we ignore the subscript \(t\) to avoid clutter.

\[h(p_1,...,p_a) = - \sum_{i=1}^a p_i log(p_i) \]

where \(a\) is the size of the alphabet (the set of letter-representation of states, in the current context, \(a = 6\)); \(p_i\) is the proportion of cases in state i at a given position. The entropy is 0 when all cases are in the same state and is maximal when we have the same proportion of cases in each states.

Self-note: Since sequential data by nature is categorical, a mean state wouldn’t make much sense (as in the case of numeric data), so as variance (defined as square distance from mean). Instead, the mode and entropy are of more usual discussion.

Sequence analysts, however, also concern about numerical characteristic of sequences (e.g “distance” from mode) - we’ll talk about those measures later.

# Plot the entropy of the state distribution at each time point

seqHtplot(mvad.seq, group = mvad$gcse5eq, main = c("GCSE below C","GCSE at A-C"))

Comment: For group of individuals with lower qualification at the end of compulsory education, the entropy index reduces notably overtime. This is because most of them get to work after some training or schooling.

On the other hand, the group of higher qualified individuals tends to experience a more diverse trajectories: after compulsory education, half of them continued schooling, then pursue higher education, while the other half either took some further education then started working or some training the working.

It could also be interesting to look at the trajectories of male and female after compulsory education.

Individual Sequence Characteristics

Unidimensional indicators

Number of transitions

\(N(x) = l_d(x) - 1\) where \(l_d(x)\) is the length of the sequence’s DSS (Distinct Successive States)

library(dplyr)
mvad %>% mutate(trans = seqtransn(mvad.seq)) -> mvad
summary(mvad$trans)

##      Trans.     
##  Min.   :0.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :2.097  
##  3rd Qu.:3.000  
##  Max.   :9.000

Number of subsequences

A subsequence \(y\) of \(x\) is composed of elements of \(x\) occurring in the same order than in \(x\). The number \(\phi(x)\) of subsequences is used in the turbulence measure presented below.

In Figure 9 Below, sequences 5 and 9 have the maximal number of transitions, while the number of subsequences is maximal for sequence 9 only.

longitudinal characteristics

mvad %>% mutate(nsubseq = seqsubsn(mvad.seq)) -> mvad

Within-sequence entropy

Time spent (total duration) at each state for each sequence

seqistatd(mvad.seq[1:4,])

##   EM FE HE JL SC TR
## 1 68  0  0  0  0  2
## 2  0 36 34  0  0  0
## 3 10 34  0  2  0 24
## 4 14  0  0  9  0 47

Within-sequence entropy: This index captures the diversity of states within each sequence, that is the diversity of states that each individual experience.

The same Shannon-entropy formula is applied here:

\[h(\pi_1,...,\pi_a) = - \sum_{i=1}^a \pi_i log(\pi_i)\]

where \(a\) is the size of the alphabet and \(\pi_i\) is the proportion of occurrences of the \(i\)th state in the considered sequence (do not consider order of occurrences here). Interpretation similar to transversal entropy. By default the function seqient normalize the entropy calculated via formula above by diving its value to the theoretical maximum, \(log a\). Since within-sequence entropy does not account for the state order in the sequence, sequences 7 and 9 have the same maximal normalized entropy of 1.

entropies <- seqient(mvad.seq)
summary(entropies)

##     Entropy      
##  Min.   :0.0000  
##  1st Qu.:0.3409  
##  Median :0.3866  
##  Mean   :0.3996  
##  3rd Qu.:0.5508  
##  Max.   :0.8548

OLS Regression explaining individual’s entropy

lm.ent <- lm(entropies ~ male + gcse5eq + fmpr + funemp, mvad)
summary(lm.ent)

## 
## Call:
## lm(formula = entropies ~ male + gcse5eq + fmpr + funemp, data = mvad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47838 -0.08812  0.00405  0.12789  0.47112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.38367    0.01243  30.868  < 2e-16 ***
## maleyes     -0.04030    0.01325  -3.042  0.00244 ** 
## gcse5eqyes   0.06327    0.01405   4.503 7.85e-06 ***
## fmpryes      0.03145    0.01568   2.006  0.04523 *  
## funempyes    0.03647    0.01803   2.023  0.04349 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1736 on 707 degrees of freedom
## Multiple R-squared:  0.06037,    Adjusted R-squared:  0.05505 
## F-statistic: 11.36 on 4 and 707 DF,  p-value: 6.157e-09

Notice that so far we haven’t talked about sequence characteristics that capture (normative) nature of states, for example: good/bad, favorable/unfavorable, positive/negative. In the current context, for instance, a sequence of EM/70 (employed all times) could be more favorable to JL/70 (joblessness all time). We’ll turn to measures that reflect these characteristics latter.

Composite Complexity Measures

Sequence Turbulence

The turbulence \(T(x)\) of a sequence \(x\) accounts for the number of \(\phi(x)\) of distinct subsequences of the DSS sequence and the variance \(s_t^2(x)\) of the consecutive times \(t_j\) spent in the \(l_d(x)\) distinct states.

\[T(x) = log_2 \left( \phi(x) \frac{s^2_{t, max}(x) + 1}{s^2_t(x) +1 } \right)\]

where \(s^2_{t, max}(x)\) is the maximum value that \(s^2_t(x)\) can take given the total duration \(l(x) = \sum_j t_j\) of that sequence. This maximum is \(s^2_{t, max}(x) = (l_d(x) - 1)(1 - \bar t(x))^2\) where \(\bar t(x)\) is the mean consecutive time spent in the distinct states. Ritschard (2021) provide a revised version of Turbulence index taking into account states that are not visited.

From a prediction point of view, the higher the differences in state durations and hence the higher their variance, the less uncertain the sequence. In that sense, small duration variance indicates high complexity.

Turbulence <- seqST(mvad.seq)
summary(Turbulence)

##    Turbulence    
##  Min.   : 1.000  
##  1st Qu.: 4.621  
##  Median : 5.454  
##  Mean   : 5.770  
##  3rd Qu.: 7.020  
##  Max.   :12.959

hist(Turbulence, col = "cyan", main = "Sequence Turbulence")

Complexity Index

This combines the number of transitions in the sequence with the within-sequence entropy.

\[C(x) = \sqrt{\frac{l_d(x)}{l(x)} \frac{h(x)}{h_{max}}}\] where \(h_max = log a\) is the theoretical maximum value of the entropy given the alphabet. * A sequence with a single distinct state (no transition, zero-entropy) has minimal complexity, 0. * A sequence \(x\) such that i) \(x\) contains each of the states in the alphabet, ii) the same time \(l(x)/a\) is spent in each state, and iii) the number of transitions is \(l(x)-1\) has the maximal complexity, 1.

Complexity vs Turbulence

Look at sequences 3, 4, 5 and 7.

[3] EM\6, FE\6

[4] EM\4, FE\4, EM\4

[5] EM\1, FE\1, EM\1, FE\1, EM\1, FE\1, EM\1, FE\1, EM\1, FE\1, EM\1, FE\1

[7] EM\3, FE\3, HE\3, JL\3

Complexity rank: [5] (0.73), [7] (0.45), [4] (0.3), [3] (0.22)

Turbulence rank: [5] (0.80), [7] (0.70), [4] (0.6), [3] (0.59)

Complexity “performs” better than turbulence when discriminating between sequences with zero duration variance (as all above 4)

Turbulence index does not account for states that are not visited, which tends to give a high turbulence value for seemingly simple sequences like [3].

Between-sequence: Measuring (dis)similarity

Counts of common attributes

Edit distances

Optimal matching

Compute the optimal matching distances using substitution costs based on transition rates observed in the data and a 1 indel cost (see Section 9.4). The resulting distance matrix is stored in the dist.om1 object.

Dissimilarity-based SA

Typology of Trajectories

### Plotting: 

seqfplot(mvad.seq, group = cl1.4fac, border = NA) #SFP within each cluster

seqdplot(mvad.seq, group = cl1.4fac, border = NA) # State Distribution at each time point within each cluster

Comment:

Type 1: Moderate Training, Little Education, Mostly Work
Type 2: Intensive School/Higher Education/Further Education, Little Training, Moderate Work
Type 3: Further Education -> Work
Type 4: Training + Joblessness + Little Work

Event Sequence Analysis

Instead of focusing on sequences of states, we can look at sequences of transitions or events (i.e change between states)

Define the sequence of transitions

mvad.seqe <- seqecreate(mvad.seq)  #create a list for every individual

Look for frequent event subsequences and plot the 15 most frequent ones Recall: event = change from one state to the next

fsubseq <- seqefsub(mvad.seqe, pmin.support = 0.05)

plot(fsubseq[1:15], col = "cyan")

Interpreting this plot:

Most frequent transition is further education implying that further education takes a while to finish
Second most frequent: from further education to employment. In other words, after (period of) further education most likely being employed
Third most frequent: training to employment

Most discriminating transitions between clusters (see Section 10.4)