This example use data from McVicar and Anyadike-Danes (2002) which is included with the package TraMineR. The data consist of static background characteristics and a time series sequence of 72 monthly labour market activities for each of 712 individuals in a cohort survey. The individuals were followed up from July 1993 to June 1999. The monthly states are recorded in columns 15 (Jul.93) to 86 (Jun.99).
States are:
States | Notation
---------------- | --------
employment | (EM)
further education | (FE)
higher education | (HE)
joblessness | (JL)
school | (SC)
training | (TR)
The data set contains also ids (id) and sample weights (weight) as well as the following binary covariates:
Format A data frame containing 712 rows, 72 state variables, 1 id variable and 13 covariates. The following analysis excludes the first two months of 1993 (summer holidays after compulsory education), thus observing from Sep 1993 to Jun 1999 (70 monthly statuses)
rm(list = ls())
library(TraMineR)
data(mvad)The weight variable in mvad contains case weights that account for the selective attrition during the survey and we attach them to the sequence object as shown below.
mvad.labels <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.scode <- c("EM","FE","HE","JL","SC","TR")
mvad.seq <- seqdef(mvad,var = 17:86,states = mvad.scode, labels = mvad.labels, weights = mvad$weight, xtstep = 6) #17:86 state variables in dataframe, xtstep = 6 -> display separation every 6 monthsmvad[1:5, 17:22]## Sep.93 Oct.93 Nov.93 Dec.93 Jan.94 Feb.94
## 1 employment employment employment employment training training
## 2 FE FE FE FE FE FE
## 3 training training training training training training
## 4 training training training training training training
## 5 FE FE FE FE FE FE
print(mvad.seq[1:5, 17:22], format = "STS") #STate-Sequence## Sequence
## 1 EM-EM-EM-EM-EM-EM
## 2 FE-FE-FE-FE-FE-FE
## 3 TR-TR-TR-TR-TR-TR
## 4 TR-TR-TR-TR-TR-TR
## 5 FE-FE-FE-FE-FE-FE
print(mvad.seq[1:5, ], format = "SPS") #State-Permanence-Sequence## Sequence
## 1 (EM,4)-(TR,2)-(EM,64)
## 2 (FE,36)-(HE,34)
## 3 (TR,24)-(FE,34)-(EM,10)-(JL,2)
## 4 (TR,47)-(EM,14)-(JL,9)
## 5 (FE,25)-(HE,45)
# Draw an index plot for whole sample, sorted by initial state (See Section 5.1 for other sorting mechanisms)
seqIplot(mvad.seq, sortv = "from.start", with.legend = F, border = NA, main = "Sequence Index Plot (Whole sample)") #idxs = 1:10 is default to plot first 10 seq# Draw an index plot of the first 10 sequences
seqiplot(mvad.seq, with.legend = "right",cex.legend = 0.8, border = NA, main = "Index Plot (10 first sequences)") #idxs = 1:10 is default to plot first 10 seqweighted = FALSEseqIplot(mvad.seq, group = mvad$gcse5eq, border = NA, sortv = "from.start", main = c("GCSE below C","GCSE at A-C")) #, with.legend = "right",cex.legend = 0.8, border = NA, main = "Index Plot (10 first sequences)")seqtab(mvad.seq, tlim = 1:4)## Freq Percent
## SC/24-HE/46 33 4.7
## SC/25-HE/45 25 3.5
## TR/22-EM/48 18 2.5
## EM/70 15 2.1
The most frequent sequence in the mvad.seq object is a spell of two years of school followed by 46 months of higher education. It accounts, however, for only 4.7% of the total weights of the 712 cases considered. The second most frequent sequence, which concerns 3.5% of the weighted individuals, is indeed very similar to the previous one.
# Set a 2x2 graph display
par(mfrow=c(1,2))
# Plot 10 most frequent sequences with bar width proportional to the frequencies"
seqfplot(mvad.seq, with.legend = FALSE, border = NA, main = "Weighted Frequencies")
# Plot 10 most frequent sequences with unweighted frequencies
seqfplot(mvad.seq, weighted = FALSE, with.legend = FALSE, border = NA, main = "Unweighted Frequencies")Mean time spent in each state
seqmtplot(mvad.seq, group = mvad$male, ylim = c(0,30), main = c("Female", "Male"))Retrieve values using seqmeant() function
#Compute mean time spent in each state for group of male/female; check `seqistatd` for individual
by(mvad.seq, mvad$male, seqmeant) ## mvad$male: no
## Mean
## EM 24.8
## FE 9.8
## HE 12.6
## JL 6.3
## SC 12.0
## TR 4.5
## ------------------------------------------------------------
## mvad$male: yes
## Mean
## EM 29.2
## FE 8.5
## HE 8.5
## JL 5.9
## SC 8.0
## TR 10.0
Transition rates
The transition rate between states \(s_i\) and \(s_j\) is obtained as:
\[p(s_j|s_i) = \frac{\sum_{t=1}^{L-1}n_{t,t+1}(s_i,s_j)}{\sum_{t=1}^{L-1}n_t(s_i)}\]
where \(n_t(s_i)\) is the number of sequences that is longer than \(t\) and is with state \(s_i\) at position \(t\); \(n_{t,t+1}\) is the number of sequences with state \(s_i\) at position \(t\) and state \(s_j\) at position \(t+1\); \(L\) is the maximal observed sequence length.
Transition Matrix
mvad.trate <- seqtrate(mvad.seq)
round(mvad.trate, 2)## [-> EM] [-> FE] [-> HE] [-> JL] [-> SC] [-> TR]
## [EM ->] 0.99 0.00 0.00 0.01 0.00 0.00
## [FE ->] 0.03 0.95 0.01 0.01 0.00 0.00
## [HE ->] 0.01 0.00 0.99 0.00 0.00 0.00
## [JL ->] 0.04 0.01 0.00 0.94 0.00 0.01
## [SC ->] 0.01 0.01 0.02 0.01 0.95 0.00
## [TR ->] 0.04 0.00 0.00 0.01 0.00 0.94
The figures along the diagonal capture the stability of each state, while each figure off-diagonal represents the probability of changing from the row-state to the column-state. Notice that the total by each row should be 1. The highest transition rates (0.04) are observed between JL (joblessness) and EM (employment), and between TR (training) and EM. Moreover, states JL and TR are the most unstable with a probability of 6% to leave the state at each position \(t\) (although by itself 6% is not a big number)
Time-varying transition rates
The following code produces the t-dependent transition matrix, i.e transition rates that are time-varying: depending on each point in time. The results is a 3-dimensional array (kxkxL-1) where kxk is transition matrix for k distinct states and (L-1) is the number of times when transition takes place.
mvad.tvrate <- seqtrate(mvad.seq, time.varying = TRUE)
trans_at1 <- round(mvad.tvrate[,,1],2)
trans_at2 <- round(mvad.tvrate[,,2],2)Transversal state distributions
seqstatd(mvad.seq[,1:8])## [State frequencies]
## Sep.93 Oct.93 Nov.93 Dec.93 Jan.94 Feb.94 Mar.94 Apr.94
## EM 0.034 0.036 0.044 0.051 0.055 0.058 0.066 0.067
## FE 0.223 0.217 0.215 0.215 0.209 0.206 0.204 0.200
## HE 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## JL 0.048 0.055 0.051 0.050 0.058 0.068 0.064 0.067
## SC 0.472 0.466 0.464 0.461 0.451 0.451 0.446 0.446
## TR 0.223 0.227 0.226 0.223 0.227 0.218 0.221 0.220
##
## [Valid states]
## Sep.93 Oct.93 Nov.93 Dec.93 Jan.94 Feb.94 Mar.94 Apr.94
## N 712 712 712 712 712 712 712 712
##
## [Entropy index]
## Sep.93 Oct.93 Nov.93 Dec.93 Jan.94 Feb.94 Mar.94 Apr.94
## H 0.72 0.73 0.73 0.74 0.75 0.76 0.77 0.77
# State distribution by time points
seqdplot(mvad.seq, group = mvad$gcse5eq, border = NA, main = c("GCSE below C","GCSE at A-C"))seqmsplot(mvad.seq, group = mvad$gcse5eq, border= NA, main = c("GCSE below C","GCSE at A-C"))Transversal entropy of state distributions
The Shannon’s entropy, also known as the entropy index is a measure of the diversity of states observed at a given time point. The following formula applied for \(h_t\) (entropy index at time \(t\)) is the same for every \(t\) so we ignore the subscript \(t\) to avoid clutter.
\[h(p_1,...,p_a) = - \sum_{i=1}^a p_i log(p_i) \]
where \(a\) is the size of the alphabet (the set of letter-representation of states, in the current context, \(a = 6\)); \(p_i\) is the proportion of cases in state i at a given position. The entropy is 0 when all cases are in the same state and is maximal when we have the same proportion of cases in each states.
Self-note: Since sequential data by nature is categorical, a mean state wouldn’t make much sense (as in the case of numeric data), so as variance (defined as square distance from mean). Instead, the mode and entropy are of more usual discussion.
Sequence analysts, however, also concern about numerical characteristic of sequences (e.g “distance” from mode) - we’ll talk about those measures later.
# Plot the entropy of the state distribution at each time point
seqHtplot(mvad.seq, group = mvad$gcse5eq, main = c("GCSE below C","GCSE at A-C"))Comment: For group of individuals with lower qualification at the end of compulsory education, the entropy index reduces notably overtime. This is because most of them get to work after some training or schooling.
On the other hand, the group of higher qualified individuals tends to experience a more diverse trajectories: after compulsory education, half of them continued schooling, then pursue higher education, while the other half either took some further education then started working or some training the working.
It could also be interesting to look at the trajectories of male and female after compulsory education.
\(N(x) = l_d(x) - 1\) where \(l_d(x)\) is the length of the sequence’s DSS (Distinct Successive States)
library(dplyr)
mvad %>% mutate(trans = seqtransn(mvad.seq)) -> mvad
summary(mvad$trans)## Trans.
## Min. :0.000
## 1st Qu.:1.000
## Median :2.000
## Mean :2.097
## 3rd Qu.:3.000
## Max. :9.000
A subsequence \(y\) of \(x\) is composed of elements of \(x\) occurring in the same order than in \(x\). The number \(\phi(x)\) of subsequences is used in the turbulence measure presented below.
In Figure 9 Below, sequences 5 and 9 have the maximal number of transitions, while the number of subsequences is maximal for sequence 9 only.
longitudinal characteristics
mvad %>% mutate(nsubseq = seqsubsn(mvad.seq)) -> mvadseqistatd(mvad.seq[1:4,])## EM FE HE JL SC TR
## 1 68 0 0 0 0 2
## 2 0 36 34 0 0 0
## 3 10 34 0 2 0 24
## 4 14 0 0 9 0 47
The same Shannon-entropy formula is applied here:
\[h(\pi_1,...,\pi_a) = - \sum_{i=1}^a \pi_i log(\pi_i)\]
where \(a\) is the size of the alphabet and \(\pi_i\) is the proportion of occurrences of the \(i\)th state in the considered sequence (do not consider order of occurrences here). Interpretation similar to transversal entropy. By default the function seqient normalize the entropy calculated via formula above by diving its value to the theoretical maximum, \(log a\). Since within-sequence entropy does not account for the state order in the sequence, sequences 7 and 9 have the same maximal normalized entropy of 1.
entropies <- seqient(mvad.seq)
summary(entropies)## Entropy
## Min. :0.0000
## 1st Qu.:0.3409
## Median :0.3866
## Mean :0.3996
## 3rd Qu.:0.5508
## Max. :0.8548
OLS Regression explaining individual’s entropy
lm.ent <- lm(entropies ~ male + gcse5eq + fmpr + funemp, mvad)
summary(lm.ent)##
## Call:
## lm(formula = entropies ~ male + gcse5eq + fmpr + funemp, data = mvad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47838 -0.08812 0.00405 0.12789 0.47112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.38367 0.01243 30.868 < 2e-16 ***
## maleyes -0.04030 0.01325 -3.042 0.00244 **
## gcse5eqyes 0.06327 0.01405 4.503 7.85e-06 ***
## fmpryes 0.03145 0.01568 2.006 0.04523 *
## funempyes 0.03647 0.01803 2.023 0.04349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1736 on 707 degrees of freedom
## Multiple R-squared: 0.06037, Adjusted R-squared: 0.05505
## F-statistic: 11.36 on 4 and 707 DF, p-value: 6.157e-09
Notice that so far we haven’t talked about sequence characteristics that capture (normative) nature of states, for example: good/bad, favorable/unfavorable, positive/negative. In the current context, for instance, a sequence of EM/70 (employed all times) could be more favorable to JL/70 (joblessness all time). We’ll turn to measures that reflect these characteristics latter.
The turbulence \(T(x)\) of a sequence \(x\) accounts for the number of \(\phi(x)\) of distinct subsequences of the DSS sequence and the variance \(s_t^2(x)\) of the consecutive times \(t_j\) spent in the \(l_d(x)\) distinct states.
\[T(x) = log_2 \left( \phi(x) \frac{s^2_{t, max}(x) + 1}{s^2_t(x) +1 } \right)\]
where \(s^2_{t, max}(x)\) is the maximum value that \(s^2_t(x)\) can take given the total duration \(l(x) = \sum_j t_j\) of that sequence. This maximum is \(s^2_{t, max}(x) = (l_d(x) - 1)(1 - \bar t(x))^2\) where \(\bar t(x)\) is the mean consecutive time spent in the distinct states. Ritschard (2021) provide a revised version of Turbulence index taking into account states that are not visited.
From a prediction point of view, the higher the differences in state durations and hence the higher their variance, the less uncertain the sequence. In that sense, small duration variance indicates high complexity.
Turbulence <- seqST(mvad.seq)
summary(Turbulence)## Turbulence
## Min. : 1.000
## 1st Qu.: 4.621
## Median : 5.454
## Mean : 5.770
## 3rd Qu.: 7.020
## Max. :12.959
hist(Turbulence, col = "cyan", main = "Sequence Turbulence")This combines the number of transitions in the sequence with the within-sequence entropy.
\[C(x) = \sqrt{\frac{l_d(x)}{l(x)} \frac{h(x)}{h_{max}}}\] where \(h_max = log a\) is the theoretical maximum value of the entropy given the alphabet. * A sequence with a single distinct state (no transition, zero-entropy) has minimal complexity, 0. * A sequence \(x\) such that i) \(x\) contains each of the states in the alphabet, ii) the same time \(l(x)/a\) is spent in each state, and iii) the number of transitions is \(l(x)-1\) has the maximal complexity, 1.
Complexity vs Turbulence
Look at sequences 3, 4, 5 and 7.
[3] EM\6, FE\6
[4] EM\4, FE\4, EM\4
[5] EM\1, FE\1, EM\1, FE\1, EM\1, FE\1, EM\1, FE\1, EM\1, FE\1, EM\1, FE\1
[7] EM\3, FE\3, HE\3, JL\3
Complexity rank: [5] (0.73), [7] (0.45), [4] (0.3), [3] (0.22)
Turbulence rank: [5] (0.80), [7] (0.70), [4] (0.6), [3] (0.59)
Complexity “performs” better than turbulence when discriminating between sequences with zero duration variance (as all above 4)
Turbulence index does not account for states that are not visited, which tends to give a high turbulence value for seemingly simple sequences like [3].
Compute the optimal matching distances using substitution costs based on transition rates observed in the data and a 1 indel cost (see Section 9.4). The resulting distance matrix is stored in the dist.om1 object.
### Plotting:
seqfplot(mvad.seq, group = cl1.4fac, border = NA) #SFP within each clusterseqdplot(mvad.seq, group = cl1.4fac, border = NA) # State Distribution at each time point within each clusterComment:
Instead of focusing on sequences of states, we can look at sequences of transitions or events (i.e change between states)
mvad.seqe <- seqecreate(mvad.seq) #create a list for every individualfsubseq <- seqefsub(mvad.seqe, pmin.support = 0.05)
plot(fsubseq[1:15], col = "cyan")Interpreting this plot: