This tutorial introduces how to conduct sequence-based analysis in R. The example data represents the fertility history of respondents from National Longitudinal Survey of Youth 1997 (NLSY97). The fertility history data documents the number of children of the respondents from age 18 to 35, i.e., 18 age-specific fertility status in total. This tutorial incorporates some recent developments in R packages, e.g., ggseqplot and seqhandbook, for visualization and analysis.

Basic Setup

Install and load packages needed for conducting sequence-based analysis.

## Install all the packages needed for conducting sequence analysis and cluster analysis. 
pkgs <- c("colorspace", "glue", "haven", "Hmisc", "kableExtra", "knitr", 
          "patchwork", "pdftools", "RColorBrewer", "reshape2", "sjPlot", 
          "tidyverse", "TraMineR", "TraMineRextras", "foreign", "here", 
          "plot.matrix","gtsummary", "cluster", "ggseqplot",    
          "WeightedCluster","seqhandbook", "stargazer","nnet")

## Load all packages to library and adjust options
lapply(pkgs, library, character.only = TRUE)

# Note: lapply() is a function to apply a function over a list or a vector 

Import Data: NLSY97 Fertility History (Age 18-35)

Open the R script from the local folder you store the practicum materials (R script and demo data). The “here” package will recognize the path and use is as current path.

here() 
## [1] "C:/Users/whuang60/OneDrive - Johns Hopkins/HPC_work_archive_2022/sequence analysis short course/Practicum/SA_practicum material"
## Import the NLSY97 fertility history data (age 18-35) and name it "fertility")  
fertility <- read_dta(here("fertility_seq.dta")) 

# The here() function specifies the path for getting the example data, which is more efficient than setwd(). 
# All the output (e.g., result tables) will be saved in the current path unless it is specified otherwise. 

Inspect the data structure

## Number of observations and variables
dim(fertility) 
## [1] 8984   22

There are 8,984 respondents and 22 variables in the NLSY97 fertility history data.

## List all the variable names
ls(fertility)
##  [1] "C18"       "C19"       "C20"       "C21"       "C22"       "C23"      
##  [7] "C24"       "C25"       "C26"       "C27"       "C28"       "C29"      
## [13] "C30"       "C31"       "C32"       "C33"       "C34"       "C35"      
## [19] "education" "female"    "pubid"     "race"

The “pubid” is the unique identifier for the respondents. The fertility sequence data is stored in wide format. The variables starting with “C” are the sequence variables ranging from age 18 to 35 with equal interval. There are also 3 sociodemographic variables (i.e., gender, race, and education) that we will use later to conduct analysis to examine sequence-covariate associations.

## Check the unique values of the state variable, e.g., C35
unique(fertility$C35) 
## <labelled<double>[8]>: number of child at age 35
## [1] 8 1 3 2 4 5 6 7
## 
## Labels:
##  value      label
##      1    0 child
##      2    1 child
##      3    2 child
##      4    3 child
##      5    4 child
##      6    5 child
##      7 6 or above
##      8    missing

The state space includes 8 unique values, ranging from having “0 child” to “6+ child”, and the missing value is coded as a unique state.

Define Sequence Object

Extract the 18 sequence variables

## Select the 18 sequence variables: C18-C35
fertility.vars <- fertility %>%
  select(starts_with("C"))

Define labels for the state space

## Create state labels 
shortlabel <- c("C=0", "C=1", "C=2", "C=3", "C=4", "C=5", "C=6/+", "NA")
# The shortlabel will show in plotting as default. 
longlabel  <- c("0 child", "1 child", "2 child", "3 child", 
                "4 child", "5 child", "6 or above child", "attrited") 

Create state sequence object using seqdef() in package “TraMineR”

fertility.seq <- seqdef(fertility.vars,    #the previously extracted state variables
                        states=shortlabel, #the predefined short labels for states
                        labels=longlabel,  #the long labels for states
                        id=fertility$pubid, 
                        xtstep=1)          #the default is 1, the tick-mark is displayed at each position 
## Type ?seqdef() in R console to learn more about the available options

Set the color scheme for the state space

library(RColorBrewer)
display.brewer.all()

nb.cols <- 8
color1<- colorRampPalette(brewer.pal(8, "Spectral"))(nb.cols)
color1<- colorRampPalette(rev(brewer.pal(8, "Spectral")))(nb.cols) #reverse the color scheme
cpal(fertility.seq) <-color1

## the state variables are ordinal except for the attrited, thus use "Spectral" color scheme
## Plot the legend for the states in the defined fertility sequence object 
seqlegend(fertility.seq, cpal=color1, ncol=1, cex=1.5)

Basic Descriptives of Sequences

Mean time spent in each state

## Extract mean time spent in each of the 8 states and calculate the standard deviation 
statemeant<- as_tibble(round(seqmeant(fertility.seq, serr = TRUE)[,c(1,3)],1), 
                             rownames = "State")
print(statemeant)
## # A tibble: 8 × 3
##   State  Mean Stdev
##   <chr> <dbl> <dbl>
## 1 C=0     9.2   6.6
## 2 C=1     3.2   4.1
## 3 C=2     2.4   3.7
## 4 C=3     1.1   2.8
## 5 C=4     0.4   1.7
## 6 C=5     0.1   0.9
## 7 C=6/+   0.1   0.8
## 8 NA      1.5   4
## Plot mean time spent in each state using seqmtplot() 
seqmtplot(fertility.seq, with.legend=F,
          main="Mean Duration of Each State in Fertility Sequences")

The average time spent in childless status is 9.2 years. With the increase in the number of children, the mean time declines. Only a very small proportion of respondents ever had more than 3 children by age 35.

Mean number of episodes by state

Episode is a substring containing identical element. If the mean number of episodes for a state is large, it means that a large proportion of sequences have this state-episode. For the fertility history data, the average number of episodes will be low for the states with multiple children.

episodes <- as_tibble(round(seqmeant(seqdss(fertility.seq),  serr = TRUE)[,c(1,3)],1),
                          rownames = "State") %>%
                          rename(ep_Mean = 2, ep_SD = 3)
print(episodes)
## # A tibble: 8 × 3
##   State ep_Mean ep_SD
##   <chr>   <dbl> <dbl>
## 1 C=0       0.9   0.3
## 2 C=1       0.6   0.5
## 3 C=2       0.4   0.5
## 4 C=3       0.2   0.4
## 5 C=4       0.1   0.3
## 6 C=5       0     0.2
## 7 C=6/+     0     0.1
## 8 NA        0.2   0.4

Create a table containing both mean time spent in each state and number of episodes by state (mean & SD)

## Generate table containing both descriptive statistics 
table1 <- statemeant %>% inner_join(episodes, by = "State")

## Print table
kable(table1,
      col.names = c("State", "Mean Time", "SD", "Mean Episodes", "SD")) %>%
  kable_styling(bootstrap_options =  c("responsive", "hover", "condensed"),
                full_width = F)
State Mean Time SD Mean Episodes SD
C=0 9.2 6.6 0.9 0.3
C=1 3.2 4.1 0.6 0.5
C=2 2.4 3.7 0.4 0.5
C=3 1.1 2.8 0.2 0.4
C=4 0.4 1.7 0.1 0.3
C=5 0.1 0.9 0.0 0.2
C=6/+ 0.1 0.8 0.0 0.1
NA 1.5 4.0 0.2 0.4
## Or simply print
print(table1)
## # A tibble: 8 × 5
##   State  Mean Stdev ep_Mean ep_SD
##   <chr> <dbl> <dbl>   <dbl> <dbl>
## 1 C=0     9.2   6.6     0.9   0.3
## 2 C=1     3.2   4.1     0.6   0.5
## 3 C=2     2.4   3.7     0.4   0.5
## 4 C=3     1.1   2.8     0.2   0.4
## 5 C=4     0.4   1.7     0.1   0.3
## 6 C=5     0.1   0.9     0     0.2
## 7 C=6/+   0.1   0.8     0     0.1
## 8 NA      1.5   4       0.2   0.4

Transition Rate

Transition rates are the probabilities of transition from one state to another observed in the sequence data.The transition matrix shows that the respondents are most likely to stay in the same states of childlessness, i.e., transition rate is 0.93. Since number of children can only increase from lower to higher number, many of the transition rates are zero, e.g., it is not possible to transition from 3 children to 1 child (unless deceased, which is not accounted for in the demo data). That’s why only several cells have non-zero values.

## Calculate the transition rates and round up to 2 decimal places. 
trate <- round(seqtrate(fertility.seq),2) 
print(trate) 
##            [-> C=0] [-> C=1] [-> C=2] [-> C=3] [-> C=4] [-> C=5] [-> C=6/+]
## [C=0 ->]       0.93     0.06     0.00      0.0     0.00     0.00       0.00
## [C=1 ->]       0.00     0.85     0.14      0.0     0.00     0.00       0.00
## [C=2 ->]       0.00     0.00     0.89      0.1     0.00     0.00       0.00
## [C=3 ->]       0.00     0.00     0.00      0.9     0.08     0.00       0.00
## [C=4 ->]       0.00     0.00     0.00      0.0     0.91     0.07       0.01
## [C=5 ->]       0.00     0.00     0.00      0.0     0.00     0.90       0.09
## [C=6/+ ->]     0.00     0.00     0.00      0.0     0.00     0.00       0.98
## [NA ->]        0.00     0.00     0.00      0.0     0.00     0.00       0.00
##            [-> NA]
## [C=0 ->]      0.01
## [C=1 ->]      0.01
## [C=2 ->]      0.01
## [C=3 ->]      0.01
## [C=4 ->]      0.01
## [C=5 ->]      0.01
## [C=6/+ ->]    0.02
## [NA ->]       1.00
## Plot the transition matrix using ggseqtrplot() available in package "ggseqplot"
ggseqtrplot(fertility.seq, x_n.dodge = 2, dss = FALSE) +
  ggtitle("STS format")

Visualization of State Sequences: Transversal VS. Longitudinal

Transversal: Examing the data position-by-position

State Distribution Plot

State distribution plot (or known as chronogram) is a series of stacked bar charts at each position (i.e., age) of the sequence. It shows the proportions of states at each position (i.e., age for our example).

par(mar=c(4,4,4,4)) #set the margins of plotting area 
seqdplot(fertility.seq, xtlab=18:35, with.legend=F, ylab= "Relative Frequency", xlab= "Age", cex.axis=0.8)

With the increase of age, a higher proportion of respondents start to distribute in the states of having 1 or more children.

## The seqplot() can all be stratified by group of interest. 
# For illustration, we stratify the sequence distribution plot by education level. 
seqdplot(fertility.seq, xtlab=18:35, with.legend=F, 
         xlab= "Age",  group=fertility$education, cex.axis=0.8)

The plotting functions in the main workhorse of sequence analysis “TraMineR” is not too flexible to modify different components of the figure. The new package “ggseqplot” can be used. See an example below:

ggseqdplot(fertility.seq, group=fertility$education, facet_ncol=2, facet_nrow=2) 

From the state distribution plot by education level, we can see that college-educated respondents start to have children at later age. By age 30, over one third of them remain child-free.

Transversal Shannon’s Entropy

Transversal entropy is a measure of diversity of states observed at each position. A value of 0 means that all the cases in are in the same state and its value is maximal when the same proportion of cases are in each state.

## seqstatd() function returns the state relative frequencies, the number of valid states and the entropy of the state distribution at each position (i.e., age) in the sequence. 
sd <- seqstatd(fertility.seq)
plot(sd, type="Ht", main="Age-specific Shannon's Entropy", 
     xlab= "Age", xtlab=18:35)

The age-specific entropy increases over time, indicating that the fertility statuses become more diverse as individuals age.

Sequence Frequency Plot

The sequences are presented as stacks of successive states, ordered by their relative frequency in the dataset.

par(mar=c(4,4,4,4)) #set the margins of plotting area 
seqfplot(fertility.seq, with.legend=F, xlab= "Age", xtlab=18:35, cex.axis=0.8)

Child-free sequence is the most frequent, followed by the sequences with no fertility status observed.

Longitudinal: Within-Sequence Characteristics

Sequence Index Plot

Sequence index plot shows individual sequences as stacked bars. All the dimensions of temporality, i.e., timing, duration, and order, can be seen in such plot.

par(mar=c(4,4,4,4)) #set the margins of plotting area 
## seqiplot() shows the first 10 sequences
seqiplot(fertility.seq, with.legend=F, xlab= "Age", xtlab=18:35)  

## seqIplot() shows all the sequences
seqIplot(fertility.seq, sortv="from.start", with.legend=F, xlab= "Age", xtlab=18:35, cex.axis=0.8) 

Basic Characteristics within Sequences

## number of transitions 
ntrans <- seqtransn(fertility.seq)
## number of subsequences
subsn  <- seqsubsn(fertility.seq)
## longitudinal entropy 
entropy_long <- seqient(fertility.seq) 
## check the distribution of longitudinal entropy in the sample
hist(entropy_long ) 

By inspecting the distribution of longitudinal entropy, a substantial portion of sequences have value 0. It means that these individuals remain in the same state across the 18 years of observation, e.g., childless or missing all the time.

## turbulence 
turbulence <- seqST(fertility.seq)
## complexity 
complexity <- seqici(fertility.seq)
## Combine all the within-sequence characteristics just calculated
indices <-tibble(ntrans, subsn, turbulence, complexity, entropy_long) 

## correlation between the longitudinal characteristics
cor(indices)  
##                 ntrans     subsn turbulence complexity entropy_long
## ntrans       1.0000000 0.7980644  0.8156056  0.9873341    0.9450613
## subsn        0.7980644 1.0000000  0.5678313  0.7436734    0.6657299
## turbulence   0.8156056 0.5678313  1.0000000  0.8823426    0.9357408
## complexity   0.9873341 0.7436734  0.8823426  1.0000000    0.9847759
## entropy_long 0.9450613 0.6657299  0.9357408  0.9847759    1.0000000

In addition to these within-sequence characteristics, there are also sequence indices reflecting meaningful aspects of sequences, e.g., precarity index. These indices can be used as predictors or outcome variables in further analysis (Ritschard, 2021).

Regress within-sequence characteristics on sociodemographic variables

Here, we use turbulence and complexity as examples.

## prepare the data used for regression analysis
regdata <- fertility %>%
  select(female, race, education) %>%
  mutate(complexity = as.numeric(seqici(fertility.seq)),
         turbulence = as.numeric(seqST(fertility.seq, norm = TRUE))) 
## List all the variable names in the prepared dataset
ls(regdata)
## [1] "complexity" "education"  "female"     "race"       "turbulence"
## Use tbl_summary() in package "gtsummary" to produce a table 
# This table shows the summary statistics of all the variables in the prepared dataset (unweighted)
regdata  %>%
  tbl_summary(include = c(complexity, turbulence, female, race, education))
Characteristic N = 8,9841
complexity 0.14 (0.00, 0.24)
turbulence 0.19 (0.00, 0.29)
female
    male 4,599 (51%)
    female 4,385 (49%)
race
    White 4,665 (52%)
    Black 2,335 (26%)
    Hispanic 1,901 (21%)
    Mixed 83 (0.9%)
education
    None 942 (10%)
    GED/HS 4,755 (53%)
    Associate Degree 757 (8.4%)
    BA or above 2,530 (28%)
1 Median (IQR); n (%)
## Estimate linear regression models for complexity and turbulence
lm.complexity <- lm(complexity ~ female + race + education , data = regdata )
lm.turbulence <- lm(turbulence ~ female + race + education, data = regdata )

## Generate regression table using "gtsummary" package
t1 <- tbl_regression(lm.complexity)
t2 <- tbl_regression(lm.turbulence)
## Merge tables: t1 & t2
  tbl_merge(
    tbls = list(t1, t2),
    tab_spanner = c("**Complexity**", "**Turbulence**")
  )
Characteristic Complexity Turbulence
Beta 95% CI1 p-value Beta 95% CI1 p-value
female
    male
    female 0.03 0.03, 0.04 <0.001 0.03 0.02, 0.03 <0.001
race
    White
    Black 0.01 0.01, 0.02 <0.001 0.01 0.01, 0.02 <0.001
    Hispanic 0.02 0.01, 0.03 <0.001 0.02 0.01, 0.03 <0.001
    Mixed 0.03 0.00, 0.06 0.047 0.01 -0.01, 0.04 0.3
education
    None
    GED/HS -0.01 -0.01, 0.00 0.3 0.01 0.00, 0.02 0.046
    Associate Degree -0.02 -0.03, -0.01 0.002 0.00 -0.01, 0.01 0.9
    BA or above -0.05 -0.06, -0.04 <0.001 -0.04 -0.05, -0.02 <0.001
1 CI = Confidence Interval

The regression results show that women, racial minorities, low-educated have higher turbulence and complexity in their fertility history.

Dissimilarity-based Analysis

Optimal matching and alternatives

Calculate the dissimilarity matrix using optimal matching (OM) method. Set substitution cost as transition rate and indel cost is calculated using “auto” argument. For illustration, only OM and Hamming distance are calculated. Only OM-based dissimilarity matrix is further used to conduct cluster analysis.

## The illustration of cluster analysis uses OM distance as an example. The substitution cost is specified as the transition rate. 
OM  <- seqdist(fertility.seq, method="OM", sm="TRATE", indel="auto")
# When "auto", the indel is set as max(sm)/2 when sm is a matrix and is computed by means of seqcost when sm is a string specifying a cost method.

Given any two sequences of equal length, the Hamming distance is the number of positions at which corresponding states are different. Thus, no insertion or deletion is considered.

## Calculate dissimilarity matrix using Hamming method
HAM <- seqdist(fertility.seq, method= "HAM") 

Different dimensions of temporarily, i.e., timing, duration, and order, can be taken into account by applying different methods for calculating dissimilarity matrix. See a comparison of method-temporarily in Studer and Ritschard (2016) paper.

Identifying typical patterns among sequences: Cluster analyis using the package “WeightedCluster”

Hierarchical clustering

## Perform hierarchical clustering using "ward.D" method, the input is the OM-based dissimilarity matrix (specified in as.dist)
wardCluster <- hclust(as.dist(OM), method="ward.D")
## Dendrogram represents the agglomerative (bottom up) clustering procedure (see slide 39)
as.dendrogram(wardCluster) %>% plot(leaflab="none")

“One starts out from the observations, each of them being considered as a group. On each iteration, the two closest groups (or initially observations) are grouped, until all the observations form a single group. The agglomeration schedule, i.e. the succession of groupings performed, represents the clustering procedure in the form of a tree which is called a dendrogram” (Studer, 2013:6).

## seq_heatmap(): index plot of sequences ordered according to a dendrogram
seq_heatmap(fertility.seq, wardCluster)

Determine the optimal number of clusters

Various quality measures can be computed to determine the optimal number of clusters. Good clustering needs to maximize the homogeneity within cluster and heterogeneity between group (i.e., whether clusters are distinct from each other). For details, please refer to the “WeightedCluster” manual (Studer, 2013, page 33-34).

## as.clustrange() produces the all the quality indices 
wardRange<-as.clustrange(wardCluster, diss=OM)
## print the quality indices for the two best solutions
summary(wardRange, max.rank=2)
##      1. N groups     1.  stat 2. N groups     2.  stat
## PBC            9 6.669443e-01           8 6.662702e-01
## HG            20 9.443307e-01          19 9.403195e-01
## HGSD          20 9.439064e-01          19 9.399031e-01
## ASW            7 4.556711e-01           8 4.524956e-01
## ASWw           7 4.561074e-01           8 4.530312e-01
## CH             2 3.209675e+03           3 2.726594e+03
## R2            20 7.550712e-01          19 7.483761e-01
## CHsq           7 5.445387e+03           6 5.427821e+03
## R2sq          20 8.988933e-01          19 8.954413e-01
## HC            20 4.379441e-02           9 4.428889e-02
# 7-cluster is the likely best solution 
## Plot the quality indices
plot(wardRange, stat=c("ASWw", "HG", "PBC", "HC"), norm="zscore") 

## cut the dendrogram at 7 clusters 
cluster7 <- cutree(wardCluster, k=7)
table(cluster7) # the distribution of cluster membership 
# Use package "ggseqplot" to produce state distribution plot by cluster membership 
ggseqdplot(fertility.seq, group=cluster7, facet_ncol=3, facet_nrow=3)

Partitioning around medoids (PAM)

PAM aims to “obtain the best partitioning of a data set into a predefined number k of groups.” It aims to identify the k best representatives of groups, called medoids. A medoid is defined as the observation of a group having the smallest weighed sum of distances from the other observations in this group (Studer, 2013:10)

## Considering the result of hierarchical clustering, we set 7 as the number of groups:
pamcluster7 <- wcKMedoids(OM, k=7)

## print the medoids in SPS format
print(fertility.seq[unique(pamcluster7$clustering),], format="SPS")
##      Sequence                               
## 9020 (C=0,18)                               
## 8970 (C=0,5)-(C=1,13)                       
## 1521 (C=0,3)-(C=1,4)-(C=2,11)               
## 634  (C=0,9)-(C=1,3)-(C=2,6)                
## 846  (C=0,13)-(C=1,4)-(C=2,1)               
## 7996 (C=0,2)-(C=1,3)-(C=2,3)-(C=3,8)-(C=4,2)
## 8812 (C=0,5)-(NA,13)

The medoids represented in SPS format can inform the naming of the clusters.

## Calculate quality measures 
pamRange <- wcKMedRange(OM, kvals=2:20)
summary(pamRange, max.rank=2) 
##      1. N groups     1.  stat 2. N groups     2.  stat
## PBC            6 6.690566e-01           8 6.514679e-01
## HG            20 9.695848e-01          16 9.669670e-01
## HGSD          20 9.693359e-01          16 9.667432e-01
## ASW            9 5.063722e-01          10 4.966375e-01
## ASWw           9 5.069428e-01          10 4.972795e-01
## CH             2 3.112973e+03           3 2.893972e+03
## R2            20 7.751902e-01          19 7.714938e-01
## CHsq           6 6.599013e+03           9 6.509341e+03
## R2sq          20 9.217484e-01          19 9.160056e-01
## HC            16 2.411017e-02          20 2.459379e-02

The quality measures show that 6 or 9 clusters may be better than 7 clusters. For illustration purpose, 7 is used. We can always inspect different solutions to make sure that the clustering makes substantive sense while improving quality measures.

Naming the clusters

Labeling the clusters is an important step. Here, each cluster is named according to the medoid of each cluster. The labels have to be succinct and informative to account for the state change and temporality.

## Label the resulting clusters yielded by PAM
type <-factor(pamcluster7$clustering , levels=c(8982, 8933, 1520, 634, 845, 7969, 8778),
              labels=c("Childless",
                       "Early 20, single child",
                       "Early 20, two children",
                       "Mid-20, two children",
                       "Early-30, two children",
                       "Late-teens, multiple children",
                       "incomplete"))
table(type) 
## type
##                     Childless        Early 20, single child 
##                          2675                          1032 
##        Early 20, two children          Mid-20, two children 
##                           984                          1009 
##        Early-30, two children Late-teens, multiple children 
##                          1103                          1300 
##                    incomplete 
##                           881
## Plot the state distribution by cluster membership
ggseqdplot(fertility.seq, group=type, facet_ncol=3, facet_nrow=3)

From the state distribution plots stratified by PAM cluster membership, we can see that the most prevalent fertility pattern is the childless from age 18 to 35 (n=2,675). The other clusters are distinct from each other in terms of timing and number of children. There is also a cluster (i.e., “incomplete”) representing the sequences loaded with missing values.

Regress cluster membership on covariates

In addition to identification of the typical patterns of fertility history among NLSY97 cohort. Researchers would like to know what factors predict the cluster membership. To do so, the resulting cluster membership is used as the outcome variable in the multinomial logistic regression model. OR, the cluster membership can be used to predict other outcomes of interest, e.g., fertility history and midlife health. We focus on the first senario in this practicum.

## Prepare the data set for the multinomial logistic regression. 
mdata <- fertility %>%
  select(female, race, education) %>%
  mutate(type = as_factor(type)) 
ls(mdata)
## [1] "education" "female"    "race"      "type"
## Estimate multinomial logistic regression model to predict cluster membership. 
multinomial <- multinom(type ~ female + race + education, data=mdata)
## Produce the result table for estimated multinomial logistic regression model. 
# Create a summary table with tbl_regression
model_table <- tbl_regression(multinomial, 
               exponentiate = TRUE) %>% 
  add_glance_table(c(nobs, AIC)) 
## ℹ Multinomial models have a different underlying structure than the models
## gtsummary was designed for. Other gtsummary functions designed to work with
## tbl_regression objects may yield unexpected results.
model_table
Characteristic OR1 95% CI1 p-value
Early 20, single child
female
    male
    female 1.91 1.64, 2.22 <0.001
race
    White
    Black 2.06 1.73, 2.44 <0.001
    Hispanic 1.25 1.02, 1.52 0.030
    Mixed 1.56 0.70, 3.49 0.3
education
    None
    GED/HS 0.68 0.51, 0.90 0.008
    Associate Degree 0.48 0.33, 0.69 <0.001
    BA or above 0.16 0.11, 0.22 <0.001
Early 20, two children
female
    male
    female 3.33 2.85, 3.91 <0.001
race
    White
    Black 2.02 1.68, 2.42 <0.001
    Hispanic 1.71 1.40, 2.07 <0.001
    Mixed 1.34 0.55, 3.27 0.5
education
    None
    GED/HS 0.45 0.35, 0.59 <0.001
    Associate Degree 0.31 0.22, 0.44 <0.001
    BA or above 0.08 0.06, 0.11 <0.001
Mid-20, two children
female
    male
    female 1.44 1.24, 1.67 <0.001
race
    White
    Black 0.77 0.64, 0.93 0.007
    Hispanic 0.94 0.78, 1.14 0.6
    Mixed 1.34 0.64, 2.79 0.4
education
    None
    GED/HS 0.89 0.63, 1.26 0.5
    Associate Degree 0.78 0.52, 1.18 0.2
    BA or above 0.71 0.50, 1.02 0.061
Early-30, two children
female
    male
    female 1.27 1.10, 1.47 0.001
race
    White
    Black 0.72 0.60, 0.87 <0.001
    Hispanic 0.83 0.69, 1.01 0.065
    Mixed 1.33 0.66, 2.68 0.4
education
    None
    GED/HS 1.44 0.93, 2.24 0.10
    Associate Degree 2.00 1.23, 3.25 0.005
    BA or above 2.04 1.32, 3.18 0.001
Late-teens, multiple children
female
    male
    female 3.76 3.25, 4.36 <0.001
race
    White
    Black 2.30 1.94, 2.73 <0.001
    Hispanic 1.90 1.59, 2.28 <0.001
    Mixed 2.53 1.25, 5.14 0.010
education
    None
    GED/HS 0.30 0.24, 0.39 <0.001
    Associate Degree 0.19 0.14, 0.27 <0.001
    BA or above 0.04 0.03, 0.06 <0.001
incomplete
female
    male
    female 1.54 1.31, 1.82 <0.001
race
    White
    Black 0.55 0.44, 0.68 <0.001
    Hispanic 0.59 0.48, 0.73 <0.001
    Mixed 0.81 0.32, 2.08 0.7
education
    None
    GED/HS 0.16 0.12, 0.20 <0.001
    Associate Degree 0.05 0.04, 0.08 <0.001
    BA or above 0.04 0.03, 0.05 <0.001
NA
No. Obs. 8,984
AIC 31,115
1 OR = Odds Ratio, CI = Confidence Interval

Taking the cluster “late-teens, multiple children” as an example, women are more likely to enter this fertility trajectory than men. Racial minorities are more likely than their White counterparts to have this childbearing pattern. There is an educational gradient in the probability of entering this fertility trajectory, e.g., college-educated are least likely to enter this trajectory.

## OR use stargazer to produce publication-quality table
multinomial.rrr <-exp(coef(multinomial))
stargazer(multinomial, type="html", coef=list(multinomial.rrr), out="multinomial_rrr.htm")

# You can find the table in the current folder here() specifies

Reference

  • Raab, M., & Struffolino, E. (2022). Sequence Analysis (1st edition). SAGE Publications, Inc.
  • Ritschard, G. (2021). Measuring the Nature of Individual Sequences. Sociological Methods & Research, 00491241211036156. https://doi.org/10.1177/00491241211036156
  • Studer M (2013). “WeightedCluster Library Manual: A practical guide to creating typologies of trajectories in the social sciences with R.”
  • Studer, M., & Ritschard, G. (2016). What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), 481–511. https://doi.org/10.1111/rssa.12125

Please contact whuang60@jhu.edu if you have any questions.