This tutorial introduces how to conduct sequence-based analysis in R. The example data represents the fertility history of respondents from National Longitudinal Survey of Youth 1997 (NLSY97). The fertility history data documents the number of children of the respondents from age 18 to 35, i.e., 18 age-specific fertility status in total. This tutorial incorporates some recent developments in R packages, e.g., ggseqplot and seqhandbook, for visualization and analysis.
Basic Setup
Install and load packages needed for conducting sequence-based analysis.
## Install all the packages needed for conducting sequence analysis and cluster analysis.
pkgs <- c("colorspace", "glue", "haven", "Hmisc", "kableExtra", "knitr",
"patchwork", "pdftools", "RColorBrewer", "reshape2", "sjPlot",
"tidyverse", "TraMineR", "TraMineRextras", "foreign", "here",
"plot.matrix","gtsummary", "cluster", "ggseqplot",
"WeightedCluster","seqhandbook", "stargazer","nnet")
## Load all packages to library and adjust options
lapply(pkgs, library, character.only = TRUE)
# Note: lapply() is a function to apply a function over a list or a vector Import Data: NLSY97 Fertility History (Age 18-35)
Open the R script from the local folder you store the practicum materials (R script and demo data). The “here” package will recognize the path and use is as current path.
## [1] "C:/Users/whuang60/OneDrive - Johns Hopkins/HPC_work_archive_2022/sequence analysis short course/Practicum/SA_practicum material"
## Import the NLSY97 fertility history data (age 18-35) and name it "fertility")
fertility <- read_dta(here("fertility_seq.dta"))
# The here() function specifies the path for getting the example data, which is more efficient than setwd().
# All the output (e.g., result tables) will be saved in the current path unless it is specified otherwise. Inspect the data structure
## [1] 8984 22
There are 8,984 respondents and 22 variables in the NLSY97 fertility history data.
## [1] "C18" "C19" "C20" "C21" "C22" "C23"
## [7] "C24" "C25" "C26" "C27" "C28" "C29"
## [13] "C30" "C31" "C32" "C33" "C34" "C35"
## [19] "education" "female" "pubid" "race"
The “pubid” is the unique identifier for the respondents. The fertility sequence data is stored in wide format. The variables starting with “C” are the sequence variables ranging from age 18 to 35 with equal interval. There are also 3 sociodemographic variables (i.e., gender, race, and education) that we will use later to conduct analysis to examine sequence-covariate associations.
## <labelled<double>[8]>: number of child at age 35
## [1] 8 1 3 2 4 5 6 7
##
## Labels:
## value label
## 1 0 child
## 2 1 child
## 3 2 child
## 4 3 child
## 5 4 child
## 6 5 child
## 7 6 or above
## 8 missing
The state space includes 8 unique values, ranging from having “0 child” to “6+ child”, and the missing value is coded as a unique state.
Define Sequence Object
Extract the 18 sequence variables
Define labels for the state space
Create state sequence object using seqdef() in package “TraMineR”
fertility.seq <- seqdef(fertility.vars, #the previously extracted state variables
states=shortlabel, #the predefined short labels for states
labels=longlabel, #the long labels for states
id=fertility$pubid,
xtstep=1) #the default is 1, the tick-mark is displayed at each position
## Type ?seqdef() in R console to learn more about the available optionsSet the color scheme for the state space
nb.cols <- 8
color1<- colorRampPalette(brewer.pal(8, "Spectral"))(nb.cols)
color1<- colorRampPalette(rev(brewer.pal(8, "Spectral")))(nb.cols) #reverse the color scheme
cpal(fertility.seq) <-color1
## the state variables are ordinal except for the attrited, thus use "Spectral" color scheme## Plot the legend for the states in the defined fertility sequence object
seqlegend(fertility.seq, cpal=color1, ncol=1, cex=1.5)Basic Descriptives of Sequences
Mean time spent in each state
## Extract mean time spent in each of the 8 states and calculate the standard deviation
statemeant<- as_tibble(round(seqmeant(fertility.seq, serr = TRUE)[,c(1,3)],1),
rownames = "State")
print(statemeant)## # A tibble: 8 × 3
## State Mean Stdev
## <chr> <dbl> <dbl>
## 1 C=0 9.2 6.6
## 2 C=1 3.2 4.1
## 3 C=2 2.4 3.7
## 4 C=3 1.1 2.8
## 5 C=4 0.4 1.7
## 6 C=5 0.1 0.9
## 7 C=6/+ 0.1 0.8
## 8 NA 1.5 4
## Plot mean time spent in each state using seqmtplot()
seqmtplot(fertility.seq, with.legend=F,
main="Mean Duration of Each State in Fertility Sequences")The average time spent in childless status is 9.2 years. With the increase in the number of children, the mean time declines. Only a very small proportion of respondents ever had more than 3 children by age 35.
Mean number of episodes by state
Episode is a substring containing identical element. If the mean number of episodes for a state is large, it means that a large proportion of sequences have this state-episode. For the fertility history data, the average number of episodes will be low for the states with multiple children.
episodes <- as_tibble(round(seqmeant(seqdss(fertility.seq), serr = TRUE)[,c(1,3)],1),
rownames = "State") %>%
rename(ep_Mean = 2, ep_SD = 3)
print(episodes)## # A tibble: 8 × 3
## State ep_Mean ep_SD
## <chr> <dbl> <dbl>
## 1 C=0 0.9 0.3
## 2 C=1 0.6 0.5
## 3 C=2 0.4 0.5
## 4 C=3 0.2 0.4
## 5 C=4 0.1 0.3
## 6 C=5 0 0.2
## 7 C=6/+ 0 0.1
## 8 NA 0.2 0.4
Create a table containing both mean time spent in each state and number of episodes by state (mean & SD)
## Generate table containing both descriptive statistics
table1 <- statemeant %>% inner_join(episodes, by = "State")
## Print table
kable(table1,
col.names = c("State", "Mean Time", "SD", "Mean Episodes", "SD")) %>%
kable_styling(bootstrap_options = c("responsive", "hover", "condensed"),
full_width = F)| State | Mean Time | SD | Mean Episodes | SD |
|---|---|---|---|---|
| C=0 | 9.2 | 6.6 | 0.9 | 0.3 |
| C=1 | 3.2 | 4.1 | 0.6 | 0.5 |
| C=2 | 2.4 | 3.7 | 0.4 | 0.5 |
| C=3 | 1.1 | 2.8 | 0.2 | 0.4 |
| C=4 | 0.4 | 1.7 | 0.1 | 0.3 |
| C=5 | 0.1 | 0.9 | 0.0 | 0.2 |
| C=6/+ | 0.1 | 0.8 | 0.0 | 0.1 |
| NA | 1.5 | 4.0 | 0.2 | 0.4 |
## # A tibble: 8 × 5
## State Mean Stdev ep_Mean ep_SD
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 C=0 9.2 6.6 0.9 0.3
## 2 C=1 3.2 4.1 0.6 0.5
## 3 C=2 2.4 3.7 0.4 0.5
## 4 C=3 1.1 2.8 0.2 0.4
## 5 C=4 0.4 1.7 0.1 0.3
## 6 C=5 0.1 0.9 0 0.2
## 7 C=6/+ 0.1 0.8 0 0.1
## 8 NA 1.5 4 0.2 0.4
Transition Rate
Transition rates are the probabilities of transition from one state to another observed in the sequence data.The transition matrix shows that the respondents are most likely to stay in the same states of childlessness, i.e., transition rate is 0.93. Since number of children can only increase from lower to higher number, many of the transition rates are zero, e.g., it is not possible to transition from 3 children to 1 child (unless deceased, which is not accounted for in the demo data). That’s why only several cells have non-zero values.
## Calculate the transition rates and round up to 2 decimal places.
trate <- round(seqtrate(fertility.seq),2)
print(trate) ## [-> C=0] [-> C=1] [-> C=2] [-> C=3] [-> C=4] [-> C=5] [-> C=6/+]
## [C=0 ->] 0.93 0.06 0.00 0.0 0.00 0.00 0.00
## [C=1 ->] 0.00 0.85 0.14 0.0 0.00 0.00 0.00
## [C=2 ->] 0.00 0.00 0.89 0.1 0.00 0.00 0.00
## [C=3 ->] 0.00 0.00 0.00 0.9 0.08 0.00 0.00
## [C=4 ->] 0.00 0.00 0.00 0.0 0.91 0.07 0.01
## [C=5 ->] 0.00 0.00 0.00 0.0 0.00 0.90 0.09
## [C=6/+ ->] 0.00 0.00 0.00 0.0 0.00 0.00 0.98
## [NA ->] 0.00 0.00 0.00 0.0 0.00 0.00 0.00
## [-> NA]
## [C=0 ->] 0.01
## [C=1 ->] 0.01
## [C=2 ->] 0.01
## [C=3 ->] 0.01
## [C=4 ->] 0.01
## [C=5 ->] 0.01
## [C=6/+ ->] 0.02
## [NA ->] 1.00
## Plot the transition matrix using ggseqtrplot() available in package "ggseqplot"
ggseqtrplot(fertility.seq, x_n.dodge = 2, dss = FALSE) +
ggtitle("STS format")Visualization of State Sequences: Transversal VS. Longitudinal
Transversal: Examing the data position-by-position
Modal state plot: the most frequent state at each position
par(mar=c(4,4,4,4)) #set the margins of plotting area
seqmsplot(fertility.seq, ylab="Relative Frequency", xlab= "Age", xtlab=18:35, with.legend=F, cex.axis=0.8)The modal state remains to be childless from age 18 to 35, reflecting the overall delaying of childbearing in NLSY97 cohort.
State Distribution Plot
State distribution plot (or known as chronogram) is a series of stacked bar charts at each position (i.e., age) of the sequence. It shows the proportions of states at each position (i.e., age for our example).
par(mar=c(4,4,4,4)) #set the margins of plotting area
seqdplot(fertility.seq, xtlab=18:35, with.legend=F, ylab= "Relative Frequency", xlab= "Age", cex.axis=0.8)With the increase of age, a higher proportion of respondents start to distribute in the states of having 1 or more children.
## The seqplot() can all be stratified by group of interest.
# For illustration, we stratify the sequence distribution plot by education level.
seqdplot(fertility.seq, xtlab=18:35, with.legend=F,
xlab= "Age", group=fertility$education, cex.axis=0.8)The plotting functions in the main workhorse of sequence analysis “TraMineR” is not too flexible to modify different components of the figure. The new package “ggseqplot” can be used. See an example below:
From the state distribution plot by education level, we can see that college-educated respondents start to have children at later age. By age 30, over one third of them remain child-free.
Transversal Shannon’s Entropy
Transversal entropy is a measure of diversity of states observed at each position. A value of 0 means that all the cases in are in the same state and its value is maximal when the same proportion of cases are in each state.
## seqstatd() function returns the state relative frequencies, the number of valid states and the entropy of the state distribution at each position (i.e., age) in the sequence.
sd <- seqstatd(fertility.seq)
plot(sd, type="Ht", main="Age-specific Shannon's Entropy",
xlab= "Age", xtlab=18:35)The age-specific entropy increases over time, indicating that the fertility statuses become more diverse as individuals age.
Sequence Frequency Plot
The sequences are presented as stacks of successive states, ordered by their relative frequency in the dataset.
par(mar=c(4,4,4,4)) #set the margins of plotting area
seqfplot(fertility.seq, with.legend=F, xlab= "Age", xtlab=18:35, cex.axis=0.8)Child-free sequence is the most frequent, followed by the sequences with no fertility status observed.
Longitudinal: Within-Sequence Characteristics
Sequence Index Plot
Sequence index plot shows individual sequences as stacked bars. All the dimensions of temporality, i.e., timing, duration, and order, can be seen in such plot.
par(mar=c(4,4,4,4)) #set the margins of plotting area
## seqiplot() shows the first 10 sequences
seqiplot(fertility.seq, with.legend=F, xlab= "Age", xtlab=18:35) ## seqIplot() shows all the sequences
seqIplot(fertility.seq, sortv="from.start", with.legend=F, xlab= "Age", xtlab=18:35, cex.axis=0.8) Basic Characteristics within Sequences
## number of transitions
ntrans <- seqtransn(fertility.seq)
## number of subsequences
subsn <- seqsubsn(fertility.seq)## longitudinal entropy
entropy_long <- seqient(fertility.seq)
## check the distribution of longitudinal entropy in the sample
hist(entropy_long ) By inspecting the distribution of longitudinal entropy, a substantial portion of sequences have value 0. It means that these individuals remain in the same state across the 18 years of observation, e.g., childless or missing all the time.
## Combine all the within-sequence characteristics just calculated
indices <-tibble(ntrans, subsn, turbulence, complexity, entropy_long)
## correlation between the longitudinal characteristics
cor(indices) ## ntrans subsn turbulence complexity entropy_long
## ntrans 1.0000000 0.7980644 0.8156056 0.9873341 0.9450613
## subsn 0.7980644 1.0000000 0.5678313 0.7436734 0.6657299
## turbulence 0.8156056 0.5678313 1.0000000 0.8823426 0.9357408
## complexity 0.9873341 0.7436734 0.8823426 1.0000000 0.9847759
## entropy_long 0.9450613 0.6657299 0.9357408 0.9847759 1.0000000
In addition to these within-sequence characteristics, there are also sequence indices reflecting meaningful aspects of sequences, e.g., precarity index. These indices can be used as predictors or outcome variables in further analysis (Ritschard, 2021).
Regress within-sequence characteristics on sociodemographic variables
Here, we use turbulence and complexity as examples.
## prepare the data used for regression analysis
regdata <- fertility %>%
select(female, race, education) %>%
mutate(complexity = as.numeric(seqici(fertility.seq)),
turbulence = as.numeric(seqST(fertility.seq, norm = TRUE))) ## [1] "complexity" "education" "female" "race" "turbulence"
## Use tbl_summary() in package "gtsummary" to produce a table
# This table shows the summary statistics of all the variables in the prepared dataset (unweighted)
regdata %>%
tbl_summary(include = c(complexity, turbulence, female, race, education))| Characteristic | N = 8,9841 |
|---|---|
| complexity | 0.14 (0.00, 0.24) |
| turbulence | 0.19 (0.00, 0.29) |
| female | |
| male | 4,599 (51%) |
| female | 4,385 (49%) |
| race | |
| White | 4,665 (52%) |
| Black | 2,335 (26%) |
| Hispanic | 1,901 (21%) |
| Mixed | 83 (0.9%) |
| education | |
| None | 942 (10%) |
| GED/HS | 4,755 (53%) |
| Associate Degree | 757 (8.4%) |
| BA or above | 2,530 (28%) |
| 1 Median (IQR); n (%) | |
## Estimate linear regression models for complexity and turbulence
lm.complexity <- lm(complexity ~ female + race + education , data = regdata )
lm.turbulence <- lm(turbulence ~ female + race + education, data = regdata )
## Generate regression table using "gtsummary" package
t1 <- tbl_regression(lm.complexity)
t2 <- tbl_regression(lm.turbulence)## Merge tables: t1 & t2
tbl_merge(
tbls = list(t1, t2),
tab_spanner = c("**Complexity**", "**Turbulence**")
)| Characteristic | Complexity | Turbulence | ||||
|---|---|---|---|---|---|---|
| Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
| female | ||||||
| male | — | — | — | — | ||
| female | 0.03 | 0.03, 0.04 | <0.001 | 0.03 | 0.02, 0.03 | <0.001 |
| race | ||||||
| White | — | — | — | — | ||
| Black | 0.01 | 0.01, 0.02 | <0.001 | 0.01 | 0.01, 0.02 | <0.001 |
| Hispanic | 0.02 | 0.01, 0.03 | <0.001 | 0.02 | 0.01, 0.03 | <0.001 |
| Mixed | 0.03 | 0.00, 0.06 | 0.047 | 0.01 | -0.01, 0.04 | 0.3 |
| education | ||||||
| None | — | — | — | — | ||
| GED/HS | -0.01 | -0.01, 0.00 | 0.3 | 0.01 | 0.00, 0.02 | 0.046 |
| Associate Degree | -0.02 | -0.03, -0.01 | 0.002 | 0.00 | -0.01, 0.01 | 0.9 |
| BA or above | -0.05 | -0.06, -0.04 | <0.001 | -0.04 | -0.05, -0.02 | <0.001 |
| 1 CI = Confidence Interval | ||||||
The regression results show that women, racial minorities, low-educated have higher turbulence and complexity in their fertility history.
Dissimilarity-based Analysis
Optimal matching and alternatives
Calculate the dissimilarity matrix using optimal matching (OM) method. Set substitution cost as transition rate and indel cost is calculated using “auto” argument. For illustration, only OM and Hamming distance are calculated. Only OM-based dissimilarity matrix is further used to conduct cluster analysis.
## The illustration of cluster analysis uses OM distance as an example. The substitution cost is specified as the transition rate.
OM <- seqdist(fertility.seq, method="OM", sm="TRATE", indel="auto")
# When "auto", the indel is set as max(sm)/2 when sm is a matrix and is computed by means of seqcost when sm is a string specifying a cost method.Given any two sequences of equal length, the Hamming distance is the number of positions at which corresponding states are different. Thus, no insertion or deletion is considered.
Different dimensions of temporarily, i.e., timing, duration, and order, can be taken into account by applying different methods for calculating dissimilarity matrix. See a comparison of method-temporarily in Studer and Ritschard (2016) paper.
Identifying typical patterns among sequences: Cluster analyis using the package “WeightedCluster”
Hierarchical clustering
## Perform hierarchical clustering using "ward.D" method, the input is the OM-based dissimilarity matrix (specified in as.dist)
wardCluster <- hclust(as.dist(OM), method="ward.D")## Dendrogram represents the agglomerative (bottom up) clustering procedure (see slide 39)
as.dendrogram(wardCluster) %>% plot(leaflab="none")“One starts out from the observations, each of them being considered as a group. On each iteration, the two closest groups (or initially observations) are grouped, until all the observations form a single group. The agglomeration schedule, i.e. the succession of groupings performed, represents the clustering procedure in the form of a tree which is called a dendrogram” (Studer, 2013:6).
## seq_heatmap(): index plot of sequences ordered according to a dendrogram
seq_heatmap(fertility.seq, wardCluster)Determine the optimal number of clusters
Various quality measures can be computed to determine the optimal number of clusters. Good clustering needs to maximize the homogeneity within cluster and heterogeneity between group (i.e., whether clusters are distinct from each other). For details, please refer to the “WeightedCluster” manual (Studer, 2013, page 33-34).
## as.clustrange() produces the all the quality indices
wardRange<-as.clustrange(wardCluster, diss=OM)## 1. N groups 1. stat 2. N groups 2. stat
## PBC 9 6.669443e-01 8 6.662702e-01
## HG 20 9.443307e-01 19 9.403195e-01
## HGSD 20 9.439064e-01 19 9.399031e-01
## ASW 7 4.556711e-01 8 4.524956e-01
## ASWw 7 4.561074e-01 8 4.530312e-01
## CH 2 3.209675e+03 3 2.726594e+03
## R2 20 7.550712e-01 19 7.483761e-01
## CHsq 7 5.445387e+03 6 5.427821e+03
## R2sq 20 8.988933e-01 19 8.954413e-01
## HC 20 4.379441e-02 9 4.428889e-02
## cut the dendrogram at 7 clusters
cluster7 <- cutree(wardCluster, k=7)
table(cluster7) # the distribution of cluster membership # Use package "ggseqplot" to produce state distribution plot by cluster membership
ggseqdplot(fertility.seq, group=cluster7, facet_ncol=3, facet_nrow=3)Partitioning around medoids (PAM)
PAM aims to “obtain the best partitioning of a data set into a predefined number k of groups.” It aims to identify the k best representatives of groups, called medoids. A medoid is defined as the observation of a group having the smallest weighed sum of distances from the other observations in this group (Studer, 2013:10)
## Considering the result of hierarchical clustering, we set 7 as the number of groups:
pamcluster7 <- wcKMedoids(OM, k=7)
## print the medoids in SPS format
print(fertility.seq[unique(pamcluster7$clustering),], format="SPS")## Sequence
## 9020 (C=0,18)
## 8970 (C=0,5)-(C=1,13)
## 1521 (C=0,3)-(C=1,4)-(C=2,11)
## 634 (C=0,9)-(C=1,3)-(C=2,6)
## 846 (C=0,13)-(C=1,4)-(C=2,1)
## 7996 (C=0,2)-(C=1,3)-(C=2,3)-(C=3,8)-(C=4,2)
## 8812 (C=0,5)-(NA,13)
The medoids represented in SPS format can inform the naming of the clusters.
## 1. N groups 1. stat 2. N groups 2. stat
## PBC 6 6.690566e-01 8 6.514679e-01
## HG 20 9.695848e-01 16 9.669670e-01
## HGSD 20 9.693359e-01 16 9.667432e-01
## ASW 9 5.063722e-01 10 4.966375e-01
## ASWw 9 5.069428e-01 10 4.972795e-01
## CH 2 3.112973e+03 3 2.893972e+03
## R2 20 7.751902e-01 19 7.714938e-01
## CHsq 6 6.599013e+03 9 6.509341e+03
## R2sq 20 9.217484e-01 19 9.160056e-01
## HC 16 2.411017e-02 20 2.459379e-02
The quality measures show that 6 or 9 clusters may be better than 7 clusters. For illustration purpose, 7 is used. We can always inspect different solutions to make sure that the clustering makes substantive sense while improving quality measures.
Naming the clusters
Labeling the clusters is an important step. Here, each cluster is named according to the medoid of each cluster. The labels have to be succinct and informative to account for the state change and temporality.
## Label the resulting clusters yielded by PAM
type <-factor(pamcluster7$clustering , levels=c(8982, 8933, 1520, 634, 845, 7969, 8778),
labels=c("Childless",
"Early 20, single child",
"Early 20, two children",
"Mid-20, two children",
"Early-30, two children",
"Late-teens, multiple children",
"incomplete"))
table(type) ## type
## Childless Early 20, single child
## 2675 1032
## Early 20, two children Mid-20, two children
## 984 1009
## Early-30, two children Late-teens, multiple children
## 1103 1300
## incomplete
## 881
## Plot the state distribution by cluster membership
ggseqdplot(fertility.seq, group=type, facet_ncol=3, facet_nrow=3)From the state distribution plots stratified by PAM cluster membership, we can see that the most prevalent fertility pattern is the childless from age 18 to 35 (n=2,675). The other clusters are distinct from each other in terms of timing and number of children. There is also a cluster (i.e., “incomplete”) representing the sequences loaded with missing values.
Regress cluster membership on covariates
In addition to identification of the typical patterns of fertility history among NLSY97 cohort. Researchers would like to know what factors predict the cluster membership. To do so, the resulting cluster membership is used as the outcome variable in the multinomial logistic regression model. OR, the cluster membership can be used to predict other outcomes of interest, e.g., fertility history and midlife health. We focus on the first senario in this practicum.
## Prepare the data set for the multinomial logistic regression.
mdata <- fertility %>%
select(female, race, education) %>%
mutate(type = as_factor(type))
ls(mdata)## [1] "education" "female" "race" "type"
## Estimate multinomial logistic regression model to predict cluster membership.
multinomial <- multinom(type ~ female + race + education, data=mdata)## Produce the result table for estimated multinomial logistic regression model.
# Create a summary table with tbl_regression
model_table <- tbl_regression(multinomial,
exponentiate = TRUE) %>%
add_glance_table(c(nobs, AIC)) ## ℹ Multinomial models have a different underlying structure than the models
## gtsummary was designed for. Other gtsummary functions designed to work with
## tbl_regression objects may yield unexpected results.
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| Early 20, single child | |||
| female | |||
| male | — | — | |
| female | 1.91 | 1.64, 2.22 | <0.001 |
| race | |||
| White | — | — | |
| Black | 2.06 | 1.73, 2.44 | <0.001 |
| Hispanic | 1.25 | 1.02, 1.52 | 0.030 |
| Mixed | 1.56 | 0.70, 3.49 | 0.3 |
| education | |||
| None | — | — | |
| GED/HS | 0.68 | 0.51, 0.90 | 0.008 |
| Associate Degree | 0.48 | 0.33, 0.69 | <0.001 |
| BA or above | 0.16 | 0.11, 0.22 | <0.001 |
| Early 20, two children | |||
| female | |||
| male | — | — | |
| female | 3.33 | 2.85, 3.91 | <0.001 |
| race | |||
| White | — | — | |
| Black | 2.02 | 1.68, 2.42 | <0.001 |
| Hispanic | 1.71 | 1.40, 2.07 | <0.001 |
| Mixed | 1.34 | 0.55, 3.27 | 0.5 |
| education | |||
| None | — | — | |
| GED/HS | 0.45 | 0.35, 0.59 | <0.001 |
| Associate Degree | 0.31 | 0.22, 0.44 | <0.001 |
| BA or above | 0.08 | 0.06, 0.11 | <0.001 |
| Mid-20, two children | |||
| female | |||
| male | — | — | |
| female | 1.44 | 1.24, 1.67 | <0.001 |
| race | |||
| White | — | — | |
| Black | 0.77 | 0.64, 0.93 | 0.007 |
| Hispanic | 0.94 | 0.78, 1.14 | 0.6 |
| Mixed | 1.34 | 0.64, 2.79 | 0.4 |
| education | |||
| None | — | — | |
| GED/HS | 0.89 | 0.63, 1.26 | 0.5 |
| Associate Degree | 0.78 | 0.52, 1.18 | 0.2 |
| BA or above | 0.71 | 0.50, 1.02 | 0.061 |
| Early-30, two children | |||
| female | |||
| male | — | — | |
| female | 1.27 | 1.10, 1.47 | 0.001 |
| race | |||
| White | — | — | |
| Black | 0.72 | 0.60, 0.87 | <0.001 |
| Hispanic | 0.83 | 0.69, 1.01 | 0.065 |
| Mixed | 1.33 | 0.66, 2.68 | 0.4 |
| education | |||
| None | — | — | |
| GED/HS | 1.44 | 0.93, 2.24 | 0.10 |
| Associate Degree | 2.00 | 1.23, 3.25 | 0.005 |
| BA or above | 2.04 | 1.32, 3.18 | 0.001 |
| Late-teens, multiple children | |||
| female | |||
| male | — | — | |
| female | 3.76 | 3.25, 4.36 | <0.001 |
| race | |||
| White | — | — | |
| Black | 2.30 | 1.94, 2.73 | <0.001 |
| Hispanic | 1.90 | 1.59, 2.28 | <0.001 |
| Mixed | 2.53 | 1.25, 5.14 | 0.010 |
| education | |||
| None | — | — | |
| GED/HS | 0.30 | 0.24, 0.39 | <0.001 |
| Associate Degree | 0.19 | 0.14, 0.27 | <0.001 |
| BA or above | 0.04 | 0.03, 0.06 | <0.001 |
| incomplete | |||
| female | |||
| male | — | — | |
| female | 1.54 | 1.31, 1.82 | <0.001 |
| race | |||
| White | — | — | |
| Black | 0.55 | 0.44, 0.68 | <0.001 |
| Hispanic | 0.59 | 0.48, 0.73 | <0.001 |
| Mixed | 0.81 | 0.32, 2.08 | 0.7 |
| education | |||
| None | — | — | |
| GED/HS | 0.16 | 0.12, 0.20 | <0.001 |
| Associate Degree | 0.05 | 0.04, 0.08 | <0.001 |
| BA or above | 0.04 | 0.03, 0.05 | <0.001 |
| NA | |||
| No. Obs. | 8,984 | ||
| AIC | 31,115 | ||
| 1 OR = Odds Ratio, CI = Confidence Interval | |||
Taking the cluster “late-teens, multiple children” as an example, women are more likely to enter this fertility trajectory than men. Racial minorities are more likely than their White counterparts to have this childbearing pattern. There is an educational gradient in the probability of entering this fertility trajectory, e.g., college-educated are least likely to enter this trajectory.
Reference
- Raab, M., & Struffolino, E. (2022). Sequence Analysis (1st edition). SAGE Publications, Inc.
- Ritschard, G. (2021). Measuring the Nature of Individual Sequences. Sociological Methods & Research, 00491241211036156. https://doi.org/10.1177/00491241211036156
- Studer M (2013). “WeightedCluster Library Manual: A practical guide to creating typologies of trajectories in the social sciences with R.”
- Studer, M., & Ritschard, G. (2016). What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), 481–511. https://doi.org/10.1111/rssa.12125
Please contact whuang60@jhu.edu if you have any questions.