Some general resources:
http://www.ats.ucla.edu/stat/mplus/seminars/lca
http://www.john-uebersax.com/stat/faq.htm An excellent chapter explaining LCA (categorical indicators) and LPA (continuous indicators): http://daob.nl/wp-content/uploads/2015/06/oberski-LCA.pdf
http://stats.stackexchange.com/questions/122213/latent-class-analysis-vs-cluster-analysis-differences-in-inferences
Includes an example (in mplus) of a model with both continuous and categorical indicators: http://www.ats.ucla.edu/stat/mplus/seminars/introMplus_part2/lca.htm
…translating it to r: http://stats.stackexchange.com/questions/47426/latent-class-model-with-both-continuous-and-categorical-indicators-in-r
The r package being used: https://cran.r-project.org/web/packages/depmixS4/vignettes/depmixS4.pdf
Some packages I’m not using:
library(mclust)
library(poLCA) # only categorical indicators
library(nlsem)
From IDRE (example 2 is most relevant):
The examples on this page use a dataset with information on high school students’ academic histories. In the first example below, a 2 class model is estimated using four dichotomous variables as indicators (category 1 = no, category 2 = yes). The variables are whether the student had taken honors math (hm), honors English (he), or vocational classes (voc); and whether the student reported they were unlikely to go to college (nocol). The expected classes are academically oriented students (i.e. students who took honors classes, did not take vocational classes and reported they were likely to go to college), and students who are less academically oriented. The dataset for this example is lca.dat.
…
Above we estimated a specific case of a mixture model, a latent class analysis, in which all of the indicators are categorical, in this example the model contains both categorical and continuous indicators. In addition to the four categorical variables used in the example above, this model includes four continuous variables, the students score on a measure of academic achievement for each of the four years of high school (ach09-ach12). The achievement variables have been centered so that each has a mean of zero.
library(depmixS4)
## Loading required package: nnet
## Loading required package: MASS
## Loading required package: Rsolnp
lca <- read.table("http://www.ats.ucla.edu/stat/mplus/seminars/introMplus_part2/lca.dat", sep=",")
names(lca) <- c( "hm", "he", "voc", "nocol" ,"ach09", "ach10", "ach11","ach12")
summary(lca)
## hm he voc nocol
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :1.000 Median :1.000
## Mean :0.322 Mean :0.314 Mean :0.678 Mean :0.666
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000
## ach09 ach10 ach11 ach12
## Min. :-5.0540 Min. :-4.7775 Min. :-4.1112 Min. :-4.0433
## 1st Qu.:-2.5189 1st Qu.:-2.5041 1st Qu.:-1.4652 1st Qu.:-1.3778
## Median :-1.5821 Median :-1.5782 Median :-0.4943 Median :-0.5779
## Mean :-0.9866 Mean :-0.9929 Mean :-0.4641 Mean :-0.5082
## 3rd Qu.: 0.4671 3rd Qu.: 0.5383 3rd Qu.: 0.4026 3rd Qu.: 0.3187
## Max. : 5.3017 Max. : 4.5390 Max. : 3.1273 Max. : 2.7939
mod1 <- mix(list(hm~1, he~1, voc~1, nocol~1),
data=lca, # the dataset to use
nstates=2, # the number of latent classes
family=list(multinomial("identity"), multinomial("identity"), multinomial("identity"), multinomial("identity")),
respstart=runif(16))
fmod1 <- fit(mod1, verbose=FALSE)
## converged at iteration 25 with logLik: -965.2441
fmod1
## Convergence info: Log likelihood converged to within tol. (relative change)
## 'log Lik.' -965.2441 (df=9)
## AIC: 1948.488
## BIC: 1986.42
summary(fmod1)
## Mixture probabilities model
## pr1 pr2
## 0.2728013 0.7271987
##
## Response parameters
## Resp 1 : multinomial
## Resp 2 : multinomial
## Resp 3 : multinomial
## Resp 4 : multinomial
## Re1.1 Re1.2 Re2.1 Re2.2 Re3.1 Re3.2 Re4.1
## St1 0.1128241 0.8871759 0.1514345 0.8485655 0.9113818 0.08861818 0.8887037
## St2 0.8900200 0.1099800 0.8865369 0.1134631 0.1008994 0.89910064 0.1259085
## Re4.2
## St1 0.1112963
## St2 0.8740915
For binary indicators, the first parameter for each response varilable (e.g. Re1.1), gives the probability of level 1 for that variable in latent state 1 and latent state 2. The second parameter (e.g. Re1.2) gives the probability of level 2 for that variable in each latent state — so the two parameters for a response variable should sum to 1 for each state.
posterior.states <- depmixS4::posterior(fmod1) # Saving Class Assignments
head(round(posterior.states, 3)) # show the first 6 cases, rounded to 3 decimal places to match IDRE output
## state S1 S2
## 1 1 0.963 0.037
## 2 1 0.971 0.029
## 3 2 0.000 1.000
## 4 1 0.999 0.001
## 5 1 0.999 0.001
## 6 2 0.004 0.996
posterior.states$state <- as.factor(posterior.states$state)
summary(posterior.states$state) # how many people in each latent class?
## 1 2
## 127 373
fmod1@npars
## [1] 18
Note about npars: The total number of parameters of the model. This is not the degrees of freedom, ie there are redundancies in the parameters, in particular in the multinomial models for the transitions and prior.
Starting values: “instart for the parameters relating to the prior probabilities, trstart, relating the transition models, and respstart for the parameters of the response models. Note that the starting values for the initial and transition models as well as multinomial logit response models are interpreted as probabilities, and internally converted to multinomial logit parameters (if a logit link function is used).”
mod2 <- mix(list(hm~1, he~1, voc~1, nocol~1, ach09~1, ach10~1, ach11~1, ach12~1),
data=lca, # the dataset to use
nstates=2, # the number of latent classes
family=list(multinomial(),multinomial(),multinomial(),multinomial(), gaussian(),gaussian(),gaussian(),gaussian()),
respstart=runif(32))
fmod2 <- fit(mod2, verbose=FALSE)
## converged at iteration 9 with logLik: -3840.115
fmod2
## Convergence info: Log likelihood converged to within tol. (relative change)
## 'log Lik.' -3840.115 (df=25)
## AIC: 7730.229
## BIC: 7835.594
summary(fmod2)
## Mixture probabilities model
## pr1 pr2
## 0.7352477 0.2647523
##
## Response parameters
## Resp 1 : multinomial
## Resp 2 : multinomial
## Resp 3 : multinomial
## Resp 4 : multinomial
## Resp 5 : gaussian
## Resp 6 : gaussian
## Resp 7 : gaussian
## Resp 8 : gaussian
## Re1.(Intercept).0 Re1.(Intercept).1 Re2.(Intercept).0
## St1 0 -2.021450 0
## St2 0 2.104972 0
## Re2.(Intercept).1 Re3.(Intercept).0 Re3.(Intercept).1
## St1 -2.075222 0 2.075223
## St2 1.957646 0 -2.271522
## Re4.(Intercept).0 Re4.(Intercept).1 Re5.(Intercept) Re5.sd
## St1 0 1.929259 -2.057901 1.062136
## St2 0 -2.305800 1.988539 1.039787
## Re6.(Intercept) Re6.sd Re7.(Intercept) Re7.sd Re8.(Intercept)
## St1 -2.060223 0.9700232 -0.9866289 1.0463418 -0.9895619
## St2 1.971368 0.9990478 0.9869759 0.9249991 0.8284875
## Re8.sd
## St1 0.9939486
## St2 0.9123412
posterior.states <- depmixS4::posterior(fmod2)
posterior.states$state <- as.factor(posterior.states$state)
Plotting, to understand the latent classes
library(ggplot2)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
plot.data <- cbind(lca, posterior.states) %>%
gather(key="measure", value="value", hm:ach12)
max_lca_probs <- cbind(lca, posterior.states) %>%
mutate(id=1:nrow(.)) %>%
gather("state_prob", "value", starts_with("S", ignore.case=FALSE)) %>%
group_by(id) %>%
summarize(max_prob=max(value))
summary(max_lca_probs) # with what probability are cases assigned to classes?
## id max_prob
## Min. : 1.0 Min. :0.6249
## 1st Qu.:125.8 1st Qu.:1.0000
## Median :250.5 Median :1.0000
## Mean :250.5 Mean :0.9992
## 3rd Qu.:375.2 3rd Qu.:1.0000
## Max. :500.0 Max. :1.0000
plot.data$var_type <- ifelse(plot.data$measure == "hm" |
plot.data$measure == "he" |
plot.data$measure == "voc" |
plot.data$measure == "nocol", "categorical",
ifelse(grepl(pattern="ach", x=plot.data$measure), "continuous", NA))
ggplot(filter(plot.data, var_type=="continuous"), aes(y=value, x=state)) +
geom_boxplot() +
facet_wrap(~ measure)
ggplot(filter(plot.data, var_type=="categorical"), aes(x=state, fill=as.factor(value))) +
geom_bar() +
facet_wrap(~ measure) +
ggtitle("displayed as counts")
ggplot(filter(plot.data, var_type=="categorical"), aes(x=state, fill=as.factor(value))) +
geom_bar(position="fill") +
facet_wrap(~ measure) +
ggtitle("displayed as proportion")
summary.plot.data <- plot.data %>%
group_by(state, measure) %>%
summarize(z=mean(value)/sd(value))
ggplot(summary.plot.data, aes(y=z, x=measure, group=state, color=state)) +
geom_point() +
geom_line() +
ggtitle("typical LCA plot")
data("balance")
head(balance)
## sex agedays age t1 t2 t3 t4 d1 d2 d3 d4
## 1 1 2531 6 2 3 2 2 0 1 0 0
## 2 1 2583 7 2 2 1 3 0 0 1 1
## 3 1 2582 7 2 2 2 2 0 0 0 0
## 4 1 2379 6 2 2 2 2 0 0 0 0
## 5 1 2408 6 2 2 2 2 0 0 0 0
## 6 1 1967 5 2 2 2 2 0 0 0 0
balance$age_c <- balance$agedays - mean(balance$agedays, na.rm=TRUE) #centering age
mod3 <- mix(list(d1 ~ 1, d2 ~ 1, d3 ~ 1, d4 ~ 1),
prior = ~ age_c, # a predictor for class membership (probability of being in each class depends on age)
data=balance,
nstates = 3,
family = list(multinomial("identity"), multinomial("identity"), multinomial("identity"), multinomial("identity")),
respstart = runif(24))
fmod3 <- fit(mod3, verbose=FALSE)
## converged at iteration 175 with logLik: -916.3218
fmod3
## Convergence info: Log likelihood converged to within tol. (relative change)
## 'log Lik.' -916.3218 (df=16)
## AIC: 1864.644
## BIC: 1939.172
summary(fmod3)
## Mixture probabilities model
## Model of type multinomial (mlogit), formula: ~age_c
## Coefficients:
## St1 St2 St3
## (Intercept) 0 -1.482141153 -1.5430781170
## age_c 0 -0.001850357 -0.0008010738
## Probalities at zero values of the covariates.
## 0.6940237 0.157648 0.1483283
##
## Response parameters
## Resp 1 : multinomial
## Resp 2 : multinomial
## Resp 3 : multinomial
## Resp 4 : multinomial
## Re1.1 Re1.2 Re2.1 Re2.2 Re3.1 Re3.2
## St1 0.02312694 9.768731e-01 0.02109371 0.97890629 0.02622178 0.9737782
## St2 0.99999730 2.702059e-06 0.97241254 0.02758746 1.00000000 0.0000000
## St3 0.57575200 4.242480e-01 0.46279954 0.53720046 0.60274391 0.3972561
## Re4.1 Re4.2
## St1 0.01753395 9.824660e-01
## St2 0.99999602 3.982813e-06
## St3 0.34579116 6.542088e-01
summary(fmod3, which="response")
## Response parameters
## Resp 1 : multinomial
## Resp 2 : multinomial
## Resp 3 : multinomial
## Resp 4 : multinomial
## Re1.1 Re1.2 Re2.1 Re2.2 Re3.1 Re3.2
## St1 0.02312694 9.768731e-01 0.02109371 0.97890629 0.02622178 0.9737782
## St2 0.99999730 2.702059e-06 0.97241254 0.02758746 1.00000000 0.0000000
## St3 0.57575200 4.242480e-01 0.46279954 0.53720046 0.60274391 0.3972561
## Re4.1 Re4.2
## St1 0.01753395 9.824660e-01
## St2 0.99999602 3.982813e-06
## St3 0.34579116 6.542088e-01
posterior.states <- depmixS4::posterior(fmod3) # Returns a data.frame with the number of latent states + 1 columns; the first column has the viterbi states, the other columns have the delta probabilities, see Rabiner (1989)
posterior.states$state <- as.factor(posterior.states$state)
balance <- cbind(balance, posterior.states)
library(ggplot2)
library(dplyr)
library(tidyr)
ggplot(tidyr::gather(balance, key="key", value="value", S1:S3), aes(x=agedays, y=value, color=key)) +
geom_point(alpha=.6) +
ggtitle("estimated probability of membership in each class by age")
probs <- balance %>%
group_by(age) %>%
summarize_each(funs(mean), S1, S2, S3)
ggplot(tidyr::gather(probs, key="key", value="value", -age), aes(y=value, x=age, color=key)) +
geom_line() +
ggtitle("mean probability of being in each class by age")
“The intercept values of the multinomial logistic parameters provide the prior probabilities of each class when the covariate has value zero (note that in this case the value zero does not have much significance, scaling and/or centering of covariates may be useful in such cases). The summary function prints these values. As can be seen from those values, at age zero, the prior probability is overwhelmingly at the second class.”
mod4 <- mix(list(hm~1, he~1, voc~1, nocol~1, ach09~1, ach10~1, ach11~1, ach12~1),
data=lca, # the dataset to use
nstates=6, # the number of latent classes
family=list(multinomial(),multinomial(),multinomial(),multinomial(), gaussian(),gaussian(),gaussian(),gaussian()),
respstart=runif(96))
fmod4 <- fit(mod4, verbose=FALSE)
## converged at iteration 226 with logLik: -3800.719
fmod4
## Convergence info: Log likelihood converged to within tol. (relative change)
## 'log Lik.' -3800.719 (df=77)
## AIC: 7755.439
## BIC: 8079.964
summary(fmod4)
## Mixture probabilities model
## pr1 pr2 pr3 pr4 pr5 pr6
## 0.20266033 0.25733251 0.07340352 0.26525866 0.03796349 0.16338149
##
## Response parameters
## Resp 1 : multinomial
## Resp 2 : multinomial
## Resp 3 : multinomial
## Resp 4 : multinomial
## Resp 5 : gaussian
## Resp 6 : gaussian
## Resp 7 : gaussian
## Resp 8 : gaussian
## Re1.(Intercept).0 Re1.(Intercept).1 Re2.(Intercept).0
## St1 0 -2.127109 0
## St2 0 -10.022130 0
## St3 0 -1.043076 0
## St4 0 2.087646 0
## St5 0 -1.990554 0
## St6 0 -1.102753 0
## Re2.(Intercept).1 Re3.(Intercept).0 Re3.(Intercept).1
## St1 -1.735630 0 1.429993
## St2 -2.574189 0 9.980436
## St3 -1.079536 0 2.470319
## St4 1.942428 0 -2.251462
## St5 -10.562149 0 -0.176333
## St6 -2.307389 0 2.187539
## Re4.(Intercept).0 Re4.(Intercept).1 Re5.(Intercept) Re5.sd
## St1 0 2.164853 -2.063200 0.8315798
## St2 0 2.315340 -2.344630 1.1413112
## St3 0 1.879144 -2.449179 0.7421640
## St4 0 -2.307788 1.984102 1.0437334
## St5 0 -0.200037 -1.743805 0.6287552
## St6 0 2.079230 -1.502248 1.1352961
## Re6.(Intercept) Re6.sd Re7.(Intercept) Re7.sd Re8.(Intercept)
## St1 -2.152611 1.0291546 0.1104593 0.5314518 -1.1633083
## St2 -1.980557 1.0708202 -1.2956392 0.6340077 -1.0030311
## St3 -2.042375 0.6062562 -2.1560338 0.5735642 -1.5678935
## St4 1.968405 1.0003883 0.9851922 0.9250128 0.8308456
## St5 -3.011547 0.7447023 -1.1845080 0.3867454 -1.1624769
## St6 -1.865754 0.7298824 -1.2926215 1.1819973 -0.4622840
## Re8.sd
## St1 0.8874967
## St2 1.0993131
## St3 0.4555784
## St4 0.9130559
## St5 1.2589837
## St6 0.7759304
posterior.states <- depmixS4::posterior(fmod4)
posterior.states$state <- as.factor(posterior.states$state)
summary(posterior.states$state) # how many cases assigned to each class?
## 1 2 3 4 5 6
## 110 149 37 133 16 55
Only works with continuous predictors: finite normal (aka Gaussian) mixture modelling.
library(mclust)
cont_data <- dplyr::select(lca, ach09:ach12) # continuous variables only
# shows model fit (BIC) on y-axis and number of latent classes on x-axis
BIC = mclustBIC(cont_data,
G=1:9) # G tells it how many latent classes to try
plot(BIC) # different lines represent different assumptions about the covariance structure
summary(BIC) # the best models
## Best BIC values:
## EII,2 VII,2 EEI,2
## BIC -6335.883 -6340.821255 -6349.47355
## BIC diff 0.000 -4.938359 -13.59065
mod1 = Mclust(cont_data,
G=1:9, # the number of latent classes to try (it will select the best one)
x = BIC) # giving it this BIC object we made before means it won't have to recalculate it
summary(mod1, parameters = TRUE)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EII (spherical, equal volume) model with 2 components:
##
## log.likelihood n df BIC ICL
## -3136.868 500 10 -6335.883 -6336.19
##
## Clustering table:
## 1 2
## 133 367
##
## Mixing probabilities:
## 1 2
## 0.266173 0.733827
##
## Means:
## [,1] [,2]
## ach09 1.9761826 -2.0612532
## ach10 1.9624250 -2.0647845
## ach11 0.9817233 -0.9885447
## ach12 0.8330294 -0.9947292
##
## Variances:
## [,,1]
## ach09 ach10 ach11 ach12
## ach09 1.010061 0.000000 0.000000 0.000000
## ach10 0.000000 1.010061 0.000000 0.000000
## ach11 0.000000 0.000000 1.010061 0.000000
## ach12 0.000000 0.000000 0.000000 1.010061
## [,,2]
## ach09 ach10 ach11 ach12
## ach09 1.010061 0.000000 0.000000 0.000000
## ach10 0.000000 1.010061 0.000000 0.000000
## ach11 0.000000 0.000000 1.010061 0.000000
## ach12 0.000000 0.000000 0.000000 1.010061
summary(as.factor(mod1$classification))
## 1 2
## 133 367
head(round(mod1$uncertainty, 3)) # uncertainty associated with each classification
## [1] 0 0 0 0 0 0
# add class assignment and uncertainty to data
cont_data$class <- as.factor(mod1$classification)
cont_data$uncertainty <- mod1$uncertainty
summary(cont_data$uncertainty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000000 0.00000000 0.00000000 0.00030120 0.00000014 0.06808000
plot.data <- cont_data %>%
gather(key="measure", value="value", ach09:ach12)
ggplot(plot.data, aes(y=value, x=class)) +
geom_boxplot() +
facet_wrap(~ measure)
ggplot(plot.data, aes(x=value, fill=class)) +
geom_histogram(position = "identity", alpha=.5) +
facet_wrap(~ measure)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary.plot.data <- plot.data %>%
group_by(class, measure) %>%
summarize(z=mean(value)/sd(value))
ggplot(summary.plot.data, aes(y=z, x=measure, group=class, color=class)) +
geom_point() +
geom_line() +
ggtitle("typical LCA plot")
Compare the assigned latent classes to some other categorical variable.
lca$class <- as.factor(mod1$classification)
# how does the latent class assignment based on academic acheivement in grades 9 through 12 compare to whether or not students took vocational classes?
xtabs(~ class + voc, data=lca)
## voc
## class 0 1
## 1 120 13
## 2 41 326
lcaxtabs <- xtabs(~ class + voc, data=lca)
summary(lcaxtabs)
## Call: xtabs(formula = ~class + voc, data = lca)
## Number of cases in table: 500
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 279.45, df = 1, p-value = 9.881e-63
library(vcd)
## Loading required package: grid
assocstats(lcaxtabs)
## X^2 df P(> X^2)
## Likelihood Ratio 286.26 1 0
## Pearson 279.45 1 0
##
## Phi-Coefficient : 0.748
## Contingency Coeff.: 0.599
## Cramer's V : 0.748