Some general resources:
http://www.ats.ucla.edu/stat/mplus/seminars/lca
http://www.john-uebersax.com/stat/faq.htm An excellent chapter explaining LCA (categorical indicators) and LPA (continuous indicators): http://daob.nl/wp-content/uploads/2015/06/oberski-LCA.pdf
http://stats.stackexchange.com/questions/122213/latent-class-analysis-vs-cluster-analysis-differences-in-inferences

Includes an example (in mplus) of a model with both continuous and categorical indicators: http://www.ats.ucla.edu/stat/mplus/seminars/introMplus_part2/lca.htm
…translating it to r: http://stats.stackexchange.com/questions/47426/latent-class-model-with-both-continuous-and-categorical-indicators-in-r
The r package being used: https://cran.r-project.org/web/packages/depmixS4/vignettes/depmixS4.pdf

Some packages I’m not using:

library(mclust)
library(poLCA) # only categorical indicators
library(nlsem)

From IDRE (example 2 is most relevant):
The examples on this page use a dataset with information on high school students’ academic histories. In the first example below, a 2 class model is estimated using four dichotomous variables as indicators (category 1 = no, category 2 = yes). The variables are whether the student had taken honors math (hm), honors English (he), or vocational classes (voc); and whether the student reported they were unlikely to go to college (nocol). The expected classes are academically oriented students (i.e. students who took honors classes, did not take vocational classes and reported they were likely to go to college), and students who are less academically oriented. The dataset for this example is lca.dat.
…
Above we estimated a specific case of a mixture model, a latent class analysis, in which all of the indicators are categorical, in this example the model contains both categorical and continuous indicators. In addition to the four categorical variables used in the example above, this model includes four continuous variables, the students score on a measure of academic achievement for each of the four years of high school (ach09-ach12). The achievement variables have been centered so that each has a mean of zero.

library(depmixS4)

## Loading required package: nnet

## Loading required package: MASS

## Loading required package: Rsolnp

lca <- read.table("http://www.ats.ucla.edu/stat/mplus/seminars/introMplus_part2/lca.dat", sep=",")
names(lca) <- c( "hm", "he", "voc", "nocol" ,"ach09", "ach10", "ach11","ach12")

summary(lca)

##        hm              he             voc            nocol      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.000   Median :0.000   Median :1.000   Median :1.000  
##  Mean   :0.322   Mean   :0.314   Mean   :0.678   Mean   :0.666  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##      ach09             ach10             ach11             ach12        
##  Min.   :-5.0540   Min.   :-4.7775   Min.   :-4.1112   Min.   :-4.0433  
##  1st Qu.:-2.5189   1st Qu.:-2.5041   1st Qu.:-1.4652   1st Qu.:-1.3778  
##  Median :-1.5821   Median :-1.5782   Median :-0.4943   Median :-0.5779  
##  Mean   :-0.9866   Mean   :-0.9929   Mean   :-0.4641   Mean   :-0.5082  
##  3rd Qu.: 0.4671   3rd Qu.: 0.5383   3rd Qu.: 0.4026   3rd Qu.: 0.3187  
##  Max.   : 5.3017   Max.   : 4.5390   Max.   : 3.1273   Max.   : 2.7939

Example 1: Only categorical indicators

mod1 <- mix(list(hm~1, he~1, voc~1, nocol~1), 
           data=lca, # the dataset to use
           nstates=2, # the number of latent classes
           family=list(multinomial("identity"), multinomial("identity"), multinomial("identity"), multinomial("identity")),
           respstart=runif(16))

fmod1 <- fit(mod1, verbose=FALSE)

## converged at iteration 25 with logLik: -965.2441

fmod1

## Convergence info: Log likelihood converged to within tol. (relative change) 
## 'log Lik.' -965.2441 (df=9)
## AIC:  1948.488 
## BIC:  1986.42

summary(fmod1)

## Mixture probabilities model 
##       pr1       pr2 
## 0.2728013 0.7271987 
## 
## Response parameters 
## Resp 1 : multinomial 
## Resp 2 : multinomial 
## Resp 3 : multinomial 
## Resp 4 : multinomial 
##         Re1.1     Re1.2     Re2.1     Re2.2     Re3.1      Re3.2     Re4.1
## St1 0.1128241 0.8871759 0.1514345 0.8485655 0.9113818 0.08861818 0.8887037
## St2 0.8900200 0.1099800 0.8865369 0.1134631 0.1008994 0.89910064 0.1259085
##         Re4.2
## St1 0.1112963
## St2 0.8740915

For binary indicators, the first parameter for each response varilable (e.g. Re1.1), gives the probability of level 1 for that variable in latent state 1 and latent state 2. The second parameter (e.g. Re1.2) gives the probability of level 2 for that variable in each latent state — so the two parameters for a response variable should sum to 1 for each state.

posterior.states <- depmixS4::posterior(fmod1) # Saving Class Assignments
head(round(posterior.states, 3)) # show the first 6 cases, rounded to 3 decimal places to match IDRE output

##   state    S1    S2
## 1     1 0.963 0.037
## 2     1 0.971 0.029
## 3     2 0.000 1.000
## 4     1 0.999 0.001
## 5     1 0.999 0.001
## 6     2 0.004 0.996

posterior.states$state <- as.factor(posterior.states$state)
summary(posterior.states$state) # how many people in each latent class?

##   1   2 
## 127 373

fmod1@npars

## [1] 18

Note about npars: The total number of parameters of the model. This is not the degrees of freedom, ie there are redundancies in the parameters, in particular in the multinomial models for the transitions and prior.

Starting values: “instart for the parameters relating to the prior probabilities, trstart, relating the transition models, and respstart for the parameters of the response models. Note that the starting values for the initial and transition models as well as multinomial logit response models are interpreted as probabilities, and internally converted to multinomial logit parameters (if a logit link function is used).”

Example 2: Both categorical and continuous indicators

mod2 <- mix(list(hm~1, he~1, voc~1, nocol~1, ach09~1, ach10~1, ach11~1, ach12~1), 
           data=lca, # the dataset to use
           nstates=2, # the number of latent classes
           family=list(multinomial(),multinomial(),multinomial(),multinomial(), gaussian(),gaussian(),gaussian(),gaussian()),
           respstart=runif(32))

fmod2 <- fit(mod2, verbose=FALSE)

## converged at iteration 9 with logLik: -3840.115

fmod2

## Convergence info: Log likelihood converged to within tol. (relative change) 
## 'log Lik.' -3840.115 (df=25)
## AIC:  7730.229 
## BIC:  7835.594

summary(fmod2)

## Mixture probabilities model 
##       pr1       pr2 
## 0.7352477 0.2647523 
## 
## Response parameters 
## Resp 1 : multinomial 
## Resp 2 : multinomial 
## Resp 3 : multinomial 
## Resp 4 : multinomial 
## Resp 5 : gaussian 
## Resp 6 : gaussian 
## Resp 7 : gaussian 
## Resp 8 : gaussian 
##     Re1.(Intercept).0 Re1.(Intercept).1 Re2.(Intercept).0
## St1                 0         -2.021450                 0
## St2                 0          2.104972                 0
##     Re2.(Intercept).1 Re3.(Intercept).0 Re3.(Intercept).1
## St1         -2.075222                 0          2.075223
## St2          1.957646                 0         -2.271522
##     Re4.(Intercept).0 Re4.(Intercept).1 Re5.(Intercept)   Re5.sd
## St1                 0          1.929259       -2.057901 1.062136
## St2                 0         -2.305800        1.988539 1.039787
##     Re6.(Intercept)    Re6.sd Re7.(Intercept)    Re7.sd Re8.(Intercept)
## St1       -2.060223 0.9700232      -0.9866289 1.0463418      -0.9895619
## St2        1.971368 0.9990478       0.9869759 0.9249991       0.8284875
##        Re8.sd
## St1 0.9939486
## St2 0.9123412

posterior.states <- depmixS4::posterior(fmod2)
posterior.states$state <- as.factor(posterior.states$state)

Plotting, to understand the latent classes

library(ggplot2)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

plot.data <- cbind(lca, posterior.states) %>% 
  gather(key="measure", value="value", hm:ach12)

max_lca_probs <- cbind(lca, posterior.states) %>% 
  mutate(id=1:nrow(.)) %>% 
  gather("state_prob", "value", starts_with("S", ignore.case=FALSE)) %>% 
  group_by(id) %>% 
  summarize(max_prob=max(value))

summary(max_lca_probs)  # with what probability are cases assigned to classes?

##        id           max_prob     
##  Min.   :  1.0   Min.   :0.6249  
##  1st Qu.:125.8   1st Qu.:1.0000  
##  Median :250.5   Median :1.0000  
##  Mean   :250.5   Mean   :0.9992  
##  3rd Qu.:375.2   3rd Qu.:1.0000  
##  Max.   :500.0   Max.   :1.0000

plot.data$var_type <- ifelse(plot.data$measure == "hm" | 
                               plot.data$measure == "he" | 
                               plot.data$measure == "voc" | 
                               plot.data$measure == "nocol", "categorical", 
                             ifelse(grepl(pattern="ach", x=plot.data$measure), "continuous", NA))

ggplot(filter(plot.data, var_type=="continuous"), aes(y=value, x=state)) + 
  geom_boxplot() + 
  facet_wrap(~ measure)

ggplot(filter(plot.data, var_type=="categorical"), aes(x=state, fill=as.factor(value))) + 
  geom_bar() + 
  facet_wrap(~ measure) +
  ggtitle("displayed as counts")

ggplot(filter(plot.data, var_type=="categorical"), aes(x=state, fill=as.factor(value))) + 
  geom_bar(position="fill") + 
  facet_wrap(~ measure) +
  ggtitle("displayed as proportion")

summary.plot.data <- plot.data %>% 
  group_by(state, measure) %>% 
  summarize(z=mean(value)/sd(value))

ggplot(summary.plot.data, aes(y=z, x=measure, group=state, color=state)) + 
  geom_point() + 
  geom_line() + 
  ggtitle("typical LCA plot")

Example 3: Using a predictor for class (a prior)

data("balance")
head(balance)

##   sex agedays age t1 t2 t3 t4 d1 d2 d3 d4
## 1   1    2531   6  2  3  2  2  0  1  0  0
## 2   1    2583   7  2  2  1  3  0  0  1  1
## 3   1    2582   7  2  2  2  2  0  0  0  0
## 4   1    2379   6  2  2  2  2  0  0  0  0
## 5   1    2408   6  2  2  2  2  0  0  0  0
## 6   1    1967   5  2  2  2  2  0  0  0  0

balance$age_c <- balance$agedays - mean(balance$agedays, na.rm=TRUE) #centering age


mod3 <- mix(list(d1 ~ 1, d2 ~ 1, d3 ~ 1, d4 ~ 1), 
            prior = ~ age_c, # a predictor for class membership (probability of being in each class depends on age)
            data=balance,
            nstates = 3, 
            family = list(multinomial("identity"), multinomial("identity"), multinomial("identity"), multinomial("identity")), 
            respstart = runif(24))

fmod3 <- fit(mod3, verbose=FALSE)

## converged at iteration 175 with logLik: -916.3218

fmod3

## Convergence info: Log likelihood converged to within tol. (relative change) 
## 'log Lik.' -916.3218 (df=16)
## AIC:  1864.644 
## BIC:  1939.172

summary(fmod3)

## Mixture probabilities model 
## Model of type multinomial (mlogit), formula: ~age_c
## Coefficients: 
##             St1          St2           St3
## (Intercept)   0 -1.482141153 -1.5430781170
## age_c         0 -0.001850357 -0.0008010738
## Probalities at zero values of the covariates.
## 0.6940237 0.157648 0.1483283 
## 
## Response parameters 
## Resp 1 : multinomial 
## Resp 2 : multinomial 
## Resp 3 : multinomial 
## Resp 4 : multinomial 
##          Re1.1        Re1.2      Re2.1      Re2.2      Re3.1     Re3.2
## St1 0.02312694 9.768731e-01 0.02109371 0.97890629 0.02622178 0.9737782
## St2 0.99999730 2.702059e-06 0.97241254 0.02758746 1.00000000 0.0000000
## St3 0.57575200 4.242480e-01 0.46279954 0.53720046 0.60274391 0.3972561
##          Re4.1        Re4.2
## St1 0.01753395 9.824660e-01
## St2 0.99999602 3.982813e-06
## St3 0.34579116 6.542088e-01

summary(fmod3, which="response")

## Response parameters 
## Resp 1 : multinomial 
## Resp 2 : multinomial 
## Resp 3 : multinomial 
## Resp 4 : multinomial 
##          Re1.1        Re1.2      Re2.1      Re2.2      Re3.1     Re3.2
## St1 0.02312694 9.768731e-01 0.02109371 0.97890629 0.02622178 0.9737782
## St2 0.99999730 2.702059e-06 0.97241254 0.02758746 1.00000000 0.0000000
## St3 0.57575200 4.242480e-01 0.46279954 0.53720046 0.60274391 0.3972561
##          Re4.1        Re4.2
## St1 0.01753395 9.824660e-01
## St2 0.99999602 3.982813e-06
## St3 0.34579116 6.542088e-01

posterior.states <- depmixS4::posterior(fmod3) # Returns a data.frame with the number of latent states + 1 columns; the first column has the viterbi states, the other columns have the delta probabilities, see Rabiner (1989)
posterior.states$state <- as.factor(posterior.states$state)

balance <- cbind(balance, posterior.states)

library(ggplot2)
library(dplyr)
library(tidyr)
ggplot(tidyr::gather(balance, key="key", value="value", S1:S3), aes(x=agedays, y=value, color=key)) + 
  geom_point(alpha=.6) + 
  ggtitle("estimated probability of membership in each class by age")

probs <- balance %>% 
  group_by(age) %>% 
  summarize_each(funs(mean), S1, S2, S3)
  
ggplot(tidyr::gather(probs, key="key", value="value", -age), aes(y=value, x=age, color=key)) + 
  geom_line() + 
  ggtitle("mean probability of being in each class by age")

“The intercept values of the multinomial logistic parameters provide the prior probabilities of each class when the covariate has value zero (note that in this case the value zero does not have much significance, scaling and/or centering of covariates may be useful in such cases). The summary function prints these values. As can be seen from those values, at age zero, the prior probability is overwhelmingly at the second class.”

Example 4: More classes

mod4 <- mix(list(hm~1, he~1, voc~1, nocol~1, ach09~1, ach10~1, ach11~1, ach12~1), 
           data=lca, # the dataset to use
           nstates=6, # the number of latent classes
           family=list(multinomial(),multinomial(),multinomial(),multinomial(), gaussian(),gaussian(),gaussian(),gaussian()),
           respstart=runif(96))

fmod4 <- fit(mod4, verbose=FALSE)

## converged at iteration 226 with logLik: -3800.719

fmod4

## Convergence info: Log likelihood converged to within tol. (relative change) 
## 'log Lik.' -3800.719 (df=77)
## AIC:  7755.439 
## BIC:  8079.964

summary(fmod4)

## Mixture probabilities model 
##        pr1        pr2        pr3        pr4        pr5        pr6 
## 0.20266033 0.25733251 0.07340352 0.26525866 0.03796349 0.16338149 
## 
## Response parameters 
## Resp 1 : multinomial 
## Resp 2 : multinomial 
## Resp 3 : multinomial 
## Resp 4 : multinomial 
## Resp 5 : gaussian 
## Resp 6 : gaussian 
## Resp 7 : gaussian 
## Resp 8 : gaussian 
##     Re1.(Intercept).0 Re1.(Intercept).1 Re2.(Intercept).0
## St1                 0         -2.127109                 0
## St2                 0        -10.022130                 0
## St3                 0         -1.043076                 0
## St4                 0          2.087646                 0
## St5                 0         -1.990554                 0
## St6                 0         -1.102753                 0
##     Re2.(Intercept).1 Re3.(Intercept).0 Re3.(Intercept).1
## St1         -1.735630                 0          1.429993
## St2         -2.574189                 0          9.980436
## St3         -1.079536                 0          2.470319
## St4          1.942428                 0         -2.251462
## St5        -10.562149                 0         -0.176333
## St6         -2.307389                 0          2.187539
##     Re4.(Intercept).0 Re4.(Intercept).1 Re5.(Intercept)    Re5.sd
## St1                 0          2.164853       -2.063200 0.8315798
## St2                 0          2.315340       -2.344630 1.1413112
## St3                 0          1.879144       -2.449179 0.7421640
## St4                 0         -2.307788        1.984102 1.0437334
## St5                 0         -0.200037       -1.743805 0.6287552
## St6                 0          2.079230       -1.502248 1.1352961
##     Re6.(Intercept)    Re6.sd Re7.(Intercept)    Re7.sd Re8.(Intercept)
## St1       -2.152611 1.0291546       0.1104593 0.5314518      -1.1633083
## St2       -1.980557 1.0708202      -1.2956392 0.6340077      -1.0030311
## St3       -2.042375 0.6062562      -2.1560338 0.5735642      -1.5678935
## St4        1.968405 1.0003883       0.9851922 0.9250128       0.8308456
## St5       -3.011547 0.7447023      -1.1845080 0.3867454      -1.1624769
## St6       -1.865754 0.7298824      -1.2926215 1.1819973      -0.4622840
##        Re8.sd
## St1 0.8874967
## St2 1.0993131
## St3 0.4555784
## St4 0.9130559
## St5 1.2589837
## St6 0.7759304

posterior.states <- depmixS4::posterior(fmod4)
posterior.states$state <- as.factor(posterior.states$state)
summary(posterior.states$state) # how many cases assigned to each class?

##   1   2   3   4   5   6 
## 110 149  37 133  16  55

Example 5: mclust package

Only works with continuous predictors: finite normal (aka Gaussian) mixture modelling.

library(mclust)

cont_data <- dplyr::select(lca, ach09:ach12) # continuous variables only

# shows model fit (BIC) on y-axis and number of latent classes on x-axis
BIC = mclustBIC(cont_data, 
              G=1:9) # G tells it how many latent classes to try 
plot(BIC) # different lines represent different assumptions about the covariance structure

summary(BIC) # the best models

## Best BIC values:
##              EII,2        VII,2       EEI,2
## BIC      -6335.883 -6340.821255 -6349.47355
## BIC diff     0.000    -4.938359   -13.59065

mod1 = Mclust(cont_data, 
              G=1:9, # the number of latent classes to try (it will select the best one)
              x = BIC) # giving it this BIC object we made before means it won't have to recalculate it
summary(mod1, parameters = TRUE)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust EII (spherical, equal volume) model with 2 components:
## 
##  log.likelihood   n df       BIC      ICL
##       -3136.868 500 10 -6335.883 -6336.19
## 
## Clustering table:
##   1   2 
## 133 367 
## 
## Mixing probabilities:
##        1        2 
## 0.266173 0.733827 
## 
## Means:
##            [,1]       [,2]
## ach09 1.9761826 -2.0612532
## ach10 1.9624250 -2.0647845
## ach11 0.9817233 -0.9885447
## ach12 0.8330294 -0.9947292
## 
## Variances:
## [,,1]
##          ach09    ach10    ach11    ach12
## ach09 1.010061 0.000000 0.000000 0.000000
## ach10 0.000000 1.010061 0.000000 0.000000
## ach11 0.000000 0.000000 1.010061 0.000000
## ach12 0.000000 0.000000 0.000000 1.010061
## [,,2]
##          ach09    ach10    ach11    ach12
## ach09 1.010061 0.000000 0.000000 0.000000
## ach10 0.000000 1.010061 0.000000 0.000000
## ach11 0.000000 0.000000 1.010061 0.000000
## ach12 0.000000 0.000000 0.000000 1.010061

summary(as.factor(mod1$classification))

##   1   2 
## 133 367

head(round(mod1$uncertainty, 3)) # uncertainty associated with each classification

## [1] 0 0 0 0 0 0

# add class assignment and uncertainty to data
cont_data$class <- as.factor(mod1$classification)

cont_data$uncertainty <- mod1$uncertainty
summary(cont_data$uncertainty)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## 0.00000000 0.00000000 0.00000000 0.00030120 0.00000014 0.06808000

plot.data <- cont_data %>% 
  gather(key="measure", value="value", ach09:ach12)

ggplot(plot.data, aes(y=value, x=class)) + 
  geom_boxplot() + 
  facet_wrap(~ measure)

ggplot(plot.data, aes(x=value, fill=class)) + 
  geom_histogram(position = "identity", alpha=.5) + 
  facet_wrap(~ measure)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary.plot.data <- plot.data %>% 
  group_by(class, measure) %>% 
  summarize(z=mean(value)/sd(value))

ggplot(summary.plot.data, aes(y=z, x=measure, group=class, color=class)) + 
  geom_point() + 
  geom_line() + 
  ggtitle("typical LCA plot")

Follow-up analyses

Compare the assigned latent classes to some other categorical variable.

lca$class <- as.factor(mod1$classification)

# how does the latent class assignment based on academic acheivement in grades 9 through 12 compare to whether or not students took vocational classes?
xtabs(~ class + voc, data=lca)

##      voc
## class   0   1
##     1 120  13
##     2  41 326

lcaxtabs <- xtabs(~ class + voc, data=lca)
summary(lcaxtabs)

## Call: xtabs(formula = ~class + voc, data = lca)
## Number of cases in table: 500 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 279.45, df = 1, p-value = 9.881e-63

library(vcd)

## Loading required package: grid

assocstats(lcaxtabs)

##                     X^2 df P(> X^2)
## Likelihood Ratio 286.26  1        0
## Pearson          279.45  1        0
## 
## Phi-Coefficient   : 0.748 
## Contingency Coeff.: 0.599 
## Cramer's V        : 0.748

LCA

Rose Maier

June 5, 2016