LCA

Author

Giuseppe A. Veltri

Starting point

Observed indicators are caused by an unobserved, or latent, variable of interest.
Study the patterns of interrelationships among the observed indicators to understand and characterise the underlying latent variable
Used factor analysis – continuous latent variables (generally continuous observed indicators)
Factor analysis and PCA reduce many observed variables to a few latent factors
Latent class analysis (LCA) is a method for studying categorically scored variables that comparable to factor analysis

	Continuous latent	Categorical latent
Continuous observed	Factor Analysis	Latent Profile Analysis
Categorical observed	Latent trait analysis or Item Response Theory	Latent Class Analysis

Continuous observed

Factor Analysis

Latent Profile Analysis

Categorical observed

Latent trait analysis or

Item Response Theory

Latent Class Analysis

FA vs LCA

The latent class model is directly analogous to the factor analysis model. Both models posit an underlying latent variable that is measured by observed variables.
The key difference between the latent class and factor analysis models lies in the nature and distribution of the latent variable.
As mentioned above, in LCA the latent variable is Categorical. This categorical latent variable has a multinomial distribution.
By contrast, in classic factor analysis the latent variable is continuous, sometimes referred to as dimensional and normally distributed.
We define categorical latent variables as those in which “qualitative differences exist between groups of people or objects”
and continuous (or dimensional) latent variables as those in which “people or objects differ quantitatively along one or more continua”

Variable-oriented approach vs people-oriented one

In the literature, some have drawn a distinction between variable-oriented and person-oriented approaches to statistical analysis of empirical data in the social and behavioral sciences.

In variable-oriented approaches the emphasis is on identifying relations between variables, and it is assumed that these relations apply across all people. Traditional factor analysis is an example of a variable-centered approach. The emphasis in factor analysis is on identifying a factor structure that accounts for the linear relations among a set of observed variables.

The factor structure is assumed to hold for all individuals.

In contrast, in person-oriented approaches the emphasis is on the individual as a whole, this focus often involves studying individuals on the basis of their patterns of individual characteristics that are relevant for the problem under consideration

[the focus of most scientific endeavors is nomothetic; that is, the goal is not merely to study individuals, but to reason inductively to draw broad conclusions and identify general laws]

One way to do this in a person-oriented framework is to look for subtypes of individuals that exhibit similar patterns of individual characteristics. LCA does exactly this, and therefore is usually considered a person-oriented approach.

Why select a model that posits a categorical latent variable, like LCA, instead of a model that posits a continuous latent variable, like factor analysis?

One reason is to identify an organizing principle for a complex array of empirical categorical data. With LCA an investigator can use an array of observed variables representing characteristics, behaviors, symptoms, or the like as the basis for organizing people into two or more meaningful homogeneous subgroups.

The array of observed data is usually much too large and complex for the subgroups to be evident from inspection

A latent class model is characterized by having a categorical latent variable and categorical observed variables.
The levels of the categorical latent variable represent groups in the population and are called classes.
We are interested in identifying and understanding these unobserved classes.

It is important to stress, though, that LCA is an unsupervised technique. The researcher stipulates the number of classes the model will estimate, but the solution which will be found cannot be determined from the start.

There is no guarantee, then that we would find groups that match up to our conceptions

Instead, as with factor analysis or PCA, it is up to the researcher to interpret what the latent groups signify on the basis of the distribution of their responses to the various inputs.

There is one fundamental assumption made by latent class models. The assumption of local independence specifies that conditional on the latent variable, the observed variables are independent.

In other words, if it were possible to create separate contingency tables for the observed variables corresponding to each latent class, the observed variables would be independent within each of these contingency tables

Applications

Behavioral Research

Classify people who are more likely to exhibit specific behaviors
Different kinds of social phobias
Categories of eating disorders

Medicine and Health

Identify patients with different disease risk profiles

Marketing Research

Differentiate subsets of customers and their buying habits

And many, many more…

The formal latent class model

If A and B are (observed) (indexed by i and j)

Eg: If Ai is respondent’s identification with, 1=Protestant, 2=Catholic, 3=Jewish, 4=Other, 5=None; (i.e. i=5), then A2 represents the Catholics

If X is the latent variable (“Variable C”)

If t is the number of latent classes (levels)

If \(\pi\) is the probability (when ‘Variable C’ is latent

Then the formal LC model can be expressed as

\[ \pi^{\overline{A}X}_{it}\times \pi^{\overline{B}X}_{it}\times \pi^{X}_{t} \]

LCA in R using poLCA

We will use the R package poLCA

library(poLCA)

Loading required package: scatterplot3d

Loading required package: MASS

data(gss82)

gss82: Attitudes towards survey taking across two dichotomous and two trichotomous items among 1202 white respondents to the 1982 General Social Survey. Source: McCutcheon

Next, we need to bind the terms which we are going to use to create the latent classes, and save them in an object called xs2. The model is regressed on an intercept term.

f <- cbind(PURPOSE,ACCURACY,UNDERSTA,COOPERAT)~1

Once this is done, we can run the program with the following single line of code:

gss.lc2 <- poLCA(f,gss82,nclass=2, graphs = TRUE)

Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$PURPOSE
            Good Depends Waste of time
class 1:  0.2154  0.2066        0.5780
class 2:  0.8953  0.0579        0.0468

$ACCURACY
          Mostly true Not true
class 1:       0.0297   0.9703
class 2:       0.6367   0.3633

$UNDERSTA
            Good Fair/Poor
class 1:  0.7422    0.2578
class 2:  0.8327    0.1673

$COOPERAT
          Interested Cooperative Impatient
class 1:      0.6478      0.2498    0.1024
class 2:      0.8840      0.1043    0.0117

Estimated class population shares 
 0.1923 0.8077 
 
Predicted class memberships (by modal posterior prob.) 
 0.1864 0.8136 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
number of observations: 1202 
number of estimated parameters: 13 
residual degrees of freedom: 22 
maximum log-likelihood: -2783.268 
 
AIC(2): 5592.536
BIC(2): 5658.729
G^2(2): 79.33723 (Likelihood ratio/deviance statistic) 
X^2(2): 93.25329 (Chi-square goodness of fit)

We can further explore

poLCA.table(formula=COOPERAT~1,condition=list(PURPOSE=3,ACCURACY=1,UNDERSTA=2),lc=gss.lc2)

 Interested Cooperative Impatient
   4.940146   0.7603696 0.1613188

poLCA.table(formula=COOPERAT~UNDERSTA,condition=list(PURPOSE=3,ACCURACY=1),lc=gss.lc2)

                  Good Fair/Poor
Interested  23.1977243 4.9401464
Cooperative  3.2481239 0.7603696
Impatient    0.5831002 0.1613188

Evaluating the model

Model fit:
- BIC and AIC
- X2 Statistic
- Lo-Mendell-Rubin Test
- Standardized Residuals

Model Usefulness

Substantive Interpretation
Classification Quality
- Classification Tables
- Entropy

Before running any analysis, are several things to think about:

Sample size
Response patterns/Sparseness
Model identification
Theory