library(poLCA)Loading required package: scatterplot3d
Loading required package: MASS
data(gss82)Observed indicators are caused by an unobserved, or latent, variable of interest.
Study the patterns of interrelationships among the observed indicators to understand and characterise the underlying latent variable
Used factor analysis – continuous latent variables (generally continuous observed indicators)
Factor analysis and PCA reduce many observed variables to a few latent factors
Latent class analysis (LCA) is a method for studying categorically scored variables that comparable to factor analysis
| Continuous latent | Categorical latent | |
|---|---|---|
| Continuous observed | Factor Analysis | Latent Profile Analysis |
| Categorical observed | Latent trait analysis or Item Response Theory |
Latent Class Analysis |
The latent class model is directly analogous to the factor analysis model. Both models posit an underlying latent variable that is measured by observed variables.
The key difference between the latent class and factor analysis models lies in the nature and distribution of the latent variable.
As mentioned above, in LCA the latent variable is Categorical. This categorical latent variable has a multinomial distribution.
By contrast, in classic factor analysis the latent variable is continuous, sometimes referred to as dimensional and normally distributed.
We define categorical latent variables as those in which “qualitative differences exist between groups of people or objects”
and continuous (or dimensional) latent variables as those in which “people or objects differ quantitatively along one or more continua”
In the literature, some have drawn a distinction between variable-oriented and person-oriented approaches to statistical analysis of empirical data in the social and behavioral sciences.
In variable-oriented approaches the emphasis is on identifying relations between variables, and it is assumed that these relations apply across all people. Traditional factor analysis is an example of a variable-centered approach. The emphasis in factor analysis is on identifying a factor structure that accounts for the linear relations among a set of observed variables.
The factor structure is assumed to hold for all individuals.
In contrast, in person-oriented approaches the emphasis is on the individual as a whole, this focus often involves studying individuals on the basis of their patterns of individual characteristics that are relevant for the problem under consideration
[the focus of most scientific endeavors is nomothetic; that is, the goal is not merely to study individuals, but to reason inductively to draw broad conclusions and identify general laws]
One way to do this in a person-oriented framework is to look for subtypes of individuals that exhibit similar patterns of individual characteristics. LCA does exactly this, and therefore is usually considered a person-oriented approach.
Why select a model that posits a categorical latent variable, like LCA, instead of a model that posits a continuous latent variable, like factor analysis?
One reason is to identify an organizing principle for a complex array of empirical categorical data. With LCA an investigator can use an array of observed variables representing characteristics, behaviors, symptoms, or the like as the basis for organizing people into two or more meaningful homogeneous subgroups.
The array of observed data is usually much too large and complex for the subgroups to be evident from inspection
A latent class model is characterized by having a categorical latent variable and categorical observed variables.
The levels of the categorical latent variable represent groups in the population and are called classes.
We are interested in identifying and understanding these unobserved classes.
It is important to stress, though, that LCA is an unsupervised technique. The researcher stipulates the number of classes the model will estimate, but the solution which will be found cannot be determined from the start.
There is no guarantee, then that we would find groups that match up to our conceptions
Instead, as with factor analysis or PCA, it is up to the researcher to interpret what the latent groups signify on the basis of the distribution of their responses to the various inputs.
There is one fundamental assumption made by latent class models. The assumption of local independence specifies that conditional on the latent variable, the observed variables are independent.
In other words, if it were possible to create separate contingency tables for the observed variables corresponding to each latent class, the observed variables would be independent within each of these contingency tables
Behavioral Research
Classify people who are more likely to exhibit specific behaviors
Different kinds of social phobias
Categories of eating disorders
Medicine and Health
Marketing Research
And many, many more…
If A and B are (observed) (indexed by i and j)
If X is the latent variable (“Variable C”)
If t is the number of latent classes (levels)
If \(\pi\) is the probability (when ‘Variable C’ is latent
Then the formal LC model can be expressed as
\[ \pi^{\overline{A}X}_{it}\times \pi^{\overline{B}X}_{it}\times \pi^{X}_{t} \]
We will use the R package poLCA
library(poLCA)Loading required package: scatterplot3d
Loading required package: MASS
data(gss82)gss82: Attitudes towards survey taking across two dichotomous and two trichotomous items among 1202 white respondents to the 1982 General Social Survey. Source: McCutcheon
Next, we need to bind the terms which we are going to use to create the latent classes, and save them in an object called xs2. The model is regressed on an intercept term.
f <- cbind(PURPOSE,ACCURACY,UNDERSTA,COOPERAT)~1Once this is done, we can run the program with the following single line of code:
gss.lc2 <- poLCA(f,gss82,nclass=2, graphs = TRUE)Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$PURPOSE
Good Depends Waste of time
class 1: 0.2154 0.2066 0.5780
class 2: 0.8953 0.0579 0.0468
$ACCURACY
Mostly true Not true
class 1: 0.0297 0.9703
class 2: 0.6367 0.3633
$UNDERSTA
Good Fair/Poor
class 1: 0.7422 0.2578
class 2: 0.8327 0.1673
$COOPERAT
Interested Cooperative Impatient
class 1: 0.6478 0.2498 0.1024
class 2: 0.8840 0.1043 0.0117
Estimated class population shares
0.1923 0.8077
Predicted class memberships (by modal posterior prob.)
0.1864 0.8136
=========================================================
Fit for 2 latent classes:
=========================================================
number of observations: 1202
number of estimated parameters: 13
residual degrees of freedom: 22
maximum log-likelihood: -2783.268
AIC(2): 5592.536
BIC(2): 5658.729
G^2(2): 79.33723 (Likelihood ratio/deviance statistic)
X^2(2): 93.25329 (Chi-square goodness of fit)
We can further explore
poLCA.table(formula=COOPERAT~1,condition=list(PURPOSE=3,ACCURACY=1,UNDERSTA=2),lc=gss.lc2) Interested Cooperative Impatient
4.940146 0.7603696 0.1613188
poLCA.table(formula=COOPERAT~UNDERSTA,condition=list(PURPOSE=3,ACCURACY=1),lc=gss.lc2) Good Fair/Poor
Interested 23.1977243 4.9401464
Cooperative 3.2481239 0.7603696
Impatient 0.5831002 0.1613188
Model fit:
BIC and AIC
X2 Statistic
Lo-Mendell-Rubin Test
Standardized Residuals
Model Usefulness
Substantive Interpretation
Classification Quality
Classification Tables
Entropy
Before running any analysis, are several things to think about:
Sample size
Response patterns/Sparseness
Model identification
Theory