Eduardo Ogawa Cardoso (IDUSP 10864890) - M.Sc Student, Marketing
Maria Carolina Dias Cavalcante (IDUSP 12436263) - Ph.D Candidate, Marketing
eogawac@usp.br; mcarolinadias@usp.br
Latent class analysis (LCA) is a statistical technique used to find qualitatively separate subgroups within populations that typically share certain exterior characteristics. LCA is based on the idea that patterns of scores across survey questions, assessment markers, or scales might explain membership in previously undetected groups (or classes). In more technical terms, LCA is used to detect latent (or unobserved) heterogeneity in samples (Hagenaars & McCutcheon, 2002).
LCA can be done with both paid and free statistical packages, such as STATA (StataCorp LLC, 1985–2019), SAS (SAS Institute Inc., 2016), R (Venables & Smith, 2019), and Mplus (L. K. Muthén & Muthén, 1998–2017). Also, free paackages, such as LatentGold (Vermunt & Magidson, 2016) and poLCA for R, have been made just for doing LCA (Linzer & Lewis, 2011).
In this exercise we chose to use poLCA for the R language since it’s a free and versitile choice. Furthermore, it provides a wide set of data sample which can be used to learn the techinique.
The chosen data is from the American National Election Study of 2000. This dataset is available as a sample of the poLCA library in R. You can find two sets of six questions, each with four possible replies, asked respondents to assess the degree to which Al Gore and George W. Bush exemplify certain characteristics (moral, compassionate, knowledgeable, excellent leader, dishonest, bright). The responses are (1) Very well; (2) Fairly well; (3) Not too well; and (4) Not at all well. Variable numbers of missing data exist for several responders on these factors.
The data set also has the possible factors VOTE3, which is the respondent’s 2000 vote choice (if asked), AGE, which is the respondent’s age, EDUC, which is the respondent’s level of education, GENDER, which is the respondent’s gender, and PARTY, which is the respondent’s Democratic-Republican political affiliation. Gore, Bush, and Other are the codes for VOTE3.
EDUC falls into the following categories: (1) 8 grades or less; (2) 9–11 grades, no further education; (3) High school diploma or equivalency; (4) More than 12 years of schooling, no higher degree; (5) Junior or community college level degree; (6) BA level degree, no further degree; (7) Advanced degree.
For GENDER, (1) Male and (2) Female are used to code male and female.
The PARTY variable is coded as follows: (1) Strong Democrat; (2) Weak Democrat; (3) Independent-Democrat; (4) Independent-Independent; (5) Independent-Republican; (6) Weak Republican; (7) Strong Republican.
install.packages("poLCA")
library("poLCA")
data(election)
Warning in data(election) : data set ‘election’ not found
head(election)
MORALG <fctr> | CARESG <fctr> | KNOWG <fctr> | LEADG <fctr> | DISHONG <fctr> | ||
---|---|---|---|---|---|---|
1 | 3 Not too well | 1 Extremely well | 2 Quite well | 2 Quite well | 3 Not too well | |
2 | 4 Not well at all | 3 Not too well | 4 Not well at all | 3 Not too well | 2 Quite well | |
3 | 1 Extremely well | 2 Quite well | 2 Quite well | 1 Extremely well | 3 Not too well | |
4 | 2 Quite well | 2 Quite well | 2 Quite well | 2 Quite well | 2 Quite well | |
5 | 2 Quite well | 4 Not well at all | 2 Quite well | 3 Not too well | 2 Quite well | |
6 | 2 Quite well | 3 Not too well | 3 Not too well | 2 Quite well | 2 Quite well |
summary(election)
MORALG CARESG KNOWG LEADG DISHONG INTELG MORALB CARESB KNOWB
1 Extremely well :423 1 Extremely well :277 1 Extremely well :461 1 Extremely well :258 1 Extremely well :133 1 Extremely well :494 1 Extremely well :340 1 Extremely well :155 1 Extremely well :274
2 Quite well :820 2 Quite well :713 2 Quite well :997 2 Quite well :728 2 Quite well :312 2 Quite well :995 2 Quite well :841 2 Quite well :625 2 Quite well :933
3 Not too well :287 3 Not too well :464 3 Not too well :212 3 Not too well :522 3 Not too well :629 3 Not too well :182 3 Not too well :330 3 Not too well :562 3 Not too well :379
4 Not well at all:133 4 Not well at all:232 4 Not well at all: 59 4 Not well at all:185 4 Not well at all:557 4 Not well at all: 65 4 Not well at all: 98 4 Not well at all:342 4 Not well at all:133
NA's :122 NA's : 99 NA's : 56 NA's : 92 NA's :154 NA's : 49 NA's :176 NA's :101 NA's : 66
LEADB DISHONB INTELB VOTE3 AGE EDUC GENDER PARTY
1 Extremely well :266 1 Extremely well : 70 1 Extremely well :329 Min. :1.000 Min. :18.00 Min. :1.000 Min. :1.00 Min. :1.000
2 Quite well :842 2 Quite well :288 2 Quite well :967 1st Qu.:1.000 1st Qu.:34.00 1st Qu.:3.000 1st Qu.:1.00 1st Qu.:2.000
3 Not too well :407 3 Not too well :653 3 Not too well :306 Median :1.000 Median :45.00 Median :4.000 Median :2.00 Median :3.000
4 Not well at all:166 4 Not well at all:574 4 Not well at all:110 Mean :1.534 Mean :47.12 Mean :4.305 Mean :1.56 Mean :3.726
NA's :104 NA's :200 NA's : 73 3rd Qu.:2.000 3rd Qu.:58.00 3rd Qu.:6.000 3rd Qu.:2.00 3rd Qu.:6.000
Max. :3.000 Max. :97.00 Max. :7.000 Max. :2.00 Max. :7.000
NA's :625 NA's :9 NA's :6 NA's :25
#create a latent class with loglinear indendence and 3 classes of data.
f = cbind(MORALG,CARESG,KNOWG,LEADG,DISHONG,INTELG,
MORALB,CARESB,KNOWB,LEADB,DISHONB,INTELB)~1
nes1 = poLCA(f,election,nclass=1)
Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$MORALG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.2578 0.4783 0.1808 0.0831
$CARESG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1594 0.4272 0.2807 0.1327
$KNOWG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.2723 0.5622 0.1297 0.0359
$LEADG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.148 0.4249 0.3143 0.1129
$DISHONG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0877 0.1869 0.3898 0.3356
$INTELG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.286 0.5667 0.1098 0.0374
$MORALB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.209 0.5195 0.2136 0.058
$CARESB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0999 0.3791 0.3349 0.1861
$KNOWB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1594 0.5454 0.2174 0.0778
$LEADB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1632 0.5072 0.2426 0.087
$DISHONB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0427 0.18 0.4218 0.3555
$INTELB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1937 0.5545 0.1869 0.0648
Estimated class population shares
1
Predicted class memberships (by modal posterior prob.)
1
=========================================================
Fit for 1 latent classes:
=========================================================
number of observations: 1311
number of estimated parameters: 36
residual degrees of freedom: 1275
maximum log-likelihood: -18647.31
AIC(1): 37366.62
BIC(1): 37553.05
G^2(1): 18882 (Likelihood ratio/deviance statistic)
X^2(1): 29318840470 (Chi-square goodness of fit)
nes2 = poLCA(f,election,nclass=2)
Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$MORALG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1172 0.4647 0.2806 0.1375
class 2: 0.4247 0.4944 0.0623 0.0186
$CARESG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0426 0.3407 0.4154 0.2014
class 2: 0.2982 0.5298 0.1208 0.0512
$KNOWG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1349 0.6121 0.2049 0.0481
class 2: 0.4354 0.5029 0.0404 0.0214
$LEADG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0375 0.3188 0.4596 0.1840
class 2: 0.2791 0.5508 0.1417 0.0284
$DISHONG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1321 0.287 0.3879 0.193
class 2: 0.0350 0.068 0.3921 0.505
$INTELG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1606 0.6271 0.1641 0.0481
class 2: 0.4349 0.4951 0.0454 0.0246
$MORALB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.3177 0.5809 0.0933 0.0081
class 2: 0.0799 0.4465 0.3564 0.1172
$CARESB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1724 0.5825 0.2216 0.0234
class 2: 0.0138 0.1376 0.4693 0.3793
$KNOWB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.2449 0.6786 0.0763 0.0002
class 2: 0.0579 0.3873 0.3849 0.1699
$LEADB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.2789 0.6532 0.0628 0.0050
class 2: 0.0259 0.3339 0.4560 0.1842
$DISHONB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0208 0.0961 0.3854 0.4977
class 2: 0.0688 0.2796 0.4650 0.1865
$INTELB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.2763 0.6686 0.0551 0.0000
class 2: 0.0958 0.4191 0.3433 0.1418
Estimated class population shares
0.5428 0.4572
Predicted class memberships (by modal posterior prob.)
0.5454 0.4546
=========================================================
Fit for 2 latent classes:
=========================================================
number of observations: 1311
number of estimated parameters: 73
residual degrees of freedom: 1238
maximum log-likelihood: -17344.92
AIC(2): 34835.85
BIC(2): 35213.88
G^2(2): 16277.22 (Likelihood ratio/deviance statistic)
X^2(2): 257217240886 (Chi-square goodness of fit)
nes3 = poLCA(f,election,nclass=3)
Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$MORALG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1013 0.5990 0.2552 0.0446
class 2: 0.1851 0.3668 0.2350 0.2131
class 3: 0.5224 0.4109 0.0390 0.0277
$CARESG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0227 0.5090 0.3819 0.0864
class 2: 0.0879 0.2546 0.3524 0.3051
class 3: 0.3970 0.4605 0.0895 0.0529
$KNOWG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0429 0.8058 0.1408 0.0105
class 2: 0.2542 0.4253 0.2382 0.0822
class 3: 0.5878 0.3543 0.0266 0.0312
$LEADG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0207 0.4886 0.4268 0.0639
class 2: 0.0747 0.2582 0.3856 0.2816
class 3: 0.3747 0.4772 0.1085 0.0396
$DISHONG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0519 0.2182 0.4998 0.2301
class 2: 0.2007 0.3060 0.2774 0.2158
class 3: 0.0426 0.0487 0.3371 0.5717
$INTELG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0600 0.7953 0.1317 0.0130
class 2: 0.2936 0.4407 0.1852 0.0805
class 3: 0.5763 0.3698 0.0197 0.0341
$MORALB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0642 0.7137 0.2164 0.0057
class 2: 0.5797 0.3787 0.0248 0.0168
class 3: 0.0966 0.3795 0.3638 0.1601
$CARESB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0093 0.4847 0.4380 0.0679
class 2: 0.3483 0.5362 0.0836 0.0319
class 3: 0.0162 0.1125 0.4045 0.4668
$KNOWB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0191 0.7583 0.2182 0.0045
class 2: 0.4970 0.4899 0.0132 0.0000
class 3: 0.0682 0.3115 0.3828 0.2374
$LEADB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0537 0.7119 0.2217 0.0127
class 2: 0.5106 0.4648 0.0187 0.0058
class 3: 0.0236 0.2735 0.4524 0.2505
$DISHONB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0134 0.1471 0.5683 0.2711
class 2: 0.0365 0.0740 0.1993 0.6902
class 3: 0.0862 0.3096 0.4111 0.1931
$INTELB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0435 0.7741 0.1796 0.0028
class 2: 0.5284 0.4656 0.0060 0.0000
class 3: 0.1179 0.3391 0.3440 0.1990
Estimated class population shares
0.4194 0.2608 0.3198
Predicted class memberships (by modal posterior prob.)
0.4256 0.2578 0.3166
=========================================================
Fit for 3 latent classes:
=========================================================
number of observations: 1311
number of estimated parameters: 110
residual degrees of freedom: 1201
maximum log-likelihood: -16714.66
AIC(3): 33649.32
BIC(3): 34218.96
G^2(3): 15016.7 (Likelihood ratio/deviance statistic)
X^2(3): 16748681903 (Chi-square goodness of fit)
f2a <- cbind(MORALG,CARESG,KNOWG,LEADG,DISHONG,INTELG,
MORALB,CARESB,KNOWB,LEADB,DISHONB,INTELB)~PARTY
nes2a <- poLCA(f2a,election,nclass=3,nrep=5)
pidmat <- cbind(1,c(1:7))
exb <- exp(pidmat %*% nes2a$coeff)
matplot(c(1:7),(cbind(1,exb)/(1+rowSums(exb))),ylim=c(0,1),type="l",
main="Party ID as a predictor of candidate affinity class",
xlab="Party ID: strong Democratic (1) to strong Republican (7)",
ylab="Probability of latent class membership",lwd=2,col=1)
text(5.9,0.35,"Other")
text(5.4,0.7,"Bush affinity")
text(1.8,0.6,"Gore affinity")