Latent Class Analysis (LCA)
主要是基于数据集中的显分类变量,来推测潜在的分类变量。
LCA的核心假设为条件独立,即任意两个外显变量分布是独立的。例如,\(A\)和\(B\)都是显变量,\(A\)变量有\(J\)个类别,\(B\)变量有\(K\)个类别,则个体属于\(j\)和\(k\)类别的联合概率为:\(P_{j k}=P_{j}{ }^{A} P_{k}{ }^{B}\)
该假设其实与朴素贝叶斯的假设比较像。区别在于朴素贝叶斯是一种监督学习,而LCA不知道个体所属具体潜在类别的。
设潜类变量为\(X\),共有\(T\)个类别,则基于极大似然原则,LPA使用EM等算法,可以估计:
1. 潜在类别\(t\)的占比:\(p\left(x_i = t\right)\)
2. \(t\)类别中外显变量的概率分布:\(f\left(a_i = j, b_i = k \mid x_{i}=t\right)\)
进而可以基于贝叶斯后验分布来估计某个体\(i\)属于某类别的概率: \[ p\left(x_{i}=t \mid a_i = j, b_i = k\right)=\frac{p\left(x_{i}=t\right) f\left(a_i = j, b_i = k \mid x_{i}=t\right)}{f\left(a_i = j, b_i = k\right)} \tag{1} \]
这里使用的是National Mental Health Services Survey (N-MHSS)
数据,该数据调查了美国50个州12472家精神卫生中心的设备、诊疗情况。
数据详细情况及下载:
https://www.datafiles.samhsa.gov/dataset/national-mental-health-services-survey-2019-n-mhss-2019-ds0001
# 设定示例中需要的5个变量
f1 <- as.formula(cbind(MHINTAKE, MHDIAGEVAL, MHREFERRAL, TREATMT, ADMINSERV)~1)
# 估计
LCA3 <- poLCA(f1, data=samhsa2015, nclass=3) # 3个潜类别
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $MHINTAKE
## Pr(1) Pr(2)
## class 1: 0.0094 0.9906
## class 2: 0.0572 0.9428
## class 3: 0.8089 0.1911
##
## $MHDIAGEVAL
## Pr(1) Pr(2)
## class 1: 0.0212 0.9788
## class 2: 0.0641 0.9359
## class 3: 0.7570 0.2430
##
## $MHREFERRAL
## Pr(1) Pr(2)
## class 1: 0.0011 0.9989
## class 2: 0.2347 0.7653
## class 3: 0.8385 0.1615
##
## $TREATMT
## Pr(1) Pr(2)
## class 1: 0.2460 0.7540
## class 2: 0.5481 0.4519
## class 3: 0.7970 0.2030
##
## $ADMINSERV
## Pr(1) Pr(2)
## class 1: 0.1000 0.9000
## class 2: 0.4687 0.5313
## class 3: 0.7316 0.2684
##
## Estimated class population shares
## 0.3876 0.5082 0.1042
##
## Predicted class memberships (by modal posterior prob.)
## 0.3378 0.5575 0.1047
##
## =========================================================
## Fit for 3 latent classes:
## =========================================================
## number of observations: 12826
## number of estimated parameters: 17
## residual degrees of freedom: 14
## maximum log-likelihood: -29888.89
##
## AIC(3): 59811.77
## BIC(3): 59938.58
## G^2(3): 159.4307 (Likelihood ratio/deviance statistic)
## X^2(3): 177.1366 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
输出结果已经对估计出的各变量说明的很清楚了,这里就不再赘述了。
特别的,我们可以观察不同类别展示某种外显特征的概率:
## [,1] [,2] [,3]
## MHINTAKE 0.9905772 0.9428064 0.1910962
## MHDIAGEVAL 0.9788447 0.9358906 0.2430108
## MHREFERRAL 0.9988919 0.7653179 0.1615213
## TREATMT 0.7539865 0.4518728 0.2030123
## ADMINSERV 0.9000497 0.5312699 0.2683611
我们引入显变量\(Z\),用来解释潜类别\(X\)。
换句话说,类似于估计出每个个体所属类别后,用类别对变量\(Z\)做一个多元logit回归(不要这样做,会造成有偏估计,最好直接在poLCA
中完成)。
f2 <- as.formula(cbind(MHINTAKE, MHDIAGEVAL, MHREFERRAL, TREATMT, ADMINSERV)~PAYASST)
LCA3c <- poLCA(f2, data=samhsa2015, nclass=3, maxiter=3000, nrep=5)
## Model 1: llik = -29884.48 ... best llik = -29884.48
## Model 2: llik = -29954.09 ... best llik = -29884.48
## Model 3: llik = -33041.62 ... best llik = -29884.48
## Model 4: llik = -30086.04 ... best llik = -29884.48
## Model 5: llik = -29884.48 ... best llik = -29884.48
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $MHINTAKE
## Pr(1) Pr(2)
## class 1: 0.8094 0.1906
## class 2: 0.0081 0.9919
## class 3: 0.0574 0.9426
##
## $MHDIAGEVAL
## Pr(1) Pr(2)
## class 1: 0.7576 0.2424
## class 2: 0.0198 0.9802
## class 3: 0.0644 0.9356
##
## $MHREFERRAL
## Pr(1) Pr(2)
## class 1: 0.8399 0.1601
## class 2: 0.0000 1.0000
## class 3: 0.2303 0.7697
##
## $TREATMT
## Pr(1) Pr(2)
## class 1: 0.7964 0.2036
## class 2: 0.2301 0.7699
## class 3: 0.5530 0.4470
##
## $ADMINSERV
## Pr(1) Pr(2)
## class 1: 0.7344 0.2656
## class 2: 0.1057 0.8943
## class 3: 0.4558 0.5442
##
## Estimated class population shares
## 0.1039 0.3759 0.5202
##
## Predicted class memberships (by modal posterior prob.)
## 0.1047 0.3391 0.5562
##
## =========================================================
## Fit for 3 latent classes:
## =========================================================
## 2 / 1
## Coefficient Std. error t value Pr(>|t|)
## (Intercept) 1.31792 0.09948 13.248 0.000
## PAYASST -0.09678 0.04964 -1.950 0.075
## =========================================================
## 3 / 1
## Coefficient Std. error t value Pr(>|t|)
## (Intercept) 1.62401 0.06623 24.520 0.000
## PAYASST -0.03601 0.05381 -0.669 0.516
## =========================================================
## number of observations: 12826
## number of estimated parameters: 19
## residual degrees of freedom: 12
## maximum log-likelihood: -29884.48
##
## AIC(3): 59806.96
## BIC(3): 59948.69
## X^2(3): 178.1568 (Chi-square goodness of fit)
##
## ALERT: estimation algorithm automatically restarted with new initial values
##
下面显示:
Class1
:五项服务都不提供Class2
:五项服务都提供Class3
:只提供基本的服务因此,上面对协变量的估计结果说明,与啥服务基本都不提供的Class1
相比,剩下两个类别较少提供付费服务(PAYASST
)
## [,1] [,2] [,3]
## MHINTAKE 0.1906281 0.9918511 0.9426321
## MHDIAGEVAL 0.2424217 0.9802387 0.9355783
## MHREFERRAL 0.1601107 1.0000000 0.7697368
## TREATMT 0.2036498 0.7698715 0.4469578
## ADMINSERV 0.2656099 0.8942964 0.5441649