1 简介

Latent Class Analysis (LCA) 主要是基于数据集中的显分类变量，来推测潜在的分类变量。

LCA的核心假设为条件独立，即任意两个外显变量分布是独立的。例如，\(A\)和\(B\)都是显变量，\(A\)变量有\(J\)个类别，\(B\)变量有\(K\)个类别，则个体属于\(j\)和\(k\)类别的联合概率为：\(P_{j k}=P_{j}{ }^{A} P_{k}{ }^{B}\)

该假设其实与朴素贝叶斯的假设比较像。区别在于朴素贝叶斯是一种监督学习，而LCA不知道个体所属具体潜在类别的。

设潜类变量为\(X\)，共有\(T\)个类别，则基于极大似然原则，LPA使用EM等算法，可以估计:
1. 潜在类别\(t\)的占比：\(p\left(x_i = t\right)\)
2. \(t\)类别中外显变量的概率分布：\(f\left(a_i = j, b_i = k \mid x_{i}=t\right)\)

进而可以基于贝叶斯后验分布来估计某个体\(i\)属于某类别的概率： \[ p\left(x_{i}=t \mid a_i = j, b_i = k\right)=\frac{p\left(x_{i}=t\right) f\left(a_i = j, b_i = k \mid x_{i}=t\right)}{f\left(a_i = j, b_i = k\right)} \tag{1} \]

2 工具包

pacman::p_load(poLCA, readr, tidyverse)

3 数据

这里使用的是National Mental Health Services Survey (N-MHSS)数据，该数据调查了美国50个州12472家精神卫生中心的设备、诊疗情况。
数据详细情况及下载：
https://www.datafiles.samhsa.gov/dataset/national-mental-health-services-survey-2019-n-mhss-2019-ds0001

samhsa2015 <- read_csv("N-MHSS-2015-DS0001-data.csv") %>%
  dplyr::select(MHINTAKE, MHDIAGEVAL, MHREFERRAL, TREATMT, ADMINSERV, PAYASST) %>%
  # 由于poLCA要求显变量必须从1开始，而原分类变量为0，所以我们将所有变量+1
  mutate(across(.cols = 1:5,.fns = ~. + 1))

4 估计

# 设定示例中需要的5个变量
f1 <- as.formula(cbind(MHINTAKE, MHDIAGEVAL, MHREFERRAL, TREATMT, ADMINSERV)~1) 

# 估计
LCA3 <- poLCA(f1, data=samhsa2015, nclass=3) # 3个潜类别

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $MHINTAKE
##            Pr(1)  Pr(2)
## class 1:  0.0094 0.9906
## class 2:  0.0572 0.9428
## class 3:  0.8089 0.1911
## 
## $MHDIAGEVAL
##            Pr(1)  Pr(2)
## class 1:  0.0212 0.9788
## class 2:  0.0641 0.9359
## class 3:  0.7570 0.2430
## 
## $MHREFERRAL
##            Pr(1)  Pr(2)
## class 1:  0.0011 0.9989
## class 2:  0.2347 0.7653
## class 3:  0.8385 0.1615
## 
## $TREATMT
##            Pr(1)  Pr(2)
## class 1:  0.2460 0.7540
## class 2:  0.5481 0.4519
## class 3:  0.7970 0.2030
## 
## $ADMINSERV
##            Pr(1)  Pr(2)
## class 1:  0.1000 0.9000
## class 2:  0.4687 0.5313
## class 3:  0.7316 0.2684
## 
## Estimated class population shares 
##  0.3876 0.5082 0.1042 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.3378 0.5575 0.1047 
##  
## ========================================================= 
## Fit for 3 latent classes: 
## ========================================================= 
## number of observations: 12826 
## number of estimated parameters: 17 
## residual degrees of freedom: 14 
## maximum log-likelihood: -29888.89 
##  
## AIC(3): 59811.77
## BIC(3): 59938.58
## G^2(3): 159.4307 (Likelihood ratio/deviance statistic) 
## X^2(3): 177.1366 (Chi-square goodness of fit) 
##  
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
##

输出结果已经对估计出的各变量说明的很清楚了，这里就不再赘述了。

特别的，我们可以观察不同类别展示某种外显特征的概率：

map(LCA3$probs, ~.[,2] ) %>%
  as_tibble() %>% 
  t()

##                 [,1]      [,2]      [,3]
## MHINTAKE   0.9905772 0.9428064 0.1910962
## MHDIAGEVAL 0.9788447 0.9358906 0.2430108
## MHREFERRAL 0.9988919 0.7653179 0.1615213
## TREATMT    0.7539865 0.4518728 0.2030123
## ADMINSERV  0.9000497 0.5312699 0.2683611

5 绘图

主要可以通过画图的形式来展示结果

plot(LCA3)

6 协变量设置

我们引入显变量\(Z\)，用来解释潜类别\(X\)。
换句话说，类似于估计出每个个体所属类别后，用类别对变量\(Z\)做一个多元logit回归（不要这样做，会造成有偏估计，最好直接在poLCA中完成）。

f2 <- as.formula(cbind(MHINTAKE, MHDIAGEVAL, MHREFERRAL, TREATMT, ADMINSERV)~PAYASST)

LCA3c <- poLCA(f2, data=samhsa2015, nclass=3, maxiter=3000, nrep=5)

## Model 1: llik = -29884.48 ... best llik = -29884.48
## Model 2: llik = -29954.09 ... best llik = -29884.48
## Model 3: llik = -33041.62 ... best llik = -29884.48
## Model 4: llik = -30086.04 ... best llik = -29884.48
## Model 5: llik = -29884.48 ... best llik = -29884.48
## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $MHINTAKE
##            Pr(1)  Pr(2)
## class 1:  0.8094 0.1906
## class 2:  0.0081 0.9919
## class 3:  0.0574 0.9426
## 
## $MHDIAGEVAL
##            Pr(1)  Pr(2)
## class 1:  0.7576 0.2424
## class 2:  0.0198 0.9802
## class 3:  0.0644 0.9356
## 
## $MHREFERRAL
##            Pr(1)  Pr(2)
## class 1:  0.8399 0.1601
## class 2:  0.0000 1.0000
## class 3:  0.2303 0.7697
## 
## $TREATMT
##            Pr(1)  Pr(2)
## class 1:  0.7964 0.2036
## class 2:  0.2301 0.7699
## class 3:  0.5530 0.4470
## 
## $ADMINSERV
##            Pr(1)  Pr(2)
## class 1:  0.7344 0.2656
## class 2:  0.1057 0.8943
## class 3:  0.4558 0.5442
## 
## Estimated class population shares 
##  0.1039 0.3759 0.5202 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.1047 0.3391 0.5562 
##  
## ========================================================= 
## Fit for 3 latent classes: 
## ========================================================= 
## 2 / 1 
##             Coefficient  Std. error  t value  Pr(>|t|)
## (Intercept)     1.31792     0.09948   13.248     0.000
## PAYASST        -0.09678     0.04964   -1.950     0.075
## ========================================================= 
## 3 / 1 
##             Coefficient  Std. error  t value  Pr(>|t|)
## (Intercept)     1.62401     0.06623   24.520     0.000
## PAYASST        -0.03601     0.05381   -0.669     0.516
## ========================================================= 
## number of observations: 12826 
## number of estimated parameters: 19 
## residual degrees of freedom: 12 
## maximum log-likelihood: -29884.48 
##  
## AIC(3): 59806.96
## BIC(3): 59948.69
## X^2(3): 178.1568 (Chi-square goodness of fit) 
##  
## ALERT: estimation algorithm automatically restarted with new initial values 
##

下面显示:

Class1:五项服务都不提供
Class2:五项服务都提供
Class3:只提供基本的服务

因此，上面对协变量的估计结果说明，与啥服务基本都不提供的Class1相比，剩下两个类别较少提供付费服务（PAYASST）

plot(LCA3c)

map(LCA3c$probs, ~.[,2] ) %>%
  as_tibble() %>% 
  t()

##                 [,1]      [,2]      [,3]
## MHINTAKE   0.1906281 0.9918511 0.9426321
## MHDIAGEVAL 0.2424217 0.9802387 0.9355783
## MHREFERRAL 0.1601107 1.0000000 0.7697368
## TREATMT    0.2036498 0.7698715 0.4469578
## ADMINSERV  0.2656099 0.8942964 0.5441649