Latent Prfile Analysis (LPA)
主要是基于数据集中外显连续变量\(Y\)(也可以是序次变量),来推测潜在的分类变量\(C\)。其假设是观测对象所属的潜在类别,决定了其外显变量服从的分布。
基于极大似然原则,LPA使用EM等算法,可以估计:
1. 潜在类别\(k\)的占比:\(p\left(c_{i}=k\right)\)
2. \(k\)类别中外显变量的概率分布:\(f\left(y_{i} \mid c_{i}=k\right)\)
进而可以基于贝叶斯后验分布来估计某个体\(i\)属于某类别的概率:
\[ p\left(c_{i}=k \mid y_{i}\right)=\frac{p\left(c_{i}=k\right) f\left(y_{i} \mid c_{i}=k\right)}{f\left(y_{i}\right)} \tag{1} \]
与因子分析
相比:
1. LPA的潜变量是定类变量,而因子分析是连续变量。
2. 因子分析从变量相关性入手,LPA的处理对象是个体
与聚类分析
相比:
1. 二者都是为了将个体分为不同组群
2. LPA是基于模型的方法,模型假设合理时,所得结果会更可靠
3. LPA可以计算个体所属各类别的概率
这里使用系统自带的pisaUSA15
数据集做演示。
由于该数据集存在部分缺失值,所以用single_imputation
做插补
# 准备数据集
data <- pisaUSA15 %>%
slice(1:100) %>%
select(broad_interest, enjoyment, self_efficacy) %>%
single_imputation()
head(data, 5)
## broad_interest enjoyment self_efficacy
## 1 3.8 4.0 1.000
## 2 3.0 3.0 2.750
## 3 1.8 2.8 3.375
## 4 1.4 1.0 2.750
## 5 1.8 2.2 2.000
这是使用的tidyLPA
包需调用其他算法包。包括R本身的mclust
以及广为使用的Mplus
。
不同算法可以自由的选择。
## tidyLPA analysis using mclust:
##
## Model Classes AIC BIC Entropy prob_min prob_max n_min n_max BLRT_p
## 1 3 628.20 664.67 0.80 0.84 0.94 0.03 0.65 0.01
plot_profiles
data %>%
scale() %>% # 只是为了展示方便(有些变量量纲很大,看不到其他变量了),不影响结果
estimate_profiles(n_profiles = 3) %>%
plot_profiles()
plot_profiles
## # A tibble: 5 x 7
## broad_interest enjoyment self_efficacy CPROB1 CPROB2 CPROB3 Class
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.8 4 1 1.00 7.54e-17 0.000333 1
## 2 3 3 2.75 0.0704 1.66e- 6 0.930 3
## 3 1.8 2.8 3.38 0.000807 1.86e- 3 0.997 3
## 4 1.4 1 2.75 0.0000000935 9.12e- 1 0.0884 2
## 5 1.8 2.2 2 0.00305 1.08e- 4 0.997 3
get_fit()
## # A tibble: 1 x 18
## Model Classes LogLik AIC AWE BIC CAIC CLC KIC SABIC ICL Entropy
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 3 -300. 628. 770. 665. 679. 602. 645. 620. -686. 0.796
## # … with 6 more variables: prob_min <dbl>, prob_max <dbl>, n_min <dbl>,
## # n_max <dbl>, BLRT_val <dbl>, BLRT_p <dbl>
Mclust_result$model_1_class_3$estimates %>%
mutate(across(where(is.numeric), ~round(., 2))) %>%
head(10)
## Category Parameter Estimate se p Class Model Classes
## 1 Means broad_interest 3.06 0.13 0 1 1 3
## 2 Means enjoyment 3.51 0.18 0 1 1 3
## 3 Means self_efficacy 1.56 0.15 0 1 1 3
## 4 Variances broad_interest 0.48 0.09 0 1 1 3
## 5 Variances enjoyment 0.23 0.06 0 1 1 3
## 6 Variances self_efficacy 0.24 0.04 0 1 1 3
## 7 Means broad_interest 1.14 0.35 0 2 1 3
## 8 Means enjoyment 1.15 0.32 0 2 1 3
## 9 Means self_efficacy 3.39 0.52 0 2 1 3
## 10 Variances broad_interest 0.48 0.09 0 2 1 3
## $pro
## [1] 0.33998137 0.03198932 0.62802931
##
## $mean
## [,1] [,2] [,3]
## broad_interest 3.057033 1.141835 2.300432
## enjoyment 3.512280 1.149478 2.508222
## self_efficacy 1.556769 3.386804 2.255146
##
## $variance
## $variance$modelName
## [1] "EEI"
##
## $variance$d
## [1] 3
##
## $variance$G
## [1] 3
##
## $variance$sigma
## , , 1
##
## broad_interest enjoyment self_efficacy
## broad_interest 0.4808715 0.0000000 0.0000000
## enjoyment 0.0000000 0.2320632 0.0000000
## self_efficacy 0.0000000 0.0000000 0.2447314
##
## , , 2
##
## broad_interest enjoyment self_efficacy
## broad_interest 0.4808715 0.0000000 0.0000000
## enjoyment 0.0000000 0.2320632 0.0000000
## self_efficacy 0.0000000 0.0000000 0.2447314
##
## , , 3
##
## broad_interest enjoyment self_efficacy
## broad_interest 0.4808715 0.0000000 0.0000000
## enjoyment 0.0000000 0.2320632 0.0000000
## self_efficacy 0.0000000 0.0000000 0.2447314
##
##
## $variance$Sigma
## broad_interest enjoyment self_efficacy
## broad_interest 0.4808715 0.0000000 0.0000000
## enjoyment 0.0000000 0.2320632 0.0000000
## self_efficacy 0.0000000 0.0000000 0.2447314
##
## $variance$scale
## [1] 0.3011445
##
## $variance$shape
## [1] 1.596813 0.770604 0.812671
可以看到,我们刚才执行的LPA,默认的协方差矩阵是一个对角阵,而且不同类别对应的对角线元素相等。
estimate_profiles()
中主要有以下参数可以设定:
n_profiles
:分为多少类variances:
:
covariances
:
variances
和covariances
可以组合成6种模型:
Model | Mclust | variances | covariances |
---|---|---|---|
1 | EEI | equal | zero |
2 | VVI | varying | zero |
3 | EEE | equal | equal |
4 | varying | equal | |
5 | equal | varying | |
6 | VVV | varying | varying |
在上一部分可以看到,LPA有非常多的参数可供选择,那么选择哪一组合进行估计最好呢?
在这里,可以使用compare_solutions()
来帮助我们:
data %>%
estimate_profiles(1:3,
variances = c("equal", "varying"),
covariances = c("zero", "varying")) %>%
compare_solutions(statistics = c("AIC", "BIC"))
## Compare tidyLPA solutions:
##
## Model Classes AIC BIC
## 1 1 675.138 690.769
## 1 2 634.718 660.770
## 1 3 627.453 663.925
## 6 1 619.386 642.832
## 6 2 617.629 667.127
## 6 3 624.324 699.874
##
## Best model according to AIC is Model 6 with 2 classes.
## Best model according to BIC is Model 6 with 1 classes.
##
## An analytic hierarchy process, based on the fit indices AIC, AWE, BIC, CLC, and KIC (Akogul & Erisoglu, 2017), suggests the best solution is Model 6 with 1 classes.
在上面的代码中,我们对比了Model1和Model6两种协方差结构设置下,1-3种成分模型的AIC和BIC。
注意:AIC和BIC准则永远只是一种候补方法。也就是说,能基于理论进行设置就基于理论。