在真实世界中,缺失数据的现象是极其普遍的。统计教科书仅用很少的篇幅介绍;统计软件提供的自动处理缺失值的方法也可能不是最优的。大部分情况下,在处理收集了真实数据的问题之前,你不得不消除缺失数据:(1) 删除含有缺失数据的实例 ;(2) 用合理的替代值替换缺失值。本文介绍一下处理缺失数据的传统方法和现代方法,主要使用VIM和mice包。
一个完整的处理方法通常包含以下几个步骤:
识别缺失数据;
检查导致数据缺失的原因;
删除包含缺失值的实例或用合理的数值代替(插补)缺失值。
可以参考的相关文献
Allison的Missing Data(2001)
Schafer和Graham的“Missing Data: Our View of the State of the Art”(2002)
Schlomer、Bauman和Card的“Best Practices for Missing Data Management in Counseling Psychology”(2010)
缺失数据的分类 统计学家通常将缺失数据分为三类。以sleep研究中对做梦时长的测量(12种动物有缺失值)来依次阐述三种类型。
完全随机缺失
若某变量的缺失数据与其他任何观测或未观测变量都不相关,则数据 为完全随机缺失(MCAR, Missing completely at random)。若12种动物的做梦时长值缺失不是出于系统原因,那么可以认为数据是MCAR。注意,如果每个有缺失值的变量都是MCAR,那么可以将数据完整的实例看作对更大数据集的一个简单随机抽样。
随机缺失
若某变量上的缺失数据与其他观测变量相关,与它自己的未观测值不相关,则数据为随机缺失(MAR, Missing at random)。例如,如果体重较小的动物更可能有做梦时长的缺失值(可能因为较小的动物更难观察),而且该“缺失”与动物的做梦时长无关,那么就可以认为该数据是MAR。此时,一旦控制了体重变量,做梦时长数据的缺失与出现将是随机的。
非随机缺失
若缺失数据不属于MCAR和MAR,则数据为非随机缺失(NMAR, not Missing completely at random)。例如,做梦时长越短的动物更可能有做梦数据的缺失(可能由于难以测量时长较短的事件),那么可认为数据是NMAR。
大部分处理缺失数据的方法都假定数据是MCAR或MAR。此时,你可以忽略缺失数据的生成机制,并且(在替换或删除缺失数据后)可以直接对感兴趣的关系进行建模。
library(VIM)
library(mice)
library(tidyverse)
VIMsleep <- VIM::sleep
glimpse(VIMsleep)
## Rows: 62
## Columns: 10
## $ BodyWgt <dbl> 6654.000, 1.000, 3.385, 0.920, 2547.000, 10.550, 0.023, 16...
## $ BrainWgt <dbl> 5712.0, 6.6, 44.5, 5.7, 4603.0, 179.5, 0.3, 169.0, 25.6, 4...
## $ NonD <dbl> NA, 6.3, NA, NA, 2.1, 9.1, 15.8, 5.2, 10.9, 8.3, 11.0, 3.2...
## $ Dream <dbl> NA, 2.0, NA, NA, 1.8, 0.7, 3.9, 1.0, 3.6, 1.4, 1.5, 0.7, 2...
## $ Sleep <dbl> 3.3, 8.3, 12.5, 16.5, 3.9, 9.8, 19.7, 6.2, 14.5, 9.7, 12.5...
## $ Span <dbl> 38.6, 4.5, 14.0, NA, 69.0, 27.0, 19.0, 30.4, 28.0, 50.0, 7...
## $ Gest <dbl> 645, 42, 60, 25, 624, 180, 35, 392, 63, 230, 112, 281, NA,...
## $ Pred <int> 3, 3, 1, 5, 3, 4, 1, 4, 1, 1, 5, 5, 2, 5, 1, 2, 2, 2, 1, 1...
## $ Exp <int> 5, 1, 1, 2, 5, 4, 1, 5, 2, 1, 4, 5, 1, 5, 1, 2, 2, 2, 2, 1...
## $ Danger <int> 3, 3, 1, 3, 4, 4, 1, 4, 1, 1, 4, 5, 2, 5, 1, 2, 2, 2, 1, 1...
sum(is.na(VIMsleep))
## [1] 38
mean(is.na(VIMsleep))
## [1] 0.06129032
VIMsleep %>%
dplyr::summarise(across(everything(), ~sum(is.na(.))))
BodyWgt | BrainWgt | NonD | Dream | Sleep | Span | Gest | Pred | Exp | Danger |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 14 | 12 | 4 | 4 | 4 | 0 | 0 | 0 |
sum(!complete.cases(VIMsleep))
## [1] 20
mean(!complete.cases(VIMsleep))
## [1] 0.3225806
aggr(VIMsleep, prop = F, number = T)
matrixplot(VIMsleep)
# scattMiss(VIMsleep)
md.pattern(VIMsleep)
## BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD
## 42 1 1 1 1 1 1 1 1 1 1 0
## 9 1 1 1 1 1 1 1 1 0 0 2
## 3 1 1 1 1 1 1 1 0 1 1 1
## 2 1 1 1 1 1 1 0 1 1 1 1
## 1 1 1 1 1 1 1 0 1 0 0 3
## 1 1 1 1 1 1 1 0 0 1 1 2
## 2 1 1 1 1 1 0 1 1 1 0 2
## 2 1 1 1 1 1 0 1 1 0 0 3
## 0 0 0 0 0 4 4 4 12 14 38
表中的1和0显示了缺失值模式:0表示变量的列中有缺失值,1则表示没有缺失值。
第一行表述了“无缺失值”的模式(所有元素都为1)。
第二行表述了“除了Span之外无缺失值”的模式。
第一列表示各缺失值模式的实例个数,最后一列表示各模式中有缺失值的变量的个数。
此处可以看到,有42个实例没有缺失值,仅2个实例缺失了Span。9个实例同时缺失了NonD和Dream的值。数据集包含了总共(42×0)+(2×1)+…+(1×3)=38个缺失值。最后一行给出了每个变量中缺失值的数目。
VIMsleep %>%
select(Gest, Dream) %>%
marginplot(pch=c(20),
col=c("darkgray", "red", "blue"))
VIMsleep %>%
as_tibble() %>%
rownames_to_column() %>%
filter(!complete.cases(.))
rowname | BodyWgt | BrainWgt | NonD | Dream | Sleep | Span | Gest | Pred | Exp | Danger |
---|---|---|---|---|---|---|---|---|---|---|
1 | 6654.000 | 5712.0 | NA | NA | 3.3 | 38.6 | 645 | 3 | 5 | 3 |
3 | 3.385 | 44.5 | NA | NA | 12.5 | 14.0 | 60 | 1 | 1 | 1 |
4 | 0.920 | 5.7 | NA | NA | 16.5 | NA | 25 | 5 | 2 | 3 |
13 | 0.550 | 2.4 | 7.6 | 2.7 | 10.3 | NA | NA | 2 | 1 | 2 |
14 | 187.100 | 419.0 | NA | NA | 3.1 | 40.0 | 365 | 5 | 5 | 5 |
19 | 1.410 | 17.5 | 4.8 | 1.3 | 6.1 | 34.0 | NA | 1 | 2 | 1 |
20 | 60.000 | 81.0 | 12.0 | 6.1 | 18.1 | 7.0 | NA | 1 | 1 | 1 |
21 | 529.000 | 680.0 | NA | 0.3 | NA | 28.0 | 400 | 5 | 5 | 5 |
24 | 207.000 | 406.0 | NA | NA | 12.0 | 39.3 | 252 | 1 | 4 | 1 |
26 | 36.330 | 119.5 | NA | NA | 13.0 | 16.2 | 63 | 1 | 1 | 1 |
30 | 100.000 | 157.0 | NA | NA | 10.8 | 22.4 | 100 | 1 | 1 | 1 |
31 | 35.000 | 56.0 | NA | NA | NA | 16.3 | 33 | 3 | 5 | 4 |
35 | 0.122 | 3.0 | 8.2 | 2.4 | 10.6 | NA | 30 | 2 | 1 | 1 |
36 | 1.350 | 8.1 | 8.4 | 2.8 | 11.2 | NA | 45 | 3 | 1 | 3 |
41 | 250.000 | 490.0 | NA | 1.0 | NA | 23.6 | 440 | 5 | 5 | 5 |
47 | 4.288 | 39.2 | NA | NA | 12.5 | 13.7 | 63 | 2 | 2 | 2 |
53 | 14.830 | 98.2 | NA | NA | 2.6 | 17.0 | 150 | 5 | 5 | 5 |
55 | 1.400 | 12.5 | NA | NA | 11.0 | 12.7 | 90 | 2 | 2 | 2 |
56 | 0.060 | 1.0 | 8.1 | 2.2 | 10.3 | 3.5 | NA | 3 | 1 | 2 |
62 | 4.050 | 17.0 | NA | NA | NA | 13.0 | 38 | 3 | 1 | 1 |
na.omit(VIMsleep) %>%
rownames_to_column() %>%
as_tibble()
rowname | BodyWgt | BrainWgt | NonD | Dream | Sleep | Span | Gest | Pred | Exp | Danger |
---|---|---|---|---|---|---|---|---|---|---|
2 | 1.000 | 6.60 | 6.3 | 2.0 | 8.3 | 4.5 | 42.0 | 3 | 1 | 3 |
5 | 2547.000 | 4603.00 | 2.1 | 1.8 | 3.9 | 69.0 | 624.0 | 3 | 5 | 4 |
6 | 10.550 | 179.50 | 9.1 | 0.7 | 9.8 | 27.0 | 180.0 | 4 | 4 | 4 |
7 | 0.023 | 0.30 | 15.8 | 3.9 | 19.7 | 19.0 | 35.0 | 1 | 1 | 1 |
8 | 160.000 | 169.00 | 5.2 | 1.0 | 6.2 | 30.4 | 392.0 | 4 | 5 | 4 |
9 | 3.300 | 25.60 | 10.9 | 3.6 | 14.5 | 28.0 | 63.0 | 1 | 2 | 1 |
10 | 52.160 | 440.00 | 8.3 | 1.4 | 9.7 | 50.0 | 230.0 | 1 | 1 | 1 |
11 | 0.425 | 6.40 | 11.0 | 1.5 | 12.5 | 7.0 | 112.0 | 5 | 4 | 4 |
12 | 465.000 | 423.00 | 3.2 | 0.7 | 3.9 | 30.0 | 281.0 | 5 | 5 | 5 |
15 | 0.075 | 1.20 | 6.3 | 2.1 | 8.4 | 3.5 | 42.0 | 1 | 1 | 1 |
16 | 3.000 | 25.00 | 8.6 | 0.0 | 8.6 | 50.0 | 28.0 | 2 | 2 | 2 |
17 | 0.785 | 3.50 | 6.6 | 4.1 | 10.7 | 6.0 | 42.0 | 2 | 2 | 2 |
18 | 0.200 | 5.00 | 9.5 | 1.2 | 10.7 | 10.4 | 120.0 | 2 | 2 | 2 |
22 | 27.660 | 115.00 | 3.3 | 0.5 | 3.8 | 20.0 | 148.0 | 5 | 5 | 5 |
23 | 0.120 | 1.00 | 11.0 | 3.4 | 14.4 | 3.9 | 16.0 | 3 | 1 | 2 |
25 | 85.000 | 325.00 | 4.7 | 1.5 | 6.2 | 41.0 | 310.0 | 1 | 3 | 1 |
27 | 0.101 | 4.00 | 10.4 | 3.4 | 13.8 | 9.0 | 28.0 | 5 | 1 | 3 |
28 | 1.040 | 5.50 | 7.4 | 0.8 | 8.2 | 7.6 | 68.0 | 5 | 3 | 4 |
29 | 521.000 | 655.00 | 2.1 | 0.8 | 2.9 | 46.0 | 336.0 | 5 | 5 | 5 |
32 | 0.005 | 0.14 | 7.7 | 1.4 | 9.1 | 2.6 | 21.5 | 5 | 2 | 4 |
33 | 0.010 | 0.25 | 17.9 | 2.0 | 19.9 | 24.0 | 50.0 | 1 | 1 | 1 |
34 | 62.000 | 1320.00 | 6.1 | 1.9 | 8.0 | 100.0 | 267.0 | 1 | 1 | 1 |
37 | 0.023 | 0.40 | 11.9 | 1.3 | 13.2 | 3.2 | 19.0 | 4 | 1 | 3 |
38 | 0.048 | 0.33 | 10.8 | 2.0 | 12.8 | 2.0 | 30.0 | 4 | 1 | 3 |
39 | 1.700 | 6.30 | 13.8 | 5.6 | 19.4 | 5.0 | 12.0 | 2 | 1 | 1 |
40 | 3.500 | 10.80 | 14.3 | 3.1 | 17.4 | 6.5 | 120.0 | 2 | 1 | 1 |
42 | 0.480 | 15.50 | 15.2 | 1.8 | 17.0 | 12.0 | 140.0 | 2 | 2 | 2 |
43 | 10.000 | 115.00 | 10.0 | 0.9 | 10.9 | 20.2 | 170.0 | 4 | 4 | 4 |
44 | 1.620 | 11.40 | 11.9 | 1.8 | 13.7 | 13.0 | 17.0 | 2 | 1 | 2 |
45 | 192.000 | 180.00 | 6.5 | 1.9 | 8.4 | 27.0 | 115.0 | 4 | 4 | 4 |
46 | 2.500 | 12.10 | 7.5 | 0.9 | 8.4 | 18.0 | 31.0 | 5 | 5 | 5 |
48 | 0.280 | 1.90 | 10.6 | 2.6 | 13.2 | 4.7 | 21.0 | 3 | 1 | 3 |
49 | 4.235 | 50.40 | 7.4 | 2.4 | 9.8 | 9.8 | 52.0 | 1 | 1 | 1 |
50 | 6.800 | 179.00 | 8.4 | 1.2 | 9.6 | 29.0 | 164.0 | 2 | 3 | 2 |
51 | 0.750 | 12.30 | 5.7 | 0.9 | 6.6 | 7.0 | 225.0 | 2 | 2 | 2 |
52 | 3.600 | 21.00 | 4.9 | 0.5 | 5.4 | 6.0 | 225.0 | 3 | 2 | 3 |
54 | 55.500 | 175.00 | 3.2 | 0.6 | 3.8 | 20.0 | 151.0 | 5 | 5 | 5 |
57 | 0.900 | 2.60 | 11.0 | 2.3 | 13.3 | 4.5 | 60.0 | 2 | 1 | 2 |
58 | 2.000 | 12.30 | 4.9 | 0.5 | 5.4 | 7.5 | 200.0 | 3 | 1 | 3 |
59 | 0.104 | 2.50 | 13.2 | 2.6 | 15.8 | 2.3 | 46.0 | 3 | 2 | 2 |
60 | 4.190 | 58.00 | 9.7 | 0.6 | 10.3 | 24.0 | 210.0 | 4 | 3 | 4 |
61 | 3.500 | 3.90 | 12.8 | 6.6 | 19.4 | 3.0 | 14.0 | 2 | 1 | 1 |
VIMsleep %>%
as_tibble() %>%
rownames_to_column() %>%
filter(complete.cases(.))
rowname | BodyWgt | BrainWgt | NonD | Dream | Sleep | Span | Gest | Pred | Exp | Danger |
---|---|---|---|---|---|---|---|---|---|---|
2 | 1.000 | 6.60 | 6.3 | 2.0 | 8.3 | 4.5 | 42.0 | 3 | 1 | 3 |
5 | 2547.000 | 4603.00 | 2.1 | 1.8 | 3.9 | 69.0 | 624.0 | 3 | 5 | 4 |
6 | 10.550 | 179.50 | 9.1 | 0.7 | 9.8 | 27.0 | 180.0 | 4 | 4 | 4 |
7 | 0.023 | 0.30 | 15.8 | 3.9 | 19.7 | 19.0 | 35.0 | 1 | 1 | 1 |
8 | 160.000 | 169.00 | 5.2 | 1.0 | 6.2 | 30.4 | 392.0 | 4 | 5 | 4 |
9 | 3.300 | 25.60 | 10.9 | 3.6 | 14.5 | 28.0 | 63.0 | 1 | 2 | 1 |
10 | 52.160 | 440.00 | 8.3 | 1.4 | 9.7 | 50.0 | 230.0 | 1 | 1 | 1 |
11 | 0.425 | 6.40 | 11.0 | 1.5 | 12.5 | 7.0 | 112.0 | 5 | 4 | 4 |
12 | 465.000 | 423.00 | 3.2 | 0.7 | 3.9 | 30.0 | 281.0 | 5 | 5 | 5 |
15 | 0.075 | 1.20 | 6.3 | 2.1 | 8.4 | 3.5 | 42.0 | 1 | 1 | 1 |
16 | 3.000 | 25.00 | 8.6 | 0.0 | 8.6 | 50.0 | 28.0 | 2 | 2 | 2 |
17 | 0.785 | 3.50 | 6.6 | 4.1 | 10.7 | 6.0 | 42.0 | 2 | 2 | 2 |
18 | 0.200 | 5.00 | 9.5 | 1.2 | 10.7 | 10.4 | 120.0 | 2 | 2 | 2 |
22 | 27.660 | 115.00 | 3.3 | 0.5 | 3.8 | 20.0 | 148.0 | 5 | 5 | 5 |
23 | 0.120 | 1.00 | 11.0 | 3.4 | 14.4 | 3.9 | 16.0 | 3 | 1 | 2 |
25 | 85.000 | 325.00 | 4.7 | 1.5 | 6.2 | 41.0 | 310.0 | 1 | 3 | 1 |
27 | 0.101 | 4.00 | 10.4 | 3.4 | 13.8 | 9.0 | 28.0 | 5 | 1 | 3 |
28 | 1.040 | 5.50 | 7.4 | 0.8 | 8.2 | 7.6 | 68.0 | 5 | 3 | 4 |
29 | 521.000 | 655.00 | 2.1 | 0.8 | 2.9 | 46.0 | 336.0 | 5 | 5 | 5 |
32 | 0.005 | 0.14 | 7.7 | 1.4 | 9.1 | 2.6 | 21.5 | 5 | 2 | 4 |
33 | 0.010 | 0.25 | 17.9 | 2.0 | 19.9 | 24.0 | 50.0 | 1 | 1 | 1 |
34 | 62.000 | 1320.00 | 6.1 | 1.9 | 8.0 | 100.0 | 267.0 | 1 | 1 | 1 |
37 | 0.023 | 0.40 | 11.9 | 1.3 | 13.2 | 3.2 | 19.0 | 4 | 1 | 3 |
38 | 0.048 | 0.33 | 10.8 | 2.0 | 12.8 | 2.0 | 30.0 | 4 | 1 | 3 |
39 | 1.700 | 6.30 | 13.8 | 5.6 | 19.4 | 5.0 | 12.0 | 2 | 1 | 1 |
40 | 3.500 | 10.80 | 14.3 | 3.1 | 17.4 | 6.5 | 120.0 | 2 | 1 | 1 |
42 | 0.480 | 15.50 | 15.2 | 1.8 | 17.0 | 12.0 | 140.0 | 2 | 2 | 2 |
43 | 10.000 | 115.00 | 10.0 | 0.9 | 10.9 | 20.2 | 170.0 | 4 | 4 | 4 |
44 | 1.620 | 11.40 | 11.9 | 1.8 | 13.7 | 13.0 | 17.0 | 2 | 1 | 2 |
45 | 192.000 | 180.00 | 6.5 | 1.9 | 8.4 | 27.0 | 115.0 | 4 | 4 | 4 |
46 | 2.500 | 12.10 | 7.5 | 0.9 | 8.4 | 18.0 | 31.0 | 5 | 5 | 5 |
48 | 0.280 | 1.90 | 10.6 | 2.6 | 13.2 | 4.7 | 21.0 | 3 | 1 | 3 |
49 | 4.235 | 50.40 | 7.4 | 2.4 | 9.8 | 9.8 | 52.0 | 1 | 1 | 1 |
50 | 6.800 | 179.00 | 8.4 | 1.2 | 9.6 | 29.0 | 164.0 | 2 | 3 | 2 |
51 | 0.750 | 12.30 | 5.7 | 0.9 | 6.6 | 7.0 | 225.0 | 2 | 2 | 2 |
52 | 3.600 | 21.00 | 4.9 | 0.5 | 5.4 | 6.0 | 225.0 | 3 | 2 | 3 |
54 | 55.500 | 175.00 | 3.2 | 0.6 | 3.8 | 20.0 | 151.0 | 5 | 5 | 5 |
57 | 0.900 | 2.60 | 11.0 | 2.3 | 13.3 | 4.5 | 60.0 | 2 | 1 | 2 |
58 | 2.000 | 12.30 | 4.9 | 0.5 | 5.4 | 7.5 | 200.0 | 3 | 1 | 3 |
59 | 0.104 | 2.50 | 13.2 | 2.6 | 15.8 | 2.3 | 46.0 | 3 | 2 | 2 |
60 | 4.190 | 58.00 | 9.7 | 0.6 | 10.3 | 24.0 | 210.0 | 4 | 3 | 4 |
61 | 3.500 | 3.90 | 12.8 | 6.6 | 19.4 | 3.0 | 14.0 | 2 | 1 | 1 |
VIMsleep %>%
# 把各变量值缺省值计算为1,其余非缺省值为0, 生成影子矩阵shadow matrix
mutate(across(everything(), ~ abs(is.na(.)))) %>%
# 选择有缺省值存在的变量
dplyr::select(where(~sum(.x) > 0)) %>%
# 计算相关矩阵
cor() %>%
# 可视化
corrplot::corrplot(addCoef.col = "grey50")
可以看到Dream和NonD常常一起缺失(r=0.91)。相对可能性较小的是Sleep和NonD一起缺失(r=0.49),以及Sleep和Dream(r=0.20)
y <- VIMsleep %>%
# 把各变量值缺省值计算为1,其余非缺省值为0, 生成影子矩阵shadow matrix
mutate(across(everything(), ~ abs(is.na(.)))) %>%
# 选择有缺省值存在的变量
dplyr::select(where(~sum(.x) > 0))
y
NonD | Dream | Sleep | Span | Gest |
---|---|---|---|---|
1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
1 | 1 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 1 |
1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
1 | 1 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 0 | 0 |
cor(y)
## NonD Dream Sleep Span Gest
## NonD 1.00000000 0.90711474 0.48626454 0.01519577 -0.14182716
## Dream 0.90711474 1.00000000 0.20370138 0.03752394 -0.12865350
## Sleep 0.48626454 0.20370138 1.00000000 -0.06896552 -0.06896552
## Span 0.01519577 0.03752394 -0.06896552 1.00000000 0.19827586
## Gest -0.14182716 -0.12865350 -0.06896552 0.19827586 1.00000000
cor(VIMsleep, y, use = "pairwise.complete.obs") %>% corrplot::corrplot(addCoef.col = "grey50", number.digits = 3)
可以看到含缺失值变量与其他可观测变量间的关系
从相关系数矩阵的第一列可以看到,体重越大(r=0.227)、妊娠期越长(r=0.202)、睡眠暴露度越大(r=0.245)的动物无梦睡眠的评分更可能缺失。
其他列的信息也可以按类似方式得出。
表中的相关系数并不特别大,表明数据是MCAR的可能性比较小,更可能为MAR。
不过也绝不能排除数据是NMAR的可能性
当缺乏强力的外部证据时,我们通常假设数据是MCAR或者MAR。
sleep_imp <- mice(VIMsleep)
##
## iter imp variable
## 1 1 NonD Dream Sleep Span Gest
## 1 2 NonD Dream Sleep Span Gest
## 1 3 NonD Dream Sleep Span Gest
## 1 4 NonD Dream Sleep Span Gest
## 1 5 NonD Dream Sleep Span Gest
## 2 1 NonD Dream Sleep Span Gest
## 2 2 NonD Dream Sleep Span Gest
## 2 3 NonD Dream Sleep Span Gest
## 2 4 NonD Dream Sleep Span Gest
## 2 5 NonD Dream Sleep Span Gest
## 3 1 NonD Dream Sleep Span Gest
## 3 2 NonD Dream Sleep Span Gest
## 3 3 NonD Dream Sleep Span Gest
## 3 4 NonD Dream Sleep Span Gest
## 3 5 NonD Dream Sleep Span Gest
## 4 1 NonD Dream Sleep Span Gest
## 4 2 NonD Dream Sleep Span Gest
## 4 3 NonD Dream Sleep Span Gest
## 4 4 NonD Dream Sleep Span Gest
## 4 5 NonD Dream Sleep Span Gest
## 5 1 NonD Dream Sleep Span Gest
## 5 2 NonD Dream Sleep Span Gest
## 5 3 NonD Dream Sleep Span Gest
## 5 4 NonD Dream Sleep Span Gest
## 5 5 NonD Dream Sleep Span Gest
fit <- with(sleep_imp, lm(Dream ~ Span + Gest))
fit
## call :
## with.mids(data = sleep_imp, expr = lm(Dream ~ Span + Gest))
##
## call1 :
## mice(data = VIMsleep)
##
## nmis :
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred
## 0 0 14 12 4 4 4 0
## Exp Danger
## 0 0
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = Dream ~ Span + Gest)
##
## Coefficients:
## (Intercept) Span Gest
## 2.348972 -0.010315 -0.001985
##
##
## [[2]]
##
## Call:
## lm(formula = Dream ~ Span + Gest)
##
## Coefficients:
## (Intercept) Span Gest
## 2.499860 -0.005153 -0.003815
##
##
## [[3]]
##
## Call:
## lm(formula = Dream ~ Span + Gest)
##
## Coefficients:
## (Intercept) Span Gest
## 2.469375 -0.004486 -0.003857
##
##
## [[4]]
##
## Call:
## lm(formula = Dream ~ Span + Gest)
##
## Coefficients:
## (Intercept) Span Gest
## 2.486298 -0.008068 -0.003252
##
##
## [[5]]
##
## Call:
## lm(formula = Dream ~ Span + Gest)
##
## Coefficients:
## (Intercept) Span Gest
## 2.520242 -0.002314 -0.003971
pooled <- pool(fit)
# pooled
base::summary(pooled)
term | estimate | std.error | statistic | df | p.value |
---|---|---|---|---|---|
(Intercept) | 2.4649495 | 0.2567150 | 9.6018931 | 48.04727 | 0.0000000 |
Span | -0.0060674 | 0.0121486 | -0.4994334 | 48.42345 | 0.6197358 |
Gest | -0.0033758 | 0.0016977 | -1.9884443 | 22.40373 | 0.0591145 |
sleep_imp1 <- complete(sleep_imp, action = 1)
fit <- lm(Dream ~ Span + Gest, data = sleep_imp1)
fit %>% broom::tidy()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 2.3489725 | 0.2541382 | 9.2428932 | 0.0000000 |
Span | -0.0103154 | 0.0119829 | -0.8608438 | 0.3928090 |
Gest | -0.0019845 | 0.0014850 | -1.3364294 | 0.1865403 |
参考文献
以上摘录自