问题描述:

坦尼克号沉船事故是西方航海事故之中影响最为深远的一次事故,其引发的社会学思考引起了学术界的广泛关注。很多人认为这场海难发生后的撤离过程是海上骑士精神的一个典范。当时,男人选择最后撤离,让妇女和儿童首先登上救生艇。

但是事实是否如此呢?有研究者发现乘客的舱位等级与生还几率是直接相关的,社会等级差异在巨大的突发灾难降临之时仍然决定着人们的命运,社会等级决定着风险的差异并决定风险降临之后的伤害差异,作为社会等级标志的舱位显然在这里成为生命的重要筹码。

本文运用关联规则,分析幸存者与舱位等级、性别、年龄等相关性,来验证这一假设。
然后请从中随机抽取80%的样本单位,准备用判别分析、随机森林、logistic回归等方法来构建合适的模型,并预测另外20%的样本单位的生存状态。

一、关联分析(Apriori算法)

需要加载 arulesarulesViz 包。

首先载入数据并查看数据结构

setwd("C:/Users/asus/Desktop/我的文件/统计实验作业/3")
load("titanic.raw.rdata")
head(titanic.raw)
##   Class  Sex   Age Survived
## 1   3rd Male Child       No
## 2   3rd Male Child       No
## 3   3rd Male Child       No
## 4   3rd Male Child       No
## 5   3rd Male Child       No
## 6   3rd Male Child       No
summary(titanic.raw)
##   Class         Sex          Age       Survived  
##  1st :325   Female: 470   Adult:2092   No :1490  
##  2nd :285   Male  :1731   Child: 109   Yes: 711  
##  3rd :706                                        
##  Crew:885

从上面对数据有了一个大致的了解。下面运用Apriori算法来找到与存活关联性较大的因素。

1.1 第1组参数 支持度0.005,可信度0.8

(由于组合太多,先分析存活率较高的)

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## 
## The following objects are masked from 'package:base':
## 
##     %in%, write
# 只跟存活相关项
rules <- apriori(titanic.raw,
                 control = list(verbose=F),
                 parameter = list(minlen=2, supp=0.005, conf=0.8),
                 appearance = list(rhs=c("Survived=No",
                                         "Survived=Yes"),
                                   default="lhs"))
## 保留三位小数点
quality(rules) <- round(quality(rules), digits=3)
## 按提升度 lift 排序
rules.sorted <- sort(rules, by="lift")
inspect(rules.sorted)
##    lhs             rhs            support confidence  lift
## 1  {Class=2nd,                                            
##     Age=Child}  => {Survived=Yes}   0.011      1.000 3.096
## 2  {Class=2nd,                                            
##     Sex=Female,                                           
##     Age=Child}  => {Survived=Yes}   0.006      1.000 3.096
## 3  {Class=1st,                                            
##     Sex=Female} => {Survived=Yes}   0.064      0.972 3.010
## 4  {Class=1st,                                            
##     Sex=Female,                                           
##     Age=Adult}  => {Survived=Yes}   0.064      0.972 3.010
## 5  {Class=2nd,                                            
##     Sex=Female} => {Survived=Yes}   0.042      0.877 2.716
## 6  {Class=Crew,                                           
##     Sex=Female} => {Survived=Yes}   0.009      0.870 2.692
## 7  {Class=Crew,                                           
##     Sex=Female,                                           
##     Age=Adult}  => {Survived=Yes}   0.009      0.870 2.692
## 8  {Class=2nd,                                            
##     Sex=Female,                                           
##     Age=Adult}  => {Survived=Yes}   0.036      0.860 2.663
## 9  {Class=2nd,                                            
##     Sex=Male,                                             
##     Age=Adult}  => {Survived=No}    0.070      0.917 1.354
## 10 {Class=2nd,                                            
##     Sex=Male}   => {Survived=No}    0.070      0.860 1.271
## 11 {Class=3rd,                                            
##     Sex=Male,                                             
##     Age=Adult}  => {Survived=No}    0.176      0.838 1.237
## 12 {Class=3rd,                                            
##     Sex=Male}   => {Survived=No}    0.192      0.827 1.222

从上面的结果来看,有些是冗余的项。

比如Sex与Age重复了。从第一条可以看出二等舱的小孩全部存活,但是第二条有重复出现,不能提供更多的信息,我们称之为冗余项。

## 找出冗余项
subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag = T)] <- NA
redundant <- colSums(subset.matrix, na.rm = T) >= 1
## 具体位置
which(redundant)
## [1] 2 4 7 8
## 去掉冗余项
rules.pruned <- rules.sorted[! redundant]
inspect(rules.pruned)
##   lhs             rhs            support confidence  lift
## 1 {Class=2nd,                                            
##    Age=Child}  => {Survived=Yes}   0.011      1.000 3.096
## 2 {Class=1st,                                            
##    Sex=Female} => {Survived=Yes}   0.064      0.972 3.010
## 3 {Class=2nd,                                            
##    Sex=Female} => {Survived=Yes}   0.042      0.877 2.716
## 4 {Class=Crew,                                           
##    Sex=Female} => {Survived=Yes}   0.009      0.870 2.692
## 5 {Class=2nd,                                            
##    Sex=Male,                                             
##    Age=Adult}  => {Survived=No}    0.070      0.917 1.354
## 6 {Class=2nd,                                            
##    Sex=Male}   => {Survived=No}    0.070      0.860 1.271
## 7 {Class=3rd,                                            
##    Sex=Male,                                             
##    Age=Adult}  => {Survived=No}    0.176      0.838 1.237
## 8 {Class=3rd,                                            
##    Sex=Male}   => {Survived=No}    0.192      0.827 1.222

可信度公式为:confidence(A => b)=p(B|A)= p(AUB)/p(A) 可信度高表示A发生引起B发生的可能性高,以及B依赖于A的可能性比较高。

从这个结果的前四条来看,从可信度来看二等舱儿童全部存活,其他各等级舱的女性存活可信度高。也就是说女性跟小孩存活率相对较高,这与西方的骑士精神很符合。

支持度公式:support(A => b)= p(AUB)= AB同时发生的件数除以总事件数。

后四条从支持度来看,三等舱成年男性的死亡率支持度为0.176,而二等舱成年男性的死亡率支持度为0.07,舱位等级与男性的存活率还是有很大关系的。

我应该注意到一等舱的数据都没有显示出来,考虑到一等舱的人数相对较少,我们降低支持度来看看,并只关注存活率与舱位等级的关联性。

1.2 第2组参数 支持度0.002,可信度0.2,成人和儿童与舱位等级的关系

rule2 <- apriori(titanic.raw, control = list(verbose=F),
                 parameter = list(minlen=3, supp=0.002, conf=0.1),
                 appearance = list(default="none", rhs=c("Survived=Yes"),
                                  lhs=c("Class=1st", "Class=2nd", "Class=3rd",
                                        "Age=Child", "Age=Adult")))
## 按可信度来排序
rule2.sorted <- sort(rule2, by="confidence")
inspect(rule2.sorted)
##   lhs            rhs                support confidence      lift
## 1 {Class=2nd,                                                   
##    Age=Child} => {Survived=Yes} 0.010904134  1.0000000 3.0956399
## 2 {Class=1st,                                                   
##    Age=Child} => {Survived=Yes} 0.002726034  1.0000000 3.0956399
## 3 {Class=1st,                                                   
##    Age=Adult} => {Survived=Yes} 0.089504771  0.6175549 1.9117275
## 4 {Class=2nd,                                                   
##    Age=Adult} => {Survived=Yes} 0.042707860  0.3601533 1.1149048
## 5 {Class=3rd,                                                   
##    Age=Child} => {Survived=Yes} 0.012267151  0.3417722 1.0580035
## 6 {Class=3rd,                                                   
##    Age=Adult} => {Survived=Yes} 0.068605179  0.2408293 0.7455209

首先从儿童与舱位来看,一二等舱儿童全部存活,三等舱儿童的存活率为0.342;
其次从成人与舱位来看,一等舱成人存活率最高,其次是二等,最后是三等舱;
最后我们看到第三条数据,也就是一等舱的成人存活率仅次于一二等舱的儿童,高于三等舱的儿童。
从上面的细节可看出:

  • 1并没有完全落实儿童优先原则,尤其是三等舱儿童
  • 2成人的存活率与舱位等级相关度很高,舱位等级越高,成人存活率越高

1.3 性别与舱位等级关联性

rule3 <- apriori(titanic.raw, control = list(verbose=F),
                 parameter = list(minlen=3, supp=0.01, conf=0.1),
                 appearance = list(default="none", rhs=c("Survived=Yes"),
                              lhs=c("Class=1st", "Class=2nd", "Class=3rd",
                                    "Sex=Female","Sex=Male")))
## 按可信度来排序
rule3.sorted <- sort(rule3, by="confidence")
inspect(rule3.sorted)
##   lhs             rhs               support confidence      lift
## 1 {Class=1st,                                                   
##    Sex=Female} => {Survived=Yes} 0.06406179  0.9724138 3.0102430
## 2 {Class=2nd,                                                   
##    Sex=Female} => {Survived=Yes} 0.04225352  0.8773585 2.7159860
## 3 {Class=3rd,                                                   
##    Sex=Female} => {Survived=Yes} 0.04089050  0.4591837 1.4214673
## 4 {Class=1st,                                                   
##    Sex=Male}   => {Survived=Yes} 0.02816901  0.3444444 1.0662760
## 5 {Class=3rd,                                                   
##    Sex=Male}   => {Survived=Yes} 0.03998183  0.1725490 0.5341496
## 6 {Class=2nd,                                                   
##    Sex=Male}   => {Survived=Yes} 0.01135847  0.1396648 0.4323519
当lift小于1时,规则意义不显著,所以我主要看前四个。我们看以看出女性优先政策确实是执行的比较好。

2、可视化

library(arulesViz)
## Loading required package: grid
## 
## Attaching package: 'arulesViz'
## 
## The following object is masked from 'package:base':
## 
##     abbreviate
plot(rules, method="graph", control=list(type="items"))

上图反映了性别是存活最重要的因素,其次是儿童跟舱位等级影响存活率。

plot(rule2, method="graph", control=list(type="items"))

从图中可以看出一等舱存活率最高

plot(rules, method = "grouped")

三等舱的男性死亡率最高,一等舱存活率最高

3、总结

通过关联分析,我们可以得出这样的结论:

  • 在这次事故中,大体是妇女和儿童优先,但是存在三等舱儿童优先的政策未落实的情况
  • 就成人与舱位等级而言,舱位等级越高,成人存活率越高。尤其是一等舱的成人存活率超过了三等舱儿童的存活率。