坦尼克号沉船事故是西方航海事故之中影响最为深远的一次事故,其引发的社会学思考引起了学术界的广泛关注。很多人认为这场海难发生后的撤离过程是海上骑士精神的一个典范。当时,男人选择最后撤离,让妇女和儿童首先登上救生艇。
但是事实是否如此呢?有研究者发现乘客的舱位等级与生还几率是直接相关的,社会等级差异在巨大的突发灾难降临之时仍然决定着人们的命运,社会等级决定着风险的差异并决定风险降临之后的伤害差异,作为社会等级标志的舱位显然在这里成为生命的重要筹码。
本文运用关联规则,分析幸存者与舱位等级、性别、年龄等相关性,来验证这一假设。
然后请从中随机抽取80%的样本单位,准备用判别分析、随机森林、logistic回归等方法来构建合适的模型,并预测另外20%的样本单位的生存状态。
需要加载 arules 和 arulesViz 包。
setwd("C:/Users/asus/Desktop/我的文件/统计实验作业/3")
load("titanic.raw.rdata")
head(titanic.raw)
## Class Sex Age Survived
## 1 3rd Male Child No
## 2 3rd Male Child No
## 3 3rd Male Child No
## 4 3rd Male Child No
## 5 3rd Male Child No
## 6 3rd Male Child No
summary(titanic.raw)
## Class Sex Age Survived
## 1st :325 Female: 470 Adult:2092 No :1490
## 2nd :285 Male :1731 Child: 109 Yes: 711
## 3rd :706
## Crew:885
(由于组合太多,先分析存活率较高的)
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
##
## The following objects are masked from 'package:base':
##
## %in%, write
# 只跟存活相关项
rules <- apriori(titanic.raw,
control = list(verbose=F),
parameter = list(minlen=2, supp=0.005, conf=0.8),
appearance = list(rhs=c("Survived=No",
"Survived=Yes"),
default="lhs"))
## 保留三位小数点
quality(rules) <- round(quality(rules), digits=3)
## 按提升度 lift 排序
rules.sorted <- sort(rules, by="lift")
inspect(rules.sorted)
## lhs rhs support confidence lift
## 1 {Class=2nd,
## Age=Child} => {Survived=Yes} 0.011 1.000 3.096
## 2 {Class=2nd,
## Sex=Female,
## Age=Child} => {Survived=Yes} 0.006 1.000 3.096
## 3 {Class=1st,
## Sex=Female} => {Survived=Yes} 0.064 0.972 3.010
## 4 {Class=1st,
## Sex=Female,
## Age=Adult} => {Survived=Yes} 0.064 0.972 3.010
## 5 {Class=2nd,
## Sex=Female} => {Survived=Yes} 0.042 0.877 2.716
## 6 {Class=Crew,
## Sex=Female} => {Survived=Yes} 0.009 0.870 2.692
## 7 {Class=Crew,
## Sex=Female,
## Age=Adult} => {Survived=Yes} 0.009 0.870 2.692
## 8 {Class=2nd,
## Sex=Female,
## Age=Adult} => {Survived=Yes} 0.036 0.860 2.663
## 9 {Class=2nd,
## Sex=Male,
## Age=Adult} => {Survived=No} 0.070 0.917 1.354
## 10 {Class=2nd,
## Sex=Male} => {Survived=No} 0.070 0.860 1.271
## 11 {Class=3rd,
## Sex=Male,
## Age=Adult} => {Survived=No} 0.176 0.838 1.237
## 12 {Class=3rd,
## Sex=Male} => {Survived=No} 0.192 0.827 1.222
比如Sex与Age重复了。从第一条可以看出二等舱的小孩全部存活,但是第二条有重复出现,不能提供更多的信息,我们称之为冗余项。
## 找出冗余项
subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag = T)] <- NA
redundant <- colSums(subset.matrix, na.rm = T) >= 1
## 具体位置
which(redundant)
## [1] 2 4 7 8
## 去掉冗余项
rules.pruned <- rules.sorted[! redundant]
inspect(rules.pruned)
## lhs rhs support confidence lift
## 1 {Class=2nd,
## Age=Child} => {Survived=Yes} 0.011 1.000 3.096
## 2 {Class=1st,
## Sex=Female} => {Survived=Yes} 0.064 0.972 3.010
## 3 {Class=2nd,
## Sex=Female} => {Survived=Yes} 0.042 0.877 2.716
## 4 {Class=Crew,
## Sex=Female} => {Survived=Yes} 0.009 0.870 2.692
## 5 {Class=2nd,
## Sex=Male,
## Age=Adult} => {Survived=No} 0.070 0.917 1.354
## 6 {Class=2nd,
## Sex=Male} => {Survived=No} 0.070 0.860 1.271
## 7 {Class=3rd,
## Sex=Male,
## Age=Adult} => {Survived=No} 0.176 0.838 1.237
## 8 {Class=3rd,
## Sex=Male} => {Survived=No} 0.192 0.827 1.222
可信度公式为:confidence(A => b)=p(B|A)= p(AUB)/p(A) 可信度高表示A发生引起B发生的可能性高,以及B依赖于A的可能性比较高。
从这个结果的前四条来看,从可信度来看二等舱儿童全部存活,其他各等级舱的女性存活可信度高。也就是说女性跟小孩存活率相对较高,这与西方的骑士精神很符合。
支持度公式:support(A => b)= p(AUB)= AB同时发生的件数除以总事件数。
后四条从支持度来看,三等舱成年男性的死亡率支持度为0.176,而二等舱成年男性的死亡率支持度为0.07,舱位等级与男性的存活率还是有很大关系的。
我应该注意到一等舱的数据都没有显示出来,考虑到一等舱的人数相对较少,我们降低支持度来看看,并只关注存活率与舱位等级的关联性。
rule2 <- apriori(titanic.raw, control = list(verbose=F),
parameter = list(minlen=3, supp=0.002, conf=0.1),
appearance = list(default="none", rhs=c("Survived=Yes"),
lhs=c("Class=1st", "Class=2nd", "Class=3rd",
"Age=Child", "Age=Adult")))
## 按可信度来排序
rule2.sorted <- sort(rule2, by="confidence")
inspect(rule2.sorted)
## lhs rhs support confidence lift
## 1 {Class=2nd,
## Age=Child} => {Survived=Yes} 0.010904134 1.0000000 3.0956399
## 2 {Class=1st,
## Age=Child} => {Survived=Yes} 0.002726034 1.0000000 3.0956399
## 3 {Class=1st,
## Age=Adult} => {Survived=Yes} 0.089504771 0.6175549 1.9117275
## 4 {Class=2nd,
## Age=Adult} => {Survived=Yes} 0.042707860 0.3601533 1.1149048
## 5 {Class=3rd,
## Age=Child} => {Survived=Yes} 0.012267151 0.3417722 1.0580035
## 6 {Class=3rd,
## Age=Adult} => {Survived=Yes} 0.068605179 0.2408293 0.7455209
首先从儿童与舱位来看,一二等舱儿童全部存活,三等舱儿童的存活率为0.342;
其次从成人与舱位来看,一等舱成人存活率最高,其次是二等,最后是三等舱;
最后我们看到第三条数据,也就是一等舱的成人存活率仅次于一二等舱的儿童,高于三等舱的儿童。
从上面的细节可看出:
rule3 <- apriori(titanic.raw, control = list(verbose=F),
parameter = list(minlen=3, supp=0.01, conf=0.1),
appearance = list(default="none", rhs=c("Survived=Yes"),
lhs=c("Class=1st", "Class=2nd", "Class=3rd",
"Sex=Female","Sex=Male")))
## 按可信度来排序
rule3.sorted <- sort(rule3, by="confidence")
inspect(rule3.sorted)
## lhs rhs support confidence lift
## 1 {Class=1st,
## Sex=Female} => {Survived=Yes} 0.06406179 0.9724138 3.0102430
## 2 {Class=2nd,
## Sex=Female} => {Survived=Yes} 0.04225352 0.8773585 2.7159860
## 3 {Class=3rd,
## Sex=Female} => {Survived=Yes} 0.04089050 0.4591837 1.4214673
## 4 {Class=1st,
## Sex=Male} => {Survived=Yes} 0.02816901 0.3444444 1.0662760
## 5 {Class=3rd,
## Sex=Male} => {Survived=Yes} 0.03998183 0.1725490 0.5341496
## 6 {Class=2nd,
## Sex=Male} => {Survived=Yes} 0.01135847 0.1396648 0.4323519
library(arulesViz)
## Loading required package: grid
##
## Attaching package: 'arulesViz'
##
## The following object is masked from 'package:base':
##
## abbreviate
plot(rules, method="graph", control=list(type="items"))
plot(rule2, method="graph", control=list(type="items"))
plot(rules, method = "grouped")
通过关联分析,我们可以得出这样的结论: