인터넷에서 구한 타이타닉 승객 데이터를 이용해 몇 가지 분석을 해봤다.
사용데이터는 http://dl.dropbox.com/u/8686172/titanic.csv 에서 받을 수 있다.
알고싶은 부분은 승객등급, 나이, 성별에 따른 생존률의 차이가 어떻게 나는지 인데…
타이타닉 남녀 주인공이 모두 승객 등급이 1,3 등급으로 등급의 격차가 컸다는 것을 염두에 두자.
library(data.table)
library(ggplot2)
```r
# 파일을 불러온다.
titanic <- read.csv("http://dl.dropboxusercontent.com/u/8686172/titanic.csv")
# 분석 편의상 data.table로 변환한다.
titanic.dt <- as.data.table(titanic)
# 데이터의 유형을 미리보기 해보자.
names(titanic.dt)
## [1] "X" "pclass" "survived" "name" "sex"
## [6] "age" "sibsp" "parch" "ticket" "fare"
## [11] "cabin" "embarked" "boat" "body" "home.dest"
head(titanic.dt)
## X pclass survived name sex age sibsp
## 1: 1 1st 1 Allen, Miss. Elisabeth Walton female 29.0000 0
## 2: 2 1st 1 Allison, Master. Hudson Trevor male 0.9167 1
## 3: 3 1st 0 Allison, Miss. Helen Loraine female 2.0000 1
## 4: 4 1st 0 Allison, Mr. Hudson Joshua Crei male 30.0000 1
## 5: 5 1st 0 Allison, Mrs. Hudson J C (Bessi female 25.0000 1
## 6: 6 1st 1 Anderson, Mr. Harry male 48.0000 0
## parch ticket fare cabin embarked boat body
## 1: 0 24160 211.34 B5 Southampton 2 NA
## 2: 2 113781 151.55 C22 C26 Southampton 11 NA
## 3: 2 113781 151.55 C22 C26 Southampton NA
## 4: 2 113781 151.55 C22 C26 Southampton 135
## 5: 2 113781 151.55 C22 C26 Southampton NA
## 6: 0 19952 26.55 E12 Southampton 3 NA
## home.dest
## 1: St Louis, MO
## 2: Montreal, PQ / Chesterville, ON
## 3: Montreal, PQ / Chesterville, ON
## 4: Montreal, PQ / Chesterville, ON
## 5: Montreal, PQ / Chesterville, ON
## 6: New York, NY
# 변수 'survived'를 요인으로 지정한다.
titanic.dt$survived <- as.factor(titanic.dt$survived)
# 15세 미만을 아이로 본다.
titanic.dt[, `:=`(isminor, "adult")]
## X pclass survived name sex age
## 1: 1 1st 1 Allen, Miss. Elisabeth Walton female 29.0000
## 2: 2 1st 1 Allison, Master. Hudson Trevor male 0.9167
## 3: 3 1st 0 Allison, Miss. Helen Loraine female 2.0000
## 4: 4 1st 0 Allison, Mr. Hudson Joshua Crei male 30.0000
## 5: 5 1st 0 Allison, Mrs. Hudson J C (Bessi female 25.0000
## ---
## 1305: 1305 3rd 0 Zabour, Miss. Hileni female 14.5000
## 1306: 1306 3rd 0 Zabour, Miss. Thamine female NA
## 1307: 1307 3rd 0 Zakarian, Mr. Mapriededer male 26.5000
## 1308: 1308 3rd 0 Zakarian, Mr. Ortin male 27.0000
## 1309: 1309 3rd 0 Zimmerman, Mr. Leo male 29.0000
## sibsp parch ticket fare cabin embarked boat body
## 1: 0 0 24160 211.337 B5 Southampton 2 NA
## 2: 1 2 113781 151.550 C22 C26 Southampton 11 NA
## 3: 1 2 113781 151.550 C22 C26 Southampton NA
## 4: 1 2 113781 151.550 C22 C26 Southampton 135
## 5: 1 2 113781 151.550 C22 C26 Southampton NA
## ---
## 1305: 1 0 2665 14.454 Cherbourg 328
## 1306: 1 0 2665 14.454 Cherbourg NA
## 1307: 0 0 2656 7.225 Cherbourg 304
## 1308: 0 0 2670 7.225 Cherbourg NA
## 1309: 0 0 315082 7.875 Southampton NA
## home.dest isminor
## 1: St Louis, MO adult
## 2: Montreal, PQ / Chesterville, ON adult
## 3: Montreal, PQ / Chesterville, ON adult
## 4: Montreal, PQ / Chesterville, ON adult
## 5: Montreal, PQ / Chesterville, ON adult
## ---
## 1305: adult
## 1306: adult
## 1307: adult
## 1308: adult
## 1309: adult
titanic.dt[age < 15, `:=`(isminor, "child")]
## X pclass survived name sex age
## 1: 1 1st 1 Allen, Miss. Elisabeth Walton female 29.0000
## 2: 2 1st 1 Allison, Master. Hudson Trevor male 0.9167
## 3: 3 1st 0 Allison, Miss. Helen Loraine female 2.0000
## 4: 4 1st 0 Allison, Mr. Hudson Joshua Crei male 30.0000
## 5: 5 1st 0 Allison, Mrs. Hudson J C (Bessi female 25.0000
## ---
## 1305: 1305 3rd 0 Zabour, Miss. Hileni female 14.5000
## 1306: 1306 3rd 0 Zabour, Miss. Thamine female NA
## 1307: 1307 3rd 0 Zakarian, Mr. Mapriededer male 26.5000
## 1308: 1308 3rd 0 Zakarian, Mr. Ortin male 27.0000
## 1309: 1309 3rd 0 Zimmerman, Mr. Leo male 29.0000
## sibsp parch ticket fare cabin embarked boat body
## 1: 0 0 24160 211.337 B5 Southampton 2 NA
## 2: 1 2 113781 151.550 C22 C26 Southampton 11 NA
## 3: 1 2 113781 151.550 C22 C26 Southampton NA
## 4: 1 2 113781 151.550 C22 C26 Southampton 135
## 5: 1 2 113781 151.550 C22 C26 Southampton NA
## ---
## 1305: 1 0 2665 14.454 Cherbourg 328
## 1306: 1 0 2665 14.454 Cherbourg NA
## 1307: 0 0 2656 7.225 Cherbourg 304
## 1308: 0 0 2670 7.225 Cherbourg NA
## 1309: 0 0 315082 7.875 Southampton NA
## home.dest isminor
## 1: St Louis, MO adult
## 2: Montreal, PQ / Chesterville, ON child
## 3: Montreal, PQ / Chesterville, ON child
## 4: Montreal, PQ / Chesterville, ON adult
## 5: Montreal, PQ / Chesterville, ON adult
## ---
## 1305: child
## 1306: adult
## 1307: adult
## 1308: adult
## 1309: adult
titanic.dt$isminor <- as.factor(titanic.dt$isminor)
# 등급별 생존률
titanic.dt[, length(which(survived == 1))/nrow(.SD), by = pclass]
## pclass V1
## 1: 1st 0.6192
## 2: 2nd 0.4296
## 3: 3rd 0.2553
# 성별 생존률
titanic.dt[, length(which(survived == 1))/nrow(.SD), by = sex]
## sex V1
## 1: female 0.7275
## 2: male 0.1910
survived가 1은 생존한 사람이기 때문에 1에 해당하는 사람들의 수를 구하고, 그 수를 전체갯수로 나눠주면, 생존률을 구할수 있다.
# 등급/성별 생존률
survived_pclass_sex <- titanic.dt[, list(cntsurv = length(which(survived ==
1)), cntdie = length(which(survived == 0))), by = list(pclass, sex)][, list(psurvived = cntsurv/(cntsurv +
cntdie)), by = list(pclass, sex)]
# 성별, 등급, 나이별 생존률
survived_pclass_sex_isminor <- titanic.dt[, list(cntsurv = length(which(survived ==
1)), cntdie = length(which(survived == 0))), by = list(pclass, sex, isminor)][,
list(psurvived = cntsurv/(cntsurv + cntdie)), by = list(pclass, sex, isminor)]
# 2차원으로 나타내기위해 성별과 나이를 묶는다.
survived_pclass_sex_isminor$sex_age <- apply(survived_pclass_sex_isminor[, list(sex,
isminor)], 1, paste, collapse = "_")
library(scales)
ggplot(survived_pclass_sex_isminor, aes(pclass, sex_age)) + geom_tile(aes(fill = psurvived)) +
scale_fill_gradient2("생존률", low = muted("white"), high = muted("blue")) +
xlab("승객등급") + ylab("성별과 나이")