Titanic passenger survival data set
Data comes from : https://www.kaggle.com/c/titanic/data
Doing data visualization with all the data of titanic, so combining ‘titanic_train’ and ‘titanic_test’ together.
titanic_all <- bind_rows(titanic_train, titanic_test)
glimpse(titanic_all)
## Rows: 1,309
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6",...
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...
dim(titanic_all)
## [1] 1309 12
names(titanic_all)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
head(titanic_all)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
| variable | definition | key |
|---|---|---|
| Survived | Survival | 0 = No, 1 = Yes |
| Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| SibSp | # of siblings / spouses aboard the Titanic | |
| Parch | # of parents / children aboard the Titanic | |
| Ticket | Ticket number | |
| Fare | Passenger fare | |
| Cabin | Cabin number | |
| Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
To see whether Jack Dawson and Rose DeWitt Bukater are on board.
grep("Jack|Rose", titanic_all$Name, value = TRUE)
## [1] "Brewe, Dr. Arthur Jackson"
## [2] "Aks, Mrs. Sam (Leah Rosen)"
## [3] "Rosenbaum, Miss. Edith Louise"
## [4] "Rosenshine, Mr. George (Mr George Thorne\")\""
We find that Jack and Rose in the movie are not on titanic.
ggplot(data = titanic_all, aes(x = Sex)) +geom_bar()
From the picture, the number of male on board is almost twice that of female.
#ggplot(data = titanic_all, aes(x = Sex, fill = factor(Pclass))) +
# geom_bar()
#ggplot(data = titanic_all, aes(x = Sex, fill = factor(Pclass))) +
# geom_bar(position = "dodge") +
# ggtitle('position = "dodge"') #add title
ggplot(data = titanic_all, aes(x = Sex, fill = factor(Pclass))) +
geom_bar(position = "fill") +
ggtitle('position = "fill"') #add title
Both of female and male, “class 3”(lower class) accounted for the most.
Note that the ratio of female of “class 1”(Upper class) is far more than that of male of “class 1”.
ggplot(titanic_all, aes(x=Embarked, y=Fare)) +
geom_boxplot()
The first item is passengers without the boarding port information, so there is no name on it.
The mean of different ports ‘C’,‘Q’,‘S’ are of small difference.
However we can see from the box plot that ‘C’ spread more loosely than others, and both ‘C’ and ‘S’ have many outliners.
p1 <- ggplot(titanic_all,aes(x=Embarked, y=Fare, fill=factor(Pclass))) +
geom_boxplot()
p1
#p1 + theme_gray() # the default
#p1 + theme_bw()
#p1 + theme_linedraw()
#p1 + theme_light()
#p1 + theme_dark()
#p1 + theme_minimal()
#p1 + theme_classic()
#p1 + theme_void()
By classifying ‘Embark’ by ‘Pclass’, we see that class 1 in ‘C’ paid more on tickets than that of in ‘S’.
ggplot(titanic_all, aes(x = Fare, fill = factor(Pclass))) +
geom_density() +
theme_light() +
facet_grid(Pclass ~ .) #with the same scale of y
It is easily to observe that people in lower class tends to buy cheaper tickets.
#ggplot(titanic_all, aes(x = Fare, fill = factor(Pclass))) +
# geom_density() +
# theme_light() +
# facet_grid(Pclass ~ . , scales = "free") #scales = "free"
Passengers are classified according to ‘Sex’ and ‘Pclass’.
Next, calculating the mean of survival of each classification.
lineData <- titanic_train %>%
group_by(Sex, Pclass) %>% #cartesian product of 'Sex' and 'Pclass'
summarise(SurvivedAvg = mean(Survived)) #calculating
head(lineData)
## # A tibble: 6 x 3
## # Groups: Sex [2]
## Sex Pclass SurvivedAvg
## <chr> <int> <dbl>
## 1 female 1 0.968
## 2 female 2 0.921
## 3 female 3 0.5
## 4 male 1 0.369
## 5 male 2 0.157
## 6 male 3 0.135
ggplot(data = lineData, aes(x = Pclass, y = SurvivedAvg, color = Sex)) +
geom_line() +
geom_text(aes(label = round(SurvivedAvg,2)), nudge_y = 0.05, show.legend = F)
Consistent with our intuition, lower class has lower average survival.
We can see that average survival of female is far more higher than male regardless ‘Pclass’.
Moreover, survival of female in class 1 and 2 is close to 1 and it has a rapid drop when ‘Pclass’ change 1 to 2.
ggplot(data = titanic_train, aes(x = Age, y = Survived, color = Sex)) +
geom_point(alpha = 0.2) +
geom_smooth(se = FALSE) + #C.I. FALSE
facet_grid(Pclass ~ .) +
theme_classic() +
theme(text = element_text(size=10)) +
scale_color_manual(values = c("#EF5350", "#64B5F6"))
Among all the classes, average of survival of ‘male’ have a small recess about 20 years old.
And children and the elderly have relatively high average og survival. This may indicate that young man give priority to the elderly, women and children to get on the rescue boat first.
#ggplot(data = titanic_train, aes(x = Age, y = Survived, color = Sex)) +
# geom_point(alpha = 0.2) +
# geom_smooth(se = TRUE) + #C.I. TRUE
# facet_grid(Pclass ~ .) +
# theme_classic() +
# theme(text = element_text(size=10)) +
# scale_color_manual(values = c("#EF5350", "#64B5F6"))
options(rpubs.upload.method = "internal")
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))