The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
This titanic dataset is taken from the kaggle site.
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)titanic <- readr::read_csv('train.csv')## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(titanic)## [1] 891 12
Our titanic dataset contains 891 rows and 12 coloumns.
head(titanic)## # A tibble: 6 × 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Braund… male 22 1 0 A/5 2… 7.25 <NA>
## 2 2 1 1 Cuming… fema… 38 1 0 PC 17… 71.3 C85
## 3 3 1 3 Heikki… fema… 26 0 0 STON/… 7.92 <NA>
## 4 4 1 1 Futrel… fema… 35 1 0 113803 53.1 C123
## 5 5 0 3 Allen,… male 35 0 0 373450 8.05 <NA>
## 6 6 0 3 Moran,… male NA 0 0 330877 8.46 <NA>
## # … with 1 more variable: Embarked <chr>
Column explanation
str(titanic)## spec_tbl_df [891 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr [1:891] "male" "female" "female" "female" ...
## $ Age : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr [1:891] NA "C85" NA "C123" ...
## $ Embarked : chr [1:891] "S" "C" "S" "S" ...
## - attr(*, "spec")=
## .. cols(
## .. PassengerId = col_double(),
## .. Survived = col_double(),
## .. Pclass = col_double(),
## .. Name = col_character(),
## .. Sex = col_character(),
## .. Age = col_double(),
## .. SibSp = col_double(),
## .. Parch = col_double(),
## .. Ticket = col_character(),
## .. Fare = col_double(),
## .. Cabin = col_character(),
## .. Embarked = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
From the result above, we find some of data type not in the corect type. We need to convert it into corect type.
titanic[,c("Survived", "Pclass", "Sex", "SibSp", "Parch", "Cabin", "Embarked")] <-
lapply(titanic[,c("Survived", "Pclass", "Sex", "SibSp", "Parch", "Cabin", "Embarked")],
as.factor)
str(titanic)## spec_tbl_df [891 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Ticket : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## - attr(*, "spec")=
## .. cols(
## .. PassengerId = col_double(),
## .. Survived = col_double(),
## .. Pclass = col_double(),
## .. Name = col_character(),
## .. Sex = col_character(),
## .. Age = col_double(),
## .. SibSp = col_double(),
## .. Parch = col_double(),
## .. Ticket = col_character(),
## .. Fare = col_double(),
## .. Cabin = col_character(),
## .. Embarked = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Each of column already changed into desired data type.
Now, we check whether there are missing value from the data.
colSums(is.na(titanic))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
There are missing values in some columns
Age : 177 missing values
Cabin : 687 missing values
Embarker : 2 missing values
For coulmn Age, we replace the missing value with average of Age
titanic$Age[is.na(titanic$Age)] <- mean(titanic$Age, na.rm = TRUE) For column Cabin, we replace the missing value with “Unknown”
titanic$Cabin <- as.character(titanic$Cabin)
titanic$Cabin[is.na(titanic$Cabin)] <- "Unknown"
titanic$Cabin <- as.factor(titanic$Cabin)For column Embarked, we remove row that contain missing value, because the missing value is only two.
titanic <- na.omit(titanic)Make sure no more missing values from the data.
colSums(is.na(titanic))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Now, Titanic dataset is ready to be processed and analyzed.
Total Passangers from Each Embarked
titanic$Embarked <- recode(titanic$Embarked,
"C" = "Cherbourg", "Q" = "Queenstown", "S" = "Southampton")
x <- data.frame(xtabs(~ titanic$Embarked))
x## titanic.Embarked Freq
## 1 Cherbourg 168
## 2 Queenstown 77
## 3 Southampton 644
Total Passengers in Each Class
table(titanic$Pclass)##
## 1 2 3
## 214 184 491
pie(table(titanic$Pclass))Average Age of Passengers in Each Class by Gender
aggregate(Age~Pclass+Sex,titanic, mean)## Pclass Sex Age
## 1 1 female 33.79665
## 2 2 female 28.74866
## 3 3 female 24.06849
## 4 1 male 39.28772
## 5 2 male 30.65391
## 6 3 male 27.37215
Percentage Women and Men
xtabs(~ Sex, titanic[titanic$Survived == 1,])['female']/nrow(titanic)*100## female
## 25.98425
xtabs(~ Sex, titanic[titanic$Survived == 1,])['male']/nrow(titanic)*100## male
## 12.26097
xtabs(~ Sex, titanic[titanic$Survived == 0,])['female']/nrow(titanic)*100## female
## 9.111361
xtabs(~ Sex, titanic[titanic$Survived == 0,])['male']/nrow(titanic)*100## male
## 52.64342
Correlation beetween Age and Survivded
cor(titanic$Age, as.numeric(titanic$Survived))## [1] -0.07467292
Average Ticket Prices for Each Class
aggregate(Fare~Pclass, titanic, mean)## Pclass Fare
## 1 1 84.19352
## 2 2 20.66218
## 3 3 13.67555
Standard Deviation of Tiket Prices
sd(titanic$Fare)## [1] 49.6975
From the above data processing, several conclusions can be drawn