Exploratory data analysis for Titanic dataset: investigation whether you’d have a chance of surviving the disaster.

Load the cleaned data into dataframe titanic.

titanic <- read.csv("titanic_clean.csv", header = TRUE, sep = ",")

Check out the structure of titanic.

str(titanic)
## 'data.frame':    1310 obs. of  15 variables:
##  $ pclass          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ survived        : int  1 1 0 0 0 1 1 0 1 0 ...
##  $ name            : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
##  $ sex             : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age             : num  29 0.917 2 30 25 ...
##  $ sibsp           : int  0 1 1 1 1 0 1 0 2 0 ...
##  $ parch           : int  0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket          : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
##  $ fare            : num  211 152 152 152 152 ...
##  $ cabin           : Factor w/ 186 levels "A10","A11","A14",..: 44 80 80 80 80 150 146 16 62 NA ...
##  $ embarked        : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...
##  $ boat            : Factor w/ 27 levels "1","10","11",..: 12 3 NA NA NA 13 2 NA 27 NA ...
##  $ body            : int  NA NA NA 135 NA NA NA NA NA 22 ...
##  $ home.dest       : Factor w/ 369 levels "?Havana, Cuba",..: 309 231 231 231 231 237 162 24 22 229 ...
##  $ has_cabin_number: int  1 1 1 1 1 1 1 1 1 0 ...

The last passanger had missing information in all fields, except for the age variable, therefore was excluded from the dataset.

tail(titanic)
##      pclass survived                      name    sex      age sibsp parch
## 1305      3        0      Zabour, Miss. Hileni female 14.50000     1     0
## 1306      3        0     Zabour, Miss. Thamine female 29.88113     1     0
## 1307      3        0 Zakarian, Mr. Mapriededer   male 26.50000     0     0
## 1308      3        0       Zakarian, Mr. Ortin   male 27.00000     0     0
## 1309      3        0        Zimmerman, Mr. Leo   male 29.00000     0     0
## 1310     NA       NA                      <NA>   <NA> 29.88113    NA    NA
##      ticket    fare cabin embarked boat body home.dest has_cabin_number
## 1305   2665 14.4542  <NA>        C <NA>  328      <NA>                0
## 1306   2665 14.4542  <NA>        C <NA>   NA      <NA>                0
## 1307   2656  7.2250  <NA>        C <NA>  304      <NA>                0
## 1308   2670  7.2250  <NA>        C <NA>   NA      <NA>                0
## 1309 315082  7.8750  <NA>        S <NA>   NA      <NA>                0
## 1310   <NA>      NA  <NA>        S <NA>   NA      <NA>                0
titanic <- titanic[-1310,]

Use ggplot() to plot the distribution of sexes within the classes of the ship.

require(ggplot2)
## Loading required package: ggplot2
ggplot(titanic,aes(x=factor(pclass),fill=factor(sex)))+
  geom_bar(position="dodge")

Use ggplot() to estimate your chances of survival from the distribution of sexes within the classes of the ship.

ggplot(titanic,aes(x=factor(pclass),fill=factor(sex)))+
  geom_bar(position="dodge")+
  facet_grid(". ~ survived")

Use ggplot() to estimate your chances of survival based on your age from the distribution of sexes within the classes of the ship.

posn.j <- position_jitter(0.5, 0)
ggplot(titanic,aes(x=factor(pclass),y=age,col=factor(sex)))+
  geom_jitter(size=3,alpha=0.5,position=posn.j)+
  facet_grid(". ~ survived")