Exploratory data analysis for Titanic dataset: investigation whether you’d have a chance of surviving the disaster.
Load the cleaned data into dataframe titanic.
titanic <- read.csv("titanic_clean.csv", header = TRUE, sep = ",")
Check out the structure of titanic.
str(titanic)
## 'data.frame': 1310 obs. of 15 variables:
## $ pclass : int 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : int 1 1 0 0 0 1 1 0 1 0 ...
## $ name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : num 29 0.917 2 30 25 ...
## $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : int 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
## $ fare : num 211 152 152 152 152 ...
## $ cabin : Factor w/ 186 levels "A10","A11","A14",..: 44 80 80 80 80 150 146 16 62 NA ...
## $ embarked : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...
## $ boat : Factor w/ 27 levels "1","10","11",..: 12 3 NA NA NA 13 2 NA 27 NA ...
## $ body : int NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest : Factor w/ 369 levels "?Havana, Cuba",..: 309 231 231 231 231 237 162 24 22 229 ...
## $ has_cabin_number: int 1 1 1 1 1 1 1 1 1 0 ...
The last passanger had missing information in all fields, except for the age variable, therefore was excluded from the dataset.
tail(titanic)
## pclass survived name sex age sibsp parch
## 1305 3 0 Zabour, Miss. Hileni female 14.50000 1 0
## 1306 3 0 Zabour, Miss. Thamine female 29.88113 1 0
## 1307 3 0 Zakarian, Mr. Mapriededer male 26.50000 0 0
## 1308 3 0 Zakarian, Mr. Ortin male 27.00000 0 0
## 1309 3 0 Zimmerman, Mr. Leo male 29.00000 0 0
## 1310 NA NA <NA> <NA> 29.88113 NA NA
## ticket fare cabin embarked boat body home.dest has_cabin_number
## 1305 2665 14.4542 <NA> C <NA> 328 <NA> 0
## 1306 2665 14.4542 <NA> C <NA> NA <NA> 0
## 1307 2656 7.2250 <NA> C <NA> 304 <NA> 0
## 1308 2670 7.2250 <NA> C <NA> NA <NA> 0
## 1309 315082 7.8750 <NA> S <NA> NA <NA> 0
## 1310 <NA> NA <NA> S <NA> NA <NA> 0
titanic <- titanic[-1310,]
Use ggplot() to plot the distribution of sexes within the classes of the ship.
require(ggplot2)
## Loading required package: ggplot2
ggplot(titanic,aes(x=factor(pclass),fill=factor(sex)))+
geom_bar(position="dodge")
Use ggplot() to estimate your chances of survival from the distribution of sexes within the classes of the ship.
ggplot(titanic,aes(x=factor(pclass),fill=factor(sex)))+
geom_bar(position="dodge")+
facet_grid(". ~ survived")
Use ggplot() to estimate your chances of survival based on your age from the distribution of sexes within the classes of the ship.
posn.j <- position_jitter(0.5, 0)
ggplot(titanic,aes(x=factor(pclass),y=age,col=factor(sex)))+
geom_jitter(size=3,alpha=0.5,position=posn.j)+
facet_grid(". ~ survived")