Introduction

This analysis will attempt to draw a relationship between variables in the Titanic dataset and the survival of Titanic passengers.

1) What is the relationship between Titanic variables and survival.

Variable Description:

Pclass = Passanger class (1 = 1st, 2 = 2nd, 3 = 3rd)

Survival (0 = No, 1 = Yes)

Sex = Sex

Embarked = Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

#Load Data
titanic.new <- read.csv("titanic_original.csv", header = TRUE, sep = ",")
library(ggplot2)
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.3

Show the mean, median, 25th and 75th quartiles, min, and max for titanic dataset

summary(titanic.new)
##      pclass         survived                                   name     
##  Min.   :1.000   Min.   :0.000   Connolly, Miss. Kate            :   2  
##  1st Qu.:2.000   1st Qu.:0.000   Kelly, Mr. James                :   2  
##  Median :3.000   Median :0.000   Abbing, Mr. Anthony             :   1  
##  Mean   :2.297   Mean   :0.381   Abbott, Master. Eugene Joseph   :   1  
##  3rd Qu.:3.000   3rd Qu.:1.000   Abbott, Mr. Rossmore Edward     :   1  
##  Max.   :3.000   Max.   :1.000   Abbott, Mrs. Stanton (Rosa Hunt):   1  
##                                  (Other)                         :1299  
##      sex           age              sibsp            parch       
##  female:464   Min.   : 0.1667   Min.   :0.0000   Min.   :0.0000  
##  male  :843   1st Qu.:21.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##               Median :28.0000   Median :0.0000   Median :0.0000  
##               Mean   :29.8426   Mean   :0.4996   Mean   :0.3856  
##               3rd Qu.:39.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##               Max.   :80.0000   Max.   :8.0000   Max.   :9.0000  
##               NA's   :263                                        
##       ticket          fare                     cabin      embarked
##  CA. 2343:  11   Min.   :  0.000                  :1014   C:270   
##  1601    :   8   1st Qu.:  7.896   C23 C25 C27    :   6   Q:123   
##  CA 2144 :   8   Median : 14.454   B57 B59 B63 B66:   5   S:914   
##  3101295 :   7   Mean   : 33.224   G6             :   5           
##  347077  :   7   3rd Qu.: 31.275   B96 B98        :   4           
##  347082  :   7   Max.   :512.329   C22 C26        :   4           
##  (Other) :1259   NA's   :1         (Other)        : 269           
##       boat          body                      home.dest  
##         :823   Min.   :  1.0                       :563  
##  13     : 39   1st Qu.: 72.0   New York, NY        : 64  
##  C      : 38   Median :155.0   London              : 14  
##  15     : 37   Mean   :160.8   Montreal, PQ        : 10  
##  14     : 33   3rd Qu.:256.0   Cornwall / Akron, OH:  9  
##  4      : 31   Max.   :328.0   Paris, France       :  9  
##  (Other):306   NA's   :1186    (Other)             :638

Find Missing values

colSums(is.na(titanic.new))
##    pclass  survived      name       sex       age     sibsp     parch 
##         0         0         0         0       263         0         0 
##    ticket      fare     cabin  embarked      boat      body home.dest 
##         0         1         0         0         0      1186         0

Calculate the mean of the Age and fill in missing vales with mean

titanic.new$age[is.na(titanic.new$age)] <- mean(titanic.new$age, na.rm=TRUE)

Fill in missing values with NA for fare

farenew <- titanic.new$fare
farenew[farenew == ""] <- NA
titanic.new$fare <- farenew

Replace missing values with NA for cabin

cabinNumber<- titanic.new$cabin
cabinNumber[cabinNumber == ""] <- NA
titanic.new$cabin <- cabinNumber

Create new column has_cabin_number

titanic.new$has_cabin_number <- 0

Replace 0 with 1 if cabin number is known for each passenger

for (i in 1:length(titanic.new$cabin)){
  if (is.na(titanic.new$cabin[i]) == FALSE){
    titanic.new$has_cabin_number[i] <- 1
  }
}

Change variables from int to factor to plot

titanic.new$pclass <- as.factor(titanic.new$pclass)
titanic.new$survived <- as.factor(titanic.new$survived)

Finding the relationship between passenger class and survival

ggplot(data = titanic.new,aes(x=pclass,fill=survived))+geom_bar(position="fill")+ylab("Frequency")

It appears that lower number class titles had a better chance of survival

Survival as function of sex

ggplot(data = titanic.new,aes(x=sex ,fill=survived))+geom_bar(position="fill")+ylab("Frequency")

It appears that females had a higher probability of survival

Embarked as a function of survival

ggplot(data = titanic.new,aes(x=embarked,fill=survived))+geom_bar(position="fill")+ylab("Frequency")

It appears that you were more likely to survive if a passenger embarked from the port of Southampton

Set up jitter levels

position.j <- position_jitter(0.5, 0)

Create plot with jitter

ggplot(titanic.new, aes((pclass), age, col=(sex)))+ geom_jitter(size= 3, alpha= 0.5, position= position.j)+ facet_grid(".~survived")

It appears that males and females 30 years of age to 18 years of age were more likely to survive

Conclusion

It can be inferred that there is a correlation between the variables pclass, age, sex and embarked to passenger surivial within the titanic dataset.