Introduction
This analysis will attempt to draw a relationship between variables in the Titanic dataset and the survival of Titanic passengers.
1) What is the relationship between Titanic variables and survival.
Variable Description:
Pclass = Passanger class (1 = 1st, 2 = 2nd, 3 = 3rd)
Survival (0 = No, 1 = Yes)
Sex = Sex
Embarked = Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
#Load Data
titanic.new <- read.csv("titanic_original.csv", header = TRUE, sep = ",")
library(ggplot2)
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.3
Find Missing values
colSums(is.na(titanic.new))
## pclass survived name sex age sibsp parch
## 0 0 0 0 263 0 0
## ticket fare cabin embarked boat body home.dest
## 0 1 0 0 0 1186 0
Calculate the mean of the Age and fill in missing vales with mean
titanic.new$age[is.na(titanic.new$age)] <- mean(titanic.new$age, na.rm=TRUE)
Fill in missing values with NA for fare
farenew <- titanic.new$fare
farenew[farenew == ""] <- NA
titanic.new$fare <- farenew
Replace missing values with NA for cabin
cabinNumber<- titanic.new$cabin
cabinNumber[cabinNumber == ""] <- NA
titanic.new$cabin <- cabinNumber
Create new column has_cabin_number
titanic.new$has_cabin_number <- 0
Replace 0 with 1 if cabin number is known for each passenger
for (i in 1:length(titanic.new$cabin)){
if (is.na(titanic.new$cabin[i]) == FALSE){
titanic.new$has_cabin_number[i] <- 1
}
}
Change variables from int to factor to plot
titanic.new$pclass <- as.factor(titanic.new$pclass)
titanic.new$survived <- as.factor(titanic.new$survived)
Finding the relationship between passenger class and survival
ggplot(data = titanic.new,aes(x=pclass,fill=survived))+geom_bar(position="fill")+ylab("Frequency")

It appears that lower number class titles had a better chance of survival
Survival as function of sex
ggplot(data = titanic.new,aes(x=sex ,fill=survived))+geom_bar(position="fill")+ylab("Frequency")

It appears that females had a higher probability of survival
Embarked as a function of survival
ggplot(data = titanic.new,aes(x=embarked,fill=survived))+geom_bar(position="fill")+ylab("Frequency")

It appears that you were more likely to survive if a passenger embarked from the port of Southampton
Set up jitter levels
position.j <- position_jitter(0.5, 0)
Create plot with jitter
ggplot(titanic.new, aes((pclass), age, col=(sex)))+ geom_jitter(size= 3, alpha= 0.5, position= position.j)+ facet_grid(".~survived")

It appears that males and females 30 years of age to 18 years of age were more likely to survive
Conclusion
It can be inferred that there is a correlation between the variables pclass, age, sex and embarked to passenger surivial within the titanic dataset.