I am going to combine the train and test data to create a complete dataset for analysis.
traindata <- read.csv('titanic_training.csv', stringsAsFactors = F)
testdata <- read.csv('titanic_test.csv', stringsAsFactors = F)
full <- bind_rows(traindata, testdata)
nrow(full)
## [1] 1309
Now that we have our complete dataset, let’s start our analysis. As we all know already, children, women and passengers from the first class were given preference during evacuation. Let us confirm this first through our analysis.
Percentage survival among children, women, men, first class passeners, second class passengers, and third class passengers. (For the purposes of our analysis, let us treat everyone below age 12 as children)
totalchildren <- traindata %>%
filter(Age <= 12)
#No of children Survived vs deceased
table(totalchildren$Survived)
##
## 0 1
## 29 40
adulttable <- traindata %>%
filter(Age > 12)
#Survival of men and women
table(adulttable$Sex, adulttable$Survived)
##
## 0 1
## female 51 178
## male 344 72
#Survival of passengers of different class
table(adulttable$Pclass, adulttable$Survived)
##
## 0 1
## 1 63 119
## 2 90 66
## 3 242 65
From the table output itself we can see that the commonly held view of women, children and first class passengers given preference is true.Let’s plot these values in a graph to get an accurate percentage.
ggplot(totalchildren, aes(x=Survived)) +
geom_bar(aes(y = (..count..)/sum(..count..)))+
scale_y_continuous(name = "Percentage", labels=percent)+
theme_few()
womentable <- adulttable %>%
filter(Sex =="female")
mentable <- adulttable %>%
filter(Sex =="male")
ggplot(womentable, aes(x=Sex, fill=factor(Survived))) +
geom_bar(aes(y = (..count..)/sum(..count..)), position ='dodge')+
scale_y_continuous(name = "Percentage", labels=percent)+
theme_few()
From the graph we can see that around 80% of adolescent and adult women were rescued, while 20% died.
Now plotting the same graph for men.
ggplot(mentable, aes(x=Sex, fill=factor(Survived))) +
geom_bar(aes(y = (..count..)/sum(..count..)), position ='dodge')+
scale_y_continuous(name = "Percentage", labels=percent)+
theme_few()
We can see that the stats are reversed here. Aroudn 80% of adolescent and adult men died, while only less 20% were rescued.
ggplot(adulttable, aes(x=Pclass, fill=factor(Survived))) +
geom_bar(aes(y = (..count..)/sum(..count..)), position ='dodge')+
scale_y_continuous(name = "Percentage", labels=percent)+
theme_few()
Now that we have tested our hypothesis, let’s go on to see if any other variable has this kind of influence on survival. We will see if port of embarkment has an impact on survival.
ggplot(adulttable, aes(x=Embarked, fill=factor(Survived))) +
geom_bar(aes(y = (..count..)/sum(..count..)), position ='dodge')+
scale_y_continuous(name = "Percentage", labels=percent)+
theme_few()
This shows that those who embarked on SouthAmpton has lesser chances of survival than those who embarked at other ports. But this may be a false reasoning, because SouthAmpton was the beginning place and most people who boarded there were adult men and third class passengers. To confirm this, let us see if the distribution of sexes and classes among people from different port of embarkment.
##Plotting embarkment and Sex
ggplot(adulttable, aes(Embarked, fill=factor(Sex))) +
geom_bar(aes(y = (..count..)/sum(..count..)),position ='dodge')+
scale_y_continuous(name = "Percentage", labels=percent)+
theme_few()
##Plotting embarkment and passenger class
ggplot(adulttable, aes(Embarked, fill=factor(Pclass))) +
geom_bar(aes(y = (..count..)/sum(..count..)),position ='dodge')+
scale_y_continuous(name = "Percentage", labels=percent)+
theme_few()
From the graphs, we can verify our hypothesis that passsengers from SouthAmpton had relatively more men and more third class passengers than passengers from other embarkments.
Let us first see the distribution of age variable.
hist(full$Age, freq=F, main ="Age Distribution", col = 'lightblue')
Seems like there were a lot of young adults on the ship. Now let’s impute values to the missing age values using the rpart method.
# # Make variables factors into factors
factor_vars <- c('PassengerId','Pclass','Sex','Embarked')
full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))
# # Set a random seed
set.seed(129)
# # Perform mice imputation, excluding certain less-than-useful variables:
mice_mod <- mice(full[, !names(full) %in% c('PassengerId','Name','Ticket','Cabin','Fare','Survived')], method='rf')
##
## iter imp variable
## 1 1 Age
## 1 2 Age
## 1 3 Age
## 1 4 Age
## 1 5 Age
## 2 1 Age
## 2 2 Age
## 2 3 Age
## 2 4 Age
## 2 5 Age
## 3 1 Age
## 3 2 Age
## 3 3 Age
## 3 4 Age
## 3 5 Age
## 4 1 Age
## 4 2 Age
## 4 3 Age
## 4 4 Age
## 4 5 Age
## 5 1 Age
## 5 2 Age
## 5 3 Age
## 5 4 Age
## 5 5 Age
mice_output <- complete(mice_mod)
par(mfrow=c(1,2))
hist(full$Age, freq=F, main='Age: Original Data',
col='darkgreen', ylim=c(0,0.04))
hist(mice_output$Age, freq=F, main='Age: MICE Output',
col='lightgreen', ylim=c(0,0.04))
full$Age <- mice_output$Age
completedata <- full
completedata$Embarked[c(62, 830)] <- 'C'
completedata$Fare[1044] <- median(completedata[completedata$Pclass == '3' & completedata$Embarked == 'S', ]$Fare, na.rm = TRUE)
I am going to classify people into Six different age groups - Baby (0-2), toddler(3-6), child (7-12), adolescent (13-18), adult (19-60), elderly(60+)
#Create new variable with different levels
completedata$ageGroup[completedata$Age<=2] <- "Baby"
completedata$ageGroup[(completedata$Age >2) & (completedata$Age<= 6)] <- "Toddler"
completedata$ageGroup[(completedata$Age >6) & (completedata$Age<= 12)] <- "Child"
completedata$ageGroup[(completedata$Age >12) & (completedata$Age<= 18)] <- "Adolescent"
completedata$ageGroup[(completedata$Age >18) & (completedata$Age<= 60)] <- "Adult"
completedata$ageGroup[(completedata$Age >60)] <- "Elderly"
Before categorizing women, let us see the distribution of age among women.
##Distribution of age of women
ggplot(completedata[completedata$Sex == "female", ], aes(x = Age)) + geom_density(fill = '#99d6ff', alpha=0.4) +
geom_vline(aes(xintercept=median(Age)),
colour='black', linetype='dashed', lwd=1) +
theme_few()
The filters I am going to use for creating a subgroup of ‘Mothers’ are - Sex is female. - No of children greater than zero. - No of Spouses equal to one(assuming monogamy) - Age greater than 18 and less than 45.
If the women’s age is above 45, her kids would most probably have grown up into adults already. So we will have to exclude women aged above 45.
#Creation of new attribute
completedata$Nwomen <- "Other"
completedata$Nwomen[completedata$Sex == 'female' & completedata$Parch > 0 & completedata$Age > 20 & completedata$Age < 40] <- "Mother"
#comparing mothers chances across class
table(completedata$Nwomen, completedata$Pclass, completedata$Survived)
## , , = 0
##
##
## 1 2 3
## Mother 1 1 11
## Other 79 96 361
##
## , , = 1
##
##
## 1 2 3
## Mother 12 13 9
## Other 124 74 110
We can see the rate of survival of mothers is 5/6, 8/9, and 1/3 for first, second and third class respectively.
We are going to repeat the same process for men as well, categorizing them into fathers and non-fathers. The filters are - Sex is male - Age is between 18 and 45 - No of children greater than zero. - No of Spouses equal to one
#Creation of new attribute
completedata$Nmen <- "Other"
completedata$Nmen[completedata$Sex == 'male' & completedata$Parch > 0 & completedata$SibSp ==1 & completedata$Age > 18 & completedata$Age < 45] <- "Father"
Let us now see if fathers had more chances of survival.
ggplot(completedata, aes(x=Nmen, fill=factor(Survived)))+
geom_bar(stat='count', position = 'dodge')+
theme_few()
ferdinora <- c(34,35)
Never expected this output. Seems like it’s actually a negative thing to be father. Other men survived better than fathers did. But the sample size of fathers was really small, so we can’t really say whether there was no impact on survival if a guy was a father. For now, let’s just keep this aside.
# Factorizing our predictive attributes
completedata$Pclass <- as.factor(completedata$Pclass)
completedata$Sex <- as.factor(completedata$Sex)
completedata$ageGroup <- as.factor(completedata$ageGroup)
completedata$Nwomen <- as.factor(completedata$Nwomen)
completedata$Nmen <- as.factor(completedata$Nmen)
completedata$Survived <- as.factor(completedata$Survived)
##completedata$Survived <- as.integer(completedata$Survived)
#splitting the dataset
train <- completedata[1:891,]
test <- completedata[892:1309,]
# Set a random seed
set.seed(759)
# Build the model
rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + ageGroup + Nwomen, data = train)
plot(rf_model)
legend('topright', colnames(rf_model$err.rate), col=1:3, fill=1:3)
# Get importance
importance <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance),
Importance = round(importance[ ,'MeanDecreaseGini'],2))
# Create a rank variable based on importance
rankImportance <- varImportance %>%
mutate(Rank = paste0('#',dense_rank(desc(Importance))))
# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat='identity') +
geom_text(aes(x = Variables, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 4, colour = 'red') +
labs(x = 'Variables') +
coord_flip() +
theme_few()
# Predict using the test set
prediction <- predict(rf_model, test)
# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)
solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
# Write the solution to file
write.csv(solution, file = 'resultfinal_vasanth.csv', row.names = F)