Setting up my enviroment

Installing and loading helpful packages

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("stringr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## βœ” dplyr     1.1.4     βœ” readr     2.1.5
## βœ” forcats   1.0.0     βœ” stringr   1.5.1
## βœ” ggplot2   3.5.2     βœ” tibble    3.3.0
## βœ” lubridate 1.9.4     βœ” tidyr     1.3.1
## βœ” purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## βœ– dplyr::filter() masks stats::filter()
## βœ– dplyr::lag()    masks stats::lag()
## β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidyr)
library(stringr)
library(readr)
library(lubridate)
library(base)

Titanic Survival Analysis and Prediction Using R

πŸ”— Dataset Source Kaggle Competition: Titanic - Machine Learning from Disaster

Download these CSV files from Kaggle:

train.csv β€” Training dataset (used for model building)

test.csv β€” Test dataset (used for prediction submission)

gender_submission.csv β€” Sample output format

Introduction

The sinking of the RMS Titanic in 1912 remains one of the most infamous maritime disasters in history. Of the 2,200 passengers and crew on board, more than 1,500 lost their lives when the ship struck an iceberg and sank in the North Atlantic. This tragic event has become a powerful dataset for exploring survival patterns and predictive modeling in data science.

In this project, we use the Titanic dataset from Kaggle to uncover which factors most influenced a passenger’s chance of survival. By applying the six-step data analysis process β€” Ask, Prepare, Process, Analyze, Share, and Act β€” we explore data patterns, visualize key trends, and build a predictive machine learning model using R. This project not only demonstrates technical skills in data wrangling and visualization but also emphasizes evidence-based decision making in a real-world scenario.

🧭 Step 1: Ask

In this project, we aim to explore and predict survival outcomes from the Titanic disaster using data-driven decisions.

Key Questions:

  • Who was more likely to survive?

  • How do age, gender, and passenger class affect survival?

  • Can we predict survival on unseen data using a machine learning model?

🧹 Step 2: Prepare

We will use the Titanic dataset from the Kaggle competition:

πŸ”— Titanic: Machine Learning from Disaster

train <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## β„Ή Use `spec()` to retrieve the full column specification for this data.
## β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
test <- read_csv("test.csv")
## Rows: 418 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
## 
## β„Ή Use `spec()` to retrieve the full column specification for this data.
## β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.

🧼 Step 3: Process (Clean the Data) We’ll check and clean the training dataset:

# Check structure
glimpse(train)
## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
# Check for missing values
colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2
# Fill missing Age with median
train$Age[is.na(train$Age)] <- median(train$Age, na.rm = TRUE)

# Fill missing Embarked with most common port
train$Embarked[is.na(train$Embarked)] <- "S"

# Convert to factors
train <- train %>%
  mutate(
    Survived = as.factor(Survived),
    Pclass = as.factor(Pclass),
    Sex = as.factor(Sex),
    Embarked = as.factor(Embarked),
    FamilySize = SibSp + Parch + 1
  )

We’ll also clean the test data in a similar way:

# Fill Age and Fare missing values
test$Age[is.na(test$Age)] <- median(train$Age, na.rm = TRUE)
test$Fare[is.na(test$Fare)] <- median(train$Fare, na.rm = TRUE)

# Fill Embarked if needed
test$Embarked[is.na(test$Embarked)] <- "S"

# Convert types
test <- test %>%
  mutate(
    Pclass = as.factor(Pclass),
    Sex = as.factor(Sex),
    Embarked = as.factor(Embarked),
    FamilySize = SibSp + Parch + 1
  )

πŸ”Ž Step 4: Analyze

Let’s explore survival distributions:

ggplot(train, aes(x = Survived)) +
  geom_bar(fill = "steelblue") +
  scale_x_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
  labs(title = "Overall Survival", x = "Survival Status", y = "Count")

Survival by Gender

ggplot(train, aes(x = Sex, fill = Survived)) +
  geom_bar(position = "fill") +
  scale_fill_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
  labs(title = "Survival Rate by Gender", y = "Proportion", fill = "Survival Status")

Survival by Passenger Class

ggplot(train, aes(x = Pclass, fill = Survived)) +
  geom_bar(position = "fill") +
  scale_fill_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
  labs(title = "Survival Rate by Passenger Class", y = "Proportion", x = "Passenger Class", fill = "Survival Status")

Age Distribution by Survival

ggplot(train, aes(x = Age, fill = Survived)) +
  geom_histogram(binwidth = 5, position = "identity", alpha = 0.6) +
  scale_fill_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
  labs(title = "Age Distribution by Survival", x = "Age", y = "Count", fill = "Survival Status")

πŸ“£ Step 5: Share (Insights and Interpretation)

Key Findings:

  • Females had a significantly higher survival rate.

  • 1st class passengers had better chances of survival than those in lower classes.

  • Children were more likely to survive than adults.

These findings align with historical reports that women and children were prioritized during rescue.

πŸš€ Step 6: Act (Model and Predict)

Let’s train a Random Forest model using key predictors:

install.packages("randomForest")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Select relevant features
model_data <- train %>%
  select(Survived, Pclass, Sex, Age, Fare, Embarked, FamilySize)

# Train the model
set.seed(123)
rf_model <- randomForest(Survived ~ ., data = model_data, ntree = 100, importance = TRUE)
print(rf_model)
## 
## Call:
##  randomForest(formula = Survived ~ ., data = model_data, ntree = 100,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.84%
## Confusion matrix:
##     0   1 class.error
## 0 507  42  0.07650273
## 1 108 234  0.31578947

Predict on Test Set

# Prepare test set with same features
test_model <- test %>%
  select(Pclass, Sex, Age, Fare, Embarked, FamilySize)

# Predict
predictions <- predict(rf_model, newdata = test_model)

# Prepare submission
submission <- data.frame(PassengerId = test$PassengerId, Survived = predictions, Sex = test$Sex, Pclass = test$Pclass)

# Write to CSV
write_csv(submission, "titanic_submission.csv")

# Visualize prediction outcome distribution
ggplot(submission, aes(x = factor(Survived))) +
  geom_bar(fill = "darkorange") +
  scale_x_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
  labs(title = "Predicted Survival Distribution", x = "Survival Status", y = "Count")

πŸ“ Conclusion

In this project, I:

  • Applied the 6-step data analysis process to understand the Titanic dataset.

  • Explored the influence of gender, class, and age on survival.

  • Built and evaluated a Random Forest model.

  • Prepared a submission file for the Kaggle Titanic competition.

πŸ“Ž Appendix

  • Tools: R, RStudio, ggplot2, dplyr, randomForest

  • Data Source: Kaggle Titanic Competition

  • Author: Elyshea Devore

Citation Will Cukierski. Titanic - Machine Learning from Disaster. https://kaggle.com/competitions/titanic, 2012. Kaggle.