Installing and loading helpful packages
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("stringr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
## β dplyr 1.1.4 β readr 2.1.5
## β forcats 1.0.0 β stringr 1.5.1
## β ggplot2 3.5.2 β tibble 3.3.0
## β lubridate 1.9.4 β tidyr 1.3.1
## β purrr 1.0.4
## ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
## β dplyr::filter() masks stats::filter()
## β dplyr::lag() masks stats::lag()
## βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidyr)
library(stringr)
library(readr)
library(lubridate)
library(base)
π Dataset Source Kaggle Competition: Titanic - Machine Learning from Disaster
Download these CSV files from Kaggle:
train.csv β Training dataset (used for model building)
test.csv β Test dataset (used for prediction submission)
gender_submission.csv β Sample output format
The sinking of the RMS Titanic in 1912 remains one of the most infamous maritime disasters in history. Of the 2,200 passengers and crew on board, more than 1,500 lost their lives when the ship struck an iceberg and sank in the North Atlantic. This tragic event has become a powerful dataset for exploring survival patterns and predictive modeling in data science.
In this project, we use the Titanic dataset from Kaggle to uncover which factors most influenced a passengerβs chance of survival. By applying the six-step data analysis process β Ask, Prepare, Process, Analyze, Share, and Act β we explore data patterns, visualize key trends, and build a predictive machine learning model using R. This project not only demonstrates technical skills in data wrangling and visualization but also emphasizes evidence-based decision making in a real-world scenario.
In this project, we aim to explore and predict survival outcomes from the Titanic disaster using data-driven decisions.
Key Questions:
Who was more likely to survive?
How do age, gender, and passenger class affect survival?
Can we predict survival on unseen data using a machine learning model?
We will use the Titanic dataset from the Kaggle competition:
π Titanic: Machine Learning from Disaster
train <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## βΉ Use `spec()` to retrieve the full column specification for this data.
## βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test <- read_csv("test.csv")
## Rows: 418 Columns: 11
## ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
##
## βΉ Use `spec()` to retrieve the full column specification for this data.
## βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check structure
glimpse(train)
## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,β¦
## $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1β¦
## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3β¦
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Flβ¦
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "malβ¦
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, β¦
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0β¦
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0β¦
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37β¦
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,β¦
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "Cβ¦
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"β¦
# Check for missing values
colSums(is.na(train))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
# Fill missing Age with median
train$Age[is.na(train$Age)] <- median(train$Age, na.rm = TRUE)
# Fill missing Embarked with most common port
train$Embarked[is.na(train$Embarked)] <- "S"
# Convert to factors
train <- train %>%
mutate(
Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
Embarked = as.factor(Embarked),
FamilySize = SibSp + Parch + 1
)
Weβll also clean the test data in a similar way:
# Fill Age and Fare missing values
test$Age[is.na(test$Age)] <- median(train$Age, na.rm = TRUE)
test$Fare[is.na(test$Fare)] <- median(train$Fare, na.rm = TRUE)
# Fill Embarked if needed
test$Embarked[is.na(test$Embarked)] <- "S"
# Convert types
test <- test %>%
mutate(
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
Embarked = as.factor(Embarked),
FamilySize = SibSp + Parch + 1
)
Letβs explore survival distributions:
ggplot(train, aes(x = Survived)) +
geom_bar(fill = "steelblue") +
scale_x_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
labs(title = "Overall Survival", x = "Survival Status", y = "Count")
Survival by Gender
ggplot(train, aes(x = Sex, fill = Survived)) +
geom_bar(position = "fill") +
scale_fill_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
labs(title = "Survival Rate by Gender", y = "Proportion", fill = "Survival Status")
Survival by Passenger Class
ggplot(train, aes(x = Pclass, fill = Survived)) +
geom_bar(position = "fill") +
scale_fill_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
labs(title = "Survival Rate by Passenger Class", y = "Proportion", x = "Passenger Class", fill = "Survival Status")
Age Distribution by Survival
ggplot(train, aes(x = Age, fill = Survived)) +
geom_histogram(binwidth = 5, position = "identity", alpha = 0.6) +
scale_fill_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
labs(title = "Age Distribution by Survival", x = "Age", y = "Count", fill = "Survival Status")
Letβs train a Random Forest model using key predictors:
install.packages("randomForest")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Select relevant features
model_data <- train %>%
select(Survived, Pclass, Sex, Age, Fare, Embarked, FamilySize)
# Train the model
set.seed(123)
rf_model <- randomForest(Survived ~ ., data = model_data, ntree = 100, importance = TRUE)
print(rf_model)
##
## Call:
## randomForest(formula = Survived ~ ., data = model_data, ntree = 100, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.84%
## Confusion matrix:
## 0 1 class.error
## 0 507 42 0.07650273
## 1 108 234 0.31578947
Predict on Test Set
# Prepare test set with same features
test_model <- test %>%
select(Pclass, Sex, Age, Fare, Embarked, FamilySize)
# Predict
predictions <- predict(rf_model, newdata = test_model)
# Prepare submission
submission <- data.frame(PassengerId = test$PassengerId, Survived = predictions, Sex = test$Sex, Pclass = test$Pclass)
# Write to CSV
write_csv(submission, "titanic_submission.csv")
# Visualize prediction outcome distribution
ggplot(submission, aes(x = factor(Survived))) +
geom_bar(fill = "darkorange") +
scale_x_discrete(labels = c("0" = "Did Not Survive", "1" = "Survived")) +
labs(title = "Predicted Survival Distribution", x = "Survival Status", y = "Count")
In this project, I:
Applied the 6-step data analysis process to understand the Titanic dataset.
Explored the influence of gender, class, and age on survival.
Built and evaluated a Random Forest model.
Prepared a submission file for the Kaggle Titanic competition.
Tools: R, RStudio, ggplot2, dplyr, randomForest
Data Source: Kaggle Titanic Competition
Author: Elyshea Devore
Citation Will Cukierski. Titanic - Machine Learning from Disaster. https://kaggle.com/competitions/titanic, 2012. Kaggle.