Introduction

This project explores the Titanic dataset to identify patterns and trends related to passenger survival. The goal is to demonstrate initial exploratory analysis and outline plans for predictive modeling and a Shiny app.

Data Loading

Load Titanic dataset from Downloads folder

data <- read.csv(“~/Downloads/titanic.csv”, stringsAsFactors = FALSE)

Quick look at the data

head(data) str(data)

Check for missing values

sum(is.na(data))

Summary Statistics

Basic summary

summary(data)

Detailed summary using summarytools

dfSummary(data)

Exploratory Data Analysis

Numeric Variables

Age distribution

ggplot(data, aes(x = Age)) + geom_histogram(binwidth = 5, fill=“blue”, color=“black”) + theme_minimal() + labs(title=“Age Distribution”, x=“Age”, y=“Frequency”)

Fare distribution

ggplot(data, aes(x = Fare)) + geom_histogram(binwidth = 10, fill=“green”, color=“black”) + theme_minimal() + labs(title=“Fare Distribution”, x=“Fare”, y=“Frequency”)

Categorical Variables

Survival counts

ggplot(data, aes(x = factor(Survived))) + geom_bar(fill=“orange”) + theme_minimal() + labs(title=“Survival Count”, x=“Survived (0 = No, 1 = Yes)”, y=“Count”)

Passenger class counts

ggplot(data, aes(x = factor(Pclass))) + geom_bar(fill=“purple”) + theme_minimal() + labs(title=“Passenger Class Count”, x=“Pclass”, y=“Count”)

Sex counts

ggplot(data, aes(x = Sex)) + geom_bar(fill=“pink”) + theme_minimal() + labs(title=“Sex Count”, x=“Sex”, y=“Count”)

Relationships Between Variables

Age vs Survival

ggplot(data, aes(x = Age, fill = factor(Survived))) + geom_histogram(binwidth = 5, position=“dodge”) + theme_minimal() + labs(title=“Age vs Survival”, x=“Age”, y=“Count”, fill=“Survived”)

Fare vs Survival

ggplot(data, aes(x = Fare, fill = factor(Survived))) + geom_histogram(binwidth = 10, position=“dodge”) + theme_minimal() + labs(title=“Fare vs Survival”, x=“Fare”, y=“Count”, fill=“Survived”)

Key Findings

Survival rate was higher among passengers in Pclass 1.

Female passengers had a higher survival rate than males.

Children (younger Age) had slightly higher chance of survival.

Higher Fare was associated with higher survival, indicating wealthier passengers had better survival chances.

Prediction Algorithm Plan

Clean the dataset (handle missing values in Age, Cabin, Embarked).

Encode categorical variables (Sex, Embarked, Pclass).

Split dataset into training and testing sets.

Build models: Logistic Regression, Random Forest, or Gradient Boosting to predict Survived.

Evaluate models using accuracy, precision, recall, and ROC-AUC.

Select the best-performing model for deployment.

Shiny App Plan

Inputs: Age, Sex, Pclass, Fare, SibSp, Parch, Embarked.

Outputs: Predicted survival probability (0 = No, 1 = Yes) for a given passenger.

Visualizations: Interactive table of predictions and bar charts showing survival probability by category.

Conclusion

This report demonstrates initial exploratory analysis of the Titanic dataset and outlines plans for building a prediction algorithm and a Shiny app to predict passenger survival. It provides a foundation for further modeling and deployment.

Exploratory Analysis and Prediction Plan - Titanic Dataset

Mahavir Rajpurohit

2026-01-13