This project explores the Titanic dataset to identify patterns and trends related to passenger survival. The goal is to demonstrate initial exploratory analysis and outline plans for predictive modeling and a Shiny app.
data <- read.csv(“~/Downloads/titanic.csv”, stringsAsFactors = FALSE)
head(data) str(data)
sum(is.na(data))
summary(data)
dfSummary(data)
ggplot(data, aes(x = Age)) + geom_histogram(binwidth = 5, fill=“blue”, color=“black”) + theme_minimal() + labs(title=“Age Distribution”, x=“Age”, y=“Frequency”)
ggplot(data, aes(x = Fare)) + geom_histogram(binwidth = 10, fill=“green”, color=“black”) + theme_minimal() + labs(title=“Fare Distribution”, x=“Fare”, y=“Frequency”)
ggplot(data, aes(x = factor(Survived))) + geom_bar(fill=“orange”) + theme_minimal() + labs(title=“Survival Count”, x=“Survived (0 = No, 1 = Yes)”, y=“Count”)
ggplot(data, aes(x = factor(Pclass))) + geom_bar(fill=“purple”) + theme_minimal() + labs(title=“Passenger Class Count”, x=“Pclass”, y=“Count”)
ggplot(data, aes(x = Sex)) + geom_bar(fill=“pink”) + theme_minimal() + labs(title=“Sex Count”, x=“Sex”, y=“Count”)
ggplot(data, aes(x = Age, fill = factor(Survived))) + geom_histogram(binwidth = 5, position=“dodge”) + theme_minimal() + labs(title=“Age vs Survival”, x=“Age”, y=“Count”, fill=“Survived”)
ggplot(data, aes(x = Fare, fill = factor(Survived))) + geom_histogram(binwidth = 10, position=“dodge”) + theme_minimal() + labs(title=“Fare vs Survival”, x=“Fare”, y=“Count”, fill=“Survived”)
Survival rate was higher among passengers in Pclass 1.
Female passengers had a higher survival rate than males.
Children (younger Age) had slightly higher chance of survival.
Higher Fare was associated with higher survival, indicating wealthier passengers had better survival chances.
Clean the dataset (handle missing values in Age, Cabin, Embarked).
Encode categorical variables (Sex, Embarked, Pclass).
Split dataset into training and testing sets.
Build models: Logistic Regression, Random Forest, or Gradient Boosting to predict Survived.
Evaluate models using accuracy, precision, recall, and ROC-AUC.
Select the best-performing model for deployment.
Inputs: Age, Sex, Pclass, Fare, SibSp, Parch, Embarked.
Outputs: Predicted survival probability (0 = No, 1 = Yes) for a given passenger.
Visualizations: Interactive table of predictions and bar charts showing survival probability by category.
This report demonstrates initial exploratory analysis of the Titanic dataset and outlines plans for building a prediction algorithm and a Shiny app to predict passenger survival. It provides a foundation for further modeling and deployment.