Machine Learning from Disaster: Titanic Survival Analysis with Logistic Regression in Python

Independent Data Analysis Project

Published

November 14, 2024

Modified

November 14, 2024

Executive Summary

This project explores the use of logistic regression to predict passenger survival on the Titanic using a dataset of 891 passengers. The analysis begins with an exploratory data analysis (EDA) to identify key factors influencing survival rates, such as passenger class, gender, and age. Significant missing data in columns like Age and Cabin were handled through imputation and column removal, respectively. Categorical variables were transformed into numerical features to prepare the data for model training. A logistic regression model was developed to predict the likelihood of survival based on selected features. The model achieved an accuracy of 83%, with high precision and recall rates for predicting survival. The analysis revealed that female passengers, younger individuals, and those in first class had higher survival rates. While the model provided valuable insights, there is potential for further enhancement by incorporating additional features and exploring more sophisticated machine learning algorithms. The project demonstrates practical applications of data analysis, statistical modeling, and machine learning in deriving actionable insights from historical data.

Keywords

Data analysis, Python, Pandas, Seaborn, Numpy, Descriptive Analysis, Data Science, Machine Learning, Scikit-learn

Introduction

This report presents an independent data analysis project aimed at predicting passenger survival on the Titanic using a logistic regression model. The analysis leverages Python’s data science libraries to explore, clean, and model the data obtained from the popular “Titanic” dataset. The objective is to understand the factors that influenced survival rates and to build a predictive model using machine learning techniques.

Key Insights

  • Female passengers and those in 1st class had a much higher likelihood of survival.
  • Age played an important role, with younger passengers slightly more likely to survive.
  • The model suggests that further improvements could be achieved by incorporating additional features or using more advanced models like Random Forest or Gradient Boosting.

Data Overview

The dataset used in this analysis contains information about passengers aboard the Titanic, including attributes like class (Pclass), sex, age, fare, number of siblings/spouses (SibSp), number of parents/children (Parch), and the port of embarkation. The dataset includes 891 observations with multiple categorical and numerical variables.

Initial Data Exploration

import pandas as pd
import seaborn as sns
import numpy as np
import plotly.express as px

# Load the dataset
titanic = pd.read_csv("titanic_train.csv")
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Upon inspecting the first few rows of the data, we observed missing values in the Age and Cabin columns. The dataset summary indicated that while most columns had complete data, Age and Cabin had significant gaps that required addressing.

Exploratory Data Analysis (EDA)

To understand the data distribution and identify potential issues, several visualizations were created:

Data Description

The summary statistics showed that: - The average age of passengers was approximately 29.7 years. - About 38.4% of passengers survived. - There was a noticeable class divide, with most passengers in 3rd class.

Visualizing Missing Data

sns.heatmap(titanic.isnull(), cbar=False)

The heatmap revealed that the Age and Cabin columns had many missing values. Cabin was dropped due to excessive missing data, while missing Age values were imputed using a function based on passenger class.

Distribution of Survival by Gender and Class

sns.countplot(x="Survived", data=titanic, hue="Sex")

  • Gender: Female passengers had a significantly higher survival rate compared to males.
sns.countplot(x="Survived", data=titanic, hue="Pclass")

  • Class: Passengers in 1st class were more likely to survive than those in lower classes.

Age Distribution

sns.displot(x="Age", data=titanic, kde=True)

The age distribution was right-skewed, with a peak around 30 years.

Data Cleaning and Feature Engineering

Handling Missing Values

To address the missing data in the Age column, we used an imputation function based on passenger class:

def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

titanic["Age"] = titanic[["Age", "Pclass"]].apply(impute_age, axis=1)

After imputing Age, we dropped the Cabin column and rows with missing values in the Embarked column:

titanic = titanic.drop("Cabin", axis=1).dropna()
titanic.columns = titanic.columns.astype(str)

Converting Categorical Features to Dummies

To prepare the data for modeling, categorical variables such as Sex and Embarked were converted into numerical form using dummy variables:

sex = pd.get_dummies(titanic["Sex"], drop_first=True)
embarked = pd.get_dummies(titanic["Embarked"], drop_first=True)
pclass = pd.get_dummies(titanic["Pclass"], drop_first=True)

mydata = pd.concat([titanic, sex, embarked, pclass], axis=1)
mydata = mydata.drop(["Sex", "Ticket", "Embarked", "Name", "PassengerId", "Pclass"], axis=1)

Building the Logistic Regression Model

With the cleaned dataset, we proceeded to build a logistic regression model to predict passenger survival.

Splitting the Data

We split the data into a training set and a test set, setting aside 30% as the test set. Note that the models will not accept variable names in any other format other than strings. Hence, we convert all variable names to strings.

from sklearn.model_selection import train_test_split

mydata.columns = mydata.columns.astype(str)
X = mydata.drop("Survived", axis=1)
y = mydata["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Training the Model

To train the model, we first initiate an instance of the LogisticRegression(), and then call the fit method with the X_train and y_train as the inputs, as shown below.

from sklearn.linear_model import LogisticRegression

logit_model = LogisticRegression()
logit_model.fit(X_train, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model Evaluation

The model’s performance was evaluated using a classification report and confusion matrix:

from sklearn.metrics import classification_report, confusion_matrix

predictions = logit_model.predict(X_test)

Classification Report:

The classification report shows the model has an accuracy of 0.83. The precision, recall, and F1 scores are also reported. We note that the model does eepecially well in flagging people that did not survive as opposed to those that survived.

print(classification_report(y_test, predictions))
              precision    recall  f1-score   support

           0       0.82      0.92      0.87       163
           1       0.85      0.68      0.76       104

    accuracy                           0.83       267
   macro avg       0.83      0.80      0.81       267
weighted avg       0.83      0.83      0.82       267

Confusion Matrix:

We visualize the confusion matrix below:

conf_matrix = confusion_matrix(y_test, predictions)
sns.heatmap(conf_matrix, annot=True, cmap=['gray', 'black', 'blue'], fmt=".0f")

  • Accuracy: The model achieved an accuracy of 83%.
  • Precision: The precision for predicting survival was 85%, while the precision for predicting non-survival was 82%.
  • Recall: The recall for survival prediction was 68%, indicating some missed predictions for actual survivors.

Conclusion

The logistic regression model developed in this project demonstrated a reasonable level of accuracy (83%) in predicting the survival of Titanic passengers. The analysis revealed that factors like gender, passenger class, and age significantly influenced survival chances.

Recommendations for Future Work

  • Conduct feature engineering to derive additional insights (e.g., family size, titles from names).
  • Experiment with other machine learning algorithms to improve predictive performance.
  • Use cross-validation techniques to enhance model generalizability.

Conclusion

This analysis highlights the power of data science techniques in deriving actionable insights from historical data. The project demonstrates proficiency in Python, data visualization, logistic regression, and machine learning techniques, showcasing my capabilities in applied data analysis (Muddana and Vinayakam 2024; James et al. 2013).

References

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.