This project explores the use of logistic regression to predict passenger survival on the Titanic using a dataset of 891 passengers. The analysis begins with an exploratory data analysis (EDA) to identify key factors influencing survival rates, such as passenger class, gender, and age. Significant missing data in columns like Age and Cabin were handled through imputation and column removal, respectively. Categorical variables were transformed into numerical features to prepare the data for model training. A logistic regression model was developed to predict the likelihood of survival based on selected features. The model achieved an accuracy of 83%, with high precision and recall rates for predicting survival. The analysis revealed that female passengers, younger individuals, and those in first class had higher survival rates. While the model provided valuable insights, there is potential for further enhancement by incorporating additional features and exploring more sophisticated machine learning algorithms. The project demonstrates practical applications of data analysis, statistical modeling, and machine learning in deriving actionable insights from historical data.
Keywords
Data analysis, Python, Pandas, Seaborn, Numpy, Descriptive Analysis, Data Science, Machine Learning, Scikit-learn
Introduction
This report presents an independent data analysis project aimed at predicting passenger survival on the Titanic using a logistic regression model. The analysis leverages Python’s data science libraries to explore, clean, and model the data obtained from the popular “Titanic” dataset. The objective is to understand the factors that influenced survival rates and to build a predictive model using machine learning techniques.
Key Insights
Female passengers and those in 1st class had a much higher likelihood of survival.
Age played an important role, with younger passengers slightly more likely to survive.
The model suggests that further improvements could be achieved by incorporating additional features or using more advanced models like Random Forest or Gradient Boosting.
Data Overview
The dataset used in this analysis contains information about passengers aboard the Titanic, including attributes like class (Pclass), sex, age, fare, number of siblings/spouses (SibSp), number of parents/children (Parch), and the port of embarkation. The dataset includes 891 observations with multiple categorical and numerical variables.
Initial Data Exploration
import pandas as pdimport seaborn as snsimport numpy as npimport plotly.express as px# Load the datasettitanic = pd.read_csv("titanic_train.csv")titanic.head()
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
Upon inspecting the first few rows of the data, we observed missing values in the Age and Cabin columns. The dataset summary indicated that while most columns had complete data, Age and Cabin had significant gaps that required addressing.
Exploratory Data Analysis (EDA)
To understand the data distribution and identify potential issues, several visualizations were created:
Data Description
The summary statistics showed that: - The average age of passengers was approximately 29.7 years. - About 38.4% of passengers survived. - There was a noticeable class divide, with most passengers in 3rd class.
Visualizing Missing Data
sns.heatmap(titanic.isnull(), cbar=False)
The heatmap revealed that the Age and Cabin columns had many missing values. Cabin was dropped due to excessive missing data, while missing Age values were imputed using a function based on passenger class.
With the cleaned dataset, we proceeded to build a logistic regression model to predict passenger survival.
Splitting the Data
We split the data into a training set and a test set, setting aside 30% as the test set. Note that the models will not accept variable names in any other format other than strings. Hence, we convert all variable names to strings.
from sklearn.model_selection import train_test_splitmydata.columns = mydata.columns.astype(str)X = mydata.drop("Survived", axis=1)y = mydata["Survived"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Training the Model
To train the model, we first initiate an instance of the LogisticRegression(), and then call the fit method with the X_train and y_train as the inputs, as shown below.
from sklearn.linear_model import LogisticRegressionlogit_model = LogisticRegression()logit_model.fit(X_train, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Model Evaluation
The model’s performance was evaluated using a classification report and confusion matrix:
from sklearn.metrics import classification_report, confusion_matrixpredictions = logit_model.predict(X_test)
Classification Report:
The classification report shows the model has an accuracy of 0.83. The precision, recall, and F1 scores are also reported. We note that the model does eepecially well in flagging people that did not survive as opposed to those that survived.
Precision: The precision for predicting survival was 85%, while the precision for predicting non-survival was 82%.
Recall: The recall for survival prediction was 68%, indicating some missed predictions for actual survivors.
Conclusion
The logistic regression model developed in this project demonstrated a reasonable level of accuracy (83%) in predicting the survival of Titanic passengers. The analysis revealed that factors like gender, passenger class, and age significantly influenced survival chances.
Recommendations for Future Work
Conduct feature engineering to derive additional insights (e.g., family size, titles from names).
Experiment with other machine learning algorithms to improve predictive performance.
Use cross-validation techniques to enhance model generalizability.
Conclusion
This analysis highlights the power of data science techniques in deriving actionable insights from historical data. The project demonstrates proficiency in Python, data visualization, logistic regression, and machine learning techniques, showcasing my capabilities in applied data analysis (Muddana and Vinayakam 2024; James et al. 2013).
References
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.