What is Logistic Regression?

Logistic Regression is a statistical method used to predict the probability of a binary outcome.

  • Used for classification problems (yes/no, 0/1, true/false)
  • Models the relationship between input features and a binary outcome
  • Outputs a probability between 0 and 1
  • Widely used in medicine, finance, and machine learning

We will use the Titanic dataset to predict whether a passenger survived or did not survive based on age, gender, and ticket class

The Math Behind It

Logistic Regression models the probability of an outcome using the sigmoid function:

\[P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}\]

Where:

  • \(P(Y=1|X)\) = probability of the outcome being 1 given input \(X\)

  • \(\beta_0\) = intercept

  • \(\beta_1\) = coefficient for feature \(X\)

  • \(e\) = Euler’s number

The model estimates coefficients by maximizing the Log-Likelihood:

\[\ell(\beta) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]\]

The Dataset

Survival by Class and Gender

Interactive Plot: Age vs Survival Probability

R Code: Building the Model

library(titanic)
# Prepare the data
titanic_data <- titanic_train
titanic_data <- titanic_data[complete.cases(
  titanic_data[, c("Survived", "Age", "Sex", "Pclass")]),]
# Fit logistic regression model
model_glm <- glm(
  Survived ~ Age + factor(Pclass) + Sex,
  data = titanic_data,
  family = binomial
)
# View summary
summary(model_glm)

# Get predicted probabilities
titanic_data$predicted_prob <- predict(model_glm, type = "response")

Model Performance

Conclusion

Logistic Regression is powerful because:

  • Models probability of a binary outcome using the sigmoid function
  • Coefficients are interpretable — show direction and strength of effect
  • Works well even with small datasets like Titanic
  • Female passengers in 1st class had the highest survival probability

Limitations:

  • Assumes its a linear relationship between the features and log-odds
  • Does not handle the complex non-linear boundaries well
  • Sensitive to outliers and correlated features > Logistic Regression remains one of the most widely used models in > medicine, finance, and machine learning due to its simplicity and interpretability