Week 9 - Data Dive - GLMs (Alex Weber)

Breakdown of the Data - Dive:

Part 1: Data Preparation
Part 2: Building the Logistic Regression Model
Part 3: Model Assessment
Part 4: Conclusion and Questions:

Set-up:

# Load the typical libs
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.2.1     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ readr   2.1.2

## Warning: package 'tibble' was built under R version 4.2.3

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)

Part 1: Data Preparation

I will start by loading the dataset in and transforming the type attribute into binary titled the binary_outcome attribute.

# Read the data set, assign alsis 'df' for the dataframe. 
df <- read.csv("../data/COVID_country.csv")

# Transform the binary part - "type" to binary outcome
df$binary_outcome <- ifelse(df$type == "Interest_Cat", 1, 0)

Part 2: Building the Logistic Regression Model

Next, I will build the logistic regression model using the sex variable as a possible explanatory variable. The binary_outcome variable, will interpret the model coefficients.

# Build a logistic regression model using glm
model <- glm(binary_outcome ~ sex, data = df, family = binomial(link = 'logit'))

## Warning: glm.fit: algorithm did not converge

# Display model coefficients
model$coefficients

##   (Intercept)       sexMale 
## -2.656607e+01 -1.484977e-13

Findings:

The intercept coefficient is approximately -26.57.
The coefficient for the sexMale variable is approximately -1.48e-13

This coefficients represent the log-odds of the binary_outcome variable. The negative intercept suggests a decrease in the log-odds of being in the Interest_Cat category. The sexMale coefficient is extremely close to zero, which as a result indicates a very weak effect of the sex variable on the binary outcome.

Part 3: Model Assessment

In order to evaluate the model’s ability to address its fit and assess its performance. I will complete the following tasks:

Calculate the standard errors of coefficients and build C.I.
Create a scatter plot to visualize the relationship between sex and the binary_outcome.

# Calculate standard errors for coefficients
se <- summary(model)$coefficients[, "Std. Error"]

# Build 95% confidence intervals for coefficients
ci_lower <- coef(model) - 1.96 * se
ci_upper <- coef(model) + 1.96 * se

# Display standard errors and confidence intervals
se

## (Intercept)     sexMale 
##    6392.053    9039.728

cbind(Coefficient = coef(model), `95% CI Lower` = ci_lower, `95% CI Upper` = ci_upper)

##               Coefficient 95% CI Lower 95% CI Upper
## (Intercept) -2.656607e+01    -12554.99     12501.86
## sexMale     -1.484977e-13    -17717.87     17717.87

Findings:

Intercept Coefficient: 6392.053:

When the “sex” variable is 0 (not male), this indicates the log-odds of the binary outcome. When the “sex” variable is 0 (not male), the log-odds of being in the “Interest_Cat” category are positive, indicating a higher chance of being in the “Interest_Cat” category.

sexMale Coefficient: 9039.728:

This indicates the influence of the “sex” variable being male (coded as 1) on the binary outcome’s log-odds. Because the coefficient is positive, being male enhances the log-odds of being in the “Interest_Cat” group.

The 95% CI for the Intercept ranges from approximately; -12554.99 to 12501.86
The 95% CI for sexMale ranges from approximately; -17717.87 to 17717.87.

It is challenging to determine with certainty how much the “sex” variable influenced the binary outcome because of the wide confidence intervals and the very low coefficient for “sexMale” (around zero). There isn’t a strong correlation found in the data between “sex” and the outcome variable “Interest_Cat.” The high uncertainty is indicated by the broad confidence ranges. In this practice, the logistic regression model provides no useful information about the effect of the “sex” variable on the likelihood of falling into the “Interest_Cat” category.

Scatter plot:

# Create a scatter plot
ggplot(data = df, aes(x = sex, y = binary_outcome)) +
  geom_point() +
  labs(x = "Sex", y = "Binary Outcome", title = "Scatter Plot of Sex vs. Binary Outcome")

Umm…this scatterplot….is…you know,…. no comment…. :D

Part 4: Conclusion and Questions:

In this data dive, I have built a logistic regression model that works, but doesn’t produce the best results, Regardless, the model was able to predict the binary_outcome Interested_Cat based on the sex variable. I interpreted the coefficients and assessed the model precision with standard errors and confidence intervals (as mentioned as a bonus exercise).

Further Questions:

Does the sex variable alone provide sufficient information to predict the Interest_Cat category, or do I need to consider additional explanatory variables?
Are there possible interactions (nonlinear relationships) between variables that should (or need to) be explored?
How well does this model fit the data? What are other statistical metrics that should be considered for model evaluation?

Overall, additional analyses, model refinements, and data exploration is required to provide more insights and enhance the predictive power of the model.