Breakdown of the Data - Dive:

Set-up:

# Load the typical libs
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse) 
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.2.1     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ readr   2.1.2
## Warning: package 'tibble' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2) 

Part 1: Data Preparation

I will start by loading the dataset in and transforming the type attribute into binary titled the binary_outcome attribute.

# Read the data set, assign alsis 'df' for the dataframe. 
df <- read.csv("../data/COVID_country.csv")

# Transform the binary part - "type" to binary outcome
df$binary_outcome <- ifelse(df$type == "Interest_Cat", 1, 0)

Part 2: Building the Logistic Regression Model

Next, I will build the logistic regression model using the sex variable as a possible explanatory variable. The binary_outcome variable, will interpret the model coefficients.

# Build a logistic regression model using glm
model <- glm(binary_outcome ~ sex, data = df, family = binomial(link = 'logit'))
## Warning: glm.fit: algorithm did not converge
# Display model coefficients
model$coefficients
##   (Intercept)       sexMale 
## -2.656607e+01 -1.484977e-13

Findings:

This coefficients represent the log-odds of the binary_outcome variable. The negative intercept suggests a decrease in the log-odds of being in the Interest_Cat category. The sexMale coefficient is extremely close to zero, which as a result indicates a very weak effect of the sex variable on the binary outcome.

Part 3: Model Assessment

In order to evaluate the model’s ability to address its fit and assess its performance. I will complete the following tasks:

  1. Calculate the standard errors of coefficients and build C.I.

  2. Create a scatter plot to visualize the relationship between sex and the binary_outcome.

# Calculate standard errors for coefficients
se <- summary(model)$coefficients[, "Std. Error"]

# Build 95% confidence intervals for coefficients
ci_lower <- coef(model) - 1.96 * se
ci_upper <- coef(model) + 1.96 * se

# Display standard errors and confidence intervals
se
## (Intercept)     sexMale 
##    6392.053    9039.728
cbind(Coefficient = coef(model), `95% CI Lower` = ci_lower, `95% CI Upper` = ci_upper)
##               Coefficient 95% CI Lower 95% CI Upper
## (Intercept) -2.656607e+01    -12554.99     12501.86
## sexMale     -1.484977e-13    -17717.87     17717.87

Findings:

  1. Intercept Coefficient: 6392.053:
  1. sexMale Coefficient: 9039.728:
  1. The 95% CI for the Intercept ranges from approximately; -12554.99 to 12501.86

  2. The 95% CI for sexMale ranges from approximately; -17717.87 to 17717.87.

It is challenging to determine with certainty how much the “sex” variable influenced the binary outcome because of the wide confidence intervals and the very low coefficient for “sexMale” (around zero). There isn’t a strong correlation found in the data between “sex” and the outcome variable “Interest_Cat.” The high uncertainty is indicated by the broad confidence ranges. In this practice, the logistic regression model provides no useful information about the effect of the “sex” variable on the likelihood of falling into the “Interest_Cat” category.

Scatter plot:

# Create a scatter plot
ggplot(data = df, aes(x = sex, y = binary_outcome)) +
  geom_point() +
  labs(x = "Sex", y = "Binary Outcome", title = "Scatter Plot of Sex vs. Binary Outcome")

Umm…this scatterplot….is…you know,…. no comment…. :D

Part 4: Conclusion and Questions:

In this data dive, I have built a logistic regression model that works, but doesn’t produce the best results, Regardless, the model was able to predict the binary_outcome Interested_Cat based on the sex variable. I interpreted the coefficients and assessed the model precision with standard errors and confidence intervals (as mentioned as a bonus exercise).

Further Questions:

  1. Does the sex variable alone provide sufficient information to predict the Interest_Cat category, or do I need to consider additional explanatory variables?

  2. Are there possible interactions (nonlinear relationships) between variables that should (or need to) be explored?

  3. How well does this model fit the data? What are other statistical metrics that should be considered for model evaluation?

Overall, additional analyses, model refinements, and data exploration is required to provide more insights and enhance the predictive power of the model.