Breakdown of the Data - Dive:
Set-up:
# Load the typical libs
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.2.1 ✔ stringr 1.4.0
## ✔ tidyr 1.2.0 ✔ forcats 0.5.1
## ✔ readr 2.1.2
## Warning: package 'tibble' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
I will start by loading the dataset in and transforming the
type attribute into binary titled the
binary_outcome attribute.
# Read the data set, assign alsis 'df' for the dataframe.
df <- read.csv("../data/COVID_country.csv")
# Transform the binary part - "type" to binary outcome
df$binary_outcome <- ifelse(df$type == "Interest_Cat", 1, 0)
Next, I will build the logistic regression model using the
sex variable as a possible explanatory variable. The
binary_outcome variable, will interpret the model
coefficients.
# Build a logistic regression model using glm
model <- glm(binary_outcome ~ sex, data = df, family = binomial(link = 'logit'))
## Warning: glm.fit: algorithm did not converge
# Display model coefficients
model$coefficients
## (Intercept) sexMale
## -2.656607e+01 -1.484977e-13
Findings:
sexMale variable is
approximately -1.48e-13This coefficients represent the log-odds of the
binary_outcome variable. The negative intercept suggests a
decrease in the log-odds of being in the Interest_Cat
category. The sexMale coefficient is extremely close to
zero, which as a result indicates a very weak effect of the
sex variable on the binary outcome.
In order to evaluate the model’s ability to address its fit and assess its performance. I will complete the following tasks:
Calculate the standard errors of coefficients and build C.I.
Create a scatter plot to visualize the relationship between
sex and the binary_outcome.
# Calculate standard errors for coefficients
se <- summary(model)$coefficients[, "Std. Error"]
# Build 95% confidence intervals for coefficients
ci_lower <- coef(model) - 1.96 * se
ci_upper <- coef(model) + 1.96 * se
# Display standard errors and confidence intervals
se
## (Intercept) sexMale
## 6392.053 9039.728
cbind(Coefficient = coef(model), `95% CI Lower` = ci_lower, `95% CI Upper` = ci_upper)
## Coefficient 95% CI Lower 95% CI Upper
## (Intercept) -2.656607e+01 -12554.99 12501.86
## sexMale -1.484977e-13 -17717.87 17717.87
Findings:
The 95% CI for the Intercept ranges from approximately; -12554.99 to 12501.86
The 95% CI for sexMale ranges from approximately; -17717.87 to 17717.87.
It is challenging to determine with certainty how much the “sex” variable influenced the binary outcome because of the wide confidence intervals and the very low coefficient for “sexMale” (around zero). There isn’t a strong correlation found in the data between “sex” and the outcome variable “Interest_Cat.” The high uncertainty is indicated by the broad confidence ranges. In this practice, the logistic regression model provides no useful information about the effect of the “sex” variable on the likelihood of falling into the “Interest_Cat” category.
Scatter plot:
# Create a scatter plot
ggplot(data = df, aes(x = sex, y = binary_outcome)) +
geom_point() +
labs(x = "Sex", y = "Binary Outcome", title = "Scatter Plot of Sex vs. Binary Outcome")
Umm…this scatterplot….is…you know,…. no comment…. :D
In this data dive, I have built a logistic regression model that
works, but doesn’t produce the best results, Regardless, the model was
able to predict the binary_outcome Interested_Cat based on
the sex variable. I interpreted the coefficients and
assessed the model precision with standard errors and confidence
intervals (as mentioned as a bonus exercise).
Further Questions:
Does the sex variable alone provide sufficient
information to predict the Interest_Cat category, or do I
need to consider additional explanatory variables?
Are there possible interactions (nonlinear relationships) between variables that should (or need to) be explored?
How well does this model fit the data? What are other statistical metrics that should be considered for model evaluation?
Overall, additional analyses, model refinements, and data exploration is required to provide more insights and enhance the predictive power of the model.