knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Olympics Data

We structure the data to add a new column called “Medals earned”

dataset_olympics <- read_delim("dataset_olympics.csv")
## Rows: 70000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl  (5): ID, Age, Height, Weight, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset_olympics <- dataset_olympics %>% arrange(Name, Year)
dataset_olympics <- dataset_olympics %>%
  group_by(Name) %>%
  mutate(is_Male = ifelse(Sex == 'M',1,0)) %>%
  ungroup()
head(dataset_olympics$is_Male)

Binary Column of choice?

The best binary column of choice that would be worth modeling is of course the column ‘Sex’.

Building a Logistic Regression Model

dataset_olympics |>
  ggplot(mapping = aes(x = Height, y = is_Male)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(title = "Modeling a Binary Response with OLS") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 16254 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 16254 rows containing missing values (`geom_point()`).

Significance is negligent and very low for the model. A Sigmoid Function makes the most sense. We utilize the `logit` function.

\[ \text{log odds of $y$} = \log\left(\frac{\hat{p}}{1 - \hat{p}}\right) = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + \cdots + \hat{\beta}_kx_k \]

(where \(\log = \ln\), the natural log.)

sexModel <- glm(is_Male ~ Height, data=dataset_olympics,family = binomial(link = 'logit'))
sexModel$coefficients
## (Intercept)      Height 
## -22.9523566   0.1372508

Our equation is =

\[ \log\left(\frac{p}{1 - p}\right) = -22.952 + 0.1372\times\texttt{Height} \]

\[ Height = 22.952/0.1372 Height = 167.28863 \]

This means that when height is 167.3 units, there is a 50/50 chance of the athlete to be Male or Female.

sigmoid <- \(x) 1 / (1 + exp(-(-22.9523566 + 0.1372508 * x)))

dataset_olympics |>
  ggplot(mapping = aes(x = Height, y = is_Male)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Modeling a Binary Response with Sigmoid") +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  theme_minimal()
## Warning: Removed 16254 rows containing missing values (`geom_point()`).