knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We structure the data to add a new column called “Medals earned”
dataset_olympics <- read_delim("dataset_olympics.csv")
## Rows: 70000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl (5): ID, Age, Height, Weight, Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset_olympics <- dataset_olympics %>% arrange(Name, Year)
dataset_olympics <- dataset_olympics %>%
group_by(Name) %>%
mutate(is_Male = ifelse(Sex == 'M',1,0)) %>%
ungroup()
head(dataset_olympics$is_Male)
The best binary column of choice that would be worth modeling is of course the column ‘Sex’.
dataset_olympics |>
ggplot(mapping = aes(x = Height, y = is_Male)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = "Modeling a Binary Response with OLS") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 16254 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 16254 rows containing missing values (`geom_point()`).
Significance is negligent and very low for the model. A Sigmoid Function makes the most sense. We utilize the `logit` function.
\[ \text{log odds of $y$} = \log\left(\frac{\hat{p}}{1 - \hat{p}}\right) = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + \cdots + \hat{\beta}_kx_k \]
(where \(\log = \ln\), the natural log.)
sexModel <- glm(is_Male ~ Height, data=dataset_olympics,family = binomial(link = 'logit'))
sexModel$coefficients
## (Intercept) Height
## -22.9523566 0.1372508
Our equation is =
\[ \log\left(\frac{p}{1 - p}\right) = -22.952 + 0.1372\times\texttt{Height} \]
\[ Height = 22.952/0.1372 Height = 167.28863 \]
This means that when height is 167.3 units, there is a 50/50 chance of the athlete to be Male or Female.
sigmoid <- \(x) 1 / (1 + exp(-(-22.9523566 + 0.1372508 * x)))
dataset_olympics |>
ggplot(mapping = aes(x = Height, y = is_Male)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
labs(title = "Modeling a Binary Response with Sigmoid") +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
theme_minimal()
## Warning: Removed 16254 rows containing missing values (`geom_point()`).