Week 10 | Data Dive

Olympics Data

We structure the data to add a new column called “Medals earned”

dataset_olympics <- read_delim("dataset_olympics.csv")

## Rows: 70000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl  (5): ID, Age, Height, Weight, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dataset_olympics <- dataset_olympics %>% arrange(Name, Year)
dataset_olympics <- dataset_olympics %>%
  group_by(Name) %>%
  mutate(is_Male = ifelse(Sex == 'M',1,0)) %>%
  ungroup()
head(dataset_olympics$is_Male)

Binary Column of choice?

The best binary column of choice that would be worth modeling is of course the column ‘Sex’.

Building a Logistic Regression Model

dataset_olympics |>
  ggplot(mapping = aes(x = Height, y = is_Male)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(title = "Modeling a Binary Response with OLS") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 16254 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 16254 rows containing missing values (`geom_point()`).

Significance is negligent and very low for the model. A Sigmoid Function makes the most sense. We utilize the `logit` function.

\[ \text{log odds of $y$} = \log\left(\frac{\hat{p}}{1 - \hat{p}}\right) = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + \cdots + \hat{\beta}_kx_k \]

(where $\log = \ln$, the natural log.)

sexModel <- glm(is_Male ~ Height, data=dataset_olympics,family = binomial(link = 'logit'))
sexModel$coefficients

## (Intercept)      Height 
## -22.9523566   0.1372508

Our equation is =

\[ \log\left(\frac{p}{1 - p}\right) = -22.952 + 0.1372\times\texttt{Height} \]

\[ Height = 22.952/0.1372 Height = 167.28863 \]

This means that when height is 167.3 units, there is a 50/50 chance of the athlete to be Male or Female.

sigmoid <- \(x) 1 / (1 + exp(-(-22.9523566 + 0.1372508 * x)))

dataset_olympics |>
  ggplot(mapping = aes(x = Height, y = is_Male)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Modeling a Binary Response with Sigmoid") +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  theme_minimal()

## Warning: Removed 16254 rows containing missing values (`geom_point()`).

Week 10 | Data Dive — GLMs

2023-10-22

Olympics Data

Binary Column of choice?

Building a Logistic Regression Model