Week 8 | Data Dive — Regression

knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Olympics Data

We structure the data to add a new column called “Medals earned”

dataset_olympics <- read_delim("dataset_olympics.csv")

## Rows: 70000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl  (5): ID, Age, Height, Weight, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dataset_olympics <- dataset_olympics %>% arrange(Name, Year)
dataset_olympics <- dataset_olympics %>%
  group_by(Name) %>%
  mutate(Total_Medal_Count = cumsum(!is.na(Medal))) %>%
  ungroup()
dataset_olympics <- dataset_olympics %>%
  group_by(Name) %>%
  mutate(Total_Participations = row_number()) %>%
  ungroup()
dataset_olympics <- dataset_olympics %>% arrange(Name, Year)
summary(dataset_olympics$Total_Medal_Count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3168  0.0000 15.0000

summary(dataset_olympics$Total_Participations)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   2.447   3.000  29.000

Most valuable Continuous Integer

In the context of this data dive, the most valuable column is Total_Medal_Count and Total_Participations. These columns contain the data of cumulative wins and participations by each athlete in the history of the Olympics games till 2016

Explanatory Categorical Variable

Similarly, the categorical data most important here would be ‘Medal_Status’: If an athlete won a medal or not.

Null Hypothesis (Ho):

There is no increased likelihood of winning if you have more past participation in the history of the Olympics Games. We create the following categories for our one-way ANOVA test:

dataset_olympics <- dataset_olympics %>%
  mutate(Medal_Status = ifelse(!is.na(Medal),"Medal won","No Medal won"))

dataset_olympics %>% group_by(Medal_Status) %>% 
  summarize(count = n())

## # A tibble: 2 × 2
##   Medal_Status count
##   <chr>        <int>
## 1 Medal won     9690
## 2 No Medal won 60310

dataset_olympics %>%
  ggplot(aes(x = Medal_Status, y = Total_Participations)) +
  geom_boxplot()

m <- aov(Total_Participations ~ Medal_Status, data = dataset_olympics)
summary(m)

##                 Df Sum Sq Mean Sq F value Pr(>F)  
## Medal_Status     1     33   32.88   4.904 0.0268 *
## Residuals    69998 469334    6.70                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F value > 1 which implies that there is evidence that the means between the groups are not all the same. The visualizationg/boxplot above is indicative of the same. The p-value is assumed to be 0.05 and our result p-value is less than the assumed p-value therefore, we reject the null hypothesis. There is not enough evidence to conclude that there is no increased likelihood of winning if you have more past participations at the Olympics, so it will be safe to assume that having previous wins at the Olympics gives you a higher likelihood to win in a future game

Continuous Variable affected by the Response Variable (Medal_Status)

Another variable that can affect medals won is the number of participations of an athlete. Lets check the correlation between the two fields.

dataset_olympics %>% ggplot(aes(x = Total_Medal_Count, y = Total_Participations)) +
  geom_point()

There is a strong linear correlation between the two continuous columns. Let’s create a linear regression for the same:

linearreg = lm(Total_Participations ~ Total_Medal_Count, data = dataset_olympics)
summary(linearreg)

## 
## Call:
## lm(formula = Total_Participations ~ Total_Medal_Count, data = dataset_olympics)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.101 -1.077 -1.077 -0.077 24.586 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.076958   0.009736   213.3   <2e-16 ***
## Total_Medal_Count 1.168716   0.011035   105.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.404 on 69998 degrees of freedom
## Multiple R-squared:  0.1381, Adjusted R-squared:  0.1381 
## F-statistic: 1.122e+04 on 1 and 69998 DF,  p-value: < 2.2e-16

For the null hypothesis: there is no significant relationship between medal’s won and total participations, we can see that the p-value is very small. With an assumed p-value of 0.05, there is enough support to claim that there is a significant relationship between Meda;’s won and Total participation in the Olympic games.

dataset_olympics %>% ggplot(aes(x = Total_Medal_Count, y = Total_Participations)) +
  geom_point() + geom_abline(aes(intercept = coef(linearreg)[1],slope= coef(linearreg)[2]),color = "purple")

## Combining both variables

linearregcomb <- lm(Total_Participations ~ Medal_Status + Total_Medal_Count, data = dataset_olympics)
summary(linearregcomb)

## 
## Call:
## lm(formula = Total_Participations ~ Medal_Status + Total_Medal_Count, 
##     data = dataset_olympics)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3196 -1.2335 -0.4494  0.5506 23.1537 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -0.35703    0.03056  -11.68   <2e-16 ***
## Medal_StatusNo Medal won  2.59055    0.03099   83.60   <2e-16 ***
## Total_Medal_Count         1.80638    0.01300  139.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.292 on 69997 degrees of freedom
## Multiple R-squared:  0.2164, Adjusted R-squared:  0.2163 
## F-statistic:  9663 on 2 and 69997 DF,  p-value: < 2.2e-16

This gives us a very similar p-value but with a much higher R squared value which indicated this to be a better linear regression model. Upon visualization, we see:

dataset_olympics %>% ggplot(aes(x = Total_Medal_Count, y = Total_Participations)) +
  geom_point() + geom_abline(aes(intercept = coef(linearregcomb)[1],slope= coef(linearregcomb)[2]),color = "purple")

Based on this we can conclude that the combined model is a better linear regression model for this case.

Conclusion:

This has been eye-opening to see the significance of ANOVA and linear regression in understanding correlation between variables in a dataset!