knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We structure the data to add a new column called “Medals earned”
dataset_olympics <- read_delim("dataset_olympics.csv")
## Rows: 70000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl (5): ID, Age, Height, Weight, Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset_olympics <- dataset_olympics %>% arrange(Name, Year)
dataset_olympics <- dataset_olympics %>%
group_by(Name) %>%
mutate(Total_Medal_Count = cumsum(!is.na(Medal))) %>%
ungroup()
dataset_olympics <- dataset_olympics %>%
group_by(Name) %>%
mutate(Total_Participations = row_number()) %>%
ungroup()
dataset_olympics <- dataset_olympics %>% arrange(Name, Year)
summary(dataset_olympics$Total_Medal_Count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3168 0.0000 15.0000
summary(dataset_olympics$Total_Participations)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 2.447 3.000 29.000
In the context of this data dive, the most valuable column is Total_Medal_Count and Total_Participations. These columns contain the data of cumulative wins and participations by each athlete in the history of the Olympics games till 2016
Similarly, the categorical data most important here would be ‘Medal_Status’: If an athlete won a medal or not.
There is no increased likelihood of winning if you have more past participation in the history of the Olympics Games. We create the following categories for our one-way ANOVA test:
dataset_olympics <- dataset_olympics %>%
mutate(Medal_Status = ifelse(!is.na(Medal),"Medal won","No Medal won"))
dataset_olympics %>% group_by(Medal_Status) %>%
summarize(count = n())
## # A tibble: 2 × 2
## Medal_Status count
## <chr> <int>
## 1 Medal won 9690
## 2 No Medal won 60310
dataset_olympics %>%
ggplot(aes(x = Medal_Status, y = Total_Participations)) +
geom_boxplot()
m <- aov(Total_Participations ~ Medal_Status, data = dataset_olympics)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## Medal_Status 1 33 32.88 4.904 0.0268 *
## Residuals 69998 469334 6.70
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F value > 1 which implies that there is evidence that the means between the groups are not all the same. The visualizationg/boxplot above is indicative of the same. The p-value is assumed to be 0.05 and our result p-value is less than the assumed p-value therefore, we reject the null hypothesis. There is not enough evidence to conclude that there is no increased likelihood of winning if you have more past participations at the Olympics, so it will be safe to assume that having previous wins at the Olympics gives you a higher likelihood to win in a future game
Another variable that can affect medals won is the number of participations of an athlete. Lets check the correlation between the two fields.
dataset_olympics %>% ggplot(aes(x = Total_Medal_Count, y = Total_Participations)) +
geom_point()
There is a strong linear correlation between the two continuous columns.
Let’s create a linear regression for the same:
linearreg = lm(Total_Participations ~ Total_Medal_Count, data = dataset_olympics)
summary(linearreg)
##
## Call:
## lm(formula = Total_Participations ~ Total_Medal_Count, data = dataset_olympics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.101 -1.077 -1.077 -0.077 24.586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.076958 0.009736 213.3 <2e-16 ***
## Total_Medal_Count 1.168716 0.011035 105.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.404 on 69998 degrees of freedom
## Multiple R-squared: 0.1381, Adjusted R-squared: 0.1381
## F-statistic: 1.122e+04 on 1 and 69998 DF, p-value: < 2.2e-16
For the null hypothesis: there is no significant relationship between medal’s won and total participations, we can see that the p-value is very small. With an assumed p-value of 0.05, there is enough support to claim that there is a significant relationship between Meda;’s won and Total participation in the Olympic games.
dataset_olympics %>% ggplot(aes(x = Total_Medal_Count, y = Total_Participations)) +
geom_point() + geom_abline(aes(intercept = coef(linearreg)[1],slope= coef(linearreg)[2]),color = "purple")
## Combining both variables
linearregcomb <- lm(Total_Participations ~ Medal_Status + Total_Medal_Count, data = dataset_olympics)
summary(linearregcomb)
##
## Call:
## lm(formula = Total_Participations ~ Medal_Status + Total_Medal_Count,
## data = dataset_olympics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3196 -1.2335 -0.4494 0.5506 23.1537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.35703 0.03056 -11.68 <2e-16 ***
## Medal_StatusNo Medal won 2.59055 0.03099 83.60 <2e-16 ***
## Total_Medal_Count 1.80638 0.01300 139.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.292 on 69997 degrees of freedom
## Multiple R-squared: 0.2164, Adjusted R-squared: 0.2163
## F-statistic: 9663 on 2 and 69997 DF, p-value: < 2.2e-16
This gives us a very similar p-value but with a much higher R squared value which indicated this to be a better linear regression model. Upon visualization, we see:
dataset_olympics %>% ggplot(aes(x = Total_Medal_Count, y = Total_Participations)) +
geom_point() + geom_abline(aes(intercept = coef(linearregcomb)[1],slope= coef(linearregcomb)[2]),color = "purple")
Based on this we can conclude that the combined model is a better linear
regression model for this case.
This has been eye-opening to see the significance of ANOVA and linear regression in understanding correlation between variables in a dataset!