knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(miscset)
##
## Attaching package: 'miscset'
##
## The following object is masked from 'package:dplyr':
##
## collapse
dataset_olympics <- read_delim("dataset_olympics.csv")
## Rows: 70000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl (5): ID, Age, Height, Weight, Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Our dataset consists of 15 columns:
summary(dataset_olympics)
## ID Name Sex Age
## Min. : 1 Length:70000 Length:70000 Min. :11.00
## 1st Qu.: 9326 Class :character Class :character 1st Qu.:21.00
## Median :18032 Mode :character Mode :character Median :25.00
## Mean :18082 Mean :25.64
## 3rd Qu.:26978 3rd Qu.:28.00
## Max. :35658 Max. :88.00
## NA's :2732
## Height Weight Team NOC
## Min. :127.0 Min. : 25.0 Length:70000 Length:70000
## 1st Qu.:168.0 1st Qu.: 61.0 Class :character Class :character
## Median :175.0 Median : 70.0 Mode :character Mode :character
## Mean :175.5 Mean : 70.9
## 3rd Qu.:183.0 3rd Qu.: 79.0
## Max. :223.0 Max. :214.0
## NA's :16254 NA's :17101
## Games Year Season City
## Length:70000 Min. :1896 Length:70000 Length:70000
## Class :character 1st Qu.:1960 Class :character Class :character
## Mode :character Median :1984 Mode :character Mode :character
## Mean :1978
## 3rd Qu.:2002
## Max. :2016
##
## Sport Event Medal
## Length:70000 Length:70000 Length:70000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
A few of the columns are self-descriptive such as Age, Height, Weight, and Name. However on first glance we can ask the following questions:
What does NOC refer to?
What is the difference between Sport and Event?
Does City refer to the athletes home city or the location of the Olympic event
Does Team refer to country? If so, how is it different from NOC?
We can deduct that NOC, Team, Sport, Event and City are unclear without the provided documentation. After reading the documentation (https://www.kaggle.com/datasets/bhanupratapbiswas/olympic-data/data), we learn that:
Team refers to Team name
NOC refers to National Olympic Committee 3-letter code
City refers to Host City
Sport refers to Sport
Event refers to Event
The documentation clears up our assumptions for Team, NOC, and City. However, Sport and Event are non-descriptive and not very well documented. The data documentation doesn’t explain the relation or contrast between the 2 variables. An example of the table reveals:
dataset_olympics[,c("Sport","Event")]
## # A tibble: 70,000 × 2
## Sport Event
## <chr> <chr>
## 1 Basketball Basketball Men's Basketball
## 2 Judo Judo Men's Extra-Lightweight
## 3 Football Football Men's Football
## 4 Tug-Of-War Tug-Of-War Men's Tug-Of-War
## 5 Speed Skating Speed Skating Women's 500 metres
## 6 Speed Skating Speed Skating Women's 1,000 metres
## 7 Speed Skating Speed Skating Women's 500 metres
## 8 Speed Skating Speed Skating Women's 1,000 metres
## 9 Speed Skating Speed Skating Women's 500 metres
## 10 Speed Skating Speed Skating Women's 1,000 metres
## # ℹ 69,990 more rows
This reveals that the two columns are related and an Event is a variation of a given Sport.
eventCount <- dataset_olympics |>
filter(Games == '2016 Summer') |>
group_by(Sport) |>
summarise(Event = n())
ggplot(data = eventCount, aes(x = Sport, y = Event)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Olympic Sport", y = "Number of Athletes in different Events", title = "Athletes in Events per Sport") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
In the year 2016, there were 700+ athletes that participated in the Sport of Athletics in different Events.
uniqueEventCount <- dataset_olympics |>
filter(Games == '2016 Summer') |>
group_by(Sport) |>
distinct(Sport, Event) |>
group_by(Sport) |>
summarise(Event = n())
head(uniqueEventCount)
## # A tibble: 6 × 2
## Sport Event
## <chr> <int>
## 1 Archery 4
## 2 Athletics 47
## 3 Badminton 5
## 4 Basketball 2
## 5 Beach Volleyball 2
## 6 Boxing 13
ggplot(data = uniqueEventCount, aes(x = Sport, y = Event)) +
geom_bar(stat = "identity", fill = "purple") +
labs(x = "Olympic Sport", y = "Number of Events", title = "Events per Sport") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
We can see that there are 4 unique Events under Archery and 47 unique events under Athletics. This offers great insight into the relation between Sport and Event. We can also observe the contrast in Athletes per sport as compared to Events per Sport. Using this information, we can create new hypothesis for the dataset!