knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(miscset)
## 
## Attaching package: 'miscset'
## 
## The following object is masked from 'package:dplyr':
## 
##     collapse

Olympics Data

dataset_olympics <- read_delim("dataset_olympics.csv")
## Rows: 70000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl  (5): ID, Age, Height, Weight, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Columns

Our dataset consists of 15 columns:

summary(dataset_olympics)
##        ID            Name               Sex                 Age       
##  Min.   :    1   Length:70000       Length:70000       Min.   :11.00  
##  1st Qu.: 9326   Class :character   Class :character   1st Qu.:21.00  
##  Median :18032   Mode  :character   Mode  :character   Median :25.00  
##  Mean   :18082                                         Mean   :25.64  
##  3rd Qu.:26978                                         3rd Qu.:28.00  
##  Max.   :35658                                         Max.   :88.00  
##                                                        NA's   :2732   
##      Height          Weight          Team               NOC           
##  Min.   :127.0   Min.   : 25.0   Length:70000       Length:70000      
##  1st Qu.:168.0   1st Qu.: 61.0   Class :character   Class :character  
##  Median :175.0   Median : 70.0   Mode  :character   Mode  :character  
##  Mean   :175.5   Mean   : 70.9                                        
##  3rd Qu.:183.0   3rd Qu.: 79.0                                        
##  Max.   :223.0   Max.   :214.0                                        
##  NA's   :16254   NA's   :17101                                        
##     Games                Year         Season              City          
##  Length:70000       Min.   :1896   Length:70000       Length:70000      
##  Class :character   1st Qu.:1960   Class :character   Class :character  
##  Mode  :character   Median :1984   Mode  :character   Mode  :character  
##                     Mean   :1978                                        
##                     3rd Qu.:2002                                        
##                     Max.   :2016                                        
##                                                                         
##     Sport              Event              Medal          
##  Length:70000       Length:70000       Length:70000      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 

A few of the columns are self-descriptive such as Age, Height, Weight, and Name. However on first glance we can ask the following questions:

  • What does NOC refer to?

  • What is the difference between Sport and Event?

  • Does City refer to the athletes home city or the location of the Olympic event

  • Does Team refer to country? If so, how is it different from NOC?

We can deduct that NOC, Team, Sport, Event and City are unclear without the provided documentation. After reading the documentation (https://www.kaggle.com/datasets/bhanupratapbiswas/olympic-data/data), we learn that:

  • Team refers to Team name

  • NOC refers to National Olympic Committee 3-letter code

  • City refers to Host City

  • Sport refers to Sport

  • Event refers to Event

The documentation clears up our assumptions for Team, NOC, and City. However, Sport and Event are non-descriptive and not very well documented. The data documentation doesn’t explain the relation or contrast between the 2 variables. An example of the table reveals:

dataset_olympics[,c("Sport","Event")]
## # A tibble: 70,000 × 2
##    Sport         Event                             
##    <chr>         <chr>                             
##  1 Basketball    Basketball Men's Basketball       
##  2 Judo          Judo Men's Extra-Lightweight      
##  3 Football      Football Men's Football           
##  4 Tug-Of-War    Tug-Of-War Men's Tug-Of-War       
##  5 Speed Skating Speed Skating Women's 500 metres  
##  6 Speed Skating Speed Skating Women's 1,000 metres
##  7 Speed Skating Speed Skating Women's 500 metres  
##  8 Speed Skating Speed Skating Women's 1,000 metres
##  9 Speed Skating Speed Skating Women's 500 metres  
## 10 Speed Skating Speed Skating Women's 1,000 metres
## # ℹ 69,990 more rows

This reveals that the two columns are related and an Event is a variation of a given Sport.

eventCount <- dataset_olympics |> 
  filter(Games == '2016 Summer') |>
  group_by(Sport) |>
  summarise(Event = n())

ggplot(data = eventCount, aes(x = Sport, y = Event)) + 
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Olympic Sport", y = "Number of Athletes in different Events", title = "Athletes in Events per Sport") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

In the year 2016, there were 700+ athletes that participated in the Sport of Athletics in different Events.

uniqueEventCount <- dataset_olympics |> 
  filter(Games == '2016 Summer') |>
  group_by(Sport) |>
  distinct(Sport, Event) |>
  group_by(Sport) |>
  summarise(Event = n())

head(uniqueEventCount)
## # A tibble: 6 × 2
##   Sport            Event
##   <chr>            <int>
## 1 Archery              4
## 2 Athletics           47
## 3 Badminton            5
## 4 Basketball           2
## 5 Beach Volleyball     2
## 6 Boxing              13
ggplot(data = uniqueEventCount, aes(x = Sport, y = Event)) + 
  geom_bar(stat = "identity", fill = "purple") +
  labs(x = "Olympic Sport", y = "Number of Events", title = "Events per Sport") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We can see that there are 4 unique Events under Archery and 47 unique events under Athletics. This offers great insight into the relation between Sport and Event. We can also observe the contrast in Athletes per sport as compared to Events per Sport. Using this information, we can create new hypothesis for the dataset!