Investigating the probability of winning Olympic Gold

An exploration of the athlete dataset from Rio 2016 Olympics

Olivia Yammouni - s3832344

Introduction

The Olympics is an international event that brings together the best athletes in the world every 4 years.
Athletes compete for titles across a number of different sporting disciplines to be ultimately crowned the best in their field globally.
The 2020 Olympics was set to be held in Tokyo however the global outbreak of COVID-19 meant that the renowned event was to be postponed until a vaccine was created and athlete safety could be guaranteed at the event.
Now we arrive in 2021 where vaccines have been rolled out globally and The Olympics has been given the green light to go ahead over July and August ths year.

Introduction Cont.

With the 2021 Olympic games nearing, I have choen to investigate the question of the probability of winning gold at the Rio 2016 Olympics given age range, gender and sport. For each athlete, I will consider a number of factors including: - Sex - Age - Height - Weight - Sport - Medal Count

A lot of emphasis is always put on an athlete being in their “prime” age for competition therefore these variables will be used to calculate probability of winning gold, given an athlete’s age and gender for their respective sport. I have chosen the probability investigation for these specific variables as I believe the probability of winning gold in 2021 may have some different outcomes to the 2016 dataset as a result of the delayed olympic event. This may mean that some athletes that scraped into the ideal age range for their sport in 2016 might now overshoot this range in 2021 as they are a year older. Additionally, athletes are thought to be more likely to succeed if they have genetic advantages such as height in basketball or weight in weightlifting, thus genetic factors will be analysed as well. I am interested specifically in female athletes who are in their “prime age” as this group most closely reflects my personal athlete profile. I am also specifically interested in the weight variable as this can be controlled to some extent by athletes as height and age cannot.

Problem Statement

What is the probability of winning gold at the 2016 Olympics based on being a “prime aged” female? And is it possible to predict weight in specific sports for future olympic athletes?
I will use probability laws to calculate the probability for a number of different age ranges and for height and weight
I will use probability distributions to visualise the spread of data and make accurate estimations based on mean, median and standard deviation
The final output will be a number of probability tables with a number of age ranges, height (greater than or less than the median) and weight as the columns and each country as the rows
To continue my investigation I will then compare my results with the 2012 olympic athlete dataset
- Here I will take a sample of this dataset and assign likelihoods of winning as per probability in the 2016 dataset
- I will use sampling to estimate confidence intervals
- I will also use critical value hypothesis testing to compare the control group (2016 dataset) to the new dataset (2012)

Data

In this assigment I am using 2 complete files (no samples) called 2016athletes.csv and 2012athletes.csv. They were pulled from Kaggle and TopEndSports respectively with URL’s as follows: https://www.kaggle.com/rio2016/olympic-games & https://www.topendsports.com/events/summer/science/anthropometry-2012.htm

To preprocess the data, I will first load the dataset and do some exploratory data analysis: - Review format of the data - Rename columns in each dataset to match - Drop the ID column as it adds no benefit to finding the problem statement (noise). Also drop additional columns in 2012 dataset that aren’t in 2016 dataset - Check shape - Check variable types and change to factor and date variables for some - Convert M to male and F to Female in 2012 set to match 2016 set - Convert height to cm for 2016 dataset - Observe lowest and highest date of birth - no outliers observed - Check for missing values and input averages for that gender and sport - Check for obvious outliers and complete a google search these athletes to see if the data is entered incorrectly or if they are a physical anomaly - Complete feature construction and make a column with bins for age groups - each athlete is assigned an age range

#import data using the readr package, (file is located in the working directory)
athletes2016 <- read_csv("2016athletes.csv")
athletes2012 <- read_csv("2012athletes.csv")


#view the header of the data tables and view top 5 rows
head(athletes2016)

head(athletes2012)

Data Cont.

###understand and inspect the data - Data pre-processing

#check data types of the variables
lapply(athletes2016, class)

## $id
## [1] "numeric"
## 
## $name
## [1] "character"
## 
## $nationality
## [1] "character"
## 
## $sex
## [1] "character"
## 
## $dob
## [1] "character"
## 
## $height
## [1] "numeric"
## 
## $weight
## [1] "numeric"
## 
## $sport
## [1] "character"
## 
## $gold
## [1] "numeric"
## 
## $silver
## [1] "numeric"
## 
## $bronze
## [1] "numeric"

lapply(athletes2012, class)

## $Name
## [1] "character"
## 
## $Country
## [1] "character"
## 
## $Age
## [1] "numeric"
## 
## $`Height, cm`
## [1] "numeric"
## 
## $Weight
## [1] "numeric"
## 
## $Sex
## [1] "character"
## 
## $`Date of birth`
## [1] "character"
## 
## $`Birth Month`
## [1] "numeric"
## 
## $`Place of birth`
## [1] "character"
## 
## $Gold
## [1] "numeric"
## 
## $Silver
## [1] "numeric"
## 
## $Bronze
## [1] "numeric"
## 
## $Total
## [1] "numeric"
## 
## $Sport
## [1] "character"
## 
## $Event
## [1] "character"

#check dimensions
dim(athletes2016)

## [1] 11538    11

dim(athletes2012)

## [1] 10384    15

#check the structure
str(athletes2016)

## tibble [11,538 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id         : num [1:11538] 7.36e+08 5.32e+08 4.36e+08 5.21e+08 3.39e+07 ...
##  $ name       : chr [1:11538] "A Jesus Garcia" "A Lam Shin" "Aaron Brown" "Aaron Cook" ...
##  $ nationality: chr [1:11538] "ESP" "KOR" "CAN" "MDA" ...
##  $ sex        : chr [1:11538] "male" "female" "male" "male" ...
##  $ dob        : chr [1:11538] "10/17/69" "9/23/86" "5/27/92" "1/2/91" ...
##  $ height     : num [1:11538] 1.72 1.68 1.98 1.83 1.81 1.8 2.05 1.93 1.8 1.65 ...
##  $ weight     : num [1:11538] 64 56 79 80 71 67 98 100 62 54 ...
##  $ sport      : chr [1:11538] "athletics" "fencing" "athletics" "taekwondo" ...
##  $ gold       : num [1:11538] 0 0 0 0 0 0 0 0 0 0 ...
##  $ silver     : num [1:11538] 0 0 0 0 0 0 0 0 0 0 ...
##  $ bronze     : num [1:11538] 0 0 1 0 0 0 1 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   nationality = col_character(),
##   ..   sex = col_character(),
##   ..   dob = col_character(),
##   ..   height = col_double(),
##   ..   weight = col_double(),
##   ..   sport = col_character(),
##   ..   gold = col_double(),
##   ..   silver = col_double(),
##   ..   bronze = col_double()
##   .. )

str(athletes2012)

## tibble [10,384 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Name          : chr [1:10384] "Hiroshi Hoketsu" "Ian Millar" "Afanasijs Kuzmins" "Carl Bouckaert" ...
##  $ Country       : chr [1:10384] "Japan" "Canada" "Latvia" "Belgium" ...
##  $ Age           : num [1:10384] 71 65 65 58 57 57 56 56 56 54 ...
##  $ Height, cm    : num [1:10384] 168 185 178 181 173 185 177 190 180 175 ...
##  $ Weight        : num [1:10384] 61 76 89 78 63 88 76 82 78 77 ...
##  $ Sex           : chr [1:10384] "M" "M" "M" "M" ...
##  $ Date of birth : chr [1:10384] "3/28/1941" "1/6/1947" "3/22/1947" "4/19/1954" ...
##  $ Birth Month   : num [1:10384] 3 1 3 4 12 3 9 3 6 12 ...
##  $ Place of birth: chr [1:10384] "CHUO (JPN)" "Halifax (CAN)" "BIKERNIEKI PARISH (LAT)" "WAREGEM (BEL)" ...
##  $ Gold          : num [1:10384] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Silver        : num [1:10384] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Bronze        : num [1:10384] 0 0 0 0 0 0 0 1 0 0 ...
##  $ Total         : num [1:10384] 0 0 0 0 0 0 0 1 0 0 ...
##  $ Sport         : chr [1:10384] "Equestrian" "Equestrian" "Shooting" "Equestrian" ...
##  $ Event         : chr [1:10384] "Individual Dressage, WHISPER" "Individual Jumping, Team Jumping, STAR POWER" "Men's 25m Rapid Fire Pistol" "Individual Eventing, Team Eventing, CYRANO Z" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Name = col_character(),
##   ..   Country = col_character(),
##   ..   Age = col_double(),
##   ..   `Height, cm` = col_double(),
##   ..   Weight = col_double(),
##   ..   Sex = col_character(),
##   ..   `Date of birth` = col_character(),
##   ..   `Birth Month` = col_double(),
##   ..   `Place of birth` = col_character(),
##   ..   Gold = col_double(),
##   ..   Silver = col_double(),
##   ..   Bronze = col_double(),
##   ..   Total = col_double(),
##   ..   Sport = col_character(),
##   ..   Event = col_character()
##   .. )

#drop additional columns in both sets

#first drop ID in 2016
athletes2016 <- select(athletes2016, -1)
head(athletes2016)

#next drop columns in 2012 dataframe that are not in 2016 dataframe
athletes2012 <- select(athletes2012, -3, -8, -9, -13, -15)
head(athletes2012)

#check current column names
colnames(athletes2016)

##  [1] "name"        "nationality" "sex"         "dob"         "height"     
##  [6] "weight"      "sport"       "gold"        "silver"      "bronze"

colnames(athletes2012)

##  [1] "Name"          "Country"       "Height, cm"    "Weight"       
##  [5] "Sex"           "Date of birth" "Gold"          "Silver"       
##  [9] "Bronze"        "Sport"

#rename the columns
colnames(athletes2016) <- c("Athlete Name", "Nationality", "Gender", "Date_of_Birth" ,
                        "Height", "Weight", "Sport", 
                        "Gold Medals","Silver Medals","Bronze Medals" )

colnames(athletes2012) <- c("Athlete Name", "Nationality",
                        "Height", "Weight", "Gender", "Date_of_Birth",
                        "Gold Medals","Silver Medals","Bronze Medals", "Sport")

#check the new column names
colnames(athletes2016)

##  [1] "Athlete Name"  "Nationality"   "Gender"        "Date_of_Birth"
##  [5] "Height"        "Weight"        "Sport"         "Gold Medals"  
##  [9] "Silver Medals" "Bronze Medals"

colnames(athletes2012)

##  [1] "Athlete Name"  "Nationality"   "Height"        "Weight"       
##  [5] "Gender"        "Date_of_Birth" "Gold Medals"   "Silver Medals"
##  [9] "Bronze Medals" "Sport"

#convert some categorical variables to factors
athletes2016$Gender <- as.factor(athletes2016$Gender)
athletes2012$Gender <- as.factor(athletes2012$Gender)
athletes2016$Sport <- as.factor(athletes2016$Sport)
athletes2012$Sport <- as.factor(athletes2012$Sport)


#Transform date of birth from character to date
athletes2016$Date_of_Birth <-as.Date(athletes2016$Date_of_Birth, format = "%m/%d/%y")
class(athletes2016$Date_of_Birth)

## [1] "Date"

x <-athletes2016$Date_of_Birth
foo <- function(x, year=1950){
  m <- year(x) %% 100
  year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
  x
}

athletes2016$Date_of_Birth <- foo(x, 1950)
athletes2016

athletes2012$Date_of_Birth <-as.Date(athletes2012$Date_of_Birth, format = "%m/%d/%Y")
class(athletes2012$Date_of_Birth)

## [1] "Date"

#check for null values
lapply(athletes2016,function(x) { length(which(is.na(x)))})

## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 1
## 
## $Height
## [1] 330
## 
## $Weight
## [1] 659
## 
## $Sport
## [1] 0
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0

lapply(athletes2012,function(x) { length(which(is.na(x)))})

## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Height
## [1] 561
## 
## $Weight
## [1] 1280
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 20
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
## 
## $Sport
## [1] 0

#calculate mean height and weight per sport and gender and fill empty values with these
athletes2016 <- athletes2016 %>% 
                      group_by(Gender, Sport) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

athletes2016 <- athletes2016 %>% 
                     group_by(Gender, Sport) %>% 
                     mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))


#do the same for the 2012 dataset
athletes2012 <- athletes2012 %>% 
                      group_by(Gender, Sport) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

athletes2012 <- athletes2012 %>% 
                     group_by(Gender, Sport) %>% 
                     mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))


#all boxing athletes have no entry for weight so I will replace these values with the average weight per their nationality and gender
athletes2016 <- athletes2016 %>% 
                      group_by(Nationality, Gender) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

lapply(athletes2016,function(x) { length(which(is.na(x)))})

## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 1
## 
## $Height
## [1] 0
## 
## $Weight
## [1] 0
## 
## $Sport
## [1] 0
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0

#we still have one DOB entry empty, I will analyse this particular athlete
missing_dob <- athletes2016[rowSums(is.na(athletes2016)) > 0,]
missing_dob

#The athlete is a well known russian sailor and his date of birth is publicly available so I will replace it with the public data
athletes2016[is.na(athletes2016)] <- c(as.Date("12/25/1987", format = "%m/%d/%Y"))

#convert height to cm for 2016 dataset
athletes2016$Height <- (athletes2016$Height) * 100
head(athletes2016)

#convert M to male and F to Female in 2012 set to match 2016 set
athletes2012$Gender <- as.character(athletes2012$Gender)
athletes2012$Gender[athletes2012$Gender == "M"] <- "male"
athletes2012$Gender[athletes2012$Gender == "F"] <- "female"
athletes2012$Gender <- as.factor(athletes2012$Gender)
head(athletes2012)

#observe Height variable range
min(athletes2016[,"Height"])

## [1] 121

which(athletes2016$Height == 121)

## [1] 7847

athletes2016[7847,]

max(athletes2016[,"Height"])

## [1] 221

which(athletes2016$Height == 221)

## [1] 5691

athletes2016[5691,]

#after searching online, this value of height is incorrect. I will replace it with his actual height 196cm
athletes2016[5691, 5] = 196
athletes2016[5691,]

#the new maximum is a Chinese basketballer and is correct
which(athletes2016$Height == 217)

## [1] 8770

athletes2016[8770,]

#Observe range of weight - no need to replace or change
min(athletes2016[,6])

## [1] 31

which(athletes2016$Weight == 31)

## [1] 3430

athletes2016[3430,]

max(athletes2016[,6])

## [1] 170

which(athletes2016$Weight == 170)

## [1] 1388 2338

athletes2016[2338,]

#Observe lowest and highest date of birth - no outliers observed
min(athletes2016$Date_of_Birth)

## [1] "1954-05-20"

max(athletes2016$Date_of_Birth)

## [1] "2002-11-26"

#create age bracket bins - we know the youngest athlete is 12 and the oldest is 62
#brackets will be 12 - 21, 22 - 31, 32 - 41, 42 - 51 and 52 - 62

#first add a column for age
a <- as.Date(athletes2016$Date_of_Birth)
b <- as.Date("8/5/2016", format = "%m/%d/%Y")

athletes2016$Age <- floor(age_calc(a, enddate = b, units = "years", precise = FALSE))
athletes2016

#Now create age bins
athletes2016$AgeBracket <- cut(athletes2016$Age, breaks=c(12, 21, 31, 41, 51, 61, 71), labels=c("12-21","22-31", "32-41", "42-51", "52-61", "62+"))
athletes2016

#count how many athletes fall into each ge bracket
count(athletes2016, vars = AgeBracket)

#we can see that most athletes (7907) are 22-31 years old.

#check null values
lapply(athletes2016,function(x) { length(which(is.na(x)))})

## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 0
## 
## $Height
## [1] 0
## 
## $Weight
## [1] 0
## 
## $Sport
## [1] 0
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
## 
## $Age
## [1] 0
## 
## $AgeBracket
## [1] 0

lapply(athletes2012,function(x) { length(which(is.na(x)))})

## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Height
## [1] 0
## 
## $Weight
## [1] 804
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 20
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
## 
## $Sport
## [1] 0

Details of the 2016athletes variables [1]:

id: numeric variable, nominal, unique athlete ID
name: athlete full name, nominal character variable
nationality: athlete’s nationality, nominal character variable, represented as a 3-letter code for each country
sex: Male or Female categrical variable
dob: date of birth for each athlete in the form of mm/dd/yy (1 missing value)
height: height of athlete in metres, numerical variable (330 missing values)
weight: weight of athlete in kilograms, numerical variable (659 missing values)
sport: the sport the athlete is competing in, categorical variable
gold: tally of gold medals won by the athlete at the games, numerical variable
silver: tally of silver medals won by the athlete at the games, numerical variable
bronze:tally of bronze medals won by the athlete at the games, numerical variable

Details of the 2012athletes variables [2]:

Name: athlete full name, nominal character variable
Country: athlete’s nationality, nominal character variable, full name of the country of representation
Age: age at time of competition
Height, cm: height of athlete in centimetres, numerical variable (561 missing values)
Weight: weight of athlete in kilograms, numerical variable (1280 missing values)
Sex: Male or Female categrical variable represented as M or F (26 missing values)
Date of birth: date of birth for each athlete in the form of mm/dd/yyyy
Birth Month: number of the month of the year (1-12) that the athlete was born, ordinal variable
Place of birth: city of birth for the athlete, categorical variable represented as city(XXX) (where X = 3-letter country code) (2690 missing values)
Gold: tally of gold medals won by the athlete at the games, numerical variable
Silver: tally of silver medals won by the athlete at the games, numerical variable
Bronze: tally of bronze medals won by the athlete at the games, numerical variable
Total: total tally of all medals won by the athlete at the games, numerical variable
Sport: the sport the athlete is competing in, categorical variable
Event: event within the sport that the athlete competed in, categorical variable

Sport and sex will need to be factor variables, date will need to be in the date format and birth month for the second dataset will also have to be factor. The levels for these are self explanatory, it will be a simple 1 & 2 for sex, it will be 1-12 for birth month and 1-n for sport where n is the number of different sports in the dataset.

Some variables such as height and weight are continuous numerical variables even though in this dataset they are represented as whole numbers. Other variables such as gold, silver and bronze medal tally will be discrete numerical variables as they only represent whole numbers of medals won (you can’t win half a medal)

Descriptive Statistics and Visualisation

Now that my data is ready to be used and is clean, I will complete some descriptive statistics and visualisations. In the above data cleaning section I have checked for any missing variables. With regards to height and weight, I used the average for that sport and the athlete’s gender. For a missing DOB variable, I obtained this from an online search of the athlete profile.

The important variables in my investigation will be sport, height, weight, age and medals won

First we observe the percentage of athletes in each sport. We can see that athletics has the overwhelming majority of athletes sitting at around 20% of total. All other sports make up around 5% or lower for each.

#Handle null values

#calculate mean height and weight per sport and gender and fill empty values with these
athletes2016 <- athletes2016 %>% 
                      group_by(Gender, Sport) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

athletes2016 <- athletes2016 %>% 
                     group_by(Gender, Sport) %>% 
                     mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))


#do the same for the 2012 dataset
athletes2012 <- athletes2012 %>% 
                      group_by(Gender, Sport) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

athletes2012 <- athletes2012 %>% 
                     group_by(Gender, Sport) %>% 
                     mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))


#all boxing athletes have no entry for weight so I will replace these values with the average weight per their nationality and gender
athletes2016 <- athletes2016 %>% 
                      group_by(Nationality, Gender) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

lapply(athletes2016,function(x) { length(which(is.na(x)))})

## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 0
## 
## $Height
## [1] 0
## 
## $Weight
## [1] 0
## 
## $Sport
## [1] 0
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
## 
## $Age
## [1] 0
## 
## $AgeBracket
## [1] 0

#we still have one DOB entry empty, I will analyse this particular athlete
missing_dob <- athletes2016[rowSums(is.na(athletes2016)) > 0,]
missing_dob

#The athlete is a well known russian sailor and his date of birth is publicly available so I will replace it with the public data
athletes2016[is.na(athletes2016)] <- c(as.Date("12/25/1987", format = "%m/%d/%Y"))

#Visualisations: - First I view the height and weight relationship for all the atheletes - there appears to be some outliers as people who are very tall with low weight and people who are very short with high weight - we can also observe that the trend of the data exists that as height increases so does weight however there is a point where the data plateaus and athletes don’t get any taller however do get heavier.

Next lets view the proportion of our athletes in each sport
Then visualise proportion of athletes in each age bracket -> we can see the overwhelming majority of athletes are aged 22-31
Next visualise proportion of athletes that win gold -> again we can see the overwhelming majority of athletes win 0 gold medals.
lets dive deeper and see the exact percentage of winners of 1 - 5 gold medals.
we observe that roughly 5.3% of athletes win 1 Gold medal or more
Create box plots to view the relationship between height and winning gold.
- It appears a taller athlete is more lilely than a smaller athlete to win 1 (or 5) gold medals
Now compare athlete weight and winning gold
- Here we see that a lighter athlete is more likely to win than a heavier athlete.
- There are some outliers for the winners of 1 gold medal (where some very heavy athletes won)

#VISUALISATIONS

#First lets view the height and weight relationship for all the atheletes
plot(Height ~ Weight,  data = athletes2016)

#there appears to be some outliers as people who are very tall with low weight and people who are very short with high weight
#we can also observe that the trend of the data exists that as height increases so does weight however there is a point where the data plateaus and athletes don't get any taller however do get heavier.

#now lets view the proportion of our athletes in each sport
freqSports <- athletes2016$Sport %>% table() %>% prop.table()*100
barplot(freqSports, main = "Athlete Sport Participation - Percentage",ylab="Percent", xlab="Sport", cex.names=0.7, las=2)

#next visualise proportion of athletes in each age bracket
freqAges <- athletes2016$AgeBracket %>% table() %>% prop.table()*100
barplot(freqAges, main = "Athlete Age Brackets - Percentage",ylab="Percent", xlab="Age Bracket", cex.names=0.7, las=2)

#Again we can see the overwhelming majority of athletes are aged 22-31

#next visualise proportion of athletes that win gold
freqMedals <- athletes2016$`Gold Medals` %>% table() %>% prop.table()*100
barplot(freqMedals, main = "Athlete Gold Medal Winners - Percentage",ylab="Percent", xlab="Gold Medal Count", cex.names=0.7, las=2)

#Again we can see the overwhelming majority of athletes win 0 gold medals.

#lets dive deeper and see the exact percentage of winners of 1 - 5 gold medals.
#we observe that roughly 5.3% of athletes win 1 Gold medal or more
freqMedals

## .
##            0            1            2            3            4            5 
## 94.635118738  5.061535795  0.234009360  0.043335067  0.017334027  0.008667013

#Create box plots to view the relationship between height and winning gold. It appears a taller athlete is more lilely than a smaller athlete to win 1 (or 5) gold medals
athletes2016 %>% boxplot(Height ~ `Gold Medals`, data = ., main="Box Plot of Gold Medals by Height", 
        ylab = "Height", xlab = "Gold Medals", col="grey")

#Now compare athlete weight and winning gold
#Here we see that a lighter athlete is more likely to win than a heavier athlete. There are some outliers for the winners of 1 gold medal (where some very heavy athletes won)
athletes2016 %>% boxplot(Weight ~ `Gold Medals`, data = ., main="Box Plot of Gold Medals by Weight", 
        ylab = "Weight", xlab = "Gold Medals", col="grey")

Decsriptive Statistics Cont.

Create 3 descriptive statistic tables:
1. For age by each sport
2. For height by each sport and gender
3. Weight by each sport and gender, including calculations for normal distribution
- In this table we can see those that have a distNorm value of 0 follow a normal distribution for weight
- Therefore I will find these specific groups: Male and Female volleyball, female rowing, male football and male cycling

I will use male cycling weight as an example and plot a histogram with a normal overlay

#Create a table with the descriptive statistics for age by each sport
athletes2016 %>% group_by(Sport, `Gold Medals`) %>% summarise(Min = min(Age,na.rm = TRUE),
                                           Q1 = quantile(Age,probs = .25,na.rm = TRUE),
                                           Median = median(Age, na.rm = TRUE),
                                           Q3 = quantile(Age,probs = .75,na.rm = TRUE),
                                           Max = max(Age,na.rm = TRUE),
                                           Mean = mean(Age, na.rm = TRUE),
                                           SD = sd(Age, na.rm = TRUE),
                                           Range = Max - Min,
                                           IQR = Q3 - Q1,
                                           n = n()) -> table1
knitr::kable(table1)

Sport	Gold Medals	Min	Q1	Median	Q3	Max	Mean	SD	Range	IQR	n
aquatics	0	13	20.00	23.0	26.00	41	23.41976	4.3753138	28	6.00	1346
aquatics	1	15	21.00	22.0	26.00	33	23.47059	4.0195600	18	5.00	85
aquatics	2	19	20.00	23.5	26.75	35	24.30000	5.2925524	16	6.75	10
aquatics	3	21	22.50	24.0	25.50	27	24.00000	4.2426407	6	3.00	2
aquatics	4	19	19.00	19.0	19.00	19	19.00000	NA	0	0.00	1
aquatics	5	31	31.00	31.0	31.00	31	31.00000	NA	0	0.00	1
archery	0	16	22.00	24.5	28.00	44	25.68852	5.8846796	28	6.00	122
archery	1	20	20.75	22.5	25.00	28	23.25000	3.5939764	8	4.25	4
archery	2	23	24.50	26.0	27.50	29	26.00000	4.2426407	6	3.00	2
athletics	0	15	23.00	26.0	29.00	46	26.32002	4.8055656	31	6.00	2303
athletics	1	19	23.50	26.0	28.50	37	26.05455	3.8892929	18	5.00	55
athletics	2	24	28.50	30.0	30.75	33	29.25000	3.7749172	9	2.25	4
athletics	3	29	29.00	29.0	29.00	29	29.00000	NA	0	0.00	1
badminton	0	19	24.00	26.5	29.00	40	26.82317	3.9271526	21	5.00	164
badminton	1	23	25.50	26.5	29.25	32	27.12500	3.0443155	9	3.75	8
basketball	0	20	24.00	27.0	31.00	39	27.56439	4.3614738	19	7.00	264
basketball	1	21	26.00	27.0	30.50	37	28.20833	3.9997735	16	4.50	24
boxing	0	18	22.00	25.0	28.00	36	25.04396	3.8403204	18	6.00	273
boxing	1	21	23.00	24.0	25.00	33	24.61538	3.0149202	12	2.00	13
canoe	0	17	23.00	27.0	29.00	48	26.78387	4.5732157	31	6.00	310
canoe	1	21	24.00	29.5	33.00	36	28.62500	4.6743984	15	9.00	16
canoe	2	22	25.00	27.0	28.25	29	26.25000	3.0956959	7	3.25	4
canoe	3	29	29.00	29.0	29.00	29	29.00000	NA	0	0.00	1
cycling	0	18	23.00	27.0	30.00	44	26.91816	4.5920899	26	7.00	501
cycling	1	21	23.00	25.5	29.75	42	27.04545	5.3849638	21	6.75	22
cycling	2	24	24.00	24.0	24.00	24	24.00000	NA	0	0.00	1
cycling	3	28	28.00	28.0	28.00	28	28.00000	NA	0	0.00	1
equestrian	0	18	30.00	36.0	44.00	62	37.12560	9.6669626	44	14.00	207
equestrian	1	21	31.50	36.0	47.00	58	38.86667	10.5211805	37	15.50	15
fencing	0	15	24.00	27.0	31.00	41	27.47788	4.8909846	26	7.00	226
fencing	1	20	26.00	28.0	31.00	38	28.47368	4.1145868	18	5.00	19
fencing	2	22	22.00	22.0	22.00	22	22.00000	NA	0	0.00	1
football	0	16	21.00	22.0	25.00	40	23.48870	3.5948270	24	4.00	575
football	1	19	22.00	23.0	27.25	33	24.41667	3.6282621	14	5.25	36
golf	0	18	24.00	28.5	34.75	46	29.50000	7.0792220	28	10.75	118
golf	1	28	30.00	32.0	34.00	36	32.00000	5.6568542	8	4.00	2
gymnastics	0	15	19.00	21.0	24.50	41	22.05686	4.4025745	26	5.50	299
gymnastics	1	16	19.25	22.0	25.00	31	22.31818	3.8469080	15	5.75	22
gymnastics	2	23	24.00	25.0	26.00	27	25.00000	2.8284271	4	2.00	2
gymnastics	4	19	19.00	19.0	19.00	19	19.00000	NA	0	0.00	1
handball	0	19	25.00	27.0	31.00	43	27.82934	4.5451596	24	6.00	334
handball	1	20	27.00	29.0	31.00	35	28.68966	3.7806221	15	4.00	29
hockey	0	17	23.00	26.0	29.00	37	25.96482	3.6728658	20	6.00	398
hockey	1	21	25.25	28.5	31.00	37	28.55882	4.0690344	16	5.75	34
judo	0	18	24.00	26.0	29.00	39	26.22751	3.7425200	21	5.00	378
judo	1	21	24.00	25.0	26.75	30	25.28571	2.8670159	9	2.75	14
modern pentathlon	0	16	23.00	25.0	29.00	36	25.60000	3.8949987	20	6.00	70
modern pentathlon	1	24	25.00	26.0	27.00	28	26.00000	2.8284271	4	2.00	2
rowing	0	18	24.00	26.0	29.00	56	26.69940	4.3994591	38	5.00	499
rowing	1	22	26.00	28.0	31.00	37	28.77083	3.5921876	15	5.00	48
rugby sevens	0	18	23.00	26.0	29.00	36	26.12364	3.8264334	18	6.00	275
rugby sevens	1	21	23.00	25.0	28.00	32	25.32000	3.0784195	11	5.00	25
sailing	0	17	24.00	27.0	30.00	47	27.41096	5.2629631	30	6.00	365
sailing	1	25	26.50	28.0	30.00	54	30.00000	7.2308861	29	3.50	15
shooting	0	16	25.00	31.0	36.00	55	31.40426	8.3775180	39	11.00	376
shooting	1	19	25.00	29.0	33.00	49	30.23077	8.6231650	30	8.00	13
shooting	2	28	28.00	28.0	28.00	28	28.00000	NA	0	0.00	1
table tennis	0	15	23.00	27.0	32.00	54	27.77711	6.8058589	39	9.00	166
table tennis	1	25	25.75	27.0	28.00	28	26.75000	1.5000000	3	2.25	4
table tennis	2	26	26.25	26.5	26.75	27	26.50000	0.7071068	1	0.50	2
taekwondo	0	17	21.00	24.0	27.00	37	24.24167	4.2207886	20	6.00	120
taekwondo	1	20	21.50	22.0	23.75	28	22.87500	2.7998724	8	2.25	8
tennis	0	18	25.00	28.0	31.00	43	28.02128	4.6168228	25	6.00	188
tennis	1	22	26.75	29.5	30.25	34	28.37500	4.0333432	12	3.50	8
triathlon	0	19	25.00	27.0	30.00	38	27.66667	3.8499909	19	5.00	108
triathlon	1	28	28.50	29.0	29.50	30	29.00000	1.4142136	2	1.00	2
volleyball	0	17	25.00	27.0	30.00	41	27.45787	4.3288641	24	5.00	356
volleyball	1	19	24.00	27.0	30.00	40	27.21429	5.2518578	21	6.00	28
weightlifting	0	15	22.00	24.0	28.00	41	24.86420	4.1273721	26	6.00	243
weightlifting	1	21	22.00	24.0	25.00	33	24.26667	3.0814344	12	3.00	15
wrestling	0	18	24.00	26.0	29.00	38	26.67463	3.7189973	20	5.00	335
wrestling	1	20	21.25	24.0	25.00	33	24.55556	3.8535274	13	3.75	18

#Now create a table with the descriptive statistics for height by each sport and gender
athletes2016 %>% group_by(Sport, Gender) %>% summarise(Min = min(Height,na.rm = TRUE),
                                           Q1 = quantile(Height,probs = .25,na.rm = TRUE),
                                           Median = median(Height),
                                           Q3 = quantile(Height,probs = .75,na.rm = TRUE),
                                           Max = max(Height,na.rm = TRUE),
                                           Mean = mean(Height, na.rm = TRUE),
                                           SD = sd(Height, na.rm = TRUE),
                                           Range = Max - Min,
                                           IQR = Q3 - Q1,
                                           n = n()) -> table2
knitr::kable(table2)

Sport	Gender	Min	Q1	Median	Q3	Max	Mean	SD	Range	IQR	n
aquatics	female	143	167.00	171.5712	176.00	194	171.5712	7.403290	51	9.00	716
aquatics	male	153	181.00	186.0342	192.00	208	185.9999	8.794747	55	11.00	729
archery	female	153	163.00	168.0000	172.00	183	167.6190	6.168091	30	9.00	64
archery	male	164	175.00	180.0000	183.00	194	179.5714	5.800149	30	8.00	64
athletics	female	121	164.00	169.0500	174.00	193	169.0500	7.925731	72	10.00	1137
athletics	male	145	175.00	180.9234	187.00	208	180.9234	8.896243	63	12.00	1226
badminton	female	156	165.00	168.0000	172.00	184	168.6000	5.562902	28	7.00	86
badminton	male	167	176.25	180.2529	184.00	198	180.5059	6.316409	31	7.75	86
basketball	female	161	177.00	185.0000	190.00	203	183.3819	9.184965	42	13.00	144
basketball	male	181	194.75	201.0000	207.00	218	200.3611	8.281423	37	12.25	144
boxing	female	150	162.75	168.0000	174.25	184	168.2222	8.124429	34	11.50	36
boxing	male	152	170.00	175.7615	181.00	206	175.7615	9.132098	54	11.00	250
canoe	female	153	166.50	170.0000	174.00	185	170.0648	6.356490	32	7.50	111
canoe	male	159	178.00	183.0000	186.00	202	182.0455	6.852505	43	8.00	220
cycling	female	155	163.00	167.1564	170.25	185	167.3128	5.895198	30	7.25	200
cycling	male	164	175.00	179.3074	184.00	196	179.3074	6.331333	32	9.00	325
equestrian	female	151	163.00	168.0000	172.00	185	168.2530	6.843216	34	9.00	85
equestrian	male	161	174.00	179.0000	183.00	196	179.0000	7.407270	35	9.00	137
fencing	female	152	167.75	171.5000	176.00	185	171.2581	6.538032	33	8.25	124
fencing	male	165	178.00	183.1901	189.00	205	183.3802	7.478074	40	11.00	122
football	female	153	163.00	168.0000	173.00	188	168.2710	6.553599	35	10.00	264
football	male	160	174.00	180.0000	185.00	196	179.3746	7.206978	36	11.00	347
golf	female	158	165.00	168.0000	172.50	183	169.1404	5.694639	25	7.50	59
golf	male	160	173.00	180.0000	185.00	196	179.3500	7.999635	36	12.00	61
gymnastics	female	133	156.00	163.0000	168.00	181	161.6524	9.051058	48	12.00	210
gymnastics	male	152	163.00	167.0000	172.00	184	167.7105	6.168323	32	9.00	114
handball	female	161	172.00	176.0000	180.00	192	175.9722	5.801590	31	8.00	180
handball	male	177	186.00	191.0000	196.00	210	191.3169	6.816922	33	10.00	183
hockey	female	152	162.00	167.0000	171.00	196	166.7870	6.984436	44	9.00	216
hockey	male	161	176.00	180.0000	184.00	196	180.0093	5.989130	35	8.00	216
judo	female	150	161.00	165.0000	173.00	186	166.5132	8.130504	36	12.00	153
judo	male	155	172.00	178.1783	185.00	210	178.1783	9.474799	55	13.00	239
modern pentathlon	female	159	165.75	169.0000	175.00	182	169.9444	5.820871	23	9.25	36
modern pentathlon	male	172	180.00	181.5000	185.00	197	182.5278	5.140795	25	5.00	36
rowing	female	150	173.00	178.0000	183.00	196	177.3286	7.512572	46	10.00	216
rowing	male	152	186.00	190.0000	195.00	209	189.8708	7.557571	57	9.00	331
rugby sevens	female	153	163.75	168.0000	172.00	203	167.8552	6.756181	50	8.25	148
rugby sevens	male	168	177.00	183.5000	188.00	198	182.9145	7.272323	30	11.00	152
sailing	female	152	165.00	169.2767	173.00	184	169.2767	6.079628	32	8.00	163
sailing	male	165	177.00	181.0000	186.00	204	181.2770	6.549900	39	9.00	217
shooting	female	149	160.00	164.8400	169.00	181	164.8400	6.079561	32	9.00	151
shooting	male	160	174.00	178.0000	182.00	202	178.0343	6.763013	42	8.00	239
table tennis	female	149	161.00	165.3659	168.75	181	165.3659	6.301283	32	7.75	86
table tennis	male	162	174.00	178.5000	183.00	198	178.6860	7.099729	36	9.00	86
taekwondo	female	159	168.75	173.0000	176.50	188	173.0159	6.188807	29	7.75	64
taekwondo	male	160	180.00	186.5000	191.00	207	186.3750	8.820071	47	11.00	64
tennis	female	159	168.00	174.0000	178.00	186	173.4205	6.290595	27	10.00	91
tennis	male	172	181.00	186.0000	191.00	204	186.0388	7.107583	32	10.00	105
triathlon	female	153	164.00	168.0000	171.00	180	167.6545	5.618152	27	7.00	55
triathlon	male	166	175.00	179.0000	183.00	191	178.8704	6.212432	25	8.00	55
volleyball	female	159	178.00	183.0000	188.00	202	182.8691	7.756682	43	10.00	192
volleyball	male	172	191.00	197.0000	202.00	211	196.4010	7.638467	39	11.00	192
weightlifting	female	137	155.00	160.0000	165.00	178	160.3558	7.703131	41	10.00	104
weightlifting	male	148	162.25	170.0000	180.00	197	170.9221	10.776719	49	17.75	154
wrestling	female	150	160.25	165.0000	170.00	180	165.1239	6.796840	30	9.75	114
wrestling	male	152	169.00	175.0000	182.00	203	175.7215	9.581499	51	13.00	239

#Now create a table with the descriptive statistics for Weight by each sport and gender, including calculations for normal distribution
athletes2016 %>% group_by(Sport, Gender) %>% summarise(Min = min(Weight,na.rm = TRUE),
                                           Q1 = quantile(Weight,probs = .25,na.rm = TRUE),
                                           Median = median(Weight),
                                           Q3 = quantile(Weight,probs = .75,na.rm = TRUE),
                                           Max = max(Weight,na.rm = TRUE),
                                           Mean = mean(Weight, na.rm = TRUE),
                                           SD = sd(Weight, na.rm = TRUE),
                                           Variance = var(Weight),
                                           Range = Max - Min,
                                           IQR = Q3 - Q1,
                                           n = n(),
                                           valNorm = SD * 1.33,
                                           distNorm = IQR - valNorm) -> table3
knitr::kable(table3)

Sport	Gender	Min	Q1	Median	Q3	Max	Mean	SD	Variance	Range	IQR	n	valNorm	distNorm
aquatics	female	39.00000	57.00000	62.00000	66.00000	130	62.28448	8.666120	75.10164	91.00000	9.000000	716	11.525940	-2.5259399
aquatics	male	50.00000	75.00000	82.00000	89.00000	125	82.21906	11.393093	129.80256	75.00000	14.000000	729	15.152813	-1.1528133
archery	female	42.00000	56.75000	63.00000	73.00000	90	64.30159	10.225451	104.55984	48.00000	16.250000	64	13.599849	2.6501508
archery	male	46.00000	74.00000	79.50000	84.50000	130	80.07937	13.458108	181.12069	84.00000	10.500000	64	17.899284	-7.3992844
athletics	female	39.00000	52.00000	58.00000	63.00000	150	60.15254	13.679727	187.13494	111.00000	11.000000	1137	18.194037	-7.1940372
athletics	male	47.00000	63.00000	72.50000	80.00000	165	74.77768	17.330468	300.34510	118.00000	17.000000	1226	23.049522	-6.0495218
badminton	female	50.00000	58.00000	60.50000	64.75000	85	61.20988	5.637513	31.78155	35.00000	6.750000	86	7.497892	-0.7478924
badminton	male	60.00000	72.00000	76.15663	80.00000	93	76.15663	6.964586	48.50546	33.00000	8.000000	86	9.262899	-1.2628993
basketball	female	53.00000	68.00000	75.00000	83.25000	104	75.37762	10.354422	107.21404	51.00000	15.250000	144	13.771381	1.4786194
basketball	male	79.00000	90.00000	100.00000	108.25000	137	100.29787	11.835554	140.08035	58.00000	18.250000	144	15.741287	2.5087128
boxing	female	56.33205	60.15761	62.46073	63.84033	91	62.83780	5.469579	29.91629	34.66795	3.682722	36	7.274540	-3.5918181
boxing	male	63.47368	75.22449	80.67671	82.25725	114	79.81591	6.099835	37.20799	50.52632	7.032756	250	8.112781	-1.0800249
canoe	female	50.00000	63.00000	66.00000	70.00000	99	66.45794	7.164661	51.33237	49.00000	7.000000	111	9.528999	-2.5289994
canoe	male	57.00000	75.75000	83.00000	89.00000	115	82.15000	8.937646	79.88151	58.00000	13.250000	220	11.887068	1.3629315
cycling	female	45.00000	55.75000	60.00000	64.00000	90	60.20725	7.018894	49.26487	45.00000	8.250000	200	9.335129	-1.0851291
cycling	male	52.00000	66.00000	71.00000	78.00000	100	72.57605	9.065853	82.18970	48.00000	12.000000	325	12.057585	-0.0575852
equestrian	female	47.00000	55.00000	58.00000	62.00000	75	58.63415	5.842195	34.13124	28.00000	7.000000	85	7.770119	-0.7701194
equestrian	male	50.00000	69.00000	72.00000	79.00000	89	72.95489	7.268854	52.83624	39.00000	10.000000	137	9.667576	0.3324236
fencing	female	45.00000	58.00000	62.00000	67.25000	81	62.73387	7.362958	54.21315	36.00000	9.250000	124	9.792734	-0.5427343
fencing	male	60.00000	75.00000	78.00000	83.00000	102	78.78512	7.760045	60.21829	42.00000	8.000000	122	10.320859	-2.3208592
football	female	47.00000	56.00000	60.00000	65.00000	82	61.06107	6.199786	38.43735	35.00000	9.000000	264	8.245716	0.7542843
football	male	60.00000	70.00000	74.45171	79.00000	100	74.45171	6.774549	45.89451	40.00000	9.000000	347	9.010150	-0.0101501
golf	female	48.00000	58.50000	63.00000	68.00000	85	63.20000	7.219227	52.11724	37.00000	9.500000	59	9.601572	-0.1015722
golf	male	63.00000	72.00000	79.00000	86.00000	106	79.00000	9.150592	83.73333	43.00000	14.000000	61	12.170287	1.8297127
gymnastics	female	31.00000	47.00000	50.00000	54.00000	63	49.55502	5.567170	30.99338	32.00000	7.000000	210	7.404336	-0.4043363
gymnastics	male	50.00000	60.00000	63.12727	66.00000	81	63.25455	6.028599	36.34401	31.00000	6.000000	114	8.018037	-2.0180367
handball	female	52.00000	66.00000	70.00000	74.00000	105	70.78916	7.775344	60.45598	53.00000	8.000000	180	10.341208	-2.3412080
handball	male	72.00000	88.00000	95.00000	103.50000	118	95.43169	9.955090	99.10383	46.00000	15.500000	183	13.240270	2.2597298
hockey	female	45.00000	55.75000	60.00000	65.00000	78	60.42593	6.265748	39.25960	33.00000	9.250000	216	8.333445	0.9165545
hockey	male	58.00000	74.00000	77.00000	81.25000	95	77.37500	6.881936	47.36105	37.00000	7.250000	216	9.152975	-1.9029752
judo	female	46.00000	52.00000	63.00000	70.00000	132	65.39216	16.519221	272.88467	86.00000	18.000000	153	21.970564	-3.9705644
judo	male	60.00000	68.00000	81.00000	94.50000	170	84.61674	20.376463	415.20024	110.00000	26.500000	239	27.100696	-0.6006955
modern pentathlon	female	49.00000	53.75000	58.00000	60.00000	69	58.00000	4.974219	24.74286	20.00000	6.250000	36	6.615712	-0.3657116
modern pentathlon	male	64.00000	70.75000	74.00000	78.00000	90	73.91667	5.683937	32.30714	26.00000	7.250000	36	7.559637	-0.3096366
rowing	female	49.00000	63.75000	71.50000	75.00000	90	69.77619	8.472797	71.78828	41.00000	11.250000	216	11.268820	-0.0188196
rowing	male	53.00000	73.00000	90.00000	95.00000	110	86.50462	12.010506	144.25225	57.00000	22.000000	331	15.973973	6.0260272
rugby sevens	female	53.00000	63.00000	66.59589	71.00000	89	66.59589	6.629740	43.95345	36.00000	8.000000	148	8.817554	-0.8175542
rugby sevens	male	65.00000	83.00000	90.00000	98.00000	113	90.45033	9.400358	88.36674	48.00000	15.000000	152	12.502477	2.4975233
sailing	female	50.00000	59.00000	63.19497	68.00000	74	63.19497	5.534937	30.63553	24.00000	9.000000	163	7.361467	1.6385334
sailing	male	59.00000	72.00000	77.00000	82.00000	102	77.12207	8.971876	80.49457	43.00000	10.000000	217	11.932596	-1.9325956
shooting	female	43.00000	55.00000	61.00000	67.00000	95	62.67568	10.064602	101.29622	52.00000	12.000000	151	13.385921	-1.3859208
shooting	male	45.00000	73.00000	80.00000	88.00000	140	81.06897	13.352823	178.29788	95.00000	15.000000	239	17.759255	-2.7592547
table tennis	female	42.00000	53.25000	57.53012	61.50000	84	57.53012	6.728932	45.27853	42.00000	8.250000	86	8.949480	-0.6994796
table tennis	male	51.00000	67.00000	72.00000	77.00000	99	72.55814	8.652242	74.86129	48.00000	10.000000	86	11.507481	-1.5074814
taekwondo	female	48.00000	54.50000	61.25806	67.00000	88	61.25806	9.107707	82.95033	40.00000	12.500000	64	12.113251	0.3867493
taekwondo	male	57.00000	62.75000	72.40476	82.25000	108	74.80952	13.913402	193.58277	51.00000	19.500000	64	18.504825	0.9951748
tennis	female	52.00000	61.00000	65.00000	68.00000	84	64.67045	5.386653	29.01604	32.00000	7.000000	91	7.164249	-0.1642491
tennis	male	65.00000	75.00000	80.00000	85.00000	99	80.41748	7.405201	54.83701	34.00000	10.000000	105	9.848918	0.1510824
triathlon	female	44.00000	51.50000	55.00000	58.00000	65	54.56364	4.532858	20.54680	21.00000	6.500000	55	6.028701	0.4712989
triathlon	male	56.00000	63.50000	65.00000	70.00000	80	66.81481	5.343997	28.55830	24.00000	6.500000	55	7.107515	-0.6075154
volleyball	female	52.00000	66.00000	71.00000	75.00000	92	70.68421	6.709785	45.02122	40.00000	9.000000	192	8.924014	0.0759856
volleyball	male	64.00000	84.00000	89.00000	95.25000	115	89.42188	8.419880	70.89439	51.00000	11.250000	192	11.198441	0.0515589
weightlifting	female	47.00000	57.00000	63.00000	74.00000	143	68.78846	20.526409	421.33346	96.00000	17.000000	104	27.300124	-10.3001237
weightlifting	male	55.00000	69.00000	84.00000	104.00000	170	87.53896	27.938339	780.55076	115.00000	35.000000	154	37.157990	-2.1579902
wrestling	female	47.00000	54.25000	62.50000	69.00000	80	61.80531	9.109810	82.98864	33.00000	14.750000	114	12.116048	2.6339523
wrestling	male	39.00000	66.00000	82.00000	98.00000	130	85.37288	21.409295	458.35793	91.00000	32.000000	239	28.474363	3.5256372

#I will use male cycling as an example and plot
maleCyc <- athletes2016 %>% 
                      group_by(Gender, Sport) %>% 
                      filter(Sport == "cycling" , Gender == "male")

#Plot histogram with normal overlay - Male cyclists weight
h<-hist(maleCyc$Weight,breaks=15, main = "Normal Distribution of Male Cyclists' weight")

weight.plot <-c(min(h$breaks),h$breaks)
freq<-c(0,h$density,0)
xfit<-seq(min(maleCyc$Weight),max(maleCyc$Weight),length=40)
yfit<-dnorm(xfit,mean=mean(maleCyc$Weight),sd=sd(maleCyc$Weight)) 
plot(weight.plot,freq,type="s",ylim=c(0,max(freq,yfit)), main="Normal pdf and histogram - Male Cyclist weights")
lines(xfit,yfit, col="red")
maleCyc$Weight %>% mean() %>% abline(v=.,col='red',lw=2)

#Complete probability testing as per research problem

Find the gold medal count and therefore probability of winning gold per sport and gender
Find probability of winning in your sport based on gender, number of medals and athletes - we observe that the highest probability sports to win for athletes based on being female are swimming and rowing (both higher than 9%) whilst the lowest probabilities are in golf and triathlon where there is little over 1% change of winning by gender
Now complete the same for males - In the male competitions we see the highest probability of winning in gymnastics which is > 11%, and the lowest probabilities in golf and triathlon again. - It is noted that the reason for this is there is only one event (one medal) for each of these sports so athletes cannot participate in multiple events, therefore lowering their chances.
Now I know that the most athletes are in age bracket 22-31 so I will calculate the probability of winning based on gender and age - We have found the probability of winning gold if you are in the age bracket 22-31 compared to a different age bracket. - The highest values are for the events where there is only 1 medal and the athlete fell in this age bracket e.g golf and triathlon as well as a number of other events where all gold medals were won by people in this age bracket. - The lowest probabilities are in shooting, gymnastics and boxing. - Total probabilites can be found in the final table per sport for females

#Probability testing

library(janitor)

#Find the gold medal count and therefore probability of winning gold per sport and gender
m1 <- as.matrix(tabyl(athletes2016, Gender, Sport))
m2 <- as.numeric(c(m1[1, 2:29]))
m3 <- as.numeric(c(m1[2, 2:29]))

athletes2016 %>%
  group_by(Sport)%>%
  filter(`Gold Medals` > 0 , Gender == "female") %>% 
  summarise(totalGoldMedals = sum(`Gold Medals`)) -> table4

knitr::kable(table4)

Sport	totalGoldMedals
aquatics	65
archery	4
athletics	32
badminton	4
basketball	12
boxing	3
canoe	9
cycling	13
equestrian	5
fencing	11
football	18
golf	1
gymnastics	17
handball	15
hockey	16
judo	7
modern pentathlon	1
rowing	20
rugby sevens	12
sailing	7
shooting	6
table tennis	4
taekwondo	4
tennis	4
triathlon	1
volleyball	14
weightlifting	7
wrestling	6

#Find probability of winning in your sport based on gender, number of medals and athletes
as.data.frame(table4)

table4$TotalFemaleAthletes <- m2
table4$ProbWin <- ((table4$totalGoldMedals / table4$TotalFemaleAthletes) * 100)
table4

#Now complete for males
athletes2016 %>%
  group_by(Sport)%>%
  filter(`Gold Medals` > 0 , Gender == "male") %>% 
  summarise(totalGoldMedals = sum(`Gold Medals`)) -> table5

knitr::kable(table5)

Sport	totalGoldMedals
aquatics	55
archery	4
athletics	34
badminton	4
basketball	12
boxing	10
canoe	18
cycling	14
equestrian	10
fencing	10
football	18
golf	1
gymnastics	13
handball	14
hockey	18
judo	7
modern pentathlon	1
rowing	28
rugby sevens	13
sailing	8
shooting	9
table tennis	4
taekwondo	4
tennis	4
triathlon	1
volleyball	14
weightlifting	8
wrestling	12

as.data.frame(table5)

table5$TotalMaleAthletes <- m3
table5$ProbWin <- ((table5$totalGoldMedals / table5$TotalMaleAthletes) * 100)
table5

#Now I know that the most athletes are in age bracket 22-31 so I will calculate the probability of winning based on gender and age
as.matrix(tabyl(athletes2016, AgeBracket, Sport))

##      AgeBracket aquatics archery athletics badminton basketball boxing canoe
## [1,] "12-21"    "529"    "32"    " 350"    " 14"     " 19"      " 50"  " 38"
## [2,] "22-31"    "848"    "76"    "1668"    "138"     "209"      "219"  "238"
## [3,] "32-41"    " 68"    "17"    " 339"    " 20"     " 60"      " 17"  " 54"
## [4,] "42-51"    "  0"    " 3"    "   6"    "  0"     "  0"      "  0"  "  1"
## [5,] "52-61"    "  0"    " 0"    "   0"    "  0"     "  0"      "  0"  "  0"
## [6,] "62+"      "  0"    " 0"    "   0"    "  0"     "  0"      "  0"  "  0"
##      cycling equestrian fencing football golf gymnastics handball hockey judo 
## [1,] " 56"   " 8"       " 21"   "180"    "17" "165"      " 23"    " 51"  " 37"
## [2,] "385"   "63"       "172"   "412"    "64" "150"      "265"    "345"  "317"
## [3,] " 81"   "78"       " 53"   " 19"    "30" "  9"      " 74"    " 36"  " 38"
## [4,] "  3"   "53"       "  0"   "  0"    " 9" "  0"      "  1"    "  0"  "  0"
## [5,] "  0"   "19"       "  0"   "  0"    " 0" "  0"      "  0"    "  0"  "  0"
## [6,] "  0"   " 1"       "  0"   "  0"    " 0" "  0"      "  0"    "  0"  "  0"
##      modern pentathlon rowing rugby sevens sailing shooting table tennis
## [1,] "10"              " 47"  " 29"        " 43"   " 44"    "29"        
## [2,] "57"              "429"  "243"        "261"   "169"    "97"        
## [3,] " 5"              " 68"  " 28"        " 71"   "128"    "42"        
## [4,] " 0"              "  2"  "  0"        "  4"   " 42"    " 2"        
## [5,] " 0"              "  1"  "  0"        "  1"   "  7"    " 2"        
## [6,] " 0"              "  0"  "  0"        "  0"   "  0"    " 0"        
##      taekwondo tennis triathlon volleyball weightlifting wrestling
## [1,] "34"      " 12"  " 5"      " 32"      " 50"         " 29"    
## [2,] "85"      "143"  "86"      "288"      "194"         "286"    
## [3,] " 9"      " 39"  "19"      " 64"      " 14"         " 38"    
## [4,] " 0"      "  2"  " 0"      "  0"      "  0"         "  0"    
## [5,] " 0"      "  0"  " 0"      "  0"      "  0"         "  0"    
## [6,] " 0"      "  0"  " 0"      "  0"      "  0"         "  0"

table4$totalGoldMedals

##  [1] 65  4 32  4 12  3  9 13  5 11 18  1 17 15 16  7  1 20 12  7  6  4  4  4  1
## [26] 14  7  6

athletes2016 %>%
  group_by(Sport, Gender)%>%
  filter(`Gold Medals` >0, AgeBracket == "22-31" , Gender == "female") %>% 
  summarise(AgeBracketGoldMedals = sum(`Gold Medals`)) -> table6

knitr::kable(table6)

Sport	Gender	AgeBracketGoldMedals
aquatics	female	42
archery	female	3
athletics	female	26
badminton	female	4
basketball	female	6
boxing	female	1
canoe	female	7
cycling	female	11
equestrian	female	2
fencing	female	9
football	female	16
golf	female	1
gymnastics	female	5
handball	female	10
hockey	female	12
judo	female	7
modern pentathlon	female	1
rowing	female	18
rugby sevens	female	8
sailing	female	6
shooting	female	2
table tennis	female	4
taekwondo	female	4
tennis	female	4
triathlon	female	1
volleyball	female	9
weightlifting	female	5
wrestling	female	3

as.data.frame(table6)

table6$totalGoldMedals <- as.numeric(c(table4$totalGoldMedals))

table6

#Now we have found the probability of winning gold if you are in the age bracket 22-31 compared to a different age bracket. The highest values are for the events where there is only 1 medal and the athlete fell in this age bracket e.g golf and triathlon as well as a number of other events where all gold medals were won by people in this age bracket. The lowest probabilities are in shooting, gymnastics and boxing.
table6$ProbGoldAge <- (table6$AgeBracketGoldMedals/table6$totalGoldMedals) * 100
table6$totalProbWin <- ((table4$totalGoldMedals / table4$TotalFemaleAthletes) * (table6$ProbGoldAge/100))*100
table6

#Now complete cluster sampling on the 2012 dataset and see if male cyclists also follow the same normal distribution for weight

Take all 2012 male road cyclists into a dataframe
Complete simple random sampling without replacement on the male cylcists 2012 dataframe
Check for normal distribution in the sample set using IQR = 1.33 * SD - We can see that 1.33 * SD is higher than IQR therefore the sample is not normally distributed
Now plot the distribution - In the 2016 dataset we see mean 72.57605, SD 9.065853 and variance 82.18970 therefore it makes sense we see different SD and variance however a similar mean as the sample population is a lot fewer
Calculate the confidence interval for this mean as a representative sample of the whole dataset
Finalise 95% CI for the sample

#take all male cyclists into a dataframe

maleCyc2012 <- athletes2012 %>% group_by(Sport, Gender) %>% 
                                           filter(Sport == "Cycling - Road", Gender == "male")
maleCyc2012

#complete simple random sampling without replacement on the male cylcists 2012 dataframe
index <- sample(1:nrow(maleCyc2012), 50) #take 50 samples
cycleSamp <- maleCyc2012[index,]
cycleSamp

#check for normal distribution in the sample set
summary(cycleSamp$Weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   58.00   65.25   69.50   70.65   76.00   86.00

IQR(cycleSamp$Weight)

## [1] 10.75

#we can see that 1.33 * SD is higher than IQR therefore the sample is not normally distributed
normCycSamp <- sd(cycleSamp$Weight) * 1.33
print(normCycSamp)

## [1] 9.212591

#now plot the distribution
h<-hist(cycleSamp$Weight,breaks=15, main = "Distribution of 2012 Sample group: Male Cyclists' weight")

weight.plot <-c(min(h$breaks),h$breaks)
freq<-c(0,h$density,0)
xfit<-seq(min(cycleSamp$Weight),max(cycleSamp$Weight),length=40)
yfit<-dnorm(xfit,mean=mean(cycleSamp$Weight,sd=sd(cycleSamp$Weight))) 
plot(weight.plot,freq,type="s",ylim=c(0,max(freq,yfit)), main="Normal pdf and histogram - Male Cyclist weights")
lines(xfit,yfit, col="red")
cycleSamp$Weight %>% mean() %>% abline(v=.,col='red',lw=2)

cycleSamp$Weight %>% mean() #mean is 71.89kg

## [1] 70.65168

cycleSamp$Weight %>% var() #variance is 34

## [1] 47.98

cycleSamp$Weight %>% sd() #SD is 5.83

## [1] 6.92676

#in the 2016 dataset we see mean 72.57605, SD 9.065853 and variance 82.18970 therefore it makes sense we see different SD and variance however a similar mean as the sample population is a lot fewer

#calculate the confidence interval for this mean as a representative sample of the whole dataset:
tcrit<- qt(p = 0.975, df = 50 -1)
CI95<- tcrit*(cycleSamp$Weight %>% sd()/sqrt(50))
CI95

## [1] 1.968563

#Therefore 95% CI for the sample is:
floorCI <- cycleSamp$Weight %>% mean() - CI95
roofCI <- cycleSamp$Weight %>% mean() + CI95
floorCI

## [1] 68.68312

roofCI

## [1] 72.62024

Hypothesis Testing

I will use a critical value approach as the goal is to ascertain if the mean of the 2012 sample population is outside of the critical values determined for mean in comparison to the mean of the 2016 dataset. This allows comparison of the 2 datasets and assumes the null hypothesis is true. I will use a two-tailed approach as it will be also important to consider the upper critical mean boundary as it is just as likely a sample population mean will be heavier than it is they will be lighter.

The critical mean now sits between 70.922 and 74.24, this means Pr(70.92 <= x¯<= 74.24)=0.05. A such, assuming the population mean = 72.58, sampling 50 people from the population and finding a mean to be less than 70.92 and greater than 74.24 would happen less than 5% of the time. In the case of our sample dataset we have a mean of 71.89kg which falls within the 95% of mean values we’d expect. Therefore we can say that the mean of the 2012 dataset sample is statistically insignificant.

cycleSamp %>% summarise(Min = min(Weight,na.rm = TRUE),
                        Q1 = quantile(Weight,probs = .25,na.rm = TRUE),
                        Median = median(Weight, na.rm = TRUE),
                        Q3 = quantile(Weight,probs = .75,na.rm = TRUE),
                        Max = max(Weight,na.rm = TRUE),
                        Mean = mean(Weight, na.rm = TRUE),
                        SD = sd(Weight, na.rm = TRUE),
                        n = n(),
                        Missing = sum(is.na(Weight)))

#sample mean is 71.89kg
cycleSamp$Weight %>% boxplot(ylab = "Weight (Kg)")

#hypothesis is Mean = 72.57605 as per 2016 dataset

#This means the sample mean sits 1.68 SEs below the population mean of 72.576. 
qt(p = 0.05/2, df = 50-1, lower.tail = TRUE)

## [1] -2.009575

#Write a short R function, named ttox to convert this back to a critical mean.
tcrittomean <- function(t ,mu, s, n) {
  se <- s/sqrt(n)
  x_bar <- mu + (t*se) #Determine critical mean
  return(x_bar)
}
#lower critical mean - 70.922
tcrittomean(t = qt(0.05/2,50-1,lower.tail = TRUE),
            mu = 72.58, 
            s = sd(cycleSamp$Weight),
            n = length(cycleSamp$Weight))

## [1] 70.61144

#upper critical mean - 74.24
tcrittomean(t = qt(1-(0.05/2),50-1,lower.tail = TRUE),
            mu = 72.58, 
            s = sd(cycleSamp$Weight),
            n = length(cycleSamp$Weight))

## [1] 74.54856

Discussion

First I cleaned the data, then did some visual analysis on the variables. After completing this I went on to complete some probability calculations with regard to the probability of winning gold based on athlete specific data. I did this by putting athletes in age brackets and grouping by sport and gender. I then calculated probability of winning based on your sport and gender, and also if female athletes were in the main age bracket as opposed to any other bracket. I created tables with final probabilities of winning. Most sports were around 5% chance if you were a female and aged 22-31 however this also ranged from 1-7% depending on the sport and how many gold medals/events were up for grabs.

This probability would be significantly lower for the other age brackets as most medals were won by people in this age bracket.

I then went on to complete sampling and hypothesis testing on the 2012 dataset. I used cluster sampling methods to sample male cyclists as this group of athletes had normal weight distribution in the 2016 dataset and I wanted to explore the weight of the same group of athletes in a different dataset.

I chose a sample size of 50 athletes as this is considered fairly large and to be a better representative sample than a smaller sample set of the total group (there were 325 male cyclists in the 2016 dataset). I completed a number of statistical summaries and did critical value hypothesis testing to come to the final conclusion that the sample mean was within 95% confidence of a reflective mean of the population. Therefore the mean of the 2012 dataset sample is statistically insignificant.

The strengths of the investigation lie in that there is strong evidence from these 2 datasets to identify the mean weight of olympic level male cyclists and therefore it is also likely the other normally distributed weights for gender and sport can be easily predicted.

However the limitiations lie with the data where weights are not normally distributed. This leaves room for future investigation. There may be rationale to combine a number of olympic datasets just by sport and gender for the specific sport and gender to analyse. Therefore a larger representative sample can be calculated and compared.

In response to my problem statement, calculating the probability of winning gold within a dataset is simple and I have created a table of probability of winning gold in each sport if you’re a “prime aged” female. However when it comes to predicting specific weight (or other continuous variable) of a separate group such as a future dataset there comes uncertainty with non-normal data. This data would have to be collected, collated and handled in a different manner.

References

[1] “2016 Olympics in Rio de Janeiro”, Kaggle.com, 2021. [Online]. Available: https://www.kaggle.com/rio2016/olympic-games. [Accessed: 12- May- 2021].

[2] R. Wood, “Anthropometry of Olympic Athletes 2012”, Topend Sports Website, 2012. [Online]. Available: https://www.topendsports.com/events/summer/science/anthropometry-2012.htm. [Accessed: 09- May- 2021].