Investigating the probability of winning Olympic Gold

An exploration of the athlete dataset from Rio 2016 Olympics

Olivia Yammouni - s3832344

Introduction

Introduction Cont.

With the 2021 Olympic games nearing, I have choen to investigate the question of the probability of winning gold at the Rio 2016 Olympics given age range, gender and sport. For each athlete, I will consider a number of factors including: - Sex - Age - Height - Weight - Sport - Medal Count

A lot of emphasis is always put on an athlete being in their “prime” age for competition therefore these variables will be used to calculate probability of winning gold, given an athlete’s age and gender for their respective sport. I have chosen the probability investigation for these specific variables as I believe the probability of winning gold in 2021 may have some different outcomes to the 2016 dataset as a result of the delayed olympic event. This may mean that some athletes that scraped into the ideal age range for their sport in 2016 might now overshoot this range in 2021 as they are a year older. Additionally, athletes are thought to be more likely to succeed if they have genetic advantages such as height in basketball or weight in weightlifting, thus genetic factors will be analysed as well. I am interested specifically in female athletes who are in their “prime age” as this group most closely reflects my personal athlete profile. I am also specifically interested in the weight variable as this can be controlled to some extent by athletes as height and age cannot.

Problem Statement

Data

In this assigment I am using 2 complete files (no samples) called 2016athletes.csv and 2012athletes.csv. They were pulled from Kaggle and TopEndSports respectively with URL’s as follows: https://www.kaggle.com/rio2016/olympic-games & https://www.topendsports.com/events/summer/science/anthropometry-2012.htm

To preprocess the data, I will first load the dataset and do some exploratory data analysis: - Review format of the data - Rename columns in each dataset to match - Drop the ID column as it adds no benefit to finding the problem statement (noise). Also drop additional columns in 2012 dataset that aren’t in 2016 dataset - Check shape - Check variable types and change to factor and date variables for some - Convert M to male and F to Female in 2012 set to match 2016 set - Convert height to cm for 2016 dataset - Observe lowest and highest date of birth - no outliers observed - Check for missing values and input averages for that gender and sport - Check for obvious outliers and complete a google search these athletes to see if the data is entered incorrectly or if they are a physical anomaly - Complete feature construction and make a column with bins for age groups - each athlete is assigned an age range

#import data using the readr package, (file is located in the working directory)
athletes2016 <- read_csv("2016athletes.csv")
athletes2012 <- read_csv("2012athletes.csv")


#view the header of the data tables and view top 5 rows
head(athletes2016)
head(athletes2012)

Data Cont.

###understand and inspect the data - Data pre-processing

#check data types of the variables
lapply(athletes2016, class)
## $id
## [1] "numeric"
## 
## $name
## [1] "character"
## 
## $nationality
## [1] "character"
## 
## $sex
## [1] "character"
## 
## $dob
## [1] "character"
## 
## $height
## [1] "numeric"
## 
## $weight
## [1] "numeric"
## 
## $sport
## [1] "character"
## 
## $gold
## [1] "numeric"
## 
## $silver
## [1] "numeric"
## 
## $bronze
## [1] "numeric"
lapply(athletes2012, class)
## $Name
## [1] "character"
## 
## $Country
## [1] "character"
## 
## $Age
## [1] "numeric"
## 
## $`Height, cm`
## [1] "numeric"
## 
## $Weight
## [1] "numeric"
## 
## $Sex
## [1] "character"
## 
## $`Date of birth`
## [1] "character"
## 
## $`Birth Month`
## [1] "numeric"
## 
## $`Place of birth`
## [1] "character"
## 
## $Gold
## [1] "numeric"
## 
## $Silver
## [1] "numeric"
## 
## $Bronze
## [1] "numeric"
## 
## $Total
## [1] "numeric"
## 
## $Sport
## [1] "character"
## 
## $Event
## [1] "character"
#check dimensions
dim(athletes2016)
## [1] 11538    11
dim(athletes2012)
## [1] 10384    15
#check the structure
str(athletes2016)
## tibble [11,538 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id         : num [1:11538] 7.36e+08 5.32e+08 4.36e+08 5.21e+08 3.39e+07 ...
##  $ name       : chr [1:11538] "A Jesus Garcia" "A Lam Shin" "Aaron Brown" "Aaron Cook" ...
##  $ nationality: chr [1:11538] "ESP" "KOR" "CAN" "MDA" ...
##  $ sex        : chr [1:11538] "male" "female" "male" "male" ...
##  $ dob        : chr [1:11538] "10/17/69" "9/23/86" "5/27/92" "1/2/91" ...
##  $ height     : num [1:11538] 1.72 1.68 1.98 1.83 1.81 1.8 2.05 1.93 1.8 1.65 ...
##  $ weight     : num [1:11538] 64 56 79 80 71 67 98 100 62 54 ...
##  $ sport      : chr [1:11538] "athletics" "fencing" "athletics" "taekwondo" ...
##  $ gold       : num [1:11538] 0 0 0 0 0 0 0 0 0 0 ...
##  $ silver     : num [1:11538] 0 0 0 0 0 0 0 0 0 0 ...
##  $ bronze     : num [1:11538] 0 0 1 0 0 0 1 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   nationality = col_character(),
##   ..   sex = col_character(),
##   ..   dob = col_character(),
##   ..   height = col_double(),
##   ..   weight = col_double(),
##   ..   sport = col_character(),
##   ..   gold = col_double(),
##   ..   silver = col_double(),
##   ..   bronze = col_double()
##   .. )
str(athletes2012)
## tibble [10,384 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Name          : chr [1:10384] "Hiroshi Hoketsu" "Ian Millar" "Afanasijs Kuzmins" "Carl Bouckaert" ...
##  $ Country       : chr [1:10384] "Japan" "Canada" "Latvia" "Belgium" ...
##  $ Age           : num [1:10384] 71 65 65 58 57 57 56 56 56 54 ...
##  $ Height, cm    : num [1:10384] 168 185 178 181 173 185 177 190 180 175 ...
##  $ Weight        : num [1:10384] 61 76 89 78 63 88 76 82 78 77 ...
##  $ Sex           : chr [1:10384] "M" "M" "M" "M" ...
##  $ Date of birth : chr [1:10384] "3/28/1941" "1/6/1947" "3/22/1947" "4/19/1954" ...
##  $ Birth Month   : num [1:10384] 3 1 3 4 12 3 9 3 6 12 ...
##  $ Place of birth: chr [1:10384] "CHUO (JPN)" "Halifax (CAN)" "BIKERNIEKI PARISH (LAT)" "WAREGEM (BEL)" ...
##  $ Gold          : num [1:10384] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Silver        : num [1:10384] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Bronze        : num [1:10384] 0 0 0 0 0 0 0 1 0 0 ...
##  $ Total         : num [1:10384] 0 0 0 0 0 0 0 1 0 0 ...
##  $ Sport         : chr [1:10384] "Equestrian" "Equestrian" "Shooting" "Equestrian" ...
##  $ Event         : chr [1:10384] "Individual Dressage, WHISPER" "Individual Jumping, Team Jumping, STAR POWER" "Men's 25m Rapid Fire Pistol" "Individual Eventing, Team Eventing, CYRANO Z" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Name = col_character(),
##   ..   Country = col_character(),
##   ..   Age = col_double(),
##   ..   `Height, cm` = col_double(),
##   ..   Weight = col_double(),
##   ..   Sex = col_character(),
##   ..   `Date of birth` = col_character(),
##   ..   `Birth Month` = col_double(),
##   ..   `Place of birth` = col_character(),
##   ..   Gold = col_double(),
##   ..   Silver = col_double(),
##   ..   Bronze = col_double(),
##   ..   Total = col_double(),
##   ..   Sport = col_character(),
##   ..   Event = col_character()
##   .. )
#drop additional columns in both sets

#first drop ID in 2016
athletes2016 <- select(athletes2016, -1)
head(athletes2016)
#next drop columns in 2012 dataframe that are not in 2016 dataframe
athletes2012 <- select(athletes2012, -3, -8, -9, -13, -15)
head(athletes2012)
#check current column names
colnames(athletes2016)
##  [1] "name"        "nationality" "sex"         "dob"         "height"     
##  [6] "weight"      "sport"       "gold"        "silver"      "bronze"
colnames(athletes2012)
##  [1] "Name"          "Country"       "Height, cm"    "Weight"       
##  [5] "Sex"           "Date of birth" "Gold"          "Silver"       
##  [9] "Bronze"        "Sport"
#rename the columns
colnames(athletes2016) <- c("Athlete Name", "Nationality", "Gender", "Date_of_Birth" ,
                        "Height", "Weight", "Sport", 
                        "Gold Medals","Silver Medals","Bronze Medals" )

colnames(athletes2012) <- c("Athlete Name", "Nationality",
                        "Height", "Weight", "Gender", "Date_of_Birth",
                        "Gold Medals","Silver Medals","Bronze Medals", "Sport")

#check the new column names
colnames(athletes2016)
##  [1] "Athlete Name"  "Nationality"   "Gender"        "Date_of_Birth"
##  [5] "Height"        "Weight"        "Sport"         "Gold Medals"  
##  [9] "Silver Medals" "Bronze Medals"
colnames(athletes2012)
##  [1] "Athlete Name"  "Nationality"   "Height"        "Weight"       
##  [5] "Gender"        "Date_of_Birth" "Gold Medals"   "Silver Medals"
##  [9] "Bronze Medals" "Sport"
#convert some categorical variables to factors
athletes2016$Gender <- as.factor(athletes2016$Gender)
athletes2012$Gender <- as.factor(athletes2012$Gender)
athletes2016$Sport <- as.factor(athletes2016$Sport)
athletes2012$Sport <- as.factor(athletes2012$Sport)


#Transform date of birth from character to date
athletes2016$Date_of_Birth <-as.Date(athletes2016$Date_of_Birth, format = "%m/%d/%y")
class(athletes2016$Date_of_Birth)
## [1] "Date"
x <-athletes2016$Date_of_Birth
foo <- function(x, year=1950){
  m <- year(x) %% 100
  year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
  x
}

athletes2016$Date_of_Birth <- foo(x, 1950)
athletes2016
athletes2012$Date_of_Birth <-as.Date(athletes2012$Date_of_Birth, format = "%m/%d/%Y")
class(athletes2012$Date_of_Birth)
## [1] "Date"
#check for null values
lapply(athletes2016,function(x) { length(which(is.na(x)))})
## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 1
## 
## $Height
## [1] 330
## 
## $Weight
## [1] 659
## 
## $Sport
## [1] 0
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
lapply(athletes2012,function(x) { length(which(is.na(x)))})
## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Height
## [1] 561
## 
## $Weight
## [1] 1280
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 20
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
## 
## $Sport
## [1] 0
#calculate mean height and weight per sport and gender and fill empty values with these
athletes2016 <- athletes2016 %>% 
                      group_by(Gender, Sport) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

athletes2016 <- athletes2016 %>% 
                     group_by(Gender, Sport) %>% 
                     mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))


#do the same for the 2012 dataset
athletes2012 <- athletes2012 %>% 
                      group_by(Gender, Sport) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

athletes2012 <- athletes2012 %>% 
                     group_by(Gender, Sport) %>% 
                     mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))


#all boxing athletes have no entry for weight so I will replace these values with the average weight per their nationality and gender
athletes2016 <- athletes2016 %>% 
                      group_by(Nationality, Gender) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

lapply(athletes2016,function(x) { length(which(is.na(x)))})
## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 1
## 
## $Height
## [1] 0
## 
## $Weight
## [1] 0
## 
## $Sport
## [1] 0
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
#we still have one DOB entry empty, I will analyse this particular athlete
missing_dob <- athletes2016[rowSums(is.na(athletes2016)) > 0,]
missing_dob
#The athlete is a well known russian sailor and his date of birth is publicly available so I will replace it with the public data
athletes2016[is.na(athletes2016)] <- c(as.Date("12/25/1987", format = "%m/%d/%Y"))

#convert height to cm for 2016 dataset
athletes2016$Height <- (athletes2016$Height) * 100
head(athletes2016)
#convert M to male and F to Female in 2012 set to match 2016 set
athletes2012$Gender <- as.character(athletes2012$Gender)
athletes2012$Gender[athletes2012$Gender == "M"] <- "male"
athletes2012$Gender[athletes2012$Gender == "F"] <- "female"
athletes2012$Gender <- as.factor(athletes2012$Gender)
head(athletes2012)
#observe Height variable range
min(athletes2016[,"Height"])
## [1] 121
which(athletes2016$Height == 121)
## [1] 7847
athletes2016[7847,]
max(athletes2016[,"Height"])
## [1] 221
which(athletes2016$Height == 221)
## [1] 5691
athletes2016[5691,]
#after searching online, this value of height is incorrect. I will replace it with his actual height 196cm
athletes2016[5691, 5] = 196
athletes2016[5691,]
#the new maximum is a Chinese basketballer and is correct
which(athletes2016$Height == 217)
## [1] 8770
athletes2016[8770,]
#Observe range of weight - no need to replace or change
min(athletes2016[,6])
## [1] 31
which(athletes2016$Weight == 31)
## [1] 3430
athletes2016[3430,]
max(athletes2016[,6])
## [1] 170
which(athletes2016$Weight == 170)
## [1] 1388 2338
athletes2016[2338,]
#Observe lowest and highest date of birth - no outliers observed
min(athletes2016$Date_of_Birth)
## [1] "1954-05-20"
max(athletes2016$Date_of_Birth)
## [1] "2002-11-26"
#create age bracket bins - we know the youngest athlete is 12 and the oldest is 62
#brackets will be 12 - 21, 22 - 31, 32 - 41, 42 - 51 and 52 - 62

#first add a column for age
a <- as.Date(athletes2016$Date_of_Birth)
b <- as.Date("8/5/2016", format = "%m/%d/%Y")

athletes2016$Age <- floor(age_calc(a, enddate = b, units = "years", precise = FALSE))
athletes2016
#Now create age bins
athletes2016$AgeBracket <- cut(athletes2016$Age, breaks=c(12, 21, 31, 41, 51, 61, 71), labels=c("12-21","22-31", "32-41", "42-51", "52-61", "62+"))
athletes2016
#count how many athletes fall into each ge bracket
count(athletes2016, vars = AgeBracket)
#we can see that most athletes (7907) are 22-31 years old.

#check null values
lapply(athletes2016,function(x) { length(which(is.na(x)))})
## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 0
## 
## $Height
## [1] 0
## 
## $Weight
## [1] 0
## 
## $Sport
## [1] 0
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
## 
## $Age
## [1] 0
## 
## $AgeBracket
## [1] 0
lapply(athletes2012,function(x) { length(which(is.na(x)))})
## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Height
## [1] 0
## 
## $Weight
## [1] 804
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 20
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
## 
## $Sport
## [1] 0

Details of the 2016athletes variables [1]:

Details of the 2012athletes variables [2]:

Sport and sex will need to be factor variables, date will need to be in the date format and birth month for the second dataset will also have to be factor. The levels for these are self explanatory, it will be a simple 1 & 2 for sex, it will be 1-12 for birth month and 1-n for sport where n is the number of different sports in the dataset.

Some variables such as height and weight are continuous numerical variables even though in this dataset they are represented as whole numbers. Other variables such as gold, silver and bronze medal tally will be discrete numerical variables as they only represent whole numbers of medals won (you can’t win half a medal)

Descriptive Statistics and Visualisation

Now that my data is ready to be used and is clean, I will complete some descriptive statistics and visualisations. In the above data cleaning section I have checked for any missing variables. With regards to height and weight, I used the average for that sport and the athlete’s gender. For a missing DOB variable, I obtained this from an online search of the athlete profile.

The important variables in my investigation will be sport, height, weight, age and medals won

First we observe the percentage of athletes in each sport. We can see that athletics has the overwhelming majority of athletes sitting at around 20% of total. All other sports make up around 5% or lower for each.

#Handle null values

#calculate mean height and weight per sport and gender and fill empty values with these
athletes2016 <- athletes2016 %>% 
                      group_by(Gender, Sport) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

athletes2016 <- athletes2016 %>% 
                     group_by(Gender, Sport) %>% 
                     mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))


#do the same for the 2012 dataset
athletes2012 <- athletes2012 %>% 
                      group_by(Gender, Sport) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

athletes2012 <- athletes2012 %>% 
                     group_by(Gender, Sport) %>% 
                     mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))


#all boxing athletes have no entry for weight so I will replace these values with the average weight per their nationality and gender
athletes2016 <- athletes2016 %>% 
                      group_by(Nationality, Gender) %>% 
                      mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))

lapply(athletes2016,function(x) { length(which(is.na(x)))})
## $`Athlete Name`
## [1] 0
## 
## $Nationality
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Date_of_Birth
## [1] 0
## 
## $Height
## [1] 0
## 
## $Weight
## [1] 0
## 
## $Sport
## [1] 0
## 
## $`Gold Medals`
## [1] 0
## 
## $`Silver Medals`
## [1] 0
## 
## $`Bronze Medals`
## [1] 0
## 
## $Age
## [1] 0
## 
## $AgeBracket
## [1] 0
#we still have one DOB entry empty, I will analyse this particular athlete
missing_dob <- athletes2016[rowSums(is.na(athletes2016)) > 0,]
missing_dob
#The athlete is a well known russian sailor and his date of birth is publicly available so I will replace it with the public data
athletes2016[is.na(athletes2016)] <- c(as.Date("12/25/1987", format = "%m/%d/%Y"))

#Visualisations: - First I view the height and weight relationship for all the atheletes - there appears to be some outliers as people who are very tall with low weight and people who are very short with high weight - we can also observe that the trend of the data exists that as height increases so does weight however there is a point where the data plateaus and athletes don’t get any taller however do get heavier.

#VISUALISATIONS

#First lets view the height and weight relationship for all the atheletes
plot(Height ~ Weight,  data = athletes2016)

#there appears to be some outliers as people who are very tall with low weight and people who are very short with high weight
#we can also observe that the trend of the data exists that as height increases so does weight however there is a point where the data plateaus and athletes don't get any taller however do get heavier.

#now lets view the proportion of our athletes in each sport
freqSports <- athletes2016$Sport %>% table() %>% prop.table()*100
barplot(freqSports, main = "Athlete Sport Participation - Percentage",ylab="Percent", xlab="Sport", cex.names=0.7, las=2)

#next visualise proportion of athletes in each age bracket
freqAges <- athletes2016$AgeBracket %>% table() %>% prop.table()*100
barplot(freqAges, main = "Athlete Age Brackets - Percentage",ylab="Percent", xlab="Age Bracket", cex.names=0.7, las=2)

#Again we can see the overwhelming majority of athletes are aged 22-31

#next visualise proportion of athletes that win gold
freqMedals <- athletes2016$`Gold Medals` %>% table() %>% prop.table()*100
barplot(freqMedals, main = "Athlete Gold Medal Winners - Percentage",ylab="Percent", xlab="Gold Medal Count", cex.names=0.7, las=2)

#Again we can see the overwhelming majority of athletes win 0 gold medals.

#lets dive deeper and see the exact percentage of winners of 1 - 5 gold medals.
#we observe that roughly 5.3% of athletes win 1 Gold medal or more
freqMedals
## .
##            0            1            2            3            4            5 
## 94.635118738  5.061535795  0.234009360  0.043335067  0.017334027  0.008667013
#Create box plots to view the relationship between height and winning gold. It appears a taller athlete is more lilely than a smaller athlete to win 1 (or 5) gold medals
athletes2016 %>% boxplot(Height ~ `Gold Medals`, data = ., main="Box Plot of Gold Medals by Height", 
        ylab = "Height", xlab = "Gold Medals", col="grey")

#Now compare athlete weight and winning gold
#Here we see that a lighter athlete is more likely to win than a heavier athlete. There are some outliers for the winners of 1 gold medal (where some very heavy athletes won)
athletes2016 %>% boxplot(Weight ~ `Gold Medals`, data = ., main="Box Plot of Gold Medals by Weight", 
        ylab = "Weight", xlab = "Gold Medals", col="grey")

Decsriptive Statistics Cont.

I will use male cycling weight as an example and plot a histogram with a normal overlay

#Create a table with the descriptive statistics for age by each sport
athletes2016 %>% group_by(Sport, `Gold Medals`) %>% summarise(Min = min(Age,na.rm = TRUE),
                                           Q1 = quantile(Age,probs = .25,na.rm = TRUE),
                                           Median = median(Age, na.rm = TRUE),
                                           Q3 = quantile(Age,probs = .75,na.rm = TRUE),
                                           Max = max(Age,na.rm = TRUE),
                                           Mean = mean(Age, na.rm = TRUE),
                                           SD = sd(Age, na.rm = TRUE),
                                           Range = Max - Min,
                                           IQR = Q3 - Q1,
                                           n = n()) -> table1
knitr::kable(table1)
Sport Gold Medals Min Q1 Median Q3 Max Mean SD Range IQR n
aquatics 0 13 20.00 23.0 26.00 41 23.41976 4.3753138 28 6.00 1346
aquatics 1 15 21.00 22.0 26.00 33 23.47059 4.0195600 18 5.00 85
aquatics 2 19 20.00 23.5 26.75 35 24.30000 5.2925524 16 6.75 10
aquatics 3 21 22.50 24.0 25.50 27 24.00000 4.2426407 6 3.00 2
aquatics 4 19 19.00 19.0 19.00 19 19.00000 NA 0 0.00 1
aquatics 5 31 31.00 31.0 31.00 31 31.00000 NA 0 0.00 1
archery 0 16 22.00 24.5 28.00 44 25.68852 5.8846796 28 6.00 122
archery 1 20 20.75 22.5 25.00 28 23.25000 3.5939764 8 4.25 4
archery 2 23 24.50 26.0 27.50 29 26.00000 4.2426407 6 3.00 2
athletics 0 15 23.00 26.0 29.00 46 26.32002 4.8055656 31 6.00 2303
athletics 1 19 23.50 26.0 28.50 37 26.05455 3.8892929 18 5.00 55
athletics 2 24 28.50 30.0 30.75 33 29.25000 3.7749172 9 2.25 4
athletics 3 29 29.00 29.0 29.00 29 29.00000 NA 0 0.00 1
badminton 0 19 24.00 26.5 29.00 40 26.82317 3.9271526 21 5.00 164
badminton 1 23 25.50 26.5 29.25 32 27.12500 3.0443155 9 3.75 8
basketball 0 20 24.00 27.0 31.00 39 27.56439 4.3614738 19 7.00 264
basketball 1 21 26.00 27.0 30.50 37 28.20833 3.9997735 16 4.50 24
boxing 0 18 22.00 25.0 28.00 36 25.04396 3.8403204 18 6.00 273
boxing 1 21 23.00 24.0 25.00 33 24.61538 3.0149202 12 2.00 13
canoe 0 17 23.00 27.0 29.00 48 26.78387 4.5732157 31 6.00 310
canoe 1 21 24.00 29.5 33.00 36 28.62500 4.6743984 15 9.00 16
canoe 2 22 25.00 27.0 28.25 29 26.25000 3.0956959 7 3.25 4
canoe 3 29 29.00 29.0 29.00 29 29.00000 NA 0 0.00 1
cycling 0 18 23.00 27.0 30.00 44 26.91816 4.5920899 26 7.00 501
cycling 1 21 23.00 25.5 29.75 42 27.04545 5.3849638 21 6.75 22
cycling 2 24 24.00 24.0 24.00 24 24.00000 NA 0 0.00 1
cycling 3 28 28.00 28.0 28.00 28 28.00000 NA 0 0.00 1
equestrian 0 18 30.00 36.0 44.00 62 37.12560 9.6669626 44 14.00 207
equestrian 1 21 31.50 36.0 47.00 58 38.86667 10.5211805 37 15.50 15
fencing 0 15 24.00 27.0 31.00 41 27.47788 4.8909846 26 7.00 226
fencing 1 20 26.00 28.0 31.00 38 28.47368 4.1145868 18 5.00 19
fencing 2 22 22.00 22.0 22.00 22 22.00000 NA 0 0.00 1
football 0 16 21.00 22.0 25.00 40 23.48870 3.5948270 24 4.00 575
football 1 19 22.00 23.0 27.25 33 24.41667 3.6282621 14 5.25 36
golf 0 18 24.00 28.5 34.75 46 29.50000 7.0792220 28 10.75 118
golf 1 28 30.00 32.0 34.00 36 32.00000 5.6568542 8 4.00 2
gymnastics 0 15 19.00 21.0 24.50 41 22.05686 4.4025745 26 5.50 299
gymnastics 1 16 19.25 22.0 25.00 31 22.31818 3.8469080 15 5.75 22
gymnastics 2 23 24.00 25.0 26.00 27 25.00000 2.8284271 4 2.00 2
gymnastics 4 19 19.00 19.0 19.00 19 19.00000 NA 0 0.00 1
handball 0 19 25.00 27.0 31.00 43 27.82934 4.5451596 24 6.00 334
handball 1 20 27.00 29.0 31.00 35 28.68966 3.7806221 15 4.00 29
hockey 0 17 23.00 26.0 29.00 37 25.96482 3.6728658 20 6.00 398
hockey 1 21 25.25 28.5 31.00 37 28.55882 4.0690344 16 5.75 34
judo 0 18 24.00 26.0 29.00 39 26.22751 3.7425200 21 5.00 378
judo 1 21 24.00 25.0 26.75 30 25.28571 2.8670159 9 2.75 14
modern pentathlon 0 16 23.00 25.0 29.00 36 25.60000 3.8949987 20 6.00 70
modern pentathlon 1 24 25.00 26.0 27.00 28 26.00000 2.8284271 4 2.00 2
rowing 0 18 24.00 26.0 29.00 56 26.69940 4.3994591 38 5.00 499
rowing 1 22 26.00 28.0 31.00 37 28.77083 3.5921876 15 5.00 48
rugby sevens 0 18 23.00 26.0 29.00 36 26.12364 3.8264334 18 6.00 275
rugby sevens 1 21 23.00 25.0 28.00 32 25.32000 3.0784195 11 5.00 25
sailing 0 17 24.00 27.0 30.00 47 27.41096 5.2629631 30 6.00 365
sailing 1 25 26.50 28.0 30.00 54 30.00000 7.2308861 29 3.50 15
shooting 0 16 25.00 31.0 36.00 55 31.40426 8.3775180 39 11.00 376
shooting 1 19 25.00 29.0 33.00 49 30.23077 8.6231650 30 8.00 13
shooting 2 28 28.00 28.0 28.00 28 28.00000 NA 0 0.00 1
table tennis 0 15 23.00 27.0 32.00 54 27.77711 6.8058589 39 9.00 166
table tennis 1 25 25.75 27.0 28.00 28 26.75000 1.5000000 3 2.25 4
table tennis 2 26 26.25 26.5 26.75 27 26.50000 0.7071068 1 0.50 2
taekwondo 0 17 21.00 24.0 27.00 37 24.24167 4.2207886 20 6.00 120
taekwondo 1 20 21.50 22.0 23.75 28 22.87500 2.7998724 8 2.25 8
tennis 0 18 25.00 28.0 31.00 43 28.02128 4.6168228 25 6.00 188
tennis 1 22 26.75 29.5 30.25 34 28.37500 4.0333432 12 3.50 8
triathlon 0 19 25.00 27.0 30.00 38 27.66667 3.8499909 19 5.00 108
triathlon 1 28 28.50 29.0 29.50 30 29.00000 1.4142136 2 1.00 2
volleyball 0 17 25.00 27.0 30.00 41 27.45787 4.3288641 24 5.00 356
volleyball 1 19 24.00 27.0 30.00 40 27.21429 5.2518578 21 6.00 28
weightlifting 0 15 22.00 24.0 28.00 41 24.86420 4.1273721 26 6.00 243
weightlifting 1 21 22.00 24.0 25.00 33 24.26667 3.0814344 12 3.00 15
wrestling 0 18 24.00 26.0 29.00 38 26.67463 3.7189973 20 5.00 335
wrestling 1 20 21.25 24.0 25.00 33 24.55556 3.8535274 13 3.75 18
#Now create a table with the descriptive statistics for height by each sport and gender
athletes2016 %>% group_by(Sport, Gender) %>% summarise(Min = min(Height,na.rm = TRUE),
                                           Q1 = quantile(Height,probs = .25,na.rm = TRUE),
                                           Median = median(Height),
                                           Q3 = quantile(Height,probs = .75,na.rm = TRUE),
                                           Max = max(Height,na.rm = TRUE),
                                           Mean = mean(Height, na.rm = TRUE),
                                           SD = sd(Height, na.rm = TRUE),
                                           Range = Max - Min,
                                           IQR = Q3 - Q1,
                                           n = n()) -> table2
knitr::kable(table2)
Sport Gender Min Q1 Median Q3 Max Mean SD Range IQR n
aquatics female 143 167.00 171.5712 176.00 194 171.5712 7.403290 51 9.00 716
aquatics male 153 181.00 186.0342 192.00 208 185.9999 8.794747 55 11.00 729
archery female 153 163.00 168.0000 172.00 183 167.6190 6.168091 30 9.00 64
archery male 164 175.00 180.0000 183.00 194 179.5714 5.800149 30 8.00 64
athletics female 121 164.00 169.0500 174.00 193 169.0500 7.925731 72 10.00 1137
athletics male 145 175.00 180.9234 187.00 208 180.9234 8.896243 63 12.00 1226
badminton female 156 165.00 168.0000 172.00 184 168.6000 5.562902 28 7.00 86
badminton male 167 176.25 180.2529 184.00 198 180.5059 6.316409 31 7.75 86
basketball female 161 177.00 185.0000 190.00 203 183.3819 9.184965 42 13.00 144
basketball male 181 194.75 201.0000 207.00 218 200.3611 8.281423 37 12.25 144
boxing female 150 162.75 168.0000 174.25 184 168.2222 8.124429 34 11.50 36
boxing male 152 170.00 175.7615 181.00 206 175.7615 9.132098 54 11.00 250
canoe female 153 166.50 170.0000 174.00 185 170.0648 6.356490 32 7.50 111
canoe male 159 178.00 183.0000 186.00 202 182.0455 6.852505 43 8.00 220
cycling female 155 163.00 167.1564 170.25 185 167.3128 5.895198 30 7.25 200
cycling male 164 175.00 179.3074 184.00 196 179.3074 6.331333 32 9.00 325
equestrian female 151 163.00 168.0000 172.00 185 168.2530 6.843216 34 9.00 85
equestrian male 161 174.00 179.0000 183.00 196 179.0000 7.407270 35 9.00 137
fencing female 152 167.75 171.5000 176.00 185 171.2581 6.538032 33 8.25 124
fencing male 165 178.00 183.1901 189.00 205 183.3802 7.478074 40 11.00 122
football female 153 163.00 168.0000 173.00 188 168.2710 6.553599 35 10.00 264
football male 160 174.00 180.0000 185.00 196 179.3746 7.206978 36 11.00 347
golf female 158 165.00 168.0000 172.50 183 169.1404 5.694639 25 7.50 59
golf male 160 173.00 180.0000 185.00 196 179.3500 7.999635 36 12.00 61
gymnastics female 133 156.00 163.0000 168.00 181 161.6524 9.051058 48 12.00 210
gymnastics male 152 163.00 167.0000 172.00 184 167.7105 6.168323 32 9.00 114
handball female 161 172.00 176.0000 180.00 192 175.9722 5.801590 31 8.00 180
handball male 177 186.00 191.0000 196.00 210 191.3169 6.816922 33 10.00 183
hockey female 152 162.00 167.0000 171.00 196 166.7870 6.984436 44 9.00 216
hockey male 161 176.00 180.0000 184.00 196 180.0093 5.989130 35 8.00 216
judo female 150 161.00 165.0000 173.00 186 166.5132 8.130504 36 12.00 153
judo male 155 172.00 178.1783 185.00 210 178.1783 9.474799 55 13.00 239
modern pentathlon female 159 165.75 169.0000 175.00 182 169.9444 5.820871 23 9.25 36
modern pentathlon male 172 180.00 181.5000 185.00 197 182.5278 5.140795 25 5.00 36
rowing female 150 173.00 178.0000 183.00 196 177.3286 7.512572 46 10.00 216
rowing male 152 186.00 190.0000 195.00 209 189.8708 7.557571 57 9.00 331
rugby sevens female 153 163.75 168.0000 172.00 203 167.8552 6.756181 50 8.25 148
rugby sevens male 168 177.00 183.5000 188.00 198 182.9145 7.272323 30 11.00 152
sailing female 152 165.00 169.2767 173.00 184 169.2767 6.079628 32 8.00 163
sailing male 165 177.00 181.0000 186.00 204 181.2770 6.549900 39 9.00 217
shooting female 149 160.00 164.8400 169.00 181 164.8400 6.079561 32 9.00 151
shooting male 160 174.00 178.0000 182.00 202 178.0343 6.763013 42 8.00 239
table tennis female 149 161.00 165.3659 168.75 181 165.3659 6.301283 32 7.75 86
table tennis male 162 174.00 178.5000 183.00 198 178.6860 7.099729 36 9.00 86
taekwondo female 159 168.75 173.0000 176.50 188 173.0159 6.188807 29 7.75 64
taekwondo male 160 180.00 186.5000 191.00 207 186.3750 8.820071 47 11.00 64
tennis female 159 168.00 174.0000 178.00 186 173.4205 6.290595 27 10.00 91
tennis male 172 181.00 186.0000 191.00 204 186.0388 7.107583 32 10.00 105
triathlon female 153 164.00 168.0000 171.00 180 167.6545 5.618152 27 7.00 55
triathlon male 166 175.00 179.0000 183.00 191 178.8704 6.212432 25 8.00 55
volleyball female 159 178.00 183.0000 188.00 202 182.8691 7.756682 43 10.00 192
volleyball male 172 191.00 197.0000 202.00 211 196.4010 7.638467 39 11.00 192
weightlifting female 137 155.00 160.0000 165.00 178 160.3558 7.703131 41 10.00 104
weightlifting male 148 162.25 170.0000 180.00 197 170.9221 10.776719 49 17.75 154
wrestling female 150 160.25 165.0000 170.00 180 165.1239 6.796840 30 9.75 114
wrestling male 152 169.00 175.0000 182.00 203 175.7215 9.581499 51 13.00 239
#Now create a table with the descriptive statistics for Weight by each sport and gender, including calculations for normal distribution
athletes2016 %>% group_by(Sport, Gender) %>% summarise(Min = min(Weight,na.rm = TRUE),
                                           Q1 = quantile(Weight,probs = .25,na.rm = TRUE),
                                           Median = median(Weight),
                                           Q3 = quantile(Weight,probs = .75,na.rm = TRUE),
                                           Max = max(Weight,na.rm = TRUE),
                                           Mean = mean(Weight, na.rm = TRUE),
                                           SD = sd(Weight, na.rm = TRUE),
                                           Variance = var(Weight),
                                           Range = Max - Min,
                                           IQR = Q3 - Q1,
                                           n = n(),
                                           valNorm = SD * 1.33,
                                           distNorm = IQR - valNorm) -> table3
knitr::kable(table3)
Sport Gender Min Q1 Median Q3 Max Mean SD Variance Range IQR n valNorm distNorm
aquatics female 39.00000 57.00000 62.00000 66.00000 130 62.28448 8.666120 75.10164 91.00000 9.000000 716 11.525940 -2.5259399
aquatics male 50.00000 75.00000 82.00000 89.00000 125 82.21906 11.393093 129.80256 75.00000 14.000000 729 15.152813 -1.1528133
archery female 42.00000 56.75000 63.00000 73.00000 90 64.30159 10.225451 104.55984 48.00000 16.250000 64 13.599849 2.6501508
archery male 46.00000 74.00000 79.50000 84.50000 130 80.07937 13.458108 181.12069 84.00000 10.500000 64 17.899284 -7.3992844
athletics female 39.00000 52.00000 58.00000 63.00000 150 60.15254 13.679727 187.13494 111.00000 11.000000 1137 18.194037 -7.1940372
athletics male 47.00000 63.00000 72.50000 80.00000 165 74.77768 17.330468 300.34510 118.00000 17.000000 1226 23.049522 -6.0495218
badminton female 50.00000 58.00000 60.50000 64.75000 85 61.20988 5.637513 31.78155 35.00000 6.750000 86 7.497892 -0.7478924
badminton male 60.00000 72.00000 76.15663 80.00000 93 76.15663 6.964586 48.50546 33.00000 8.000000 86 9.262899 -1.2628993
basketball female 53.00000 68.00000 75.00000 83.25000 104 75.37762 10.354422 107.21404 51.00000 15.250000 144 13.771381 1.4786194
basketball male 79.00000 90.00000 100.00000 108.25000 137 100.29787 11.835554 140.08035 58.00000 18.250000 144 15.741287 2.5087128
boxing female 56.33205 60.15761 62.46073 63.84033 91 62.83780 5.469579 29.91629 34.66795 3.682722 36 7.274540 -3.5918181
boxing male 63.47368 75.22449 80.67671 82.25725 114 79.81591 6.099835 37.20799 50.52632 7.032756 250 8.112781 -1.0800249
canoe female 50.00000 63.00000 66.00000 70.00000 99 66.45794 7.164661 51.33237 49.00000 7.000000 111 9.528999 -2.5289994
canoe male 57.00000 75.75000 83.00000 89.00000 115 82.15000 8.937646 79.88151 58.00000 13.250000 220 11.887068 1.3629315
cycling female 45.00000 55.75000 60.00000 64.00000 90 60.20725 7.018894 49.26487 45.00000 8.250000 200 9.335129 -1.0851291
cycling male 52.00000 66.00000 71.00000 78.00000 100 72.57605 9.065853 82.18970 48.00000 12.000000 325 12.057585 -0.0575852
equestrian female 47.00000 55.00000 58.00000 62.00000 75 58.63415 5.842195 34.13124 28.00000 7.000000 85 7.770119 -0.7701194
equestrian male 50.00000 69.00000 72.00000 79.00000 89 72.95489 7.268854 52.83624 39.00000 10.000000 137 9.667576 0.3324236
fencing female 45.00000 58.00000 62.00000 67.25000 81 62.73387 7.362958 54.21315 36.00000 9.250000 124 9.792734 -0.5427343
fencing male 60.00000 75.00000 78.00000 83.00000 102 78.78512 7.760045 60.21829 42.00000 8.000000 122 10.320859 -2.3208592
football female 47.00000 56.00000 60.00000 65.00000 82 61.06107 6.199786 38.43735 35.00000 9.000000 264 8.245716 0.7542843
football male 60.00000 70.00000 74.45171 79.00000 100 74.45171 6.774549 45.89451 40.00000 9.000000 347 9.010150 -0.0101501
golf female 48.00000 58.50000 63.00000 68.00000 85 63.20000 7.219227 52.11724 37.00000 9.500000 59 9.601572 -0.1015722
golf male 63.00000 72.00000 79.00000 86.00000 106 79.00000 9.150592 83.73333 43.00000 14.000000 61 12.170287 1.8297127
gymnastics female 31.00000 47.00000 50.00000 54.00000 63 49.55502 5.567170 30.99338 32.00000 7.000000 210 7.404336 -0.4043363
gymnastics male 50.00000 60.00000 63.12727 66.00000 81 63.25455 6.028599 36.34401 31.00000 6.000000 114 8.018037 -2.0180367
handball female 52.00000 66.00000 70.00000 74.00000 105 70.78916 7.775344 60.45598 53.00000 8.000000 180 10.341208 -2.3412080
handball male 72.00000 88.00000 95.00000 103.50000 118 95.43169 9.955090 99.10383 46.00000 15.500000 183 13.240270 2.2597298
hockey female 45.00000 55.75000 60.00000 65.00000 78 60.42593 6.265748 39.25960 33.00000 9.250000 216 8.333445 0.9165545
hockey male 58.00000 74.00000 77.00000 81.25000 95 77.37500 6.881936 47.36105 37.00000 7.250000 216 9.152975 -1.9029752
judo female 46.00000 52.00000 63.00000 70.00000 132 65.39216 16.519221 272.88467 86.00000 18.000000 153 21.970564 -3.9705644
judo male 60.00000 68.00000 81.00000 94.50000 170 84.61674 20.376463 415.20024 110.00000 26.500000 239 27.100696 -0.6006955
modern pentathlon female 49.00000 53.75000 58.00000 60.00000 69 58.00000 4.974219 24.74286 20.00000 6.250000 36 6.615712 -0.3657116
modern pentathlon male 64.00000 70.75000 74.00000 78.00000 90 73.91667 5.683937 32.30714 26.00000 7.250000 36 7.559637 -0.3096366
rowing female 49.00000 63.75000 71.50000 75.00000 90 69.77619 8.472797 71.78828 41.00000 11.250000 216 11.268820 -0.0188196
rowing male 53.00000 73.00000 90.00000 95.00000 110 86.50462 12.010506 144.25225 57.00000 22.000000 331 15.973973 6.0260272
rugby sevens female 53.00000 63.00000 66.59589 71.00000 89 66.59589 6.629740 43.95345 36.00000 8.000000 148 8.817554 -0.8175542
rugby sevens male 65.00000 83.00000 90.00000 98.00000 113 90.45033 9.400358 88.36674 48.00000 15.000000 152 12.502477 2.4975233
sailing female 50.00000 59.00000 63.19497 68.00000 74 63.19497 5.534937 30.63553 24.00000 9.000000 163 7.361467 1.6385334
sailing male 59.00000 72.00000 77.00000 82.00000 102 77.12207 8.971876 80.49457 43.00000 10.000000 217 11.932596 -1.9325956
shooting female 43.00000 55.00000 61.00000 67.00000 95 62.67568 10.064602 101.29622 52.00000 12.000000 151 13.385921 -1.3859208
shooting male 45.00000 73.00000 80.00000 88.00000 140 81.06897 13.352823 178.29788 95.00000 15.000000 239 17.759255 -2.7592547
table tennis female 42.00000 53.25000 57.53012 61.50000 84 57.53012 6.728932 45.27853 42.00000 8.250000 86 8.949480 -0.6994796
table tennis male 51.00000 67.00000 72.00000 77.00000 99 72.55814 8.652242 74.86129 48.00000 10.000000 86 11.507481 -1.5074814
taekwondo female 48.00000 54.50000 61.25806 67.00000 88 61.25806 9.107707 82.95033 40.00000 12.500000 64 12.113251 0.3867493
taekwondo male 57.00000 62.75000 72.40476 82.25000 108 74.80952 13.913402 193.58277 51.00000 19.500000 64 18.504825 0.9951748
tennis female 52.00000 61.00000 65.00000 68.00000 84 64.67045 5.386653 29.01604 32.00000 7.000000 91 7.164249 -0.1642491
tennis male 65.00000 75.00000 80.00000 85.00000 99 80.41748 7.405201 54.83701 34.00000 10.000000 105 9.848918 0.1510824
triathlon female 44.00000 51.50000 55.00000 58.00000 65 54.56364 4.532858 20.54680 21.00000 6.500000 55 6.028701 0.4712989
triathlon male 56.00000 63.50000 65.00000 70.00000 80 66.81481 5.343997 28.55830 24.00000 6.500000 55 7.107515 -0.6075154
volleyball female 52.00000 66.00000 71.00000 75.00000 92 70.68421 6.709785 45.02122 40.00000 9.000000 192 8.924014 0.0759856
volleyball male 64.00000 84.00000 89.00000 95.25000 115 89.42188 8.419880 70.89439 51.00000 11.250000 192 11.198441 0.0515589
weightlifting female 47.00000 57.00000 63.00000 74.00000 143 68.78846 20.526409 421.33346 96.00000 17.000000 104 27.300124 -10.3001237
weightlifting male 55.00000 69.00000 84.00000 104.00000 170 87.53896 27.938339 780.55076 115.00000 35.000000 154 37.157990 -2.1579902
wrestling female 47.00000 54.25000 62.50000 69.00000 80 61.80531 9.109810 82.98864 33.00000 14.750000 114 12.116048 2.6339523
wrestling male 39.00000 66.00000 82.00000 98.00000 130 85.37288 21.409295 458.35793 91.00000 32.000000 239 28.474363 3.5256372
#I will use male cycling as an example and plot
maleCyc <- athletes2016 %>% 
                      group_by(Gender, Sport) %>% 
                      filter(Sport == "cycling" , Gender == "male")

#Plot histogram with normal overlay - Male cyclists weight
h<-hist(maleCyc$Weight,breaks=15, main = "Normal Distribution of Male Cyclists' weight")

weight.plot <-c(min(h$breaks),h$breaks)
freq<-c(0,h$density,0)
xfit<-seq(min(maleCyc$Weight),max(maleCyc$Weight),length=40)
yfit<-dnorm(xfit,mean=mean(maleCyc$Weight),sd=sd(maleCyc$Weight)) 
plot(weight.plot,freq,type="s",ylim=c(0,max(freq,yfit)), main="Normal pdf and histogram - Male Cyclist weights")
lines(xfit,yfit, col="red")
maleCyc$Weight %>% mean() %>% abline(v=.,col='red',lw=2)

#Complete probability testing as per research problem

  1. Find the gold medal count and therefore probability of winning gold per sport and gender
  2. Find probability of winning in your sport based on gender, number of medals and athletes - we observe that the highest probability sports to win for athletes based on being female are swimming and rowing (both higher than 9%) whilst the lowest probabilities are in golf and triathlon where there is little over 1% change of winning by gender
  3. Now complete the same for males - In the male competitions we see the highest probability of winning in gymnastics which is > 11%, and the lowest probabilities in golf and triathlon again. - It is noted that the reason for this is there is only one event (one medal) for each of these sports so athletes cannot participate in multiple events, therefore lowering their chances.
  4. Now I know that the most athletes are in age bracket 22-31 so I will calculate the probability of winning based on gender and age - We have found the probability of winning gold if you are in the age bracket 22-31 compared to a different age bracket. - The highest values are for the events where there is only 1 medal and the athlete fell in this age bracket e.g golf and triathlon as well as a number of other events where all gold medals were won by people in this age bracket. - The lowest probabilities are in shooting, gymnastics and boxing. - Total probabilites can be found in the final table per sport for females
#Probability testing

library(janitor)

#Find the gold medal count and therefore probability of winning gold per sport and gender
m1 <- as.matrix(tabyl(athletes2016, Gender, Sport))
m2 <- as.numeric(c(m1[1, 2:29]))
m3 <- as.numeric(c(m1[2, 2:29]))

athletes2016 %>%
  group_by(Sport)%>%
  filter(`Gold Medals` > 0 , Gender == "female") %>% 
  summarise(totalGoldMedals = sum(`Gold Medals`)) -> table4

knitr::kable(table4)
Sport totalGoldMedals
aquatics 65
archery 4
athletics 32
badminton 4
basketball 12
boxing 3
canoe 9
cycling 13
equestrian 5
fencing 11
football 18
golf 1
gymnastics 17
handball 15
hockey 16
judo 7
modern pentathlon 1
rowing 20
rugby sevens 12
sailing 7
shooting 6
table tennis 4
taekwondo 4
tennis 4
triathlon 1
volleyball 14
weightlifting 7
wrestling 6
#Find probability of winning in your sport based on gender, number of medals and athletes
as.data.frame(table4)
table4$TotalFemaleAthletes <- m2
table4$ProbWin <- ((table4$totalGoldMedals / table4$TotalFemaleAthletes) * 100)
table4
#Now complete for males
athletes2016 %>%
  group_by(Sport)%>%
  filter(`Gold Medals` > 0 , Gender == "male") %>% 
  summarise(totalGoldMedals = sum(`Gold Medals`)) -> table5

knitr::kable(table5)
Sport totalGoldMedals
aquatics 55
archery 4
athletics 34
badminton 4
basketball 12
boxing 10
canoe 18
cycling 14
equestrian 10
fencing 10
football 18
golf 1
gymnastics 13
handball 14
hockey 18
judo 7
modern pentathlon 1
rowing 28
rugby sevens 13
sailing 8
shooting 9
table tennis 4
taekwondo 4
tennis 4
triathlon 1
volleyball 14
weightlifting 8
wrestling 12
as.data.frame(table5)
table5$TotalMaleAthletes <- m3
table5$ProbWin <- ((table5$totalGoldMedals / table5$TotalMaleAthletes) * 100)
table5
#Now I know that the most athletes are in age bracket 22-31 so I will calculate the probability of winning based on gender and age
as.matrix(tabyl(athletes2016, AgeBracket, Sport))
##      AgeBracket aquatics archery athletics badminton basketball boxing canoe
## [1,] "12-21"    "529"    "32"    " 350"    " 14"     " 19"      " 50"  " 38"
## [2,] "22-31"    "848"    "76"    "1668"    "138"     "209"      "219"  "238"
## [3,] "32-41"    " 68"    "17"    " 339"    " 20"     " 60"      " 17"  " 54"
## [4,] "42-51"    "  0"    " 3"    "   6"    "  0"     "  0"      "  0"  "  1"
## [5,] "52-61"    "  0"    " 0"    "   0"    "  0"     "  0"      "  0"  "  0"
## [6,] "62+"      "  0"    " 0"    "   0"    "  0"     "  0"      "  0"  "  0"
##      cycling equestrian fencing football golf gymnastics handball hockey judo 
## [1,] " 56"   " 8"       " 21"   "180"    "17" "165"      " 23"    " 51"  " 37"
## [2,] "385"   "63"       "172"   "412"    "64" "150"      "265"    "345"  "317"
## [3,] " 81"   "78"       " 53"   " 19"    "30" "  9"      " 74"    " 36"  " 38"
## [4,] "  3"   "53"       "  0"   "  0"    " 9" "  0"      "  1"    "  0"  "  0"
## [5,] "  0"   "19"       "  0"   "  0"    " 0" "  0"      "  0"    "  0"  "  0"
## [6,] "  0"   " 1"       "  0"   "  0"    " 0" "  0"      "  0"    "  0"  "  0"
##      modern pentathlon rowing rugby sevens sailing shooting table tennis
## [1,] "10"              " 47"  " 29"        " 43"   " 44"    "29"        
## [2,] "57"              "429"  "243"        "261"   "169"    "97"        
## [3,] " 5"              " 68"  " 28"        " 71"   "128"    "42"        
## [4,] " 0"              "  2"  "  0"        "  4"   " 42"    " 2"        
## [5,] " 0"              "  1"  "  0"        "  1"   "  7"    " 2"        
## [6,] " 0"              "  0"  "  0"        "  0"   "  0"    " 0"        
##      taekwondo tennis triathlon volleyball weightlifting wrestling
## [1,] "34"      " 12"  " 5"      " 32"      " 50"         " 29"    
## [2,] "85"      "143"  "86"      "288"      "194"         "286"    
## [3,] " 9"      " 39"  "19"      " 64"      " 14"         " 38"    
## [4,] " 0"      "  2"  " 0"      "  0"      "  0"         "  0"    
## [5,] " 0"      "  0"  " 0"      "  0"      "  0"         "  0"    
## [6,] " 0"      "  0"  " 0"      "  0"      "  0"         "  0"
table4$totalGoldMedals
##  [1] 65  4 32  4 12  3  9 13  5 11 18  1 17 15 16  7  1 20 12  7  6  4  4  4  1
## [26] 14  7  6
athletes2016 %>%
  group_by(Sport, Gender)%>%
  filter(`Gold Medals` >0, AgeBracket == "22-31" , Gender == "female") %>% 
  summarise(AgeBracketGoldMedals = sum(`Gold Medals`)) -> table6

knitr::kable(table6)
Sport Gender AgeBracketGoldMedals
aquatics female 42
archery female 3
athletics female 26
badminton female 4
basketball female 6
boxing female 1
canoe female 7
cycling female 11
equestrian female 2
fencing female 9
football female 16
golf female 1
gymnastics female 5
handball female 10
hockey female 12
judo female 7
modern pentathlon female 1
rowing female 18
rugby sevens female 8
sailing female 6
shooting female 2
table tennis female 4
taekwondo female 4
tennis female 4
triathlon female 1
volleyball female 9
weightlifting female 5
wrestling female 3
as.data.frame(table6)
table6$totalGoldMedals <- as.numeric(c(table4$totalGoldMedals))

table6
#Now we have found the probability of winning gold if you are in the age bracket 22-31 compared to a different age bracket. The highest values are for the events where there is only 1 medal and the athlete fell in this age bracket e.g golf and triathlon as well as a number of other events where all gold medals were won by people in this age bracket. The lowest probabilities are in shooting, gymnastics and boxing.
table6$ProbGoldAge <- (table6$AgeBracketGoldMedals/table6$totalGoldMedals) * 100
table6$totalProbWin <- ((table4$totalGoldMedals / table4$TotalFemaleAthletes) * (table6$ProbGoldAge/100))*100
table6

#Now complete cluster sampling on the 2012 dataset and see if male cyclists also follow the same normal distribution for weight

  1. Take all 2012 male road cyclists into a dataframe
  2. Complete simple random sampling without replacement on the male cylcists 2012 dataframe
  3. Check for normal distribution in the sample set using IQR = 1.33 * SD - We can see that 1.33 * SD is higher than IQR therefore the sample is not normally distributed
  4. Now plot the distribution - In the 2016 dataset we see mean 72.57605, SD 9.065853 and variance 82.18970 therefore it makes sense we see different SD and variance however a similar mean as the sample population is a lot fewer
  5. Calculate the confidence interval for this mean as a representative sample of the whole dataset
  6. Finalise 95% CI for the sample
#take all male cyclists into a dataframe

maleCyc2012 <- athletes2012 %>% group_by(Sport, Gender) %>% 
                                           filter(Sport == "Cycling - Road", Gender == "male")
maleCyc2012
#complete simple random sampling without replacement on the male cylcists 2012 dataframe
index <- sample(1:nrow(maleCyc2012), 50) #take 50 samples
cycleSamp <- maleCyc2012[index,]
cycleSamp
#check for normal distribution in the sample set
summary(cycleSamp$Weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   58.00   65.25   69.50   70.65   76.00   86.00
IQR(cycleSamp$Weight)
## [1] 10.75
#we can see that 1.33 * SD is higher than IQR therefore the sample is not normally distributed
normCycSamp <- sd(cycleSamp$Weight) * 1.33
print(normCycSamp)
## [1] 9.212591
#now plot the distribution
h<-hist(cycleSamp$Weight,breaks=15, main = "Distribution of 2012 Sample group: Male Cyclists' weight")

weight.plot <-c(min(h$breaks),h$breaks)
freq<-c(0,h$density,0)
xfit<-seq(min(cycleSamp$Weight),max(cycleSamp$Weight),length=40)
yfit<-dnorm(xfit,mean=mean(cycleSamp$Weight,sd=sd(cycleSamp$Weight))) 
plot(weight.plot,freq,type="s",ylim=c(0,max(freq,yfit)), main="Normal pdf and histogram - Male Cyclist weights")
lines(xfit,yfit, col="red")
cycleSamp$Weight %>% mean() %>% abline(v=.,col='red',lw=2)

cycleSamp$Weight %>% mean() #mean is 71.89kg
## [1] 70.65168
cycleSamp$Weight %>% var() #variance is 34
## [1] 47.98
cycleSamp$Weight %>% sd() #SD is 5.83
## [1] 6.92676
#in the 2016 dataset we see mean 72.57605, SD 9.065853 and variance 82.18970 therefore it makes sense we see different SD and variance however a similar mean as the sample population is a lot fewer

#calculate the confidence interval for this mean as a representative sample of the whole dataset:
tcrit<- qt(p = 0.975, df = 50 -1)
CI95<- tcrit*(cycleSamp$Weight %>% sd()/sqrt(50))
CI95
## [1] 1.968563
#Therefore 95% CI for the sample is:
floorCI <- cycleSamp$Weight %>% mean() - CI95
roofCI <- cycleSamp$Weight %>% mean() + CI95
floorCI
## [1] 68.68312
roofCI
## [1] 72.62024

Hypothesis Testing

I will use a critical value approach as the goal is to ascertain if the mean of the 2012 sample population is outside of the critical values determined for mean in comparison to the mean of the 2016 dataset. This allows comparison of the 2 datasets and assumes the null hypothesis is true. I will use a two-tailed approach as it will be also important to consider the upper critical mean boundary as it is just as likely a sample population mean will be heavier than it is they will be lighter.

The critical mean now sits between 70.922 and 74.24, this means Pr(70.92 <= x¯<= 74.24)=0.05. A such, assuming the population mean = 72.58, sampling 50 people from the population and finding a mean to be less than 70.92 and greater than 74.24 would happen less than 5% of the time. In the case of our sample dataset we have a mean of 71.89kg which falls within the 95% of mean values we’d expect. Therefore we can say that the mean of the 2012 dataset sample is statistically insignificant.

cycleSamp %>% summarise(Min = min(Weight,na.rm = TRUE),
                        Q1 = quantile(Weight,probs = .25,na.rm = TRUE),
                        Median = median(Weight, na.rm = TRUE),
                        Q3 = quantile(Weight,probs = .75,na.rm = TRUE),
                        Max = max(Weight,na.rm = TRUE),
                        Mean = mean(Weight, na.rm = TRUE),
                        SD = sd(Weight, na.rm = TRUE),
                        n = n(),
                        Missing = sum(is.na(Weight)))
#sample mean is 71.89kg
cycleSamp$Weight %>% boxplot(ylab = "Weight (Kg)")

#hypothesis is Mean = 72.57605 as per 2016 dataset

#This means the sample mean sits 1.68 SEs below the population mean of 72.576. 
qt(p = 0.05/2, df = 50-1, lower.tail = TRUE)
## [1] -2.009575
#Write a short R function, named ttox to convert this back to a critical mean.
tcrittomean <- function(t ,mu, s, n) {
  se <- s/sqrt(n)
  x_bar <- mu + (t*se) #Determine critical mean
  return(x_bar)
}
#lower critical mean - 70.922
tcrittomean(t = qt(0.05/2,50-1,lower.tail = TRUE),
            mu = 72.58, 
            s = sd(cycleSamp$Weight),
            n = length(cycleSamp$Weight))
## [1] 70.61144
#upper critical mean - 74.24
tcrittomean(t = qt(1-(0.05/2),50-1,lower.tail = TRUE),
            mu = 72.58, 
            s = sd(cycleSamp$Weight),
            n = length(cycleSamp$Weight))
## [1] 74.54856

Discussion

First I cleaned the data, then did some visual analysis on the variables. After completing this I went on to complete some probability calculations with regard to the probability of winning gold based on athlete specific data. I did this by putting athletes in age brackets and grouping by sport and gender. I then calculated probability of winning based on your sport and gender, and also if female athletes were in the main age bracket as opposed to any other bracket. I created tables with final probabilities of winning. Most sports were around 5% chance if you were a female and aged 22-31 however this also ranged from 1-7% depending on the sport and how many gold medals/events were up for grabs.

This probability would be significantly lower for the other age brackets as most medals were won by people in this age bracket.

I then went on to complete sampling and hypothesis testing on the 2012 dataset. I used cluster sampling methods to sample male cyclists as this group of athletes had normal weight distribution in the 2016 dataset and I wanted to explore the weight of the same group of athletes in a different dataset.

I chose a sample size of 50 athletes as this is considered fairly large and to be a better representative sample than a smaller sample set of the total group (there were 325 male cyclists in the 2016 dataset). I completed a number of statistical summaries and did critical value hypothesis testing to come to the final conclusion that the sample mean was within 95% confidence of a reflective mean of the population. Therefore the mean of the 2012 dataset sample is statistically insignificant.

The strengths of the investigation lie in that there is strong evidence from these 2 datasets to identify the mean weight of olympic level male cyclists and therefore it is also likely the other normally distributed weights for gender and sport can be easily predicted.

However the limitiations lie with the data where weights are not normally distributed. This leaves room for future investigation. There may be rationale to combine a number of olympic datasets just by sport and gender for the specific sport and gender to analyse. Therefore a larger representative sample can be calculated and compared.

In response to my problem statement, calculating the probability of winning gold within a dataset is simple and I have created a table of probability of winning gold in each sport if you’re a “prime aged” female. However when it comes to predicting specific weight (or other continuous variable) of a separate group such as a future dataset there comes uncertainty with non-normal data. This data would have to be collected, collated and handled in a different manner.

References

[1] “2016 Olympics in Rio de Janeiro”, Kaggle.com, 2021. [Online]. Available: https://www.kaggle.com/rio2016/olympic-games. [Accessed: 12- May- 2021].

[2] R. Wood, “Anthropometry of Olympic Athletes 2012”, Topend Sports Website, 2012. [Online]. Available: https://www.topendsports.com/events/summer/science/anthropometry-2012.htm. [Accessed: 09- May- 2021].