Olivia Yammouni - s3832344
With the 2021 Olympic games nearing, I have choen to investigate the question of the probability of winning gold at the Rio 2016 Olympics given age range, gender and sport. For each athlete, I will consider a number of factors including: - Sex - Age - Height - Weight - Sport - Medal Count
A lot of emphasis is always put on an athlete being in their “prime” age for competition therefore these variables will be used to calculate probability of winning gold, given an athlete’s age and gender for their respective sport. I have chosen the probability investigation for these specific variables as I believe the probability of winning gold in 2021 may have some different outcomes to the 2016 dataset as a result of the delayed olympic event. This may mean that some athletes that scraped into the ideal age range for their sport in 2016 might now overshoot this range in 2021 as they are a year older. Additionally, athletes are thought to be more likely to succeed if they have genetic advantages such as height in basketball or weight in weightlifting, thus genetic factors will be analysed as well. I am interested specifically in female athletes who are in their “prime age” as this group most closely reflects my personal athlete profile. I am also specifically interested in the weight variable as this can be controlled to some extent by athletes as height and age cannot.
In this assigment I am using 2 complete files (no samples) called 2016athletes.csv and 2012athletes.csv. They were pulled from Kaggle and TopEndSports respectively with URL’s as follows: https://www.kaggle.com/rio2016/olympic-games & https://www.topendsports.com/events/summer/science/anthropometry-2012.htm
To preprocess the data, I will first load the dataset and do some exploratory data analysis: - Review format of the data - Rename columns in each dataset to match - Drop the ID column as it adds no benefit to finding the problem statement (noise). Also drop additional columns in 2012 dataset that aren’t in 2016 dataset - Check shape - Check variable types and change to factor and date variables for some - Convert M to male and F to Female in 2012 set to match 2016 set - Convert height to cm for 2016 dataset - Observe lowest and highest date of birth - no outliers observed - Check for missing values and input averages for that gender and sport - Check for obvious outliers and complete a google search these athletes to see if the data is entered incorrectly or if they are a physical anomaly - Complete feature construction and make a column with bins for age groups - each athlete is assigned an age range
#import data using the readr package, (file is located in the working directory)
athletes2016 <- read_csv("2016athletes.csv")
athletes2012 <- read_csv("2012athletes.csv")
#view the header of the data tables and view top 5 rows
head(athletes2016)
###understand and inspect the data - Data pre-processing
#check data types of the variables
lapply(athletes2016, class)
## $id
## [1] "numeric"
##
## $name
## [1] "character"
##
## $nationality
## [1] "character"
##
## $sex
## [1] "character"
##
## $dob
## [1] "character"
##
## $height
## [1] "numeric"
##
## $weight
## [1] "numeric"
##
## $sport
## [1] "character"
##
## $gold
## [1] "numeric"
##
## $silver
## [1] "numeric"
##
## $bronze
## [1] "numeric"
## $Name
## [1] "character"
##
## $Country
## [1] "character"
##
## $Age
## [1] "numeric"
##
## $`Height, cm`
## [1] "numeric"
##
## $Weight
## [1] "numeric"
##
## $Sex
## [1] "character"
##
## $`Date of birth`
## [1] "character"
##
## $`Birth Month`
## [1] "numeric"
##
## $`Place of birth`
## [1] "character"
##
## $Gold
## [1] "numeric"
##
## $Silver
## [1] "numeric"
##
## $Bronze
## [1] "numeric"
##
## $Total
## [1] "numeric"
##
## $Sport
## [1] "character"
##
## $Event
## [1] "character"
## [1] 11538 11
## [1] 10384 15
## tibble [11,538 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:11538] 7.36e+08 5.32e+08 4.36e+08 5.21e+08 3.39e+07 ...
## $ name : chr [1:11538] "A Jesus Garcia" "A Lam Shin" "Aaron Brown" "Aaron Cook" ...
## $ nationality: chr [1:11538] "ESP" "KOR" "CAN" "MDA" ...
## $ sex : chr [1:11538] "male" "female" "male" "male" ...
## $ dob : chr [1:11538] "10/17/69" "9/23/86" "5/27/92" "1/2/91" ...
## $ height : num [1:11538] 1.72 1.68 1.98 1.83 1.81 1.8 2.05 1.93 1.8 1.65 ...
## $ weight : num [1:11538] 64 56 79 80 71 67 98 100 62 54 ...
## $ sport : chr [1:11538] "athletics" "fencing" "athletics" "taekwondo" ...
## $ gold : num [1:11538] 0 0 0 0 0 0 0 0 0 0 ...
## $ silver : num [1:11538] 0 0 0 0 0 0 0 0 0 0 ...
## $ bronze : num [1:11538] 0 0 1 0 0 0 1 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. name = col_character(),
## .. nationality = col_character(),
## .. sex = col_character(),
## .. dob = col_character(),
## .. height = col_double(),
## .. weight = col_double(),
## .. sport = col_character(),
## .. gold = col_double(),
## .. silver = col_double(),
## .. bronze = col_double()
## .. )
## tibble [10,384 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Name : chr [1:10384] "Hiroshi Hoketsu" "Ian Millar" "Afanasijs Kuzmins" "Carl Bouckaert" ...
## $ Country : chr [1:10384] "Japan" "Canada" "Latvia" "Belgium" ...
## $ Age : num [1:10384] 71 65 65 58 57 57 56 56 56 54 ...
## $ Height, cm : num [1:10384] 168 185 178 181 173 185 177 190 180 175 ...
## $ Weight : num [1:10384] 61 76 89 78 63 88 76 82 78 77 ...
## $ Sex : chr [1:10384] "M" "M" "M" "M" ...
## $ Date of birth : chr [1:10384] "3/28/1941" "1/6/1947" "3/22/1947" "4/19/1954" ...
## $ Birth Month : num [1:10384] 3 1 3 4 12 3 9 3 6 12 ...
## $ Place of birth: chr [1:10384] "CHUO (JPN)" "Halifax (CAN)" "BIKERNIEKI PARISH (LAT)" "WAREGEM (BEL)" ...
## $ Gold : num [1:10384] 0 0 0 0 0 0 0 0 0 0 ...
## $ Silver : num [1:10384] 0 0 0 0 0 0 0 0 0 0 ...
## $ Bronze : num [1:10384] 0 0 0 0 0 0 0 1 0 0 ...
## $ Total : num [1:10384] 0 0 0 0 0 0 0 1 0 0 ...
## $ Sport : chr [1:10384] "Equestrian" "Equestrian" "Shooting" "Equestrian" ...
## $ Event : chr [1:10384] "Individual Dressage, WHISPER" "Individual Jumping, Team Jumping, STAR POWER" "Men's 25m Rapid Fire Pistol" "Individual Eventing, Team Eventing, CYRANO Z" ...
## - attr(*, "spec")=
## .. cols(
## .. Name = col_character(),
## .. Country = col_character(),
## .. Age = col_double(),
## .. `Height, cm` = col_double(),
## .. Weight = col_double(),
## .. Sex = col_character(),
## .. `Date of birth` = col_character(),
## .. `Birth Month` = col_double(),
## .. `Place of birth` = col_character(),
## .. Gold = col_double(),
## .. Silver = col_double(),
## .. Bronze = col_double(),
## .. Total = col_double(),
## .. Sport = col_character(),
## .. Event = col_character()
## .. )
#drop additional columns in both sets
#first drop ID in 2016
athletes2016 <- select(athletes2016, -1)
head(athletes2016)
#next drop columns in 2012 dataframe that are not in 2016 dataframe
athletes2012 <- select(athletes2012, -3, -8, -9, -13, -15)
head(athletes2012)
## [1] "name" "nationality" "sex" "dob" "height"
## [6] "weight" "sport" "gold" "silver" "bronze"
## [1] "Name" "Country" "Height, cm" "Weight"
## [5] "Sex" "Date of birth" "Gold" "Silver"
## [9] "Bronze" "Sport"
#rename the columns
colnames(athletes2016) <- c("Athlete Name", "Nationality", "Gender", "Date_of_Birth" ,
"Height", "Weight", "Sport",
"Gold Medals","Silver Medals","Bronze Medals" )
colnames(athletes2012) <- c("Athlete Name", "Nationality",
"Height", "Weight", "Gender", "Date_of_Birth",
"Gold Medals","Silver Medals","Bronze Medals", "Sport")
#check the new column names
colnames(athletes2016)
## [1] "Athlete Name" "Nationality" "Gender" "Date_of_Birth"
## [5] "Height" "Weight" "Sport" "Gold Medals"
## [9] "Silver Medals" "Bronze Medals"
## [1] "Athlete Name" "Nationality" "Height" "Weight"
## [5] "Gender" "Date_of_Birth" "Gold Medals" "Silver Medals"
## [9] "Bronze Medals" "Sport"
#convert some categorical variables to factors
athletes2016$Gender <- as.factor(athletes2016$Gender)
athletes2012$Gender <- as.factor(athletes2012$Gender)
athletes2016$Sport <- as.factor(athletes2016$Sport)
athletes2012$Sport <- as.factor(athletes2012$Sport)
#Transform date of birth from character to date
athletes2016$Date_of_Birth <-as.Date(athletes2016$Date_of_Birth, format = "%m/%d/%y")
class(athletes2016$Date_of_Birth)
## [1] "Date"
x <-athletes2016$Date_of_Birth
foo <- function(x, year=1950){
m <- year(x) %% 100
year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
x
}
athletes2016$Date_of_Birth <- foo(x, 1950)
athletes2016
athletes2012$Date_of_Birth <-as.Date(athletes2012$Date_of_Birth, format = "%m/%d/%Y")
class(athletes2012$Date_of_Birth)
## [1] "Date"
## $`Athlete Name`
## [1] 0
##
## $Nationality
## [1] 0
##
## $Gender
## [1] 0
##
## $Date_of_Birth
## [1] 1
##
## $Height
## [1] 330
##
## $Weight
## [1] 659
##
## $Sport
## [1] 0
##
## $`Gold Medals`
## [1] 0
##
## $`Silver Medals`
## [1] 0
##
## $`Bronze Medals`
## [1] 0
## $`Athlete Name`
## [1] 0
##
## $Nationality
## [1] 0
##
## $Height
## [1] 561
##
## $Weight
## [1] 1280
##
## $Gender
## [1] 0
##
## $Date_of_Birth
## [1] 20
##
## $`Gold Medals`
## [1] 0
##
## $`Silver Medals`
## [1] 0
##
## $`Bronze Medals`
## [1] 0
##
## $Sport
## [1] 0
#calculate mean height and weight per sport and gender and fill empty values with these
athletes2016 <- athletes2016 %>%
group_by(Gender, Sport) %>%
mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))
athletes2016 <- athletes2016 %>%
group_by(Gender, Sport) %>%
mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))
#do the same for the 2012 dataset
athletes2012 <- athletes2012 %>%
group_by(Gender, Sport) %>%
mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))
athletes2012 <- athletes2012 %>%
group_by(Gender, Sport) %>%
mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))
#all boxing athletes have no entry for weight so I will replace these values with the average weight per their nationality and gender
athletes2016 <- athletes2016 %>%
group_by(Nationality, Gender) %>%
mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))
lapply(athletes2016,function(x) { length(which(is.na(x)))})
## $`Athlete Name`
## [1] 0
##
## $Nationality
## [1] 0
##
## $Gender
## [1] 0
##
## $Date_of_Birth
## [1] 1
##
## $Height
## [1] 0
##
## $Weight
## [1] 0
##
## $Sport
## [1] 0
##
## $`Gold Medals`
## [1] 0
##
## $`Silver Medals`
## [1] 0
##
## $`Bronze Medals`
## [1] 0
#we still have one DOB entry empty, I will analyse this particular athlete
missing_dob <- athletes2016[rowSums(is.na(athletes2016)) > 0,]
missing_dob
#The athlete is a well known russian sailor and his date of birth is publicly available so I will replace it with the public data
athletes2016[is.na(athletes2016)] <- c(as.Date("12/25/1987", format = "%m/%d/%Y"))
#convert height to cm for 2016 dataset
athletes2016$Height <- (athletes2016$Height) * 100
head(athletes2016)
#convert M to male and F to Female in 2012 set to match 2016 set
athletes2012$Gender <- as.character(athletes2012$Gender)
athletes2012$Gender[athletes2012$Gender == "M"] <- "male"
athletes2012$Gender[athletes2012$Gender == "F"] <- "female"
athletes2012$Gender <- as.factor(athletes2012$Gender)
head(athletes2012)
## [1] 121
## [1] 7847
## [1] 221
## [1] 5691
#after searching online, this value of height is incorrect. I will replace it with his actual height 196cm
athletes2016[5691, 5] = 196
athletes2016[5691,]
## [1] 8770
## [1] 31
## [1] 3430
## [1] 170
## [1] 1388 2338
## [1] "1954-05-20"
## [1] "2002-11-26"
#create age bracket bins - we know the youngest athlete is 12 and the oldest is 62
#brackets will be 12 - 21, 22 - 31, 32 - 41, 42 - 51 and 52 - 62
#first add a column for age
a <- as.Date(athletes2016$Date_of_Birth)
b <- as.Date("8/5/2016", format = "%m/%d/%Y")
athletes2016$Age <- floor(age_calc(a, enddate = b, units = "years", precise = FALSE))
athletes2016
#Now create age bins
athletes2016$AgeBracket <- cut(athletes2016$Age, breaks=c(12, 21, 31, 41, 51, 61, 71), labels=c("12-21","22-31", "32-41", "42-51", "52-61", "62+"))
athletes2016
#we can see that most athletes (7907) are 22-31 years old.
#check null values
lapply(athletes2016,function(x) { length(which(is.na(x)))})
## $`Athlete Name`
## [1] 0
##
## $Nationality
## [1] 0
##
## $Gender
## [1] 0
##
## $Date_of_Birth
## [1] 0
##
## $Height
## [1] 0
##
## $Weight
## [1] 0
##
## $Sport
## [1] 0
##
## $`Gold Medals`
## [1] 0
##
## $`Silver Medals`
## [1] 0
##
## $`Bronze Medals`
## [1] 0
##
## $Age
## [1] 0
##
## $AgeBracket
## [1] 0
## $`Athlete Name`
## [1] 0
##
## $Nationality
## [1] 0
##
## $Height
## [1] 0
##
## $Weight
## [1] 804
##
## $Gender
## [1] 0
##
## $Date_of_Birth
## [1] 20
##
## $`Gold Medals`
## [1] 0
##
## $`Silver Medals`
## [1] 0
##
## $`Bronze Medals`
## [1] 0
##
## $Sport
## [1] 0
Details of the 2016athletes variables [1]:
Details of the 2012athletes variables [2]:
Sport and sex will need to be factor variables, date will need to be in the date format and birth month for the second dataset will also have to be factor. The levels for these are self explanatory, it will be a simple 1 & 2 for sex, it will be 1-12 for birth month and 1-n for sport where n is the number of different sports in the dataset.
Some variables such as height and weight are continuous numerical variables even though in this dataset they are represented as whole numbers. Other variables such as gold, silver and bronze medal tally will be discrete numerical variables as they only represent whole numbers of medals won (you can’t win half a medal)
Now that my data is ready to be used and is clean, I will complete some descriptive statistics and visualisations. In the above data cleaning section I have checked for any missing variables. With regards to height and weight, I used the average for that sport and the athlete’s gender. For a missing DOB variable, I obtained this from an online search of the athlete profile.
The important variables in my investigation will be sport, height, weight, age and medals won
First we observe the percentage of athletes in each sport. We can see that athletics has the overwhelming majority of athletes sitting at around 20% of total. All other sports make up around 5% or lower for each.
#Handle null values
#calculate mean height and weight per sport and gender and fill empty values with these
athletes2016 <- athletes2016 %>%
group_by(Gender, Sport) %>%
mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))
athletes2016 <- athletes2016 %>%
group_by(Gender, Sport) %>%
mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))
#do the same for the 2012 dataset
athletes2012 <- athletes2012 %>%
group_by(Gender, Sport) %>%
mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))
athletes2012 <- athletes2012 %>%
group_by(Gender, Sport) %>%
mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height))
#all boxing athletes have no entry for weight so I will replace these values with the average weight per their nationality and gender
athletes2016 <- athletes2016 %>%
group_by(Nationality, Gender) %>%
mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))
lapply(athletes2016,function(x) { length(which(is.na(x)))})
## $`Athlete Name`
## [1] 0
##
## $Nationality
## [1] 0
##
## $Gender
## [1] 0
##
## $Date_of_Birth
## [1] 0
##
## $Height
## [1] 0
##
## $Weight
## [1] 0
##
## $Sport
## [1] 0
##
## $`Gold Medals`
## [1] 0
##
## $`Silver Medals`
## [1] 0
##
## $`Bronze Medals`
## [1] 0
##
## $Age
## [1] 0
##
## $AgeBracket
## [1] 0
#we still have one DOB entry empty, I will analyse this particular athlete
missing_dob <- athletes2016[rowSums(is.na(athletes2016)) > 0,]
missing_dob
#The athlete is a well known russian sailor and his date of birth is publicly available so I will replace it with the public data
athletes2016[is.na(athletes2016)] <- c(as.Date("12/25/1987", format = "%m/%d/%Y"))
#Visualisations: - First I view the height and weight relationship for all the atheletes - there appears to be some outliers as people who are very tall with low weight and people who are very short with high weight - we can also observe that the trend of the data exists that as height increases so does weight however there is a point where the data plateaus and athletes don’t get any taller however do get heavier.
we observe that roughly 5.3% of athletes win 1 Gold medal or more
#VISUALISATIONS
#First lets view the height and weight relationship for all the atheletes
plot(Height ~ Weight, data = athletes2016)
#there appears to be some outliers as people who are very tall with low weight and people who are very short with high weight
#we can also observe that the trend of the data exists that as height increases so does weight however there is a point where the data plateaus and athletes don't get any taller however do get heavier.
#now lets view the proportion of our athletes in each sport
freqSports <- athletes2016$Sport %>% table() %>% prop.table()*100
barplot(freqSports, main = "Athlete Sport Participation - Percentage",ylab="Percent", xlab="Sport", cex.names=0.7, las=2)
#next visualise proportion of athletes in each age bracket
freqAges <- athletes2016$AgeBracket %>% table() %>% prop.table()*100
barplot(freqAges, main = "Athlete Age Brackets - Percentage",ylab="Percent", xlab="Age Bracket", cex.names=0.7, las=2)
#Again we can see the overwhelming majority of athletes are aged 22-31
#next visualise proportion of athletes that win gold
freqMedals <- athletes2016$`Gold Medals` %>% table() %>% prop.table()*100
barplot(freqMedals, main = "Athlete Gold Medal Winners - Percentage",ylab="Percent", xlab="Gold Medal Count", cex.names=0.7, las=2)
#Again we can see the overwhelming majority of athletes win 0 gold medals.
#lets dive deeper and see the exact percentage of winners of 1 - 5 gold medals.
#we observe that roughly 5.3% of athletes win 1 Gold medal or more
freqMedals
## .
## 0 1 2 3 4 5
## 94.635118738 5.061535795 0.234009360 0.043335067 0.017334027 0.008667013
#Create box plots to view the relationship between height and winning gold. It appears a taller athlete is more lilely than a smaller athlete to win 1 (or 5) gold medals
athletes2016 %>% boxplot(Height ~ `Gold Medals`, data = ., main="Box Plot of Gold Medals by Height",
ylab = "Height", xlab = "Gold Medals", col="grey")
#Now compare athlete weight and winning gold
#Here we see that a lighter athlete is more likely to win than a heavier athlete. There are some outliers for the winners of 1 gold medal (where some very heavy athletes won)
athletes2016 %>% boxplot(Weight ~ `Gold Medals`, data = ., main="Box Plot of Gold Medals by Weight",
ylab = "Weight", xlab = "Gold Medals", col="grey")
I will use male cycling weight as an example and plot a histogram with a normal overlay
#Create a table with the descriptive statistics for age by each sport
athletes2016 %>% group_by(Sport, `Gold Medals`) %>% summarise(Min = min(Age,na.rm = TRUE),
Q1 = quantile(Age,probs = .25,na.rm = TRUE),
Median = median(Age, na.rm = TRUE),
Q3 = quantile(Age,probs = .75,na.rm = TRUE),
Max = max(Age,na.rm = TRUE),
Mean = mean(Age, na.rm = TRUE),
SD = sd(Age, na.rm = TRUE),
Range = Max - Min,
IQR = Q3 - Q1,
n = n()) -> table1
knitr::kable(table1)
Sport | Gold Medals | Min | Q1 | Median | Q3 | Max | Mean | SD | Range | IQR | n |
---|---|---|---|---|---|---|---|---|---|---|---|
aquatics | 0 | 13 | 20.00 | 23.0 | 26.00 | 41 | 23.41976 | 4.3753138 | 28 | 6.00 | 1346 |
aquatics | 1 | 15 | 21.00 | 22.0 | 26.00 | 33 | 23.47059 | 4.0195600 | 18 | 5.00 | 85 |
aquatics | 2 | 19 | 20.00 | 23.5 | 26.75 | 35 | 24.30000 | 5.2925524 | 16 | 6.75 | 10 |
aquatics | 3 | 21 | 22.50 | 24.0 | 25.50 | 27 | 24.00000 | 4.2426407 | 6 | 3.00 | 2 |
aquatics | 4 | 19 | 19.00 | 19.0 | 19.00 | 19 | 19.00000 | NA | 0 | 0.00 | 1 |
aquatics | 5 | 31 | 31.00 | 31.0 | 31.00 | 31 | 31.00000 | NA | 0 | 0.00 | 1 |
archery | 0 | 16 | 22.00 | 24.5 | 28.00 | 44 | 25.68852 | 5.8846796 | 28 | 6.00 | 122 |
archery | 1 | 20 | 20.75 | 22.5 | 25.00 | 28 | 23.25000 | 3.5939764 | 8 | 4.25 | 4 |
archery | 2 | 23 | 24.50 | 26.0 | 27.50 | 29 | 26.00000 | 4.2426407 | 6 | 3.00 | 2 |
athletics | 0 | 15 | 23.00 | 26.0 | 29.00 | 46 | 26.32002 | 4.8055656 | 31 | 6.00 | 2303 |
athletics | 1 | 19 | 23.50 | 26.0 | 28.50 | 37 | 26.05455 | 3.8892929 | 18 | 5.00 | 55 |
athletics | 2 | 24 | 28.50 | 30.0 | 30.75 | 33 | 29.25000 | 3.7749172 | 9 | 2.25 | 4 |
athletics | 3 | 29 | 29.00 | 29.0 | 29.00 | 29 | 29.00000 | NA | 0 | 0.00 | 1 |
badminton | 0 | 19 | 24.00 | 26.5 | 29.00 | 40 | 26.82317 | 3.9271526 | 21 | 5.00 | 164 |
badminton | 1 | 23 | 25.50 | 26.5 | 29.25 | 32 | 27.12500 | 3.0443155 | 9 | 3.75 | 8 |
basketball | 0 | 20 | 24.00 | 27.0 | 31.00 | 39 | 27.56439 | 4.3614738 | 19 | 7.00 | 264 |
basketball | 1 | 21 | 26.00 | 27.0 | 30.50 | 37 | 28.20833 | 3.9997735 | 16 | 4.50 | 24 |
boxing | 0 | 18 | 22.00 | 25.0 | 28.00 | 36 | 25.04396 | 3.8403204 | 18 | 6.00 | 273 |
boxing | 1 | 21 | 23.00 | 24.0 | 25.00 | 33 | 24.61538 | 3.0149202 | 12 | 2.00 | 13 |
canoe | 0 | 17 | 23.00 | 27.0 | 29.00 | 48 | 26.78387 | 4.5732157 | 31 | 6.00 | 310 |
canoe | 1 | 21 | 24.00 | 29.5 | 33.00 | 36 | 28.62500 | 4.6743984 | 15 | 9.00 | 16 |
canoe | 2 | 22 | 25.00 | 27.0 | 28.25 | 29 | 26.25000 | 3.0956959 | 7 | 3.25 | 4 |
canoe | 3 | 29 | 29.00 | 29.0 | 29.00 | 29 | 29.00000 | NA | 0 | 0.00 | 1 |
cycling | 0 | 18 | 23.00 | 27.0 | 30.00 | 44 | 26.91816 | 4.5920899 | 26 | 7.00 | 501 |
cycling | 1 | 21 | 23.00 | 25.5 | 29.75 | 42 | 27.04545 | 5.3849638 | 21 | 6.75 | 22 |
cycling | 2 | 24 | 24.00 | 24.0 | 24.00 | 24 | 24.00000 | NA | 0 | 0.00 | 1 |
cycling | 3 | 28 | 28.00 | 28.0 | 28.00 | 28 | 28.00000 | NA | 0 | 0.00 | 1 |
equestrian | 0 | 18 | 30.00 | 36.0 | 44.00 | 62 | 37.12560 | 9.6669626 | 44 | 14.00 | 207 |
equestrian | 1 | 21 | 31.50 | 36.0 | 47.00 | 58 | 38.86667 | 10.5211805 | 37 | 15.50 | 15 |
fencing | 0 | 15 | 24.00 | 27.0 | 31.00 | 41 | 27.47788 | 4.8909846 | 26 | 7.00 | 226 |
fencing | 1 | 20 | 26.00 | 28.0 | 31.00 | 38 | 28.47368 | 4.1145868 | 18 | 5.00 | 19 |
fencing | 2 | 22 | 22.00 | 22.0 | 22.00 | 22 | 22.00000 | NA | 0 | 0.00 | 1 |
football | 0 | 16 | 21.00 | 22.0 | 25.00 | 40 | 23.48870 | 3.5948270 | 24 | 4.00 | 575 |
football | 1 | 19 | 22.00 | 23.0 | 27.25 | 33 | 24.41667 | 3.6282621 | 14 | 5.25 | 36 |
golf | 0 | 18 | 24.00 | 28.5 | 34.75 | 46 | 29.50000 | 7.0792220 | 28 | 10.75 | 118 |
golf | 1 | 28 | 30.00 | 32.0 | 34.00 | 36 | 32.00000 | 5.6568542 | 8 | 4.00 | 2 |
gymnastics | 0 | 15 | 19.00 | 21.0 | 24.50 | 41 | 22.05686 | 4.4025745 | 26 | 5.50 | 299 |
gymnastics | 1 | 16 | 19.25 | 22.0 | 25.00 | 31 | 22.31818 | 3.8469080 | 15 | 5.75 | 22 |
gymnastics | 2 | 23 | 24.00 | 25.0 | 26.00 | 27 | 25.00000 | 2.8284271 | 4 | 2.00 | 2 |
gymnastics | 4 | 19 | 19.00 | 19.0 | 19.00 | 19 | 19.00000 | NA | 0 | 0.00 | 1 |
handball | 0 | 19 | 25.00 | 27.0 | 31.00 | 43 | 27.82934 | 4.5451596 | 24 | 6.00 | 334 |
handball | 1 | 20 | 27.00 | 29.0 | 31.00 | 35 | 28.68966 | 3.7806221 | 15 | 4.00 | 29 |
hockey | 0 | 17 | 23.00 | 26.0 | 29.00 | 37 | 25.96482 | 3.6728658 | 20 | 6.00 | 398 |
hockey | 1 | 21 | 25.25 | 28.5 | 31.00 | 37 | 28.55882 | 4.0690344 | 16 | 5.75 | 34 |
judo | 0 | 18 | 24.00 | 26.0 | 29.00 | 39 | 26.22751 | 3.7425200 | 21 | 5.00 | 378 |
judo | 1 | 21 | 24.00 | 25.0 | 26.75 | 30 | 25.28571 | 2.8670159 | 9 | 2.75 | 14 |
modern pentathlon | 0 | 16 | 23.00 | 25.0 | 29.00 | 36 | 25.60000 | 3.8949987 | 20 | 6.00 | 70 |
modern pentathlon | 1 | 24 | 25.00 | 26.0 | 27.00 | 28 | 26.00000 | 2.8284271 | 4 | 2.00 | 2 |
rowing | 0 | 18 | 24.00 | 26.0 | 29.00 | 56 | 26.69940 | 4.3994591 | 38 | 5.00 | 499 |
rowing | 1 | 22 | 26.00 | 28.0 | 31.00 | 37 | 28.77083 | 3.5921876 | 15 | 5.00 | 48 |
rugby sevens | 0 | 18 | 23.00 | 26.0 | 29.00 | 36 | 26.12364 | 3.8264334 | 18 | 6.00 | 275 |
rugby sevens | 1 | 21 | 23.00 | 25.0 | 28.00 | 32 | 25.32000 | 3.0784195 | 11 | 5.00 | 25 |
sailing | 0 | 17 | 24.00 | 27.0 | 30.00 | 47 | 27.41096 | 5.2629631 | 30 | 6.00 | 365 |
sailing | 1 | 25 | 26.50 | 28.0 | 30.00 | 54 | 30.00000 | 7.2308861 | 29 | 3.50 | 15 |
shooting | 0 | 16 | 25.00 | 31.0 | 36.00 | 55 | 31.40426 | 8.3775180 | 39 | 11.00 | 376 |
shooting | 1 | 19 | 25.00 | 29.0 | 33.00 | 49 | 30.23077 | 8.6231650 | 30 | 8.00 | 13 |
shooting | 2 | 28 | 28.00 | 28.0 | 28.00 | 28 | 28.00000 | NA | 0 | 0.00 | 1 |
table tennis | 0 | 15 | 23.00 | 27.0 | 32.00 | 54 | 27.77711 | 6.8058589 | 39 | 9.00 | 166 |
table tennis | 1 | 25 | 25.75 | 27.0 | 28.00 | 28 | 26.75000 | 1.5000000 | 3 | 2.25 | 4 |
table tennis | 2 | 26 | 26.25 | 26.5 | 26.75 | 27 | 26.50000 | 0.7071068 | 1 | 0.50 | 2 |
taekwondo | 0 | 17 | 21.00 | 24.0 | 27.00 | 37 | 24.24167 | 4.2207886 | 20 | 6.00 | 120 |
taekwondo | 1 | 20 | 21.50 | 22.0 | 23.75 | 28 | 22.87500 | 2.7998724 | 8 | 2.25 | 8 |
tennis | 0 | 18 | 25.00 | 28.0 | 31.00 | 43 | 28.02128 | 4.6168228 | 25 | 6.00 | 188 |
tennis | 1 | 22 | 26.75 | 29.5 | 30.25 | 34 | 28.37500 | 4.0333432 | 12 | 3.50 | 8 |
triathlon | 0 | 19 | 25.00 | 27.0 | 30.00 | 38 | 27.66667 | 3.8499909 | 19 | 5.00 | 108 |
triathlon | 1 | 28 | 28.50 | 29.0 | 29.50 | 30 | 29.00000 | 1.4142136 | 2 | 1.00 | 2 |
volleyball | 0 | 17 | 25.00 | 27.0 | 30.00 | 41 | 27.45787 | 4.3288641 | 24 | 5.00 | 356 |
volleyball | 1 | 19 | 24.00 | 27.0 | 30.00 | 40 | 27.21429 | 5.2518578 | 21 | 6.00 | 28 |
weightlifting | 0 | 15 | 22.00 | 24.0 | 28.00 | 41 | 24.86420 | 4.1273721 | 26 | 6.00 | 243 |
weightlifting | 1 | 21 | 22.00 | 24.0 | 25.00 | 33 | 24.26667 | 3.0814344 | 12 | 3.00 | 15 |
wrestling | 0 | 18 | 24.00 | 26.0 | 29.00 | 38 | 26.67463 | 3.7189973 | 20 | 5.00 | 335 |
wrestling | 1 | 20 | 21.25 | 24.0 | 25.00 | 33 | 24.55556 | 3.8535274 | 13 | 3.75 | 18 |
#Now create a table with the descriptive statistics for height by each sport and gender
athletes2016 %>% group_by(Sport, Gender) %>% summarise(Min = min(Height,na.rm = TRUE),
Q1 = quantile(Height,probs = .25,na.rm = TRUE),
Median = median(Height),
Q3 = quantile(Height,probs = .75,na.rm = TRUE),
Max = max(Height,na.rm = TRUE),
Mean = mean(Height, na.rm = TRUE),
SD = sd(Height, na.rm = TRUE),
Range = Max - Min,
IQR = Q3 - Q1,
n = n()) -> table2
knitr::kable(table2)
Sport | Gender | Min | Q1 | Median | Q3 | Max | Mean | SD | Range | IQR | n |
---|---|---|---|---|---|---|---|---|---|---|---|
aquatics | female | 143 | 167.00 | 171.5712 | 176.00 | 194 | 171.5712 | 7.403290 | 51 | 9.00 | 716 |
aquatics | male | 153 | 181.00 | 186.0342 | 192.00 | 208 | 185.9999 | 8.794747 | 55 | 11.00 | 729 |
archery | female | 153 | 163.00 | 168.0000 | 172.00 | 183 | 167.6190 | 6.168091 | 30 | 9.00 | 64 |
archery | male | 164 | 175.00 | 180.0000 | 183.00 | 194 | 179.5714 | 5.800149 | 30 | 8.00 | 64 |
athletics | female | 121 | 164.00 | 169.0500 | 174.00 | 193 | 169.0500 | 7.925731 | 72 | 10.00 | 1137 |
athletics | male | 145 | 175.00 | 180.9234 | 187.00 | 208 | 180.9234 | 8.896243 | 63 | 12.00 | 1226 |
badminton | female | 156 | 165.00 | 168.0000 | 172.00 | 184 | 168.6000 | 5.562902 | 28 | 7.00 | 86 |
badminton | male | 167 | 176.25 | 180.2529 | 184.00 | 198 | 180.5059 | 6.316409 | 31 | 7.75 | 86 |
basketball | female | 161 | 177.00 | 185.0000 | 190.00 | 203 | 183.3819 | 9.184965 | 42 | 13.00 | 144 |
basketball | male | 181 | 194.75 | 201.0000 | 207.00 | 218 | 200.3611 | 8.281423 | 37 | 12.25 | 144 |
boxing | female | 150 | 162.75 | 168.0000 | 174.25 | 184 | 168.2222 | 8.124429 | 34 | 11.50 | 36 |
boxing | male | 152 | 170.00 | 175.7615 | 181.00 | 206 | 175.7615 | 9.132098 | 54 | 11.00 | 250 |
canoe | female | 153 | 166.50 | 170.0000 | 174.00 | 185 | 170.0648 | 6.356490 | 32 | 7.50 | 111 |
canoe | male | 159 | 178.00 | 183.0000 | 186.00 | 202 | 182.0455 | 6.852505 | 43 | 8.00 | 220 |
cycling | female | 155 | 163.00 | 167.1564 | 170.25 | 185 | 167.3128 | 5.895198 | 30 | 7.25 | 200 |
cycling | male | 164 | 175.00 | 179.3074 | 184.00 | 196 | 179.3074 | 6.331333 | 32 | 9.00 | 325 |
equestrian | female | 151 | 163.00 | 168.0000 | 172.00 | 185 | 168.2530 | 6.843216 | 34 | 9.00 | 85 |
equestrian | male | 161 | 174.00 | 179.0000 | 183.00 | 196 | 179.0000 | 7.407270 | 35 | 9.00 | 137 |
fencing | female | 152 | 167.75 | 171.5000 | 176.00 | 185 | 171.2581 | 6.538032 | 33 | 8.25 | 124 |
fencing | male | 165 | 178.00 | 183.1901 | 189.00 | 205 | 183.3802 | 7.478074 | 40 | 11.00 | 122 |
football | female | 153 | 163.00 | 168.0000 | 173.00 | 188 | 168.2710 | 6.553599 | 35 | 10.00 | 264 |
football | male | 160 | 174.00 | 180.0000 | 185.00 | 196 | 179.3746 | 7.206978 | 36 | 11.00 | 347 |
golf | female | 158 | 165.00 | 168.0000 | 172.50 | 183 | 169.1404 | 5.694639 | 25 | 7.50 | 59 |
golf | male | 160 | 173.00 | 180.0000 | 185.00 | 196 | 179.3500 | 7.999635 | 36 | 12.00 | 61 |
gymnastics | female | 133 | 156.00 | 163.0000 | 168.00 | 181 | 161.6524 | 9.051058 | 48 | 12.00 | 210 |
gymnastics | male | 152 | 163.00 | 167.0000 | 172.00 | 184 | 167.7105 | 6.168323 | 32 | 9.00 | 114 |
handball | female | 161 | 172.00 | 176.0000 | 180.00 | 192 | 175.9722 | 5.801590 | 31 | 8.00 | 180 |
handball | male | 177 | 186.00 | 191.0000 | 196.00 | 210 | 191.3169 | 6.816922 | 33 | 10.00 | 183 |
hockey | female | 152 | 162.00 | 167.0000 | 171.00 | 196 | 166.7870 | 6.984436 | 44 | 9.00 | 216 |
hockey | male | 161 | 176.00 | 180.0000 | 184.00 | 196 | 180.0093 | 5.989130 | 35 | 8.00 | 216 |
judo | female | 150 | 161.00 | 165.0000 | 173.00 | 186 | 166.5132 | 8.130504 | 36 | 12.00 | 153 |
judo | male | 155 | 172.00 | 178.1783 | 185.00 | 210 | 178.1783 | 9.474799 | 55 | 13.00 | 239 |
modern pentathlon | female | 159 | 165.75 | 169.0000 | 175.00 | 182 | 169.9444 | 5.820871 | 23 | 9.25 | 36 |
modern pentathlon | male | 172 | 180.00 | 181.5000 | 185.00 | 197 | 182.5278 | 5.140795 | 25 | 5.00 | 36 |
rowing | female | 150 | 173.00 | 178.0000 | 183.00 | 196 | 177.3286 | 7.512572 | 46 | 10.00 | 216 |
rowing | male | 152 | 186.00 | 190.0000 | 195.00 | 209 | 189.8708 | 7.557571 | 57 | 9.00 | 331 |
rugby sevens | female | 153 | 163.75 | 168.0000 | 172.00 | 203 | 167.8552 | 6.756181 | 50 | 8.25 | 148 |
rugby sevens | male | 168 | 177.00 | 183.5000 | 188.00 | 198 | 182.9145 | 7.272323 | 30 | 11.00 | 152 |
sailing | female | 152 | 165.00 | 169.2767 | 173.00 | 184 | 169.2767 | 6.079628 | 32 | 8.00 | 163 |
sailing | male | 165 | 177.00 | 181.0000 | 186.00 | 204 | 181.2770 | 6.549900 | 39 | 9.00 | 217 |
shooting | female | 149 | 160.00 | 164.8400 | 169.00 | 181 | 164.8400 | 6.079561 | 32 | 9.00 | 151 |
shooting | male | 160 | 174.00 | 178.0000 | 182.00 | 202 | 178.0343 | 6.763013 | 42 | 8.00 | 239 |
table tennis | female | 149 | 161.00 | 165.3659 | 168.75 | 181 | 165.3659 | 6.301283 | 32 | 7.75 | 86 |
table tennis | male | 162 | 174.00 | 178.5000 | 183.00 | 198 | 178.6860 | 7.099729 | 36 | 9.00 | 86 |
taekwondo | female | 159 | 168.75 | 173.0000 | 176.50 | 188 | 173.0159 | 6.188807 | 29 | 7.75 | 64 |
taekwondo | male | 160 | 180.00 | 186.5000 | 191.00 | 207 | 186.3750 | 8.820071 | 47 | 11.00 | 64 |
tennis | female | 159 | 168.00 | 174.0000 | 178.00 | 186 | 173.4205 | 6.290595 | 27 | 10.00 | 91 |
tennis | male | 172 | 181.00 | 186.0000 | 191.00 | 204 | 186.0388 | 7.107583 | 32 | 10.00 | 105 |
triathlon | female | 153 | 164.00 | 168.0000 | 171.00 | 180 | 167.6545 | 5.618152 | 27 | 7.00 | 55 |
triathlon | male | 166 | 175.00 | 179.0000 | 183.00 | 191 | 178.8704 | 6.212432 | 25 | 8.00 | 55 |
volleyball | female | 159 | 178.00 | 183.0000 | 188.00 | 202 | 182.8691 | 7.756682 | 43 | 10.00 | 192 |
volleyball | male | 172 | 191.00 | 197.0000 | 202.00 | 211 | 196.4010 | 7.638467 | 39 | 11.00 | 192 |
weightlifting | female | 137 | 155.00 | 160.0000 | 165.00 | 178 | 160.3558 | 7.703131 | 41 | 10.00 | 104 |
weightlifting | male | 148 | 162.25 | 170.0000 | 180.00 | 197 | 170.9221 | 10.776719 | 49 | 17.75 | 154 |
wrestling | female | 150 | 160.25 | 165.0000 | 170.00 | 180 | 165.1239 | 6.796840 | 30 | 9.75 | 114 |
wrestling | male | 152 | 169.00 | 175.0000 | 182.00 | 203 | 175.7215 | 9.581499 | 51 | 13.00 | 239 |
#Now create a table with the descriptive statistics for Weight by each sport and gender, including calculations for normal distribution
athletes2016 %>% group_by(Sport, Gender) %>% summarise(Min = min(Weight,na.rm = TRUE),
Q1 = quantile(Weight,probs = .25,na.rm = TRUE),
Median = median(Weight),
Q3 = quantile(Weight,probs = .75,na.rm = TRUE),
Max = max(Weight,na.rm = TRUE),
Mean = mean(Weight, na.rm = TRUE),
SD = sd(Weight, na.rm = TRUE),
Variance = var(Weight),
Range = Max - Min,
IQR = Q3 - Q1,
n = n(),
valNorm = SD * 1.33,
distNorm = IQR - valNorm) -> table3
knitr::kable(table3)
Sport | Gender | Min | Q1 | Median | Q3 | Max | Mean | SD | Variance | Range | IQR | n | valNorm | distNorm |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aquatics | female | 39.00000 | 57.00000 | 62.00000 | 66.00000 | 130 | 62.28448 | 8.666120 | 75.10164 | 91.00000 | 9.000000 | 716 | 11.525940 | -2.5259399 |
aquatics | male | 50.00000 | 75.00000 | 82.00000 | 89.00000 | 125 | 82.21906 | 11.393093 | 129.80256 | 75.00000 | 14.000000 | 729 | 15.152813 | -1.1528133 |
archery | female | 42.00000 | 56.75000 | 63.00000 | 73.00000 | 90 | 64.30159 | 10.225451 | 104.55984 | 48.00000 | 16.250000 | 64 | 13.599849 | 2.6501508 |
archery | male | 46.00000 | 74.00000 | 79.50000 | 84.50000 | 130 | 80.07937 | 13.458108 | 181.12069 | 84.00000 | 10.500000 | 64 | 17.899284 | -7.3992844 |
athletics | female | 39.00000 | 52.00000 | 58.00000 | 63.00000 | 150 | 60.15254 | 13.679727 | 187.13494 | 111.00000 | 11.000000 | 1137 | 18.194037 | -7.1940372 |
athletics | male | 47.00000 | 63.00000 | 72.50000 | 80.00000 | 165 | 74.77768 | 17.330468 | 300.34510 | 118.00000 | 17.000000 | 1226 | 23.049522 | -6.0495218 |
badminton | female | 50.00000 | 58.00000 | 60.50000 | 64.75000 | 85 | 61.20988 | 5.637513 | 31.78155 | 35.00000 | 6.750000 | 86 | 7.497892 | -0.7478924 |
badminton | male | 60.00000 | 72.00000 | 76.15663 | 80.00000 | 93 | 76.15663 | 6.964586 | 48.50546 | 33.00000 | 8.000000 | 86 | 9.262899 | -1.2628993 |
basketball | female | 53.00000 | 68.00000 | 75.00000 | 83.25000 | 104 | 75.37762 | 10.354422 | 107.21404 | 51.00000 | 15.250000 | 144 | 13.771381 | 1.4786194 |
basketball | male | 79.00000 | 90.00000 | 100.00000 | 108.25000 | 137 | 100.29787 | 11.835554 | 140.08035 | 58.00000 | 18.250000 | 144 | 15.741287 | 2.5087128 |
boxing | female | 56.33205 | 60.15761 | 62.46073 | 63.84033 | 91 | 62.83780 | 5.469579 | 29.91629 | 34.66795 | 3.682722 | 36 | 7.274540 | -3.5918181 |
boxing | male | 63.47368 | 75.22449 | 80.67671 | 82.25725 | 114 | 79.81591 | 6.099835 | 37.20799 | 50.52632 | 7.032756 | 250 | 8.112781 | -1.0800249 |
canoe | female | 50.00000 | 63.00000 | 66.00000 | 70.00000 | 99 | 66.45794 | 7.164661 | 51.33237 | 49.00000 | 7.000000 | 111 | 9.528999 | -2.5289994 |
canoe | male | 57.00000 | 75.75000 | 83.00000 | 89.00000 | 115 | 82.15000 | 8.937646 | 79.88151 | 58.00000 | 13.250000 | 220 | 11.887068 | 1.3629315 |
cycling | female | 45.00000 | 55.75000 | 60.00000 | 64.00000 | 90 | 60.20725 | 7.018894 | 49.26487 | 45.00000 | 8.250000 | 200 | 9.335129 | -1.0851291 |
cycling | male | 52.00000 | 66.00000 | 71.00000 | 78.00000 | 100 | 72.57605 | 9.065853 | 82.18970 | 48.00000 | 12.000000 | 325 | 12.057585 | -0.0575852 |
equestrian | female | 47.00000 | 55.00000 | 58.00000 | 62.00000 | 75 | 58.63415 | 5.842195 | 34.13124 | 28.00000 | 7.000000 | 85 | 7.770119 | -0.7701194 |
equestrian | male | 50.00000 | 69.00000 | 72.00000 | 79.00000 | 89 | 72.95489 | 7.268854 | 52.83624 | 39.00000 | 10.000000 | 137 | 9.667576 | 0.3324236 |
fencing | female | 45.00000 | 58.00000 | 62.00000 | 67.25000 | 81 | 62.73387 | 7.362958 | 54.21315 | 36.00000 | 9.250000 | 124 | 9.792734 | -0.5427343 |
fencing | male | 60.00000 | 75.00000 | 78.00000 | 83.00000 | 102 | 78.78512 | 7.760045 | 60.21829 | 42.00000 | 8.000000 | 122 | 10.320859 | -2.3208592 |
football | female | 47.00000 | 56.00000 | 60.00000 | 65.00000 | 82 | 61.06107 | 6.199786 | 38.43735 | 35.00000 | 9.000000 | 264 | 8.245716 | 0.7542843 |
football | male | 60.00000 | 70.00000 | 74.45171 | 79.00000 | 100 | 74.45171 | 6.774549 | 45.89451 | 40.00000 | 9.000000 | 347 | 9.010150 | -0.0101501 |
golf | female | 48.00000 | 58.50000 | 63.00000 | 68.00000 | 85 | 63.20000 | 7.219227 | 52.11724 | 37.00000 | 9.500000 | 59 | 9.601572 | -0.1015722 |
golf | male | 63.00000 | 72.00000 | 79.00000 | 86.00000 | 106 | 79.00000 | 9.150592 | 83.73333 | 43.00000 | 14.000000 | 61 | 12.170287 | 1.8297127 |
gymnastics | female | 31.00000 | 47.00000 | 50.00000 | 54.00000 | 63 | 49.55502 | 5.567170 | 30.99338 | 32.00000 | 7.000000 | 210 | 7.404336 | -0.4043363 |
gymnastics | male | 50.00000 | 60.00000 | 63.12727 | 66.00000 | 81 | 63.25455 | 6.028599 | 36.34401 | 31.00000 | 6.000000 | 114 | 8.018037 | -2.0180367 |
handball | female | 52.00000 | 66.00000 | 70.00000 | 74.00000 | 105 | 70.78916 | 7.775344 | 60.45598 | 53.00000 | 8.000000 | 180 | 10.341208 | -2.3412080 |
handball | male | 72.00000 | 88.00000 | 95.00000 | 103.50000 | 118 | 95.43169 | 9.955090 | 99.10383 | 46.00000 | 15.500000 | 183 | 13.240270 | 2.2597298 |
hockey | female | 45.00000 | 55.75000 | 60.00000 | 65.00000 | 78 | 60.42593 | 6.265748 | 39.25960 | 33.00000 | 9.250000 | 216 | 8.333445 | 0.9165545 |
hockey | male | 58.00000 | 74.00000 | 77.00000 | 81.25000 | 95 | 77.37500 | 6.881936 | 47.36105 | 37.00000 | 7.250000 | 216 | 9.152975 | -1.9029752 |
judo | female | 46.00000 | 52.00000 | 63.00000 | 70.00000 | 132 | 65.39216 | 16.519221 | 272.88467 | 86.00000 | 18.000000 | 153 | 21.970564 | -3.9705644 |
judo | male | 60.00000 | 68.00000 | 81.00000 | 94.50000 | 170 | 84.61674 | 20.376463 | 415.20024 | 110.00000 | 26.500000 | 239 | 27.100696 | -0.6006955 |
modern pentathlon | female | 49.00000 | 53.75000 | 58.00000 | 60.00000 | 69 | 58.00000 | 4.974219 | 24.74286 | 20.00000 | 6.250000 | 36 | 6.615712 | -0.3657116 |
modern pentathlon | male | 64.00000 | 70.75000 | 74.00000 | 78.00000 | 90 | 73.91667 | 5.683937 | 32.30714 | 26.00000 | 7.250000 | 36 | 7.559637 | -0.3096366 |
rowing | female | 49.00000 | 63.75000 | 71.50000 | 75.00000 | 90 | 69.77619 | 8.472797 | 71.78828 | 41.00000 | 11.250000 | 216 | 11.268820 | -0.0188196 |
rowing | male | 53.00000 | 73.00000 | 90.00000 | 95.00000 | 110 | 86.50462 | 12.010506 | 144.25225 | 57.00000 | 22.000000 | 331 | 15.973973 | 6.0260272 |
rugby sevens | female | 53.00000 | 63.00000 | 66.59589 | 71.00000 | 89 | 66.59589 | 6.629740 | 43.95345 | 36.00000 | 8.000000 | 148 | 8.817554 | -0.8175542 |
rugby sevens | male | 65.00000 | 83.00000 | 90.00000 | 98.00000 | 113 | 90.45033 | 9.400358 | 88.36674 | 48.00000 | 15.000000 | 152 | 12.502477 | 2.4975233 |
sailing | female | 50.00000 | 59.00000 | 63.19497 | 68.00000 | 74 | 63.19497 | 5.534937 | 30.63553 | 24.00000 | 9.000000 | 163 | 7.361467 | 1.6385334 |
sailing | male | 59.00000 | 72.00000 | 77.00000 | 82.00000 | 102 | 77.12207 | 8.971876 | 80.49457 | 43.00000 | 10.000000 | 217 | 11.932596 | -1.9325956 |
shooting | female | 43.00000 | 55.00000 | 61.00000 | 67.00000 | 95 | 62.67568 | 10.064602 | 101.29622 | 52.00000 | 12.000000 | 151 | 13.385921 | -1.3859208 |
shooting | male | 45.00000 | 73.00000 | 80.00000 | 88.00000 | 140 | 81.06897 | 13.352823 | 178.29788 | 95.00000 | 15.000000 | 239 | 17.759255 | -2.7592547 |
table tennis | female | 42.00000 | 53.25000 | 57.53012 | 61.50000 | 84 | 57.53012 | 6.728932 | 45.27853 | 42.00000 | 8.250000 | 86 | 8.949480 | -0.6994796 |
table tennis | male | 51.00000 | 67.00000 | 72.00000 | 77.00000 | 99 | 72.55814 | 8.652242 | 74.86129 | 48.00000 | 10.000000 | 86 | 11.507481 | -1.5074814 |
taekwondo | female | 48.00000 | 54.50000 | 61.25806 | 67.00000 | 88 | 61.25806 | 9.107707 | 82.95033 | 40.00000 | 12.500000 | 64 | 12.113251 | 0.3867493 |
taekwondo | male | 57.00000 | 62.75000 | 72.40476 | 82.25000 | 108 | 74.80952 | 13.913402 | 193.58277 | 51.00000 | 19.500000 | 64 | 18.504825 | 0.9951748 |
tennis | female | 52.00000 | 61.00000 | 65.00000 | 68.00000 | 84 | 64.67045 | 5.386653 | 29.01604 | 32.00000 | 7.000000 | 91 | 7.164249 | -0.1642491 |
tennis | male | 65.00000 | 75.00000 | 80.00000 | 85.00000 | 99 | 80.41748 | 7.405201 | 54.83701 | 34.00000 | 10.000000 | 105 | 9.848918 | 0.1510824 |
triathlon | female | 44.00000 | 51.50000 | 55.00000 | 58.00000 | 65 | 54.56364 | 4.532858 | 20.54680 | 21.00000 | 6.500000 | 55 | 6.028701 | 0.4712989 |
triathlon | male | 56.00000 | 63.50000 | 65.00000 | 70.00000 | 80 | 66.81481 | 5.343997 | 28.55830 | 24.00000 | 6.500000 | 55 | 7.107515 | -0.6075154 |
volleyball | female | 52.00000 | 66.00000 | 71.00000 | 75.00000 | 92 | 70.68421 | 6.709785 | 45.02122 | 40.00000 | 9.000000 | 192 | 8.924014 | 0.0759856 |
volleyball | male | 64.00000 | 84.00000 | 89.00000 | 95.25000 | 115 | 89.42188 | 8.419880 | 70.89439 | 51.00000 | 11.250000 | 192 | 11.198441 | 0.0515589 |
weightlifting | female | 47.00000 | 57.00000 | 63.00000 | 74.00000 | 143 | 68.78846 | 20.526409 | 421.33346 | 96.00000 | 17.000000 | 104 | 27.300124 | -10.3001237 |
weightlifting | male | 55.00000 | 69.00000 | 84.00000 | 104.00000 | 170 | 87.53896 | 27.938339 | 780.55076 | 115.00000 | 35.000000 | 154 | 37.157990 | -2.1579902 |
wrestling | female | 47.00000 | 54.25000 | 62.50000 | 69.00000 | 80 | 61.80531 | 9.109810 | 82.98864 | 33.00000 | 14.750000 | 114 | 12.116048 | 2.6339523 |
wrestling | male | 39.00000 | 66.00000 | 82.00000 | 98.00000 | 130 | 85.37288 | 21.409295 | 458.35793 | 91.00000 | 32.000000 | 239 | 28.474363 | 3.5256372 |
#I will use male cycling as an example and plot
maleCyc <- athletes2016 %>%
group_by(Gender, Sport) %>%
filter(Sport == "cycling" , Gender == "male")
#Plot histogram with normal overlay - Male cyclists weight
h<-hist(maleCyc$Weight,breaks=15, main = "Normal Distribution of Male Cyclists' weight")
weight.plot <-c(min(h$breaks),h$breaks)
freq<-c(0,h$density,0)
xfit<-seq(min(maleCyc$Weight),max(maleCyc$Weight),length=40)
yfit<-dnorm(xfit,mean=mean(maleCyc$Weight),sd=sd(maleCyc$Weight))
plot(weight.plot,freq,type="s",ylim=c(0,max(freq,yfit)), main="Normal pdf and histogram - Male Cyclist weights")
lines(xfit,yfit, col="red")
maleCyc$Weight %>% mean() %>% abline(v=.,col='red',lw=2)
#Complete probability testing as per research problem
#Probability testing
library(janitor)
#Find the gold medal count and therefore probability of winning gold per sport and gender
m1 <- as.matrix(tabyl(athletes2016, Gender, Sport))
m2 <- as.numeric(c(m1[1, 2:29]))
m3 <- as.numeric(c(m1[2, 2:29]))
athletes2016 %>%
group_by(Sport)%>%
filter(`Gold Medals` > 0 , Gender == "female") %>%
summarise(totalGoldMedals = sum(`Gold Medals`)) -> table4
knitr::kable(table4)
Sport | totalGoldMedals |
---|---|
aquatics | 65 |
archery | 4 |
athletics | 32 |
badminton | 4 |
basketball | 12 |
boxing | 3 |
canoe | 9 |
cycling | 13 |
equestrian | 5 |
fencing | 11 |
football | 18 |
golf | 1 |
gymnastics | 17 |
handball | 15 |
hockey | 16 |
judo | 7 |
modern pentathlon | 1 |
rowing | 20 |
rugby sevens | 12 |
sailing | 7 |
shooting | 6 |
table tennis | 4 |
taekwondo | 4 |
tennis | 4 |
triathlon | 1 |
volleyball | 14 |
weightlifting | 7 |
wrestling | 6 |
#Find probability of winning in your sport based on gender, number of medals and athletes
as.data.frame(table4)
table4$TotalFemaleAthletes <- m2
table4$ProbWin <- ((table4$totalGoldMedals / table4$TotalFemaleAthletes) * 100)
table4
#Now complete for males
athletes2016 %>%
group_by(Sport)%>%
filter(`Gold Medals` > 0 , Gender == "male") %>%
summarise(totalGoldMedals = sum(`Gold Medals`)) -> table5
knitr::kable(table5)
Sport | totalGoldMedals |
---|---|
aquatics | 55 |
archery | 4 |
athletics | 34 |
badminton | 4 |
basketball | 12 |
boxing | 10 |
canoe | 18 |
cycling | 14 |
equestrian | 10 |
fencing | 10 |
football | 18 |
golf | 1 |
gymnastics | 13 |
handball | 14 |
hockey | 18 |
judo | 7 |
modern pentathlon | 1 |
rowing | 28 |
rugby sevens | 13 |
sailing | 8 |
shooting | 9 |
table tennis | 4 |
taekwondo | 4 |
tennis | 4 |
triathlon | 1 |
volleyball | 14 |
weightlifting | 8 |
wrestling | 12 |
table5$TotalMaleAthletes <- m3
table5$ProbWin <- ((table5$totalGoldMedals / table5$TotalMaleAthletes) * 100)
table5
#Now I know that the most athletes are in age bracket 22-31 so I will calculate the probability of winning based on gender and age
as.matrix(tabyl(athletes2016, AgeBracket, Sport))
## AgeBracket aquatics archery athletics badminton basketball boxing canoe
## [1,] "12-21" "529" "32" " 350" " 14" " 19" " 50" " 38"
## [2,] "22-31" "848" "76" "1668" "138" "209" "219" "238"
## [3,] "32-41" " 68" "17" " 339" " 20" " 60" " 17" " 54"
## [4,] "42-51" " 0" " 3" " 6" " 0" " 0" " 0" " 1"
## [5,] "52-61" " 0" " 0" " 0" " 0" " 0" " 0" " 0"
## [6,] "62+" " 0" " 0" " 0" " 0" " 0" " 0" " 0"
## cycling equestrian fencing football golf gymnastics handball hockey judo
## [1,] " 56" " 8" " 21" "180" "17" "165" " 23" " 51" " 37"
## [2,] "385" "63" "172" "412" "64" "150" "265" "345" "317"
## [3,] " 81" "78" " 53" " 19" "30" " 9" " 74" " 36" " 38"
## [4,] " 3" "53" " 0" " 0" " 9" " 0" " 1" " 0" " 0"
## [5,] " 0" "19" " 0" " 0" " 0" " 0" " 0" " 0" " 0"
## [6,] " 0" " 1" " 0" " 0" " 0" " 0" " 0" " 0" " 0"
## modern pentathlon rowing rugby sevens sailing shooting table tennis
## [1,] "10" " 47" " 29" " 43" " 44" "29"
## [2,] "57" "429" "243" "261" "169" "97"
## [3,] " 5" " 68" " 28" " 71" "128" "42"
## [4,] " 0" " 2" " 0" " 4" " 42" " 2"
## [5,] " 0" " 1" " 0" " 1" " 7" " 2"
## [6,] " 0" " 0" " 0" " 0" " 0" " 0"
## taekwondo tennis triathlon volleyball weightlifting wrestling
## [1,] "34" " 12" " 5" " 32" " 50" " 29"
## [2,] "85" "143" "86" "288" "194" "286"
## [3,] " 9" " 39" "19" " 64" " 14" " 38"
## [4,] " 0" " 2" " 0" " 0" " 0" " 0"
## [5,] " 0" " 0" " 0" " 0" " 0" " 0"
## [6,] " 0" " 0" " 0" " 0" " 0" " 0"
## [1] 65 4 32 4 12 3 9 13 5 11 18 1 17 15 16 7 1 20 12 7 6 4 4 4 1
## [26] 14 7 6
athletes2016 %>%
group_by(Sport, Gender)%>%
filter(`Gold Medals` >0, AgeBracket == "22-31" , Gender == "female") %>%
summarise(AgeBracketGoldMedals = sum(`Gold Medals`)) -> table6
knitr::kable(table6)
Sport | Gender | AgeBracketGoldMedals |
---|---|---|
aquatics | female | 42 |
archery | female | 3 |
athletics | female | 26 |
badminton | female | 4 |
basketball | female | 6 |
boxing | female | 1 |
canoe | female | 7 |
cycling | female | 11 |
equestrian | female | 2 |
fencing | female | 9 |
football | female | 16 |
golf | female | 1 |
gymnastics | female | 5 |
handball | female | 10 |
hockey | female | 12 |
judo | female | 7 |
modern pentathlon | female | 1 |
rowing | female | 18 |
rugby sevens | female | 8 |
sailing | female | 6 |
shooting | female | 2 |
table tennis | female | 4 |
taekwondo | female | 4 |
tennis | female | 4 |
triathlon | female | 1 |
volleyball | female | 9 |
weightlifting | female | 5 |
wrestling | female | 3 |
#Now we have found the probability of winning gold if you are in the age bracket 22-31 compared to a different age bracket. The highest values are for the events where there is only 1 medal and the athlete fell in this age bracket e.g golf and triathlon as well as a number of other events where all gold medals were won by people in this age bracket. The lowest probabilities are in shooting, gymnastics and boxing.
table6$ProbGoldAge <- (table6$AgeBracketGoldMedals/table6$totalGoldMedals) * 100
table6$totalProbWin <- ((table4$totalGoldMedals / table4$TotalFemaleAthletes) * (table6$ProbGoldAge/100))*100
table6
#Now complete cluster sampling on the 2012 dataset and see if male cyclists also follow the same normal distribution for weight
#take all male cyclists into a dataframe
maleCyc2012 <- athletes2012 %>% group_by(Sport, Gender) %>%
filter(Sport == "Cycling - Road", Gender == "male")
maleCyc2012
#complete simple random sampling without replacement on the male cylcists 2012 dataframe
index <- sample(1:nrow(maleCyc2012), 50) #take 50 samples
cycleSamp <- maleCyc2012[index,]
cycleSamp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58.00 65.25 69.50 70.65 76.00 86.00
## [1] 10.75
#we can see that 1.33 * SD is higher than IQR therefore the sample is not normally distributed
normCycSamp <- sd(cycleSamp$Weight) * 1.33
print(normCycSamp)
## [1] 9.212591
#now plot the distribution
h<-hist(cycleSamp$Weight,breaks=15, main = "Distribution of 2012 Sample group: Male Cyclists' weight")
weight.plot <-c(min(h$breaks),h$breaks)
freq<-c(0,h$density,0)
xfit<-seq(min(cycleSamp$Weight),max(cycleSamp$Weight),length=40)
yfit<-dnorm(xfit,mean=mean(cycleSamp$Weight,sd=sd(cycleSamp$Weight)))
plot(weight.plot,freq,type="s",ylim=c(0,max(freq,yfit)), main="Normal pdf and histogram - Male Cyclist weights")
lines(xfit,yfit, col="red")
cycleSamp$Weight %>% mean() %>% abline(v=.,col='red',lw=2)
## [1] 70.65168
## [1] 47.98
## [1] 6.92676
#in the 2016 dataset we see mean 72.57605, SD 9.065853 and variance 82.18970 therefore it makes sense we see different SD and variance however a similar mean as the sample population is a lot fewer
#calculate the confidence interval for this mean as a representative sample of the whole dataset:
tcrit<- qt(p = 0.975, df = 50 -1)
CI95<- tcrit*(cycleSamp$Weight %>% sd()/sqrt(50))
CI95
## [1] 1.968563
#Therefore 95% CI for the sample is:
floorCI <- cycleSamp$Weight %>% mean() - CI95
roofCI <- cycleSamp$Weight %>% mean() + CI95
floorCI
## [1] 68.68312
## [1] 72.62024
I will use a critical value approach as the goal is to ascertain if the mean of the 2012 sample population is outside of the critical values determined for mean in comparison to the mean of the 2016 dataset. This allows comparison of the 2 datasets and assumes the null hypothesis is true. I will use a two-tailed approach as it will be also important to consider the upper critical mean boundary as it is just as likely a sample population mean will be heavier than it is they will be lighter.
The critical mean now sits between 70.922 and 74.24, this means Pr(70.92 <= x¯<= 74.24)=0.05. A such, assuming the population mean = 72.58, sampling 50 people from the population and finding a mean to be less than 70.92 and greater than 74.24 would happen less than 5% of the time. In the case of our sample dataset we have a mean of 71.89kg which falls within the 95% of mean values we’d expect. Therefore we can say that the mean of the 2012 dataset sample is statistically insignificant.
cycleSamp %>% summarise(Min = min(Weight,na.rm = TRUE),
Q1 = quantile(Weight,probs = .25,na.rm = TRUE),
Median = median(Weight, na.rm = TRUE),
Q3 = quantile(Weight,probs = .75,na.rm = TRUE),
Max = max(Weight,na.rm = TRUE),
Mean = mean(Weight, na.rm = TRUE),
SD = sd(Weight, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Weight)))
#hypothesis is Mean = 72.57605 as per 2016 dataset
#This means the sample mean sits 1.68 SEs below the population mean of 72.576.
qt(p = 0.05/2, df = 50-1, lower.tail = TRUE)
## [1] -2.009575
#Write a short R function, named ttox to convert this back to a critical mean.
tcrittomean <- function(t ,mu, s, n) {
se <- s/sqrt(n)
x_bar <- mu + (t*se) #Determine critical mean
return(x_bar)
}
#lower critical mean - 70.922
tcrittomean(t = qt(0.05/2,50-1,lower.tail = TRUE),
mu = 72.58,
s = sd(cycleSamp$Weight),
n = length(cycleSamp$Weight))
## [1] 70.61144
#upper critical mean - 74.24
tcrittomean(t = qt(1-(0.05/2),50-1,lower.tail = TRUE),
mu = 72.58,
s = sd(cycleSamp$Weight),
n = length(cycleSamp$Weight))
## [1] 74.54856
First I cleaned the data, then did some visual analysis on the variables. After completing this I went on to complete some probability calculations with regard to the probability of winning gold based on athlete specific data. I did this by putting athletes in age brackets and grouping by sport and gender. I then calculated probability of winning based on your sport and gender, and also if female athletes were in the main age bracket as opposed to any other bracket. I created tables with final probabilities of winning. Most sports were around 5% chance if you were a female and aged 22-31 however this also ranged from 1-7% depending on the sport and how many gold medals/events were up for grabs.
This probability would be significantly lower for the other age brackets as most medals were won by people in this age bracket.
I then went on to complete sampling and hypothesis testing on the 2012 dataset. I used cluster sampling methods to sample male cyclists as this group of athletes had normal weight distribution in the 2016 dataset and I wanted to explore the weight of the same group of athletes in a different dataset.
I chose a sample size of 50 athletes as this is considered fairly large and to be a better representative sample than a smaller sample set of the total group (there were 325 male cyclists in the 2016 dataset). I completed a number of statistical summaries and did critical value hypothesis testing to come to the final conclusion that the sample mean was within 95% confidence of a reflective mean of the population. Therefore the mean of the 2012 dataset sample is statistically insignificant.
The strengths of the investigation lie in that there is strong evidence from these 2 datasets to identify the mean weight of olympic level male cyclists and therefore it is also likely the other normally distributed weights for gender and sport can be easily predicted.
However the limitiations lie with the data where weights are not normally distributed. This leaves room for future investigation. There may be rationale to combine a number of olympic datasets just by sport and gender for the specific sport and gender to analyse. Therefore a larger representative sample can be calculated and compared.
In response to my problem statement, calculating the probability of winning gold within a dataset is simple and I have created a table of probability of winning gold in each sport if you’re a “prime aged” female. However when it comes to predicting specific weight (or other continuous variable) of a separate group such as a future dataset there comes uncertainty with non-normal data. This data would have to be collected, collated and handled in a different manner.
[1] “2016 Olympics in Rio de Janeiro”, Kaggle.com, 2021. [Online]. Available: https://www.kaggle.com/rio2016/olympic-games. [Accessed: 12- May- 2021].
[2] R. Wood, “Anthropometry of Olympic Athletes 2012”, Topend Sports Website, 2012. [Online]. Available: https://www.topendsports.com/events/summer/science/anthropometry-2012.htm. [Accessed: 09- May- 2021].