Loading the required packages for investigation
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(readr)
## Warning: package 'readr' was built under R version 3.5.3
library(deductive)
## Warning: package 'deductive' was built under R version 3.5.3
library(validate)
## Warning: package 'validate' was built under R version 3.5.3
##
## Attaching package: 'validate'
## The following object is masked from 'package:dplyr':
##
## expr
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.5.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.5.3
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.5.3
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:validate':
##
## expr
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:validate':
##
## label, label<-
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(forecast)
## Warning: package 'forecast' was built under R version 3.5.3
The dataset “athlete”, which contains information about athletes participating at Olympics from the year 1896 to 2016, was merged with the other dataset “noc” which contains country names for the respective NOC codes. The resulting dataset had 271116 observations across 16 variables about the athlete, with “Age”, “Weight”, “Height” and “Medal” being the most important ones. The pre-processing of data was carried out with the aim of building a dataset that helps find relation between the Body Mass Index of an athlete and his/her success at the Olympics. The data structures of variables of the dataset were inspected and a few variables were converted into factor format. Body Mass Index for each athlete was calculated by using “Height” and “Weight” variables of the dataset and adding the result under a new variable “BMI”. The resulting dataset was scanned for missing values, which were handled by imputation followed by omission. Outlier scan detected numerous outliers, which were not manipulated as the data was about the physical attributes and performance of individual athletes at the Olympics. However, the critical BMI values, falling out of (15,60) range were filtered out to maintain the integrity and relevance of the dataset. Finally, the “Weight” variable’s values were found to be right-skewed, which were transformed into a more symmetrical, normally distributed graph, by using cube-root transformation.
The datasets chosen for investigation have been sourced from https://www.kaggle.com/ , which contains information about all the athletes who have participated in Olympics from the year 1896 to 2016. The first dataset “athlete” comprises of 271116 observations across 15 variables about each athlete, as given below:
The second dataset “noc” has 230 observations across 3 variables, containing the country name for each NOC code that exists in the aforementioned dataset about athletes.
Source: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
The Country Name was added for each NOC code of an athlete by merging the “noc” into “athlete” by using a left join.
Steps undertaken:
## Loading athlete dataset
athlete <- read_csv("athlete_events.csv")
## Parsed with column specification:
## cols(
## ID = col_double(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## Height = col_double(),
## Weight = col_double(),
## Team = col_character(),
## NOC = col_character(),
## Games = col_character(),
## Year = col_double(),
## Season = col_character(),
## City = col_character(),
## Sport = col_character(),
## Event = col_character(),
## Medal = col_character()
## )
## Inspecting the head of dataset
head(athlete)
## Loading noc dataset
noc <- read_csv("noc_regions.csv")
## Parsed with column specification:
## cols(
## NOC = col_character(),
## region = col_character(),
## notes = col_character()
## )
## Inspecting the head of dataset
head(noc)
## Joining the dataset using left_join()
olympic <- athlete %>% left_join(noc, by = "NOC") %>% select(-notes)
The variables were summarized using summarized, and the data structures of the variables were checked. The data structure of variables “Sex”, “Season” and “Medal” was found to be improper. These character variables were converted into factors using as.factor() function. Furthermore, the levels in “Medal” variable were also defined as “Bronze” < “Silver” < “Gold”.
Steps taken:
## Summarizing the variables of olympic dataset
olympic %>% summary()
## ID Name Sex Age
## Min. : 1 Length:271116 Length:271116 Min. :10.00
## 1st Qu.: 34643 Class :character Class :character 1st Qu.:21.00
## Median : 68205 Mode :character Mode :character Median :24.00
## Mean : 68249 Mean :25.56
## 3rd Qu.:102097 3rd Qu.:28.00
## Max. :135571 Max. :97.00
## NA's :9474
## Height Weight Team NOC
## Min. :127.0 Min. : 25.0 Length:271116 Length:271116
## 1st Qu.:168.0 1st Qu.: 60.0 Class :character Class :character
## Median :175.0 Median : 70.0 Mode :character Mode :character
## Mean :175.3 Mean : 70.7
## 3rd Qu.:183.0 3rd Qu.: 79.0
## Max. :226.0 Max. :214.0
## NA's :60171 NA's :62875
## Games Year Season City
## Length:271116 Min. :1896 Length:271116 Length:271116
## Class :character 1st Qu.:1960 Class :character Class :character
## Mode :character Median :1988 Mode :character Mode :character
## Mean :1978
## 3rd Qu.:2002
## Max. :2016
##
## Sport Event Medal
## Length:271116 Length:271116 Length:271116
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## region
## Length:271116
## Class :character
## Mode :character
##
##
##
##
## Checking data structure of variables of olympic
str(olympic)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 271116 obs. of 16 variables:
## $ ID : num 1 2 3 4 5 5 5 5 5 5 ...
## $ Name : chr "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
## $ Sex : chr "M" "M" "M" "M" ...
## $ Age : num 24 23 24 34 21 21 25 25 27 27 ...
## $ Height: num 180 170 NA NA 185 185 185 185 185 185 ...
## $ Weight: num 80 60 NA NA 82 82 82 82 82 82 ...
## $ Team : chr "China" "China" "Denmark" "Denmark/Sweden" ...
## $ NOC : chr "CHN" "CHN" "DEN" "DEN" ...
## $ Games : chr "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
## $ Year : num 1992 2012 1920 1900 1988 ...
## $ Season: chr "Summer" "Summer" "Summer" "Summer" ...
## $ City : chr "Barcelona" "London" "Antwerpen" "Paris" ...
## $ Sport : chr "Basketball" "Judo" "Football" "Tug-Of-War" ...
## $ Event : chr "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
## $ Medal : chr NA NA NA "Gold" ...
## $ region: chr "China" "China" "Denmark" "Denmark" ...
## Converting data structure of Sex to factor
olympic$Sex <- olympic$Sex %>% as.factor()
## Converting data structure of Season to factor
olympic$Season <- olympic$Season %>% as.factor()
## Converting data structure of Medal to ordered factor with 3 levels
olympic$Medal <- olympic$Medal %>% factor(levels = c("Bronze", "Silver","Gold"), ordered = TRUE)
## Checking data structure of variables of olympic
str(olympic)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 271116 obs. of 16 variables:
## $ ID : num 1 2 3 4 5 5 5 5 5 5 ...
## $ Name : chr "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
## $ Sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 1 ...
## $ Age : num 24 23 24 34 21 21 25 25 27 27 ...
## $ Height: num 180 170 NA NA 185 185 185 185 185 185 ...
## $ Weight: num 80 60 NA NA 82 82 82 82 82 82 ...
## $ Team : chr "China" "China" "Denmark" "Denmark/Sweden" ...
## $ NOC : chr "CHN" "CHN" "DEN" "DEN" ...
## $ Games : chr "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
## $ Year : num 1992 2012 1920 1900 1988 ...
## $ Season: Factor w/ 2 levels "Summer","Winter": 1 1 1 1 2 2 2 2 2 2 ...
## $ City : chr "Barcelona" "London" "Antwerpen" "Paris" ...
## $ Sport : chr "Basketball" "Judo" "Football" "Tug-Of-War" ...
## $ Event : chr "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
## $ Medal : Ord.factor w/ 3 levels "Bronze"<"Silver"<..: NA NA NA 3 NA NA NA NA NA NA ...
## $ region: chr "China" "China" "Denmark" "Denmark" ...
The dataset was found to be tidy, as per the following observations:
The variables “Weight” and “Height” were used to create a new variable “BMI”. BMI stands for Body Mass Index which is the factor of a person’s body weight and height. It is used to check if a person is underweight, normal weight, overweight, or obese. The formula for calculating BMI is:
\[ BMI = Weight(kg) * 10000 / Height^2(cm) \]
The mutate() function was used to add a variable to “olympic” dataset using the above mentioned formula for BMI.
## Creating new variable for Body Mass Index BMI, while preserving other columns
olympic <- mutate(olympic, BMI = (Weight*10000)/(Height^2))
## Checking the data structure of BMI
str(olympic$BMI)
## num [1:271116] 24.7 20.8 NA NA 24 ...
6 variables were found to be having missing values. These inconsistencies were imputed and omitted, depending upon the nature of error.
## Finding total missing values in each variable
colSums(is.na(olympic))
## ID Name Sex Age Height Weight Team NOC Games Year
## 0 0 0 9474 60171 62875 0 0 0 0
## Season City Sport Event Medal region BMI
## 0 0 0 0 231333 370 64263
## Defining function for NA values in Medal variable
nomedal <- "No Medal"
## Imputing missing values of Medal variable
olympic$Medal <- impute(olympic$Medal, fun = nomedal)
## Checking levels of Medal variable
levels(olympic$Medal)
## [1] "Bronze" "Silver" "Gold" "No Medal"
## Redefining the order of levels of Medal variable
olympic$Medal <- olympic$Medal %>% factor(levels = c("No Medal","Bronze", "Silver","Gold"), ordered = TRUE)
## Finding total missing values in each variable
colSums(is.na(olympic))
## ID Name Sex Age Height Weight Team NOC Games Year
## 0 0 0 9474 60171 62875 0 0 0 0
## Season City Sport Event Medal region BMI
## 0 0 0 0 0 370 64263
## Omitting the rows that have missing values
olympic <- na.omit(olympic)
## Finding total missing values in each variable
colSums(is.na(olympic))
## ID Name Sex Age Height Weight Team NOC Games Year
## 0 0 0 0 0 0 0 0 0 0
## Season City Sport Event Medal region BMI
## 0 0 0 0 0 0 0
The dataset consists of 4 crucial numeric variables, i.e. “Height”, “Weight”, “Age” and “BMI”. Boxplot was plotted for each of the 4 variables to observe the presence of any outliers using boxplot() function.
It was observed that all the 4 variables had numerous outliers present. BMI lower than 15 or greater than 60 is extremely rare, so the BMI variable “BMI” of dataset “olympic” was filtered for handling any outliers outside the rang (15,60). Outlier handling was not carried out further due to the fact that age, weight and height are recording for an athlete at the time of his/her participation at the Olympics, and should not be manipulated.
## Boxplot for Age
boxplot(olympic$Age, main = "Age of Athlete")
## Boxplot for Height
boxplot(olympic$Height, main = "Height of Athlete")
## Boxplot for Weight
boxplot(olympic$Weight, main = "Weight of Athlete")
## Boxplot for BMI
boxplot(olympic$BMI, main = "BMI of Athlete")
## Filtering out critical values of BMI
olympic <- olympic %>% filter(BMI >= 15, BMI <= 60)
A histogram was plotted for the variable “Weight”, which was found to be right-skewed. As per the concepts of transformation, right-skewness in the distribution can be reduced by taking roots of the values. The right skewness in Weight variable’s histogram was reduced here by taking the cube root of weights of athletes. Cube root values ensured a much symmetrical, normal distribution.
Steps undertaken:
It can be seen from the graph that the right-skewness of the variables has been significantly reduced.
## Plotting histogram for Weight
hist(olympic$Weight, breaks = 1000)
## Finding cube root of Weight values
olympic_Weight <- (olympic$Weight)^(1/3)
## Plotting histogram again
hist(olympic_Weight, breaks = 1000)