MATH2349 Semester 1, 2019

Required packages

Loading the required packages for investigation

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readr)

## Warning: package 'readr' was built under R version 3.5.3

library(deductive)

## Warning: package 'deductive' was built under R version 3.5.3

library(validate)

## Warning: package 'validate' was built under R version 3.5.3

## 
## Attaching package: 'validate'

## The following object is masked from 'package:dplyr':
## 
##     expr

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 3.5.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 3.5.3

## Loading required package: survival

## Warning: package 'survival' was built under R version 3.5.3

## Loading required package: Formula

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.5.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:validate':
## 
##     expr

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:validate':
## 
##     label, label<-

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(forecast)

## Warning: package 'forecast' was built under R version 3.5.3

Executive Summary

The dataset “athlete”, which contains information about athletes participating at Olympics from the year 1896 to 2016, was merged with the other dataset “noc” which contains country names for the respective NOC codes. The resulting dataset had 271116 observations across 16 variables about the athlete, with “Age”, “Weight”, “Height” and “Medal” being the most important ones. The pre-processing of data was carried out with the aim of building a dataset that helps find relation between the Body Mass Index of an athlete and his/her success at the Olympics. The data structures of variables of the dataset were inspected and a few variables were converted into factor format. Body Mass Index for each athlete was calculated by using “Height” and “Weight” variables of the dataset and adding the result under a new variable “BMI”. The resulting dataset was scanned for missing values, which were handled by imputation followed by omission. Outlier scan detected numerous outliers, which were not manipulated as the data was about the physical attributes and performance of individual athletes at the Olympics. However, the critical BMI values, falling out of (15,60) range were filtered out to maintain the integrity and relevance of the dataset. Finally, the “Weight” variable’s values were found to be right-skewed, which were transformed into a more symmetrical, normally distributed graph, by using cube-root transformation.

Data

The datasets chosen for investigation have been sourced from https://www.kaggle.com/ , which contains information about all the athletes who have participated in Olympics from the year 1896 to 2016. The first dataset “athlete” comprises of 271116 observations across 15 variables about each athlete, as given below:

ID - Unique number for each athlete
Name - Athlete’s name
Sex - Gender of athlete
Age - Age of athlete
Height - Height of athlete in centimeters
Weight - Weight of athlete in kilograms
Team - Team name of athlete
NOC - National Olympic Committee 3-letter code
Games - Year and season of tournament
Year - Year of tournament
Season - Summer or Winter
City - Host city
Sport - Sport
Event - Event
Medal - Gold, Silver, Bronze, or NA

The second dataset “noc” has 230 observations across 3 variables, containing the country name for each NOC code that exists in the aforementioned dataset about athletes.

Source: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

The Country Name was added for each NOC code of an athlete by merging the “noc” into “athlete” by using a left join.

Steps undertaken:

Datasets “athlete” and “noc” were loaded into R using read_csv() function
The head of datasets were displayed using head() function
The two datasets were merged by using left_join() function with “NOC” as the common variable
While merging, the variable “notes” from dataset “noc” was removed because of irrelevance
The final dataset created for further investigation was named “olympic”

## Loading athlete dataset
athlete <- read_csv("athlete_events.csv")

## Parsed with column specification:
## cols(
##   ID = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   Height = col_double(),
##   Weight = col_double(),
##   Team = col_character(),
##   NOC = col_character(),
##   Games = col_character(),
##   Year = col_double(),
##   Season = col_character(),
##   City = col_character(),
##   Sport = col_character(),
##   Event = col_character(),
##   Medal = col_character()
## )

## Inspecting the head of dataset
head(athlete)

## Loading noc dataset
noc <- read_csv("noc_regions.csv")

## Parsed with column specification:
## cols(
##   NOC = col_character(),
##   region = col_character(),
##   notes = col_character()
## )

## Inspecting the head of dataset
head(noc)

## Joining the dataset using left_join()
olympic <- athlete %>% left_join(noc, by = "NOC") %>% select(-notes)

Understand

The variables were summarized using summarized, and the data structures of the variables were checked. The data structure of variables “Sex”, “Season” and “Medal” was found to be improper. These character variables were converted into factors using as.factor() function. Furthermore, the levels in “Medal” variable were also defined as “Bronze” < “Silver” < “Gold”.

Steps taken:

The summary of variables was checked using summary() function
The data structure of variables was inspected using str() function
The variables “Sex” and “Season” were converted to a factor format using as.factor() function
The variable “Medal” was converted into an ordered factor, with levels “Bronze” < “Silver” < “Gold”

## Summarizing the variables of olympic dataset
olympic %>% summary()

##        ID             Name               Sex                 Age       
##  Min.   :     1   Length:271116      Length:271116      Min.   :10.00  
##  1st Qu.: 34643   Class :character   Class :character   1st Qu.:21.00  
##  Median : 68205   Mode  :character   Mode  :character   Median :24.00  
##  Mean   : 68249                                         Mean   :25.56  
##  3rd Qu.:102097                                         3rd Qu.:28.00  
##  Max.   :135571                                         Max.   :97.00  
##                                                         NA's   :9474   
##      Height          Weight          Team               NOC           
##  Min.   :127.0   Min.   : 25.0   Length:271116      Length:271116     
##  1st Qu.:168.0   1st Qu.: 60.0   Class :character   Class :character  
##  Median :175.0   Median : 70.0   Mode  :character   Mode  :character  
##  Mean   :175.3   Mean   : 70.7                                        
##  3rd Qu.:183.0   3rd Qu.: 79.0                                        
##  Max.   :226.0   Max.   :214.0                                        
##  NA's   :60171   NA's   :62875                                        
##     Games                Year         Season              City          
##  Length:271116      Min.   :1896   Length:271116      Length:271116     
##  Class :character   1st Qu.:1960   Class :character   Class :character  
##  Mode  :character   Median :1988   Mode  :character   Mode  :character  
##                     Mean   :1978                                        
##                     3rd Qu.:2002                                        
##                     Max.   :2016                                        
##                                                                         
##     Sport              Event              Medal          
##  Length:271116      Length:271116      Length:271116     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     region         
##  Length:271116     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

## Checking data structure of variables of olympic
str(olympic)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 271116 obs. of  16 variables:
##  $ ID    : num  1 2 3 4 5 5 5 5 5 5 ...
##  $ Name  : chr  "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
##  $ Sex   : chr  "M" "M" "M" "M" ...
##  $ Age   : num  24 23 24 34 21 21 25 25 27 27 ...
##  $ Height: num  180 170 NA NA 185 185 185 185 185 185 ...
##  $ Weight: num  80 60 NA NA 82 82 82 82 82 82 ...
##  $ Team  : chr  "China" "China" "Denmark" "Denmark/Sweden" ...
##  $ NOC   : chr  "CHN" "CHN" "DEN" "DEN" ...
##  $ Games : chr  "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
##  $ Year  : num  1992 2012 1920 1900 1988 ...
##  $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
##  $ City  : chr  "Barcelona" "London" "Antwerpen" "Paris" ...
##  $ Sport : chr  "Basketball" "Judo" "Football" "Tug-Of-War" ...
##  $ Event : chr  "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
##  $ Medal : chr  NA NA NA "Gold" ...
##  $ region: chr  "China" "China" "Denmark" "Denmark" ...

## Converting data structure of Sex to factor
olympic$Sex <- olympic$Sex %>% as.factor()

## Converting data structure of Season to factor
olympic$Season <- olympic$Season %>% as.factor()

## Converting data structure of Medal to ordered factor with 3 levels
olympic$Medal <- olympic$Medal %>% factor(levels = c("Bronze", "Silver","Gold"), ordered = TRUE)

## Checking data structure of variables of olympic
str(olympic)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 271116 obs. of  16 variables:
##  $ ID    : num  1 2 3 4 5 5 5 5 5 5 ...
##  $ Name  : chr  "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
##  $ Sex   : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 1 ...
##  $ Age   : num  24 23 24 34 21 21 25 25 27 27 ...
##  $ Height: num  180 170 NA NA 185 185 185 185 185 185 ...
##  $ Weight: num  80 60 NA NA 82 82 82 82 82 82 ...
##  $ Team  : chr  "China" "China" "Denmark" "Denmark/Sweden" ...
##  $ NOC   : chr  "CHN" "CHN" "DEN" "DEN" ...
##  $ Games : chr  "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
##  $ Year  : num  1992 2012 1920 1900 1988 ...
##  $ Season: Factor w/ 2 levels "Summer","Winter": 1 1 1 1 2 2 2 2 2 2 ...
##  $ City  : chr  "Barcelona" "London" "Antwerpen" "Paris" ...
##  $ Sport : chr  "Basketball" "Judo" "Football" "Tug-Of-War" ...
##  $ Event : chr  "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
##  $ Medal : Ord.factor w/ 3 levels "Bronze"<"Silver"<..: NA NA NA 3 NA NA NA NA NA NA ...
##  $ region: chr  "China" "China" "Denmark" "Denmark" ...

Tidy & Manipulate Data I

The dataset was found to be tidy, as per the following observations:

Each variable has its own column
Each observation has its own row
Each value has its own cell

Tidy & Manipulate Data II

The variables “Weight” and “Height” were used to create a new variable “BMI”. BMI stands for Body Mass Index which is the factor of a person’s body weight and height. It is used to check if a person is underweight, normal weight, overweight, or obese. The formula for calculating BMI is:

\[ BMI = Weight(kg) * 10000 / Height^2(cm) \]

The mutate() function was used to add a variable to “olympic” dataset using the above mentioned formula for BMI.

## Creating new variable for Body Mass Index BMI, while preserving other columns
olympic <- mutate(olympic, BMI = (Weight*10000)/(Height^2))

## Checking the data structure of BMI
str(olympic$BMI)

##  num [1:271116] 24.7 20.8 NA NA 24 ...

Scan I

6 variables were found to be having missing values. These inconsistencies were imputed and omitted, depending upon the nature of error.

The colSums() function was used to find the total missing values in each column
“Medal” variable had the most number of missing values, i.e. 231,331. The “NA” values in this column denoted no medal. Hence, these NA values were handled by imputing “No Medal” in the cells containing “NA”. After imputing, the levels of “Medal” variables were inspected and changed as per the new requirement, i,e, “No Medal” < “Bronze” < “Silver” < “Gold”
The missing values for each column was checked again using colSums()
The missing values were found to be mostly existing in “Height” and “Weight” variables, which are directly tied to the missing values in the column “BMI”. These missing values are due to incomplete data collection, and the inconsistency existed in the original datasets. Since, the purpose of this scan is to prepare the dataset for finding a relation between BMI and success at Olympics (medal), the missing values in variables “Weight” and “Height” should not be imputed with any constant. As a result, these rows were omitted using na.omit() function.
The colSums() was used to scan for missing values again, and it showed no inconsistency

## Finding total missing values in each variable
colSums(is.na(olympic))

##     ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year 
##      0      0      0   9474  60171  62875      0      0      0      0 
## Season   City  Sport  Event  Medal region    BMI 
##      0      0      0      0 231333    370  64263

## Defining function for NA values in Medal variable
nomedal <- "No Medal"

## Imputing missing values of Medal variable
olympic$Medal <- impute(olympic$Medal, fun = nomedal)

## Checking levels of Medal variable
levels(olympic$Medal)

## [1] "Bronze"   "Silver"   "Gold"     "No Medal"

## Redefining the order of levels of Medal variable
olympic$Medal <- olympic$Medal %>% factor(levels = c("No Medal","Bronze", "Silver","Gold"), ordered = TRUE)

## Finding total missing values in each variable
colSums(is.na(olympic))

##     ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year 
##      0      0      0   9474  60171  62875      0      0      0      0 
## Season   City  Sport  Event  Medal region    BMI 
##      0      0      0      0      0    370  64263

## Omitting the rows that have missing values
olympic <- na.omit(olympic)

## Finding total missing values in each variable
colSums(is.na(olympic))

##     ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year 
##      0      0      0      0      0      0      0      0      0      0 
## Season   City  Sport  Event  Medal region    BMI 
##      0      0      0      0      0      0      0

Scan II

The dataset consists of 4 crucial numeric variables, i.e. “Height”, “Weight”, “Age” and “BMI”. Boxplot was plotted for each of the 4 variables to observe the presence of any outliers using boxplot() function.

It was observed that all the 4 variables had numerous outliers present. BMI lower than 15 or greater than 60 is extremely rare, so the BMI variable “BMI” of dataset “olympic” was filtered for handling any outliers outside the rang (15,60). Outlier handling was not carried out further due to the fact that age, weight and height are recording for an athlete at the time of his/her participation at the Olympics, and should not be manipulated.

## Boxplot for Age
boxplot(olympic$Age, main = "Age of Athlete")

## Boxplot for Height
boxplot(olympic$Height, main = "Height of Athlete")

## Boxplot for Weight
boxplot(olympic$Weight, main = "Weight of Athlete")

## Boxplot for BMI
boxplot(olympic$BMI, main = "BMI of Athlete")

## Filtering out critical values of BMI
olympic <- olympic %>% filter(BMI >= 15, BMI <= 60)

Transform

A histogram was plotted for the variable “Weight”, which was found to be right-skewed. As per the concepts of transformation, right-skewness in the distribution can be reduced by taking roots of the values. The right skewness in Weight variable’s histogram was reduced here by taking the cube root of weights of athletes. Cube root values ensured a much symmetrical, normal distribution.