Required packages

library(readr)
library(dplyr)
library(editrules)
library(MVN)

Executive Summary

The two tables in the Exploring 120 Years of Olympic History dataset are loaded in to R, stored as respective dataframe objects, and briefly explored by displaying the head() of each dataframe. The individual tables are combined with a join on a shared value, creating a single dataframe that contains all features in the Olympic History dataset. A sample of the joined dataframe is taken and data types of each variable are explored, using the class() function to explore typed and str() function to display the size of the dataset and a sample of each variable. These exploration processes highlight incorrectly assigned data types that need to be converted. After these transformations are applied, the structure of the data is explored using the Tidy Data Principles to determine if it is tidy. A new variable is created by multiplying existing ones with the mutate() function. Missing values, inconsistencies, & errors are explored with the use of the is.na() function (for the whole dataset with sum() and individual columns with colSums()). A user-defined function locates special values, and a rule set to identify obvious errors is applied to the data. Once located, the problematic values are handled for both numeric and categorical features; either by string replacements or imputations with statistics from similar observations. Any outliers in numerical features are explored and dealt with using capping for univariate outliers and mvn()s Mahalanobis distances for multivariate outliers. Methods of dealing with outliers are determined by calculating the percentage of outliers. After exploring their distributions, mathematical operations are used to transform numeric variables to make them more closely resemble normality. The resulting dataset is tidy, complete, and suitable to use for statistical analysis or machine learning applications.

Data

The 120 Years of Olympic History: Athletes and Results has been uploaded to Kaggle.com and falls under the Creative Commons License. The dataset consists of two tables; athlete_events.csv and noc_regions.csv. The athlete events dataset contains information about Olympic Athletes and events that they have competed in, including biological data (Age, Sex, Height, Weight etc.) and event data (Year, Season, City, Sport etc.). The NOC (National Organizing Committee) regions dataset possesses information about the countries that compete in the Olympics, including the country name and any notes about said country.

The Athletes dataset contains 15 variables:

The NOC Regions dataset contains 3 variables:

Load Data

Each dataset is individually loaded in to R using the read_csv() function from the readr library. read_csv() is used instead of base R’s read.csv() since it it faster at processing larger datasets. The athlete_events.csv dataset is assigned to an object named athletes. After which, the head() function displays the first 5 rows of the newly created data.frame to ensure that everything has been imported correctly.

athletes <- read_csv("Data/Olympics/athlete_events.csv", col_types = cols())
head(athletes, 5)

The same process used to load athlete_events.csv is applied to noc_regions.csv, with this dataset being assigned to an object called regions.

regions <- read_csv("Data/Olympics/noc_regions.csv", col_types = cols())
head(regions, 5)

Merge

The two datasets share a common variable: NOC. A left_join is used to join regions to athletes by NOC value. This preserves all of the athlete’s information within athletes and appends the corresponding region information from regions, applying an NA value where no matching information is available. These merged datasets are assigned to a new object called olympics. Once again calling the head() function shows the first 5 rows of the newly created data.frame to confirm that the two datasets have been merged correctly.

olympics <- left_join(athletes, regions, by = "NOC")
head(olympics, 5)

After the join is performed, due to the size of the dataset a random sample of 500 rows will be partitioned for analysis to limit processing time using the sample_n() function.

olympics_sample <- sample_n(olympics, 5000)

Understand

The function class() can be called to determine the data type stored within the variables in the olympics dataset. To prevent the need to call class() once for each individual variable, the sapply function administers it to each column automatically.

sapply(olympics_sample, class)
         ID        Name         Sex         Age      Height      Weight        Team         NOC       Games        Year      Season        City       Sport 
  "numeric" "character" "character"   "numeric"   "numeric"   "numeric" "character" "character" "character"   "numeric" "character" "character" "character" 
      Event       Medal      region       notes 
"character" "character" "character" "character" 

There are multiple data types within the olympics dataset, however the read_csv() function has assigned them all as either character or numeric. While this is correct in some cases (Name, Height, Weight, etc.) some have been mis-labelled. Specifically, Medals should be an ordered factor, Sex should be a factor and requires additional re-labelling of levels, and multiple other features require factor conversions.

Base R’s str() function displays the structure of the olympics dataset, including the attribute name, data type, and sample of values for each column. str() also returns the number of variables and observations in the dataset.

str(olympics_sample)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    5000 obs. of  17 variables:
 $ ID    : num  94906 67498 62645 80790 113748 ...
 $ Name  : chr  "Paulo Braune Pimenta" "Pierre Le Coq" "Primo Kopa" "Aubrey Maphosa Modiba" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : num  23 27 21 21 33 22 33 29 39 27 ...
 $ Height: num  169 186 NA 172 180 174 177 NA 183 178 ...
 $ Weight: num  65 72 NA 64 102 63 72 NA NA 68 ...
 $ Team  : chr  "Brazil" "France" "Slovenia" "South Africa" ...
 $ NOC   : chr  "BRA" "FRA" "SLO" "RSA" ...
 $ Games : chr  "1984 Summer" "2016 Summer" "1992 Winter" "2016 Summer" ...
 $ Year  : num  1984 2016 1992 2016 2014 ...
 $ Season: chr  "Summer" "Summer" "Winter" "Summer" ...
 $ City  : chr  "Los Angeles" "Rio de Janeiro" "Albertville" "Rio de Janeiro" ...
 $ Sport : chr  "Shooting" "Sailing" "Ski Jumping" "Football" ...
 $ Event : chr  "Shooting Men's Air Rifle, 10 metres" "Sailing Men's Windsurfer" "Ski Jumping Men's Large Hill, Team" "Football Men's Football" ...
 $ Medal : chr  NA "Bronze" NA NA ...
 $ region: chr  "Brazil" "France" "Slovenia" "South Africa" ...
 $ notes : chr  NA NA NA NA ...

The output of str() confirms that some variables have been mis-assigned data types. Additionally, the shape of the dataset is shown to be 5,000 observations of 17 variables. This further confirms that the two datasets have been correctly joined since athletes contained 15 variables, regions was made up of 3, and they each shared a common attribute to make up 17 total variables. Displaying a sample of the values within each variable provides some insight to the contents of the dataset. Most are to be expected, (Age is a 2-digit integer, Team is a Country name as a character string, etc.), however str() helps to explain the difference between Sport and Event: An Event looks to be the subset of a Sport, i.e. the Sport Basketball encompasses basketball-related events including Basketball Men's Basketball.

Conversions

Using the output of str() as a guide, conversions can be applied to the variables with mis-assigned data types. Firstly, the variables that should be un-ordered factors are each added to a character vector called factor_names. Using the lapply() function, each variable in the factor_names list is converted to a factor, without any changes to level names or ordering. Secondly, the Sex variable is of type Chr when it should be a factor. Using the factor() function, the levels M and F are re-named to Male and Female respectively and Sex is converted to a factor.

unique(olympics_sample$Year)
 [1] 1984 2016 1992 2014 2004 1998 1976 1924 2012 1996 1980 1956 2000 1948 1988 1920 1936 1952 1960 2008 1964 1928 1972 1912 1932 1968 2002 1994 2010 1908 2006
[32] 1906 1904 1900 1896

While Year contains 4-digit integers, these values skip every 2 to 4 years due to the amount of time between Olympic games. For this reason, Year will be converted to a factor.

unique(olympics_sample$Medal)
[1] NA       "Bronze" "Gold"   "Silver"

Finally, the Medal variable can be ordered by the podium positions: Gold > Silver > Bronze > NA. This is again performed with the factor() function, this time by also specifying the order of the levels and setting the ordered argument to T (or TRUE).

# Character to factor (no orderering or labelling)
factor_names <- c("Team", "NOC", "Games", "Season", "City", "Sport", "Event", "region")
olympics_sample[, factor_names] <- lapply(olympics_sample[, factor_names], factor)

# Character to factor (labelled)
olympics_sample$Sex <- factor(olympics_sample$Sex, levels = c("M", "F"), labels = c("Male", "Female"), ordered = F)

# Character to factor (ordered)
olympics_sample$Medal <- factor(olympics_sample$Medal, levels = c(NA, "Bronze", "Silver", "Gold"), ordered = T)

# number to factor
olympics_sample$Year <- factor(olympics_sample$Year)

str(olympics_sample)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    5000 obs. of  17 variables:
 $ ID    : num  94906 67498 62645 80790 113748 ...
 $ Name  : chr  "Paulo Braune Pimenta" "Pierre Le Coq" "Primo Kopa" "Aubrey Maphosa Modiba" ...
 $ Sex   : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 2 1 1 1 1 ...
 $ Age   : num  23 27 21 21 33 22 33 29 39 27 ...
 $ Height: num  169 186 NA 172 180 174 177 NA 183 178 ...
 $ Weight: num  65 72 NA 64 102 63 72 NA NA 68 ...
 $ Team  : Factor w/ 279 levels "Afghanistan",..: 34 94 224 225 15 204 204 15 257 94 ...
 $ NOC   : Factor w/ 185 levels "AFG","AHO","ALG",..: 23 57 148 138 12 139 139 12 174 57 ...
 $ Games : Factor w/ 51 levels "1896 Summer",..: 34 51 39 51 50 45 42 30 8 34 ...
 $ Year  : Factor w/ 35 levels "1896","1900",..: 21 35 23 35 34 29 26 19 8 21 ...
 $ Season: Factor w/ 2 levels "Summer","Winter": 1 1 2 1 2 1 2 1 1 1 ...
 $ City  : Factor w/ 42 levels "Albertville",..: 19 28 1 28 35 4 25 22 27 19 ...
 $ Sport : Factor w/ 59 levels "Aeronautics",..: 41 40 44 22 11 5 15 48 41 5 ...
 $ Event : Factor w/ 564 levels "Aeronautics Mixed Aeronautics",..: 369 349 429 253 124 78 160 463 404 47 ...
 $ Medal : Ord.factor w/ 3 levels "Bronze"<"Silver"<..: NA 1 NA NA NA NA NA NA 3 NA ...
 $ region: Factor w/ 170 levels "Afghanistan",..: 21 52 139 140 10 127 127 10 162 52 ...
 $ notes : chr  NA NA NA NA ...

Showing the structure of the newly-converted dataset shows that the factors originally mistakenly assigned as chr have been correctly changed to Factor, and additionally, the levels within Sex have been re-labelled as Male and Female from M and F respectively, and finally Medal is now an ordered factor corresponding to the podium placements.

Tidy & Manipulate Data I

Hadley Wickham and Grolemund’s Tidy Data principles state that in tidy data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Investigating each of these principles in the context of the olympics dataset will determine whether or not any conversions need to be applied to make the data tidy.

Firstly, each variable in the olympics dataset has its own column. The columns are always associated with a single variable, and any variable can simply be accessed with the standard R codes. For example: olympics$Name will return Athlete’s names and nothing else. olympics$Age will only return their ages, etc. Secondly, each observation in the olympics dataset has its own row; as discussed in the Data description, each row in the data frame corresponds to an athlete and the olympics that they competed in. Finally, each value in the olympics dataset has its own cell. At first glance there appears to be a breach of this principle in the variable Games as it contains a Year and Season (both of which are separate variables), however the description of the dataset on Kaggle.com states that “Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on.”, so for this purpose the Games variable is important as its own attribute.

From examining each of the tidy data principles, it can be concluded that the olympics dataset is tidy and reshaping is not required.

Tidy & Manipulate Data II

Since the olympics dataset contains multiple biological attributes, it is possible to calculate each athlete’s Body Mass Index (or BMI). Calculating BMI as a body-structure indicator can provide helpful insight to the different sizes of athletes between Sports, Age, Gender, etc.

BMI is calculated as (\(BMI = \frac{kg}{m^2}\)) or BMI = Weight (in kg) / Height (in m)^2. This calculation is performed within the mutate function of the dplyr package. A new variable BMI is created by dividing the existing variables Weight by Height by Height (to incorporate height^2) and multiplying the result by 10000 to factor in the incorrect measurement type for Height (Height is recorded in cm in the dataset, BMI calculations require meters). A sample of the resulting dataset shows the calculated BMIs for certain athletes.

# BMI = kg/m2 
olympics_mutate <- olympics_sample %>% mutate(
  BMI = round((Weight / Height / Height) * 10000, 2) 
)

olympics_mutate[c(1:5), ] %>% select(Name, Sex, Age, Height, Weight, BMI)

Scan I

To explore (and potentially deal with) inconsistencies and missing values, first their presence or absence needs to be determined. Starting with NA values; base R’s sum() and is.na() functions are applied to the olympics_mutate dataframe to determine a total count of all NA instances.

Missing Values

# Total missing values
sum(is.na(olympics_mutate))
[1] 12726

There are a total of 12,726 missing values in the olympics_mutate dataframe, which is too many to blindly remove or ignore as it would potentially distort the underlying message of the data. Calling the colSums() function on the dataframe will display which columns contain these NA instances.

# Missing values by column
colSums(is.na(olympics_mutate))
    ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year Season   City  Sport  Event  Medal region  notes    BMI 
     0      0      0    165   1090   1133      0      0      0      0      0      0      0      0   4253     14   4909   1162 

The NA values are spread amongst the variables Age, Height, Weight, Medal, region, notes, and the mutated feature BMI. The numeric variables can be imputed but first the categorical variables will be addressed.

First, the nature of the Medal variable suggests that a missing value indicates that the athlete did not win a medal for that particular Event, Year, and Games combination. These NAs will be replaced by a new factor level called None. To accomplish this, the existing levels of Medals are assigned to a character vector medal_levels, the length of the vector is increased by 1 and the empty value is replaced by the string "None". Similar to the operations performed in 5. Understand, the Medal variable is converted to an ordered factor using the new medal_levels. Using the is.na() function, the NAs are replaced with the new factor level None. This process preserves the order of the Medal variable so that Gold > Silver > Bronze > None.

olympics_complete <- olympics_mutate

medal_levels <- levels(olympics_complete$Medal)
medal_levels[length(medal_levels) + 1] <- "None"

olympics_complete$Medal <- factor(olympics_complete$Medal, levels = medal_levels, order = T)
olympics_complete$Medal[is.na(olympics_complete$Medal)] <- "None"

Merging athletes and regions to form the initial olympics dataframe involved performing a left_join() which is prone to introducing NAs since each row in the left dataframe (athletes) is kept and a lack of corresponding rows in the right table (regions) results in an NA. The region and notes features contain numerous NAs, some of which are a result of the join and the remaining existed in the original table. A quick investigation of the rows whose region is NA shows the following:

head(olympics_complete[is.na(olympics_complete$region), "Team"], 5)

The data description states that the region variable is the country or region assosciated with the corresponding NOC code. In the above sample, all of the Teams contain the same value; Singapore.

unique(olympics_complete[is.na(olympics_complete$region), "Team"])
which("Singapore" %in% olympics_complete$region)              
integer(0)

Out of the 14 NA instances in regions, there is only 1 unique Team value, and it appears that Singapore never shows up in the region column despite showing up in Team. By variable definitions, Team and region aren’t always the same, however due to the small number of NAs, the Team value will be copied to region for these problematic rows via an ifelse() statement. As for notes, a similar approach to the solution for Medal will be applied, minus the factor ordering which will replace the NA values with "None".

# notes
notes_levels <- levels(olympics_complete$notes)
notes_levels[length(notes_levels) + 1] <- "None"

olympics_complete$notes <- factor(olympics_complete$notes, levels = medal_levels)
olympics_complete$notes[is.na(olympics_complete$notes)] <- "None"

# region
olympics_complete$region <- ifelse(is.na(olympics_complete$region), olympics_complete$Team, olympics_complete$region) 

The remaining variables that contain NA instances are Age, Height, Weight, and BMI. Being numeric variables, incorrectly changing or imputing these NAs can potentially distort the data and result in invalid conclusions. Starting with Age, summary statistics will be explored grouped by Sport to see if there is a difference in Age between different sports. This exploration will be conducted by using various functions from the dplyr package.

olympics_complete %>% 
  group_by(Sport) %>%  
  na.omit() %>% 
  summarize(mean=mean(Age), median = median(Age), min = min(Age), max = max(Age), SD = sd(Age)) %>% 
  sample_n(5)

The mean, median, and variance in Age appears to differ between sports which makes logical sense– an Olympic Gymnast or Swimmer would likely be younger than a Weightlifter or Shooter. The median Age of each Sport will be imputed to the missing Age variables.

olympics_complete <- olympics_complete %>%
 group_by(Sport) %>%
 mutate(
   Age = ifelse(
   is.na(Age),
   median(Age, na.rm = TRUE),
   Age)
 )

A similar approach will be applied to Height and Weight, this time using the median of the combination of Age and Sport. The logic here is that an 11-year-old Gymnast would likely weigh less and be shorter than a 40-year-old Gymnast, and be significantly smaller than a 40-year-old Weightlifter.

olympics_complete <- olympics_complete %>%
 na.omit() %>% 
 group_by(Sport, Age) %>%
 mutate(
   Height = ifelse(
   is.na(Height),
   median(Height, na.rm = TRUE),
   Height)
 )

olympics_complete <- olympics_complete %>%
 na.omit() %>% 
 group_by(Sport, Age) %>%
 mutate(
   Weight = ifelse(
   is.na(Weight),
   median(Weight, na.rm = TRUE),
   Weight)
 )

BMI is the final variable that contains NAs, which is a result of performing calculations on variables with missing values, Height and Weight. Since these NAs have been imputed, the now complete columns can be used to once again calculate BMI.

olympics_complete <- olympics_complete %>%
 mutate(
   BMI = ifelse(
   is.na(BMI),
   round((Weight / Height / Height) * 10000, 2),
   BMI)
 )

colSums(is.na(olympics_complete))
    ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year Season   City  Sport  Event  Medal region  notes    BMI 
     0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0 

The result of the multiple imputations and renaming procedures is a new data.frame olympics_complete which contains some sort of value for every observation in the whole dataset.

Special Values

NA values aren’t the only problematic observations in an exploratory data analysis. To explore any potential Special Values, (Inf, -Inf, NaN), a custom function is_special is defined, which utilises base R’s functions is.infinite(), is.nan(), and is.na(). Applying this custom function to the olympics_complete data.frame will determine the number of these special values present.

is_special <- function(x){
 if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}
sum(is_special(olympics_complete))
[1] 0

The is_special() function has returned 0, indicating that there are no special values in the olympics_complete data.frame that require further investigation.

Inconsistencies and Errors

Another type of error to check for is inconsistencies and errors, or data values that are nonsensical in context. For example, negative values for Age, Height, or Weight, or exceedingly large values for those same variables (such as Age being over 200). Using the editset function from the editrules package, a set of rules is defined to identify any of these obvious errors. The rule set checks that:

  • Age is between 0 and 100,
  • Height is between 0 and 250,
  • Weight is between 0 and 300,
  • Season is one of either Summer or Winter,
  • and Medal is one of either Gold, Silver, Bronze or None

Some of the numerical boundaries are extreme, (i.e. an Olympic athlete being 0 or 100 years old is very unlikely), however the purpose of these functions is to locate inconsistencies and errors. Outliers will be explored later in this study. Some variables have too many factor levels to check for (Team and Event for example) so this ruleset is limited to numerical variables, and categorical variables that contain less than 5 unique levels.

(Rule1 <- editset(c("Age >= 0", "Age <= 100", "Height >= 0", 
                    "Height <= 250", "Weight >= 0", "Weight <= 300",
                    "Season %in% c('Summer', 'Winter')",
                    "Medal %in% c('Gold', 'Silver', 'Bronze', 'None')")))

Data model:
dat1 : Medal %in% c('Bronze', 'Gold', 'None', 'Silver')
dat2 : Season %in% c('Summer', 'Winter') 

Edit set:
num1 : 0 <= Age
num2 : Age <= 100
num3 : 0 <= Height
num4 : Height <= 250
num5 : 0 <= Weight
num6 : Weight <= 300 
sum(violatedEdits(Rule1, olympics_complete))
[1] 0

Using a combination of base R’s sum() and editrulesviolatedEdits shows that there are no instances in the olympics_complete dataset that violate the defined rules.

Scan II

Outliers

Missing or inconsistent data is one common issue in exploratory data analysis; another is outliers which can be present as a result of some form of error. A quick method of detecting the presence of outliers is to display boxplots for each numeric variable.

Boxplots show that there are outliers present in all of the numeric features (as represented by the dots) according to Tukey’s method of outlier detection (\(\pm1.5 × IQR\)). Shapes of the box plots give a general idea of each variable’s univariate distribution. Height is the only variable that appears to resemble normality, which suggests that the z-score approach can be used to detect outliers for these features. However since the combination of features Height, Weight, and BMI are not likely to be independent, multivariate outlier detection techniques will be used. Age will be assessed for univariate outliers.

Outliers in Age were highlighted in the initial boxplots. By saving a boxplot() of the Age variable to an object and accessing the $out attribute of the stored boxplot, the percentage of outliers according to Tukey’s method can be counted.

boxplot_age <- boxplot(olympics_complete$Age)

round(length(boxplot_age$out) / nrow(olympics_complete) * 100, 2)
[1] 2.55

Since only 2.55% of instances in Age are outliers, they can likely be removed without distorting the underlying data. As highlighted in the histogram, Age is not normally distributed so z-scores shouldn’t be used. Instead, capping (or Windsoring) will be implemented by invoking a user-defined function cap that replaces outliers with their nearest neighbour.

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

olympics_capped <- olympics_complete
olympics_capped$Age <- cap(olympics_capped$Age)

The remaining numeric variables Height and Weight will be investigated using multivariate outlier detection methods, given that it is likely that a person’s height will impact their weight.

plot(olympics_capped$Height~olympics_capped$Weight, main = "Height vs Weight", xlab = "Weight", ylab = "Height")

Logically, a taller person would likely weigh more than a shorter person, which is supported by the scatterplot of Height vs Weight. There is a strong, positive correlation between the two variables with some likely outliers spread toward the outside of the plot space. Using the mvn() function from the MVN package can detect outliers using Mahalanobis distances.

mvn_results <- mvn(data = olympics_capped[, c("Height", "Weight")], multivariateOutlierMethod = "quan", showOutliers = TRUE)

Using the quan method in mvn() uses the chi-square distribution critical value approach to detect multivariate outliers. Values to the right of the red Quantile line in the plot are determined to be outliers from this method. As displayed on the visualisation, there are n=200 multivariate Height and Weight outliers.

mvn_results$multivariateOutliers
round(nrow(mvn_results$multivariateOutliers) / nrow(olympics_capped) * 100, 2)
[1] 5.21

5.21% of values are multivariate outliers amongst the Height and Weight variables, so it is likely safe to remove them without distorting the underlying data. This is accomplished by directly subsetting the problematic rows from the olympics_capped dataframe.

olympics_clean <- olympics_capped[ -c(1:nrow(mvn_results$multivariateOutliers)), ]
dim(olympics_clean)
[1] 3637   18

The remaining numeric variable with potential outliers is BMI, however this is a special case since it is calculated from two existing variables Height and Weight. Since the outliers for these variables have been addressed, the resulting BMI values aren’t likely to be caused by any form of user input error. Since BMI is a product of variables that have been screened for outliers, they will remain as-is.

Transform

The distributions of numeric features was briefly covered in Scan II - Outliers. Historgrams will provide more accurate insight.

As initially suspected, Height appears normally distributed and will require no transformations. Post-outlier removal, Age also appears to approximately resemble normality, leaving Weight and BMI that are both left skewed, with BMI containing a very narrow peak. Mathematical operations for Weight and BMI will be explored in attempt to make them more closely resemble the normal distribution.

sqrt_weight <- sqrt(olympics_clean$Weight)
hist(sqrt_weight, main = "Histogram of sqrt(Weight)")

The square root operation sqrt() has transformed the Weight feature to more closely resemble normality.

bmi_recip <- 1/(olympics_clean$BMI)

hist(bmi_recip)

Applying the reciprocal function to BMI shows that the shape is closer to normal.

After converting the previously explored variables using sqrt() for Weight and the reciprocal of BMI, each of the numeric features now approximately resembles normality which is an assumption and sometimes a requirement for many statistical tests.

After investigating the Exploring 120 Years of Olympic History dataset, converting the necessary data types, handling outliers and missing values, and transforming numeric features to normal, the data is now in a clean & complete form that is suitable for statistical analysis and potentially machine learning applications.



---
title: "MATH2349 Semester 1, 2019 - Assignment 3"
author: "Caitlin Lonski, s3679850"
subtitle: Exploring 120 Years of Olympic History
output:
  html_notebook: default
---

## Required packages 

```{r include=FALSE}
options(warn=-1)
set.seed(123)
```

```{r}
library(readr)
library(dplyr)
library(editrules)
library(MVN)
```


## Executive Summary 

The two tables in the _Exploring 120 Years of Olympic History_ dataset are loaded in to R, stored as respective dataframe objects, and briefly explored by displaying the `head()` of each dataframe. The individual tables are combined with a join on a shared value, creating a single dataframe that contains all features in the Olympic History dataset. A sample of the joined dataframe is taken and data types of each variable are explored, using the `class()` function to explore typed and `str()` function to display the size of the dataset and a sample of each variable. These exploration processes highlight incorrectly assigned data types that need to be converted. After these transformations are applied, the structure of the data is explored using the _Tidy Data Principles_ to determine if it is tidy. A new variable is created by multiplying existing ones with the `mutate()` function. Missing values, inconsistencies, & errors are explored with the use of the `is.na()` function (for the whole dataset with `sum()` and individual columns with `colSums()`). A user-defined function locates special values, and a rule set to identify obvious errors is applied to the data. Once located, the problematic values are handled for both numeric and categorical features; either by string replacements or imputations with statistics from similar observations. Any outliers in numerical features are explored and dealt with using capping for univariate outliers and `mvn()`s Mahalanobis distances for multivariate outliers. Methods of dealing with outliers are determined by calculating the percentage of outliers. After exploring their distributions, mathematical operations are used to transform numeric variables to make them more closely resemble normality. The resulting dataset is tidy, complete, and suitable to use for statistical analysis or machine learning applications.


## Data 

The [120 Years of Olympic History: Athletes and Results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results) has been uploaded to [Kaggle.com](https://www.kaggle.com/) and falls under the Creative Commons License. The dataset consists of two tables; `athlete_events.csv` and `noc_regions.csv`. The athlete events dataset contains information about Olympic Athletes and events that they have competed in, including biological data (Age, Sex, Height, Weight etc.) and event data (Year, Season, City, Sport etc.). The NOC (National Organizing Committee) regions dataset possesses information about the countries that compete in the Olympics, including the country name and any notes about said country. 

The Athletes dataset contains 15 variables:

* __ID__: A number used as a unique identifier for each athlete
* __Name__: The athlete's name(s) in the form of `First Middle Last` where available
* __Sex__: The athlete's gender; one of `M` or `F`
* __Age__: The athlete's age in years
* __Height__: The athlete's height in centimeters (cm)
* __Weight__: The athlete's weight in kilograms (kg)
* __Team__: The name of the team that the athlete competed for
* __NOC__: The National Organizing Committee's 3-letter code
* __Games__: The year and season of the Olympics the athlete competed in in the format `YYYY Season`
* __Year__: The year of the Olympics that the athlete competed in
* __Season__: The season of the Olympics that the athlete competed in
* __City__: The city that hosted the Olympics that the athlete competed in
* __Sport__: The sport that the athlete competed in
* __Event__: The event that the athlete competed in
* __Medal__: The medal won by the athlete; one of `Gold`, `Silver`, or `Bronze`. `NA` if no medal was won.

The NOC Regions dataset contains 3 variables: 

* __NOC__: The National Organizing Committee’s 3-letter code
* __region__: The name of the country/region associated with the NOC code
* __notes__: Any extra/miscellaneous information about the NOC region 

### Load Data

Each dataset is individually loaded in to R using the `read_csv()` function from the `readr` library. `read_csv()` is used instead of base R's `read.csv()` since it it faster at processing larger datasets. The `athlete_events.csv` dataset is assigned to an object named `athletes`. After which, the `head()` function displays the first 5 rows of the newly created `data.frame` to ensure that everything has been imported correctly.
```{r}
athletes <- read_csv("Data/Olympics/athlete_events.csv", col_types = cols())
head(athletes, 5)
```

The same process used to load `athlete_events.csv` is applied to `noc_regions.csv`, with this dataset being assigned to an object called `regions`.

```{r}
regions <- read_csv("Data/Olympics/noc_regions.csv", col_types = cols())
head(regions, 5)
```

### Merge

The two datasets share a common variable: NOC. A `left_join` is used to join `regions` to `athletes` by `NOC` value. This preserves all of the athlete's information within `athletes` and appends the corresponding region information from `regions`, applying an `NA` value where no matching information is available. These merged datasets are assigned to a new object called `olympics`. Once again calling the `head()` function shows the first 5 rows of the newly created `data.frame` to confirm that the two datasets have been merged correctly.

```{r}
olympics <- left_join(athletes, regions, by = "NOC")
head(olympics, 5)
```

After the join is performed, due to the size of the dataset a random sample of 500 rows will be partitioned for analysis to limit processing time using the `sample_n()` function.

```{r}
olympics_sample <- sample_n(olympics, 5000)
```

## Understand 

The function `class()` can be called to determine the data type stored within the variables in the `olympics` dataset. To prevent the need to call `class()` once for each individual variable, the `sapply` function administers it to each column automatically.

```{r}
sapply(olympics_sample, class)
```
There are multiple data types within the `olympics` dataset, however the `read_csv()` function has assigned them all as either `character` or `numeric`. While this is correct in some cases (`Name`, `Height`, `Weight`, etc.) some have been mis-labelled. Specifically, `Medals` should be an ordered factor, `Sex` should be a factor and requires additional re-labelling of levels, and multiple other features require factor conversions. <br><br>

Base R's `str()` function displays the structure of the `olympics` dataset, including the attribute name, data type, and sample of values for each column. `str()` also returns the number of variables and observations in the dataset.
```{r}
str(olympics_sample)
```
The output of `str()` confirms that some variables have been mis-assigned data types. Additionally, the shape of the dataset is shown to be 5,000 observations of 17 variables. This further confirms that the two datasets have been correctly joined since `athletes` contained 15 variables, `regions` was made up of 3, and they each shared a common attribute to make up 17 total variables. Displaying a sample of the values within each variable provides some insight to the contents of the dataset. Most are to be expected, (`Age` is a 2-digit integer, `Team` is a Country name as a character string, etc.), however `str()` helps to explain the difference between `Sport` and `Event`: An `Event` looks to be the subset of a `Sport`, i.e. the `Sport` `Basketball` encompasses basketball-related events including `Basketball Men's Basketball`. 

### Conversions
Using the output of `str()` as a guide, conversions can be applied to the variables with mis-assigned data types. Firstly, the variables that should be un-ordered factors are each added to a character vector called `factor_names`. Using the `lapply()` function, each variable in the `factor_names` list is converted to a factor, without any changes to level names or ordering. Secondly, the `Sex` variable is of type `Chr` when it should be a factor. Using the `factor()` function, the levels `M` and `F` are re-named to `Male` and `Female` respectively and `Sex` is converted to a factor. 

```{r}
unique(olympics_sample$Year)
```

While `Year` contains 4-digit integers, these values skip every 2 to 4 years due to the amount of time between Olympic games. For this reason, `Year` will be converted to a factor.

```{r}
unique(olympics_sample$Medal)
```

Finally, the `Medal` variable can be ordered by the podium positions: `Gold` > `Silver` > `Bronze` > `NA`. This is again performed with the `factor()` function, this time by also specifying the order of the levels and setting the `ordered` argument to `T` (or `TRUE`).

```{r}
# Character to factor (no orderering or labelling)
factor_names <- c("Team", "NOC", "Games", "Season", "City", "Sport", "Event", "region")
olympics_sample[, factor_names] <- lapply(olympics_sample[, factor_names], factor)

# Character to factor (labelled)
olympics_sample$Sex <- factor(olympics_sample$Sex, levels = c("M", "F"), labels = c("Male", "Female"), ordered = F)

# Character to factor (ordered)
olympics_sample$Medal <- factor(olympics_sample$Medal, levels = c(NA, "Bronze", "Silver", "Gold"), ordered = T)

# number to factor
olympics_sample$Year <- factor(olympics_sample$Year)

str(olympics_sample)
```
Showing the structure of the newly-converted dataset shows that the factors originally mistakenly assigned as `chr` have been correctly changed to `Factor`, and additionally, the levels within `Sex` have been re-labelled as `Male` and `Female` from `M` and `F` respectively, and finally `Medal` is now an ordered factor corresponding to the podium placements.

##	Tidy & Manipulate Data I 

Hadley Wickham and Grolemund's Tidy Data principles state that in tidy data:

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

Investigating each of these principles in the context of the `olympics` dataset will determine whether or not any conversions need to be applied to make the data tidy. <br><br>

Firstly, each variable in the `olympics` dataset has its own column. The columns are always associated with a single variable, and any variable can simply be accessed with the standard R codes. For example: `olympics$Name` will return Athlete's names and nothing else. `olympics$Age` will only return their ages, etc. Secondly, each observation in the `olympics` dataset has its own row; as discussed in the Data description, each row in the data frame corresponds to an athlete and the olympics that they competed in. Finally, each value in the `olympics` dataset has its own cell. At first glance there appears to be a breach of this principle in the variable `Games` as it contains a `Year` and `Season` (both of which are separate variables), however the description of the dataset on Kaggle.com states that *"Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on."*, so for this purpose the `Games` variable is important as its own attribute. <br><br>

From examining each of the tidy data principles, it can be concluded that the `olympics` dataset is tidy and reshaping is not required.

##	Tidy & Manipulate Data II 

Since the `olympics` dataset contains multiple biological attributes, it is possible to calculate each athlete's Body Mass Index (or BMI). Calculating BMI as a body-structure indicator can provide helpful insight to the different sizes of athletes between `Sports`, `Age`, `Gender`, etc. <br><br> 

BMI is calculated as ($BMI = \frac{kg}{m^2}$) or BMI = Weight (in kg) / Height (in m)^2. This calculation is performed within the `mutate` function of the `dplyr` package. A new variable `BMI` is created by dividing the existing variables `Weight` by `Height` by `Height` (to incorporate height^2) and multiplying the result by 10000 to factor in the incorrect measurement type for `Height` (`Height` is recorded in cm in the dataset, BMI calculations require meters). A sample of the resulting dataset shows the calculated `BMI`s for certain athletes.

```{r}
# BMI = kg/m2 
olympics_mutate <- olympics_sample %>% mutate(
  BMI = round((Weight / Height / Height) * 10000, 2) 
)

olympics_mutate[c(1:5), ] %>% select(Name, Sex, Age, Height, Weight, BMI)
```


##	Scan I 

To explore (and potentially deal with) inconsistencies and missing values, first their presence or absence needs to be determined. Starting with `NA` values; base R's `sum()` and `is.na()` functions are applied to the `olympics_mutate` dataframe to determine a total count of all `NA` instances.

### Missing Values

```{r}
# Total missing values
sum(is.na(olympics_mutate))
```

There are a total of 12,726 missing values in the `olympics_mutate` dataframe, which is too many to blindly remove or ignore as it would potentially distort the underlying message of the data. Calling the `colSums()` function on the dataframe will display which columns contain these `NA` instances.

```{r}
# Missing values by column
colSums(is.na(olympics_mutate))
```

The `NA` values are spread amongst the variables `Age`, `Height`, `Weight`, `Medal`, `region`, `notes`, and the mutated feature `BMI`. The numeric variables can be imputed but first the categorical variables will be addressed. <br><br>

First, the nature of the `Medal` variable suggests that a missing value indicates that the athlete did not win a medal for that particular `Event`, `Year`, and `Games` combination. These `NA`s will be replaced by a new factor level called `None`. To accomplish this, the existing `levels` of `Medals` are assigned to a character vector `medal_levels`, the `length` of the vector is increased by 1 and the empty value is replaced by the string `"None"`. Similar to the operations performed in __5. Understand__, the `Medal` variable is converted to an ordered factor using the new `medal_levels`. Using the `is.na()` function, the `NA`s are replaced with the new factor level `None`. This process preserves the order of the `Medal` variable so that Gold > Silver > Bronze > None.

```{r}
olympics_complete <- olympics_mutate

medal_levels <- levels(olympics_complete$Medal)
medal_levels[length(medal_levels) + 1] <- "None"

olympics_complete$Medal <- factor(olympics_complete$Medal, levels = medal_levels, order = T)
olympics_complete$Medal[is.na(olympics_complete$Medal)] <- "None"
```

Merging `athletes` and `regions` to form the initial `olympics` dataframe involved performing a `left_join()` which is prone to introducing `NA`s since each row in the left dataframe (`athletes`) is kept and a lack of corresponding rows in the right table (`regions`) results in an `NA`. The `region` and `notes` features contain numerous `NA`s, some of which are a result of the join and the remaining existed in the original table. A quick investigation of the rows whose `region` is `NA` shows the following:

```{r}
head(olympics_complete[is.na(olympics_complete$region), "Team"], 5)
```

The data description states that the `region` variable is the country or region assosciated with the corresponding `NOC` code. In the above sample, all of the `Team`s contain the same value; `Singapore`. 

```{r}
unique(olympics_complete[is.na(olympics_complete$region), "Team"])
```


```{r}
which("Singapore" %in% olympics_complete$region)              
```
Out of the 14 `NA` instances in `regions`, there is only 1 unique `Team` value, and it appears that `Singapore` never shows up in the `region` column despite showing up in `Team`. By variable definitions, `Team` and `region` aren't always the same, however due to the small number of `NA`s, the `Team` value will be copied to `region` for these problematic rows via an `ifelse()` statement. As for `notes`, a similar approach to the solution for `Medal` will be applied, minus the factor ordering which will replace the `NA` values with `"None"`.


```{r}
# notes
notes_levels <- levels(olympics_complete$notes)
notes_levels[length(notes_levels) + 1] <- "None"

olympics_complete$notes <- factor(olympics_complete$notes, levels = medal_levels)
olympics_complete$notes[is.na(olympics_complete$notes)] <- "None"

# region
olympics_complete$region <- ifelse(is.na(olympics_complete$region), olympics_complete$Team, olympics_complete$region) 
```

The remaining variables that contain `NA` instances are `Age`, `Height`, `Weight`, and `BMI`. Being numeric variables, incorrectly changing or imputing these `NA`s can potentially distort the data and result in invalid conclusions. Starting with `Age`, summary statistics will be explored grouped by `Sport` to see if there is a difference in `Age` between different sports. This exploration will be conducted by using various functions from the `dplyr` package.

```{r}
olympics_complete %>% 
  group_by(Sport) %>%  
  na.omit() %>% 
  summarize(mean=mean(Age), median = median(Age), min = min(Age), max = max(Age), SD = sd(Age)) %>% 
  sample_n(5)
```
The `mean`, `median`, and variance in `Age` appears to differ between sports which makes logical sense-- an Olympic Gymnast or Swimmer would likely be younger than a Weightlifter or Shooter. The `median` `Age` of each `Sport` will be imputed to the missing `Age` variables.

```{r}
olympics_complete <- olympics_complete %>%
 group_by(Sport) %>%
 mutate(
   Age = ifelse(
   is.na(Age),
   median(Age, na.rm = TRUE),
   Age)
 )
```

A similar approach will be applied to `Height` and `Weight`, this time using the median of the combination of `Age` and `Sport`. The logic here is that an 11-year-old Gymnast would likely weigh less and be shorter than a 40-year-old Gymnast, and be significantly smaller than a 40-year-old Weightlifter.

```{r}
olympics_complete <- olympics_complete %>%
 na.omit() %>% 
 group_by(Sport, Age) %>%
 mutate(
   Height = ifelse(
   is.na(Height),
   median(Height, na.rm = TRUE),
   Height)
 )

olympics_complete <- olympics_complete %>%
 na.omit() %>% 
 group_by(Sport, Age) %>%
 mutate(
   Weight = ifelse(
   is.na(Weight),
   median(Weight, na.rm = TRUE),
   Weight)
 )

```

`BMI` is the final variable that contains `NA`s, which is a result of performing calculations on variables with missing values, `Height` and `Weight`. Since these `NA`s have been imputed, the now complete columns can be used to once again calculate `BMI`.

```{r}
olympics_complete <- olympics_complete %>%
 mutate(
   BMI = ifelse(
   is.na(BMI),
   round((Weight / Height / Height) * 10000, 2),
   BMI)
 )

colSums(is.na(olympics_complete))
```

The result of the multiple imputations and renaming procedures is a new `data.frame` `olympics_complete` which contains some sort of value for every observation in the whole dataset.

### Special Values

`NA` values aren't the only problematic observations in an exploratory data analysis. To explore any potential Special Values, (`Inf`, `-Inf`, `NaN`), a custom function `is_special` is defined, which utilises base R's functions `is.infinite()`, `is.nan()`, and `is.na()`. Applying this custom function to the `olympics_complete` `data.frame` will determine the number of these special values present.

```{r}
is_special <- function(x){
 if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}
sum(is_special(olympics_complete))
```

The `is_special()` function has returned 0, indicating that there are no special values in the `olympics_complete` `data.frame` that require further investigation. 

### Inconsistencies and Errors

Another type of error to check for is inconsistencies and errors, or data values that are nonsensical  in context. For example, negative values for `Age`, `Height`, or `Weight`, or exceedingly large values for those same variables (such as `Age` being over 200). Using the `editset` function from the `editrules` package, a set of rules is defined to identify any of these obvious errors. The rule set checks that: 

* `Age` is between `0` and `100`, 
* `Height` is between `0` and `250`, 
* `Weight` is between `0` and `300`, 
* `Season` is one of either `Summer` or `Winter`,
* and `Medal` is one of either `Gold`, `Silver`, `Bronze` or `None`

Some of the numerical boundaries are extreme, (i.e. an Olympic athlete being 0 or 100 years old is very unlikely), however the purpose of these functions is to locate inconsistencies and errors. Outliers will be explored later in this study. Some variables have too many factor levels to check for (`Team` and `Event` for example) so this ruleset is limited to numerical variables, and categorical variables that contain less than 5 unique levels.
```{r}
(Rule1 <- editset(c("Age >= 0", "Age <= 100", "Height >= 0", 
                    "Height <= 250", "Weight >= 0", "Weight <= 300",
                    "Season %in% c('Summer', 'Winter')",
                    "Medal %in% c('Gold', 'Silver', 'Bronze', 'None')")))
sum(violatedEdits(Rule1, olympics_complete))
```

Using a combination of base R's `sum()` and `editrules`' `violatedEdits` shows that there are no instances in the `olympics_complete` dataset that violate the defined rules.

##	Scan II

### Outliers

Missing or inconsistent data is one common issue in exploratory data analysis; another is outliers which can be present as a result of some form of error. A quick method of detecting the presence of outliers is to display boxplots for each numeric variable.

```{r,echo=FALSE, out.width='.49\\linewidth', fig.width=12, fig.height=3,fig.show='hold',fig.align='center'}
numeric_variables <- c("Age", "Height", "Weight", "BMI")
par(mfrow = c(1,4))
for (i in numeric_variables) {
  boxplot(olympics_complete[, i],
          main = paste("Boxplot of", i))
}
```
Boxplots show that there are outliers present in all of the numeric features (as represented by the dots) according to Tukey’s method of outlier detection ($\pm1.5 × IQR$). Shapes of the box plots give a general idea of each variable's univariate distribution. `Height` is the only variable that appears to resemble normality, which suggests that the z-score approach can be used to detect outliers for these features. However since the combination of features `Height`, `Weight`, and `BMI` are not likely to be independent, multivariate outlier detection techniques will be used. `Age` will be assessed for univariate outliers. <br><br>

Outliers in Age were highlighted in the initial boxplots. By saving a `boxplot()` of the `Age` variable to an object and accessing the `$out` attribute of the stored boxplot, the percentage of outliers according to Tukey's method can be counted.

```{r fig.show='hide'}
boxplot_age <- boxplot(olympics_complete$Age)
```
```{r}
round(length(boxplot_age$out) / nrow(olympics_complete) * 100, 2)
```

Since only 2.55% of instances in `Age` are outliers, they can likely be removed without distorting the underlying data. As highlighted in the histogram, `Age` is not normally distributed so z-scores shouldn't be used. Instead, capping (or Windsoring) will be implemented by invoking a user-defined function `cap` that replaces outliers with their nearest neighbour.

```{r}
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

olympics_capped <- olympics_complete
olympics_capped$Age <- cap(olympics_capped$Age)
```

The remaining numeric variables `Height` and `Weight` will be investigated using multivariate outlier detection methods, given that it is likely that a person's height will impact their weight.

```{r message=FALSE, warning=FALSE}
plot(olympics_capped$Height~olympics_capped$Weight, main = "Height vs Weight", xlab = "Weight", ylab = "Height")
```
Logically, a taller person would likely weigh more than a shorter person, which is supported by the scatterplot of `Height` vs `Weight`. There is a strong, positive correlation between the two variables with some likely outliers spread toward the outside of the plot space. Using the `mvn()` function from the `MVN` package can detect outliers using Mahalanobis distances. 

```{r message=FALSE, warning=FALSE}
mvn_results <- mvn(data = olympics_capped[, c("Height", "Weight")], multivariateOutlierMethod = "quan", showOutliers = TRUE)
```

Using the `quan` method in `mvn()` uses the chi-square distribution critical value approach to detect multivariate outliers. Values to the right of the red _Quantile_ line in the plot are determined to be outliers from this method. As displayed on the visualisation, there are `n=200` multivariate `Height` and `Weight` outliers.

```{r}
mvn_results$multivariateOutliers
```
```{r}
round(nrow(mvn_results$multivariateOutliers) / nrow(olympics_capped) * 100, 2)
```
5.21% of values are multivariate outliers amongst the `Height` and `Weight` variables, so it is likely safe to remove them without distorting the underlying data. This is accomplished by directly subsetting the problematic rows from the `olympics_capped` dataframe.

```{r}
olympics_clean <- olympics_capped[ -c(1:nrow(mvn_results$multivariateOutliers)), ]
dim(olympics_clean)
```

The remaining numeric variable with potential outliers is `BMI`, however this is a special case since it is calculated from two existing variables `Height` and `Weight`. Since the outliers for these variables have been addressed, the resulting `BMI` values aren't likely to be caused by any form of user input error. Since `BMI` is a product of variables that have been screened for outliers, they will remain as-is.

##	Transform 

The distributions of numeric features was briefly covered in __Scan II - Outliers__. Historgrams will provide more accurate insight.

```{r,echo=FALSE, out.width='.49\\linewidth', fig.width=12, fig.height=3,fig.show='hold',fig.align='center'}
par(mfrow = c(1,5))
for (i in numeric_variables) {
   x <- as.numeric(unlist(olympics_clean[, i]))
   hist(x, main = paste("Histogram of", i), xlab = paste(i), ylab = "Count")
}
```

As initially suspected, `Height` appears normally distributed and will require no transformations. Post-outlier removal, `Age` also appears to approximately resemble normality, leaving `Weight` and `BMI` that are both left skewed, with `BMI` containing a very narrow peak. Mathematical operations for `Weight` and `BMI` will be explored in attempt to make them more closely resemble the normal distribution.

```{r}
sqrt_weight <- sqrt(olympics_clean$Weight)
hist(sqrt_weight, main = "Histogram of sqrt(Weight)")
```
The square root operation `sqrt()` has transformed the `Weight` feature to more closely resemble normality. 

```{r}
bmi_recip <- 1/(olympics_clean$BMI)

hist(bmi_recip)
```
Applying the reciprocal function to `BMI` shows that the shape is closer to normal.

```{r,echo=FALSE, out.width='.49\\linewidth', fig.width=12, fig.height=3,fig.show='hold',fig.align='center'}
olympics_final <- olympics_clean
olympics_final$Weight <- sqrt(olympics_final$Weight)
olympics_final$BMI <- 1/(olympics_final$BMI)

par(mfrow = c(1,5))
for (i in numeric_variables) {
   x <- as.numeric(unlist(olympics_final[, i]))
   hist(x, main = paste("Histogram of", i), xlab = paste(i), ylab = "Count")
}
```
After converting the previously explored variables using `sqrt()` for `Weight` and the reciprocal of `BMI`, each of the numeric features now approximately resembles normality which is an assumption and sometimes a requirement for many statistical tests.

After investigating the Exploring 120 Years of Olympic History dataset, converting the necessary data types, handling outliers and missing values, and transforming numeric features to normal, the data is now in a clean & complete form that is suitable for statistical analysis and potentially machine learning applications. 

<br>
<br>
