Required packages
Following are the packages below that I have used for this assignment.
# This is the R chunk for the required packages
library(readr)
library(dplyr)
library(outliers)
library(ggplot2)
library(tidyr)
library(validate)
library(lubridate)
library(knitr)
library(Hmisc)
Executive Summary
The main aim of the assignment is to preprocess a open source untidy dataset. I have downloaded the dataset from Kaggle. The data file contains two datasets named athlete_events.csv and noc_regions.csv. Then the datasets are imported using read_csv() function from the readr library. After importing the datasets I have subsetted both the datasets using select() function as some of the variables are not required for the further process. Later I have used left join to join both the datasets using common variable NOC. Then using the str() function I get the data types of the variables. Data type of some variables like Sex, Season, Medal should be factor but it is shown as character thats why I converted it using as.factor() function. Later I used head() function to check whether the dataset is tidy or not. After concluding that the dataset is tidy I created a new variable named Body Mass Index using Height, Weight variables and mutate() function. In the scan section I have scanned the dataset for missing values using colSums() and is.na() function. I have imputed the Null values in numeric variables using the mean() function. Null values in the categorical variables are imputed by creating a new category. Then I have scanned for outliers by plotting the numeric variables using boxplot(). I have calculated the percentage of outliers using score() and which() function. Outliers were capped using cap() function. The variables were plotted using boxplot() to check if there are outliers. After removing outliers using hist() the distribution of the variables was analysed. Age and Weight were transformed using log10 transformation as their graph was right skewed. Height was normally distributed thats why it is not transformed.
Data
The following dataset is historical dataset of olyampic games from 1896 to 2016. The dataset is downloade from kaggle. Link of the data source is given below-
https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
The dataset contains two tables named athlete_events.csv and noc_regions.csv. First table contains information about the athletes and the olyampic events that they have participated and second table contains information about the countries that participated in the olyampic games. The athlete_events.csv contains following variables- 1) ID- A unique identifier for each athlete 2) Name- Name of the athlete 3) Sex- Gender of athlete 4) Age Age of athlete 5) Height- Height of athlete in cm 6) Weight- Weight of an athlete in kgs 7) Team- Name of the Team 8) NOC- National olyampic committee 3-letter code 9) Games- Year and Season of the olyampic game 10) Year- Year of the olyampic 11) Season- Season of the olyampic 12) City- Name of the city that hosted the olyampics 13) Sport- Sport that the athlete participated 14) Event- Event of the sport in which the athlete participated 15) Medal- Medal won by the athlete.
The noc_regions.csv contains following variables- 1) NOC- National olyampic committee 3-letter code 2) region- Country name of the respective NOC 3) notes- Some information about the region.
Importing and Subsetting the Data
Here I have used the read_csv() function from the readr library to import both the datasets. athlete_events.csv dataset is assigned to object player_data and noc_regions.csv datadet is assigned to the object country_data. Games variable consist of year and season information combined and seperating the variable would give us the variables like Year and Season which already exist thats why I have excluded the Games variable using the select() function and store all the other variable in object Player_data. Similarly the note variable in the countries data is not useful for further analysis therefore I have selected only two variables NOC and region and stored it in an object country_data Now after subsetting both the datasets I have merged both the datasets using left_join based on the common key variable NOC. The merged datset is stored in the object new. After that using the head() function I have displayed the new dataset and checked if every column is included after joining both the datasets
#Importing the first dataset using read_csv
player_data <- read_csv("/Users/christangelfargose/120-years-of-olympic-history-athletes-and-results/athlete_events.csv")
Parsed with column specification:
cols(
ID = [32mcol_double()[39m,
Name = [31mcol_character()[39m,
Sex = [31mcol_character()[39m,
Age = [32mcol_double()[39m,
Height = [32mcol_double()[39m,
Weight = [32mcol_double()[39m,
Team = [31mcol_character()[39m,
NOC = [31mcol_character()[39m,
Games = [31mcol_character()[39m,
Year = [32mcol_double()[39m,
Season = [31mcol_character()[39m,
City = [31mcol_character()[39m,
Sport = [31mcol_character()[39m,
Event = [31mcol_character()[39m,
Medal = [31mcol_character()[39m
)
#Subsetting the dataset by removing the 9th column
player_data <- select(player_data,-9)
head(player_data)
#Importing country data using read_csv
country_data <- read_csv("/Users/christangelfargose/120-years-of-olympic-history-athletes-and-results/noc_regions.csv")
Parsed with column specification:
cols(
NOC = [31mcol_character()[39m,
region = [31mcol_character()[39m,
notes = [31mcol_character()[39m
)
#Subsetting the country_data
country_data <- country_data[,c(1,2)]
head(country_data)
#Joining both the datasets using left join
new <- left_join(player_data,country_data,by="NOC")
head(new)
NA
NA
Understand
The new dataset consist of total 15 variables and 271116 rows. The data type of variables can be found out using str() function
#Displaying the structure of dataset and checking data types
str(new)
tibble [271,116 × 15] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:271116] 1 2 3 4 5 5 5 5 5 5 ...
$ Name : chr [1:271116] "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
$ Sex : chr [1:271116] "M" "M" "M" "M" ...
$ Age : num [1:271116] 24 23 24 34 21 21 25 25 27 27 ...
$ Height: num [1:271116] 180 170 NA NA 185 185 185 185 185 185 ...
$ Weight: num [1:271116] 80 60 NA NA 82 82 82 82 82 82 ...
$ Team : chr [1:271116] "China" "China" "Denmark" "Denmark/Sweden" ...
$ NOC : chr [1:271116] "CHN" "CHN" "DEN" "DEN" ...
$ Year : num [1:271116] 1992 2012 1920 1900 1988 ...
$ Season: chr [1:271116] "Summer" "Summer" "Summer" "Summer" ...
$ City : chr [1:271116] "Barcelona" "London" "Antwerpen" "Paris" ...
$ Sport : chr [1:271116] "Basketball" "Judo" "Football" "Tug-Of-War" ...
$ Event : chr [1:271116] "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
$ Medal : chr [1:271116] NA NA NA "Gold" ...
$ region: chr [1:271116] "China" "China" "Denmark" "Denmark" ...
Conversion
As shown from the above output the data types of ID,Age,Height,Weight,Year is numeric which is correct. Also the data types of Name, Team, NOC,Event, Sport,City,region is character which is correct. Variables like Sex,Season,Medal should be factors but are shown as character thats why they should be changed to factors. I have used as.factor() function to change the data type of variables Sex, Season and Medal. I have also ordered the Medal variable in the order Gold, Silver and then Bronze. After converting the data type to factors I have used class() function to check the class of the variables and also used the str() function to check if the data types are changed.
#Converting Sex to factor and checking it using class
new$Sex <- as.factor(new$Sex)
class(new$Sex)
[1] "factor"
#Converting Season to factor and checking it using class
new$Season <- as.factor(new$Season)
class(new$Season)
[1] "factor"
#Converting Medal to factor and checking it using class
new$Medal <- as.factor(new$Medal)
new$Medal <- factor(new$Medal,levels = c("Gold","Silver","Bronze"),ordered = TRUE)
class(new$Medal)
[1] "ordered" "factor"
#Checking whether the data type is changed
str(new)
tibble [271,116 × 15] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:271116] 1 2 3 4 5 5 5 5 5 5 ...
$ Name : chr [1:271116] "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
$ Sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 1 ...
$ Age : num [1:271116] 24 23 24 34 21 21 25 25 27 27 ...
$ Height: num [1:271116] 180 170 NA NA 185 185 185 185 185 185 ...
$ Weight: num [1:271116] 80 60 NA NA 82 82 82 82 82 82 ...
$ Team : chr [1:271116] "China" "China" "Denmark" "Denmark/Sweden" ...
$ NOC : chr [1:271116] "CHN" "CHN" "DEN" "DEN" ...
$ Year : num [1:271116] 1992 2012 1920 1900 1988 ...
$ Season: Factor w/ 2 levels "Summer","Winter": 1 1 1 1 2 2 2 2 2 2 ...
$ City : chr [1:271116] "Barcelona" "London" "Antwerpen" "Paris" ...
$ Sport : chr [1:271116] "Basketball" "Judo" "Football" "Tug-Of-War" ...
$ Event : chr [1:271116] "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
$ Medal : Ord.factor w/ 3 levels "Gold"<"Silver"<..: NA NA NA 1 NA NA NA NA NA NA ...
$ region: chr [1:271116] "China" "China" "Denmark" "Denmark" ...
Tidy & Manipulate Data I
We will check the dataset using head() function. In the dataset each variable have its own column and each observation is stored in a seperate row. Plus each value is stored in its own cell therfore the dataset is tidy and there is no necessity to reshape the dataset. The dataset can be used for further process.
# Using the head() function to check if the dataset is tidy or not
head(new)
Tidy & Manipulate Data II
Here we have created a new variable Body mass index using the mutate() function of the dplyr package. Body mass index was created using the variables Height and Weight. Body mass index is a measurement of persons weight with respect to their height.Following formula was used to create the variable-
Body Mass Index = Weight/(Height)^2
Here I have used 10000 because the height in the formula should be in meter but in the data we have it in cm so we need to convert it.
# Creating variable Body Mass Index using mutate() function
new <- mutate(new,Body_Mass_Index=round(new$Weight/(new$Height * new$Height) *10000))
head(new)
Scan I
First I have used colSums() function with is.na() to find the NULL values in each variables of the dataset.
# Displaying the count of Null values in each variable of dataset
colSums(is.na(new))
ID Name Sex Age
0 0 0 9474
Height Weight Team NOC
60171 62875 0 0
Year Season City Sport
0 0 0 0
Event Medal region Body_Mass_Index
0 231333 370 64263
According to the above output we can conclude that there are missing values in the Age, Height, Weight, Medal, region and Body_Mass_Index variables. Age, Height, Weight, Body_Mass_Index are numeric variables and Medal and region are categorical that is why they should be handled diffrently. The NULL value count of variables are high therfore they cannot be excluded as this might affect the analysis process therefore the null values must be imputed. I have imputed the numeric NULL values by using mean() function along with is.na()
#Imputing numeric variables with the mean of the particular column
new$Age[is.na(new$Age)] <- round(mean(new$Age, na.rm = TRUE))
new$Weight[is.na(new$Weight)] <- mean(new$Weight, na.rm = TRUE)
new$Height[is.na(new$Height)] <- round(mean(new$Height, na.rm = TRUE))
new$Body_Mass_Index[is.na(new$Body_Mass_Index)] <- round(mean(new$Body_Mass_Index, na.rm = TRUE))
head(new)
NA
NA
The categorical variables are imputed by forming a new category in the variable. Here I have formed a category “No Medal” in the variable Medal and replace all the null values with that category using is.na() function. The medal variable is factor so we need to convert it to character. Similarly new category named “Unknown” was formed in the variable region and all null values were imputed with this category using is.na() function
#Imputing categorical variable by creating a new category
new$Medal<- as.character(new$Medal)
new$Medal[is.na(new$Medal)] <- "No Medal"
new$Medal <- as.factor(new$Medal)
new$Medal <-factor(new$Medal,levels = c("Gold","Silver","Bronze","No Medal"),ordered = TRUE)
new$region[is.na(new$region)] <- "Unknown"
head(new)
NA
NA
Lastly all the numeric variables were scanned for missing values using sum() function along with is.infinite() function to check the special values in the dataset. Finally to check whether all the null values were removed I have used colSums() function along with is.na() function
#Scanning for special values in numeric variables of the dataset
sum(is.infinite(new$ID))
[1] 0
sum(is.infinite(new$Age))
[1] 0
sum(is.infinite(new$Height))
[1] 0
sum(is.infinite(new$Weight))
[1] 0
sum(is.infinite(new$Body_Mass_Index))
[1] 0
sum(is.infinite(new$Year))
[1] 0
colSums(is.na(new))
ID Name Sex Age
0 0 0 0
Height Weight Team NOC
0 0 0 0
Year Season City Sport
0 0 0 0
Event Medal region Body_Mass_Index
0 0 0 0
Scan II
An outlier is an observation which stand out from other observations. Outlier can be because of data entry error, Measurement error, Data processing error or many other reasons. To scan the outliers I have plotted the numeric variables using boxplot function from the ggplot2 library. I have not checked the numeric variables like ID and year for outliers because ID is unique and there cant be a outlier for ID. Similar is the case for year. Here I have used Turkeys method of outlier detection using boxplot.
# Ploting numeric variables using boxplot() to check the outliers
#Referred from week8 lecture slides and worksheet
new$Age %>% boxplot(main="Boxplot Age",ylab="Age",col = "green")

new$Height %>% boxplot(main="Boxplot Height",ylab="Height",col = "blue")

new$Weight %>% boxplot(main="Boxplot Weight",ylab="Weight",col = "red")

new$Body_Mass_Index %>% boxplot(main="Boxplot Body Mass Index",ylab="Body Mass Index",col = "orange")

From the above outputs we can see that there are outliers in each numeric variables that we have plotted using the boxplot.Now to check the number and percentage of outliers I have used the z-score value along with the score() and summary() function. Summary() function gives us the summary of the variable. By using lenght() function along with which() function and total number of rows I have calculated the percentage of outliers in a variable.The percentage of outliers value is stored in objects like age_outliers, height_outliers, weight_outliers, body_mass outliers
#Finding the percentage of outliers using scores, summary and which function
#Referred from week8 lecture slides and worksheet
z.age <- new$Age %>% scores(type = "z")
z.age %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.47913 -0.56873 -0.09112 0.00000 0.38648 11.37132
age_ouliers <- (length (which( abs(z.age) >3 )) / 271116) * 100
age_ouliers
[1] 1.625135
z.height <- new$Height %>% scores(type = "z")
z.height %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.20130 -0.56726 -0.02842 0.00000 0.51042 5.46776
height_ouliers <- (length (which( abs(z.height) >3 )) / 271116) * 100
height_ouliers
[1] 0.7513389
z.weight <- new$Weight %>% scores(type = "z")
z.weight %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.6345 -0.6125 0.0000 0.0000 0.3418 11.3957
weight_ouliers <- (length (which( abs(z.weight) >3 )) / 271116) * 100
weight_ouliers
[1] 1.144897
z.bodymass <- new$Body_Mass_Index %>% scores(type = "z")
z.bodymass %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.79814 -0.71697 0.06475 0.00000 0.45561 16.08999
bodymass_ouliers <- (length (which( abs(z.bodymass) >3 )) / 271116) * 100
bodymass_ouliers
[1] 1.451039
From the above data we can see that there are around 1.625% outliers in Age variable, 0.75% outliers in Height variable, 1.144% outliers in Weight and 1.45% outliers in Body Mass Index. I have used capping function along with the cap() method to replace the outliers in each variable with their nearest neighbour that are not outliers. Excluding the outliers might affect the analysis later and imputing it with mean or median will ultimately favour the count of one value thats why I choosed the cap method to impute the outliers.
#Capping the outliers using the cap function
#Referred from week8 lecture slides and worksheet
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
new_olyampic <- new
new_olyampic$Age <- cap(new_olyampic$Age)
new_olyampic$Height <- cap(new_olyampic$Height)
new_olyampic$Weight <- cap(new_olyampic$Weight)
new_olyampic$Body_Mass_Index <- cap(new_olyampic$Body_Mass_Index)
head(new_olyampic)
NA
NA
After using the cap function I have plotted the variables using boxplots to check if there are outliers present in the Age, Height, Weight and Body Mass Index.
#Plotting the varaibles to check if the outliers are removed or not.
new_olyampic$Age %>% boxplot(main="Boxplot Age",ylab="Age",col = "green")

new_olyampic$Height %>% boxplot(main="Boxplot Height",ylab="Height",col = "blue")

new_olyampic$Weight %>% boxplot(main="Boxplot Weight",ylab="Weight",col = "red")

new_olyampic$Body_Mass_Index %>% boxplot(main="Boxplot Body Mass Index",ylab="Body Mass Index",col = "orange")

From the above output we can conclude that there are no outliers present in the dataset. The outliers have been capped successfully.
Transform
Transformation is required to change the scale of the variable or standardise the value of variable for better understanding. It can also be used to change the linearity of a realtionship. I have used the hist() function to plot histogram of the numeric variable. The Age variable was plotted using the hist() function here xlim and ylim are scale limits of the x and y axis
#Using hist() to check the distribution of age
#Referred from week9 lecture slides and worksheet
hist(new$Age, main = "Histogram of Age",
xlab = "Age",
ylim = c(0,120000), xlim = c(0,80))

NA
NA
From the above output it is clear that the graph for the Age variable is right skewed. Therefore I have used the log10 transformation to Age variable to obtain a normal distribution. I have transformed Age by applying log10() function to it and store it in an object named age_log. Later with the help of hist() function I have plotted the age variable
#Using log10 to transform age variable
#Referred from week9 lecture slides and worksheet
age_log <- log10(new$Age)
hist(age_log,main = "Age histogram after transformation",
xlab = "Age", ylim = c(0,80000))

NA
NA
From the output we can conclude that age is normally distributed after tranformation .
Now I have used hist() function to plot the Height variable.
#Using hist() to check the distribution of age
#Referred from week9 lecture slides and worksheet
hist(new$Height, main = "Histogram of Height",
xlab = "Height",
ylim = c(0, 120000),
xlim = c(130, 220))

From the output we can conclude that Height is normally distributed and therefore we dont need to tranform it.
Now I have used hist() function to plot the Weight variable.
#Using hist() to check the distribution of weight
#Referred from week9 lecture slides and worksheet
hist(new$Weight, main = "Histogram of Weight",
xlab = "Weight",
ylim = c(0, 120000),
xlim = c(20,150))

From the output we can conclude that graph for weight variable is right skewed. Therefore I have used the log10 transformation to Weight variable to obtain a normal distribution. I have transformed Weight by applying log10() function to it and store it in an object named weight_log.
#Using log10 to transform age variable
#Referred from week9 lecture slides and worksheet
weight_log <- log10(new$Weight)
hist(weight_log,main = "Weight histogram after transformation",xlab = "Weight",
ylim = c(0,120000))

From the output we can conclude that weight is normally distributed after tranformation.
After processing and investigating the historical dataset of olyampic games I have transformed the dataset by converting the variables to appropraite data types, handling outliers and missing values, transforming the variables. The dataset is now clean and can be used for further analysis.
---
title: "MATH2349 Semester 1, 2020"
author: "Student name: Christangel Fargose and Student number:s3794800"
subtitle: Assignment 2
output:
  html_notebook: default
---

## Required packages 
Following are the packages below that I have used for this assignment.

```{r}
# Below packages are used
library(readr)
library(dplyr)
library(outliers)
library(ggplot2)
library(tidyr)
library(validate)
library(lubridate)
library(knitr)
library(Hmisc)


```


## Executive Summary 
The main aim of the assignment is to preprocess a open source untidy dataset. I have downloaded the dataset from Kaggle. The data file contains two datasets named athlete_events.csv and noc_regions.csv. Then the datasets are imported using read_csv() function from the readr library.
After importing the datasets I have subsetted both the datasets using select() function as some of the variables are not required for the further process. Later I have used left join to join both the datasets using common variable NOC.
Then using the str() function I get the data types of the variables. Data type of some variables like Sex, Season, Medal should be factor but it is shown as character thats why I converted it using as.factor() function.
Later I used head() function to check whether the dataset is tidy or not. After concluding that the dataset is tidy I created a new variable named Body Mass Index using Height, Weight variables and mutate() function.
In the scan section I have scanned the dataset for missing values using colSums() and is.na() function. I have imputed the Null values in numeric variables using the mean() function. Null values in the categorical variables are imputed by creating a new category.
Then I have scanned for outliers by plotting the numeric variables using boxplot(). I have calculated the percentage of outliers using score() and which() function. Outliers were capped using cap() function. The variables were plotted using boxplot() to check if there are outliers.
After removing outliers using hist() the distribution of the variables was analysed. Age and Weight were transformed using log10 transformation as their graph was right skewed. Height was normally distributed thats why it is not transformed.

## Data 

The following dataset is historical dataset of olyampic games from 1896 to 2016. The dataset is downloade from kaggle. Link of the data source is given below-

https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

The dataset contains two tables named athlete_events.csv and noc_regions.csv. First table contains information about the athletes and the olyampic events that they have participated and second table contains information about the countries that participated in the olyampic games. The athlete_events.csv contains following variables-
1) ID- A unique identifier for each athlete
2) Name- Name of the athlete
3) Sex- Gender of athlete
4) Age Age of athlete
5) Height- Height of athlete in cm
6) Weight- Weight of an athlete in kgs
7) Team- Name of the Team
8) NOC- National olyampic committee 3-letter code
9) Games- Year and Season of the olyampic game
10) Year- Year of the olyampic
11) Season- Season of the olyampic
12) City- Name of the city that hosted the olyampics
13) Sport- Sport that the athlete participated
14) Event- Event of the sport in which the athlete participated
15) Medal- Medal won by the athlete.

The noc_regions.csv contains following variables-
1) NOC- National olyampic committee 3-letter code
2) region- Country name of the respective NOC
3) notes- Some information about the region.

## Importing and Subsetting the Data

Here I have used the read_csv() function from the readr library to import both the datasets. athlete_events.csv dataset is assigned to object player_data and
noc_regions.csv datadet is assigned to the object country_data. 
Games variable consist of year and season information combined and seperating the variable would give us the variables like Year and Season which already exist thats why I have excluded the Games variable using the select() function and store all the other variable in object Player_data. 
Similarly the note variable in the countries data is not useful for further analysis therefore I have selected only two variables NOC and region and stored it in an object country_data
Now after subsetting both the datasets I have merged both the datasets using left_join based on the common key variable NOC. The merged datset is stored in the object new. After that using the head() function I have displayed the new dataset and checked if every column is included after joining both the datasets


```{r}
#Importing the first dataset using read_csv
player_data <- read_csv("/Users/christangelfargose/120-years-of-olympic-history-athletes-and-results/athlete_events.csv")

#Subsetting the dataset by removing the 9th column
player_data <- select(player_data,-9)
head(player_data)

#Importing country data using read_csv
country_data <- read_csv("/Users/christangelfargose/120-years-of-olympic-history-athletes-and-results/noc_regions.csv")

#Subsetting the country_data
country_data <- country_data[,c(1,2)]
head(country_data)

#Joining both the datasets using left join
new <- left_join(player_data,country_data,by="NOC")
head(new)


```

## Understand 

The new dataset consist of total 15 variables and 271116 rows. The data type of variables can be found out using str() function


```{r}

#Displaying the structure of dataset and checking data types 
str(new)
```

## Conversion

As shown from the above output the data types of ID,Age,Height,Weight,Year is numeric which is correct. Also the data types of Name, Team, NOC,Event, Sport,City,region is character which is correct. Variables like Sex,Season,Medal should be factors but are shown as character thats why they should be changed to factors. I have used as.factor() function to change the data type of variables Sex, Season and Medal. I have also ordered the Medal variable in the order Gold, Silver and then Bronze. After converting the data type to factors I have used class() function to check the class of the variables and also used the str() function to check if the data types are changed.

```{r}
#Converting Sex to factor and checking it using class
new$Sex <- as.factor(new$Sex)
class(new$Sex)

#Converting Season to factor and checking it using class
new$Season <- as.factor(new$Season)
class(new$Season)

#Converting Medal to factor and checking it using class
new$Medal <- as.factor(new$Medal)
new$Medal <- factor(new$Medal,levels = c("Gold","Silver","Bronze"),ordered = TRUE)
class(new$Medal)

#Checking whether the data type is changed 
str(new)

```

##	Tidy & Manipulate Data I 

We will check the dataset using head() function. In the dataset each variable have its own column and each observation is stored in a seperate row. Plus each value is stored in its own cell therfore the dataset is tidy and there is no necessity to reshape the dataset. The dataset can be used for further process.


```{r}
# Using the head() function to check if the dataset is tidy or not 
head(new)
```

##	Tidy & Manipulate Data II 

Here we have created a new variable Body mass index using the mutate() function of the dplyr package. Body mass index was created using the variables Height and Weight. Body mass index is a measurement of persons weight with respect to their height.Following formula was used to create the variable-

Body Mass Index = Weight/(Height)^2

Here I have used 10000 because the height in the formula should be in meter but in the data we have it in cm so we need to convert it.


```{r}
# Creating variable Body Mass Index using mutate() function
new <-  mutate(new,Body_Mass_Index=round(new$Weight/(new$Height * new$Height) *10000))
head(new)
```


##	Scan I 

First I have used colSums() function with is.na() to find the NULL values in each variables of the dataset.

```{r}
# Displaying the count of Null values in each variable of dataset
colSums(is.na(new))
```

According to the above output we can conclude that there are missing values in the Age, Height, Weight, Medal, region and Body_Mass_Index variables. Age, Height, Weight, Body_Mass_Index are numeric variables and Medal and region are categorical that is why they should be handled diffrently. The NULL value count of variables are high therfore they cannot be excluded as this might affect the analysis process therefore the null values must be imputed. I have imputed the numeric NULL values by using mean() function along with is.na()

```{r}
#Imputing numeric variables with the mean of the particular column
new$Age[is.na(new$Age)] <- round(mean(new$Age, na.rm = TRUE))

new$Weight[is.na(new$Weight)] <- mean(new$Weight, na.rm = TRUE)

new$Height[is.na(new$Height)] <- round(mean(new$Height, na.rm = TRUE))

new$Body_Mass_Index[is.na(new$Body_Mass_Index)] <- round(mean(new$Body_Mass_Index, na.rm = TRUE))

head(new)


```

The categorical variables are imputed by forming a new category in the variable. Here I have formed a category "No Medal" in the variable Medal and replace all the null values with that category using is.na() function. The medal variable is factor so we need to convert it to character. Similarly new category named "Unknown" was formed in the variable region and all null values were imputed with this category using is.na() function

```{r}
#Imputing categorical variable by creating a new category
new$Medal<- as.character(new$Medal)
new$Medal[is.na(new$Medal)] <- "No Medal"
new$Medal <- as.factor(new$Medal)
new$Medal <-factor(new$Medal,levels = c("Gold","Silver","Bronze","No Medal"),ordered = TRUE)

new$region[is.na(new$region)] <- "Unknown"

head(new)


```

Lastly all the numeric variables were scanned for missing values using sum() function along with is.infinite() function to check the special values in the dataset. Finally to check whether all the null values were removed I have used colSums() function along with is.na() function

```{r}
#Scanning for special values in numeric variables of the dataset
sum(is.infinite(new$ID))

sum(is.infinite(new$Age))

sum(is.infinite(new$Height))

sum(is.infinite(new$Weight))

sum(is.infinite(new$Body_Mass_Index))

sum(is.infinite(new$Year))

colSums(is.na(new))

```

##	Scan II

An outlier is an observation which stand out from other observations. Outlier can be because of data entry error, Measurement error, Data processing error or many other reasons. To scan the outliers I have plotted the numeric variables using boxplot function from the ggplot2 library. I have not checked the numeric variables like ID and year for outliers because ID is unique and there cant be a outlier for ID. Similar is the case for year. Here I have used Turkeys method of outlier detection using boxplot.

```{r}
# Ploting numeric variables using boxplot() to check the outliers
#Referred from week8 lecture slides and worksheet
new$Age %>% boxplot(main="Boxplot Age",ylab="Age",col = "green")

new$Height %>% boxplot(main="Boxplot Height",ylab="Height",col = "blue")

new$Weight %>% boxplot(main="Boxplot Weight",ylab="Weight",col = "red")

new$Body_Mass_Index %>% boxplot(main="Boxplot Body Mass Index",ylab="Body Mass Index",col = "orange")

```

From the above outputs we can see that there are outliers in each numeric variables that we have plotted using the boxplot.Now to check the number and percentage of outliers I have used the z-score value along with the score() and summary() function. Summary() function gives us the summary of the variable. By using lenght() function along with which() function and total number of rows I have calculated the percentage of outliers in a variable.The percentage of outliers value is stored in objects like age_outliers, height_outliers, weight_outliers, body_mass outliers

```{r}
#Finding the percentage of outliers using scores, summary and which function 
#Referred from week8 lecture slides and worksheet
z.age <- new$Age %>%  scores(type = "z")
z.age %>% summary()
age_ouliers <- (length (which( abs(z.age) >3 )) / 271116) * 100
age_ouliers

z.height <- new$Height %>%  scores(type = "z")
z.height %>% summary()
height_ouliers <- (length (which( abs(z.height) >3 )) / 271116) * 100
height_ouliers

z.weight <- new$Weight %>%  scores(type = "z")
z.weight %>% summary()
weight_ouliers <- (length (which( abs(z.weight) >3 )) / 271116) * 100
weight_ouliers

z.bodymass <- new$Body_Mass_Index %>%  scores(type = "z")
z.bodymass %>% summary()
bodymass_ouliers <- (length (which( abs(z.bodymass) >3 )) / 271116) * 100
bodymass_ouliers


```

From the above data we can see that there are around 1.625% outliers in Age variable, 0.75% outliers in Height variable, 1.144% outliers in Weight and 1.45% outliers in Body Mass Index. I have used capping function along with the cap() method to replace the outliers in each variable with their nearest neighbour that are not outliers. Excluding the outliers might affect the analysis later and imputing it with mean or median will ultimately favour the count of one value thats why I choosed the cap method to impute the outliers.

```{r}
#Capping the outliers using the cap function
#Referred from week8 lecture slides and worksheet
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

new_olyampic <- new
new_olyampic$Age <- cap(new_olyampic$Age)
new_olyampic$Height <- cap(new_olyampic$Height)
new_olyampic$Weight <- cap(new_olyampic$Weight)
new_olyampic$Body_Mass_Index <- cap(new_olyampic$Body_Mass_Index)

head(new_olyampic)


```

After using the cap function I have plotted the variables using boxplots to check if there are outliers present in the Age, Height, Weight and Body Mass Index.

```{r}
#Plotting the varaibles to check if the outliers are removed or not.
new_olyampic$Age %>% boxplot(main="Boxplot Age",ylab="Age",col = "green")

new_olyampic$Height %>% boxplot(main="Boxplot Height",ylab="Height",col = "blue")

new_olyampic$Weight %>% boxplot(main="Boxplot Weight",ylab="Weight",col = "red")

new_olyampic$Body_Mass_Index %>% boxplot(main="Boxplot Body Mass Index",ylab="Body Mass Index",col = "orange")
```

From the above output we can conclude that there are no outliers present in the dataset. The outliers have been capped successfully.

##	Transform 

Transformation is required to change the scale of the variable or standardise the value of variable for better understanding. It can also be used to change the linearity of a realtionship. I have used the hist() function to plot histogram of the numeric variable. The Age variable was plotted using the hist() function here xlim and ylim are scale limits of the x and y axis

```{r}
#Using hist() to check the distribution of age
#Referred from week9 lecture slides and worksheet
hist(new$Age, main = "Histogram of Age", 
     xlab = "Age",
     ylim = c(0,120000), xlim = c(0,80))


```

From the above output it is clear that the graph for the Age variable is right skewed. Therefore I have used the log10 transformation to Age variable to obtain a normal distribution. I have transformed Age by applying log10() function to it and store it in an object named age_log. Later with the help of hist() function I have plotted the age variable

```{r}
#Using log10 to transform age variable
#Referred from week9 lecture slides and worksheet
age_log <- log10(new$Age)
hist(age_log,main = "Age histogram after transformation",
     xlab = "Age", ylim = c(0,80000))


```

From the output we can conclude that age is normally distributed after tranformation .

Now I have used hist() function to plot the Height variable.

```{r}
#Using hist() to check the distribution of age
#Referred from week9 lecture slides and worksheet
hist(new$Height, main = "Histogram of Height", 
     xlab = "Height",
     ylim = c(0, 120000),
     xlim = c(130, 220))

```

From the output we can conclude that Height is normally distributed and therefore we dont need to tranform it.

Now I have used hist() function to plot the Weight variable.

```{r}
#Using hist() to check the distribution of weight
#Referred from week9 lecture slides and worksheet
hist(new$Weight, main = "Histogram of Weight", 
     xlab = "Weight",
     ylim = c(0, 120000),
     xlim = c(20,150))

```

From the output we can conclude that graph for weight variable is right skewed.
Therefore I have used the log10 transformation to Weight variable to obtain a normal distribution. I have transformed Weight by applying log10() function to it and store it in an object named weight_log.

```{r}
#Using log10 to transform age variable
#Referred from week9 lecture slides and worksheet
weight_log <- log10(new$Weight)
hist(weight_log,main = "Weight histogram after transformation",xlab = "Weight",
     ylim = c(0,120000))
```

From the output we can conclude that weight is normally distributed after tranformation.

After processing and investigating the historical dataset of olyampic games I have transformed the dataset by converting the variables to appropraite data types, handling outliers and missing values, transforming the variables. The dataset is now clean and can be used for further analysis.


<br>
<br>
