Packages Required
library(readr);library(tidyr);library(dplyr)
library(validate);library(lubridate);library(deductive)
library(outliers);library(Hmisc);library(stringr)
library(knitr);library(ggplot2)
Abstract
The assignment is based on pre-processing the data. The dataset used for this purpose is exported from Kaggle. The dataset contains two files with descriptions of olympic events and players demographics. The two data files are mergered together using the function left_join. The dataset is analysed to understand the variable types.
The variable NOC were renamed as country code in both the datasets.Some variables such as Country code and height of players were converted into factor and numeric variables respectively. The date variable was analysed and converted based the seasons. The variable medals eas ordered and assigned labels based on Gld, Silver and Bronze and N/A’s as Not_Awarded. The variable games contained year and seasons which was spilt and assigned them into two columns. The variable column notes was dropped as it wasn’t useful and it mostly contained N/A’s.
The Body mass Index column was created using the variables weight and height by converting height variable and using function mutate to create a new column Bosy Mass Index. Next, the dataset was scanned to check for N/A’s. The N/A’s in the weight, height and bosy_mass columns were replaced based on gender. The age variable was replaced with the mean ages of players. The N/A’s in the medals variable was replaced using Not_awarded.
Then the height and body mass index variables were transformed and ploted using boxplot. They happened to have many outliers, tose outliers were treated by capping them. The height and body mass variables were transformed using log and log10 transformations. Then those transformed variables were plotted using histograms.
Data
The data of olympic events and their regional information is taken from Kaggle. The dataset source:
https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results#athlete_events.csv
The dataset contains information of all events and players from 1896 to 2016. the data contains 2,71,116 observations containing 15 columns in the athletics dataset. The dataset nationality contains 230 observation and 3 variables. They are merged together based on the Country code of the olympics being played.
The datset athletics contains variables such as ID, Names, Sex, Age, Weight, Height, Teams, counry code, games, year,seasons, sport played, city, events and medals won. The other dataset nationality contains 3 variables such as country code, countries and notes.
Reading the dataset into R
The datsets both athletics and nationality were read to R Studio using function read_csv and the NOC cariable was renamed as countrycode. Then the nationality dataset was read into R and both the columns NOC and Regions were renamed as country code and country respectively. Both the dataset was then merged together and using the head() function the first few rows of the datset is displayed.
athletics <- read_csv("C:/Users/Varshini/Desktop/Data Pre-Processing/athletics.csv")
head(athletics)
nationality <- read.csv("C:/Users/Varshini/Desktop/Data Pre-Processing/nationality.csv")
colnames(nationality)
[1] "NOC" "region" "notes"
names(nationality)[names(nationality) == "NOC"] <- "Country_code"
names(nationality)[names(nationality) == "region"] <- "Country"
head(nationality)
sports <- left_join(athletics, nationality)
head(sports)
Understand
The data is scanned using the str() function to understad the variable types. The variable types were read in correctly. Country code variable is converted into factor variable. Changing the height variable to be numeric. The year variable is converted to date format using the ifelse statement to determine summer and winter olympics. The medal variable contained silver, silve and bronze levels. These levels were ordered from gold to bronze.
sports$Height <- as.numeric(sports$Height)
sports$Country_code <- as.factor(sports$Country_code)
year_season <- ifelse(sports$Season == "Winter", paste(sports$Year, 7, 20, sep = "-"), paste(sports$Year, 1, 20, sep = "-"))
year_season <- ymd(year_season)
sports$season_year <- year_season
sports$Medal <- ordered(sports$Medal, levels = c("Gold", "Silver", "Bronze"), labels=c("Gold_Medal", "Silver_Medal", "Bronze_Medal"))
str(sports)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 271116 obs. of 18 variables:
$ ID : num 1 2 3 4 5 5 5 5 5 5 ...
$ Name : chr "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
$ Gender : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 2 2 ...
$ Age : num 24 23 24 34 21 21 25 25 27 27 ...
$ Height : num 180 170 NA NA 185 185 185 185 185 185 ...
$ Weight : num 80 60 NA NA 82 82 82 82 82 82 ...
$ Team : chr "China" "China" "Denmark" "Denmark/Sweden" ...
$ Country_code: Factor w/ 230 levels "AFG","AHO","ALB",..: 42 42 56 56 146 146 146 146 146 146 ...
$ Games : chr "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
$ Year : num 1992 2012 1920 1900 1988 ...
$ Season : chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Barcelona" "London" "Antwerpen" "Paris" ...
$ Sport : chr "Basketball" "Judo" "Football" "Tug-Of-War" ...
$ Event : chr "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
$ Medal : Ord.factor w/ 3 levels "Gold_Medal"<"Silver_Medal"<..: NA NA NA 1 NA NA NA NA NA NA ...
$ Country : Factor w/ 206 levels "Afghanistan",..: 40 40 51 51 130 130 130 130 130 130 ...
$ notes : Factor w/ 22 levels "","Antigua and Barbuda",..: 1 1 1 1 1 1 1 1 1 1 ...
$ season_year : Date, format: "1992-01-20" "2012-01-20" ...
Tidy & Manipulate Data I
The dataset is cleaned but the games played variable contains year and season information which needed to be spilt by the use of the seperate() to create new column season and year from the spilt information of the games variable. Then the column games and notes was dropped as they were not useful for the analysis.
sports_mod <- sports %>% separate(Games, into = c("Date_Year", "Seasons"), sep = " ")
head(sports_mod)
Tidy & Manipulate Data II
The height variable is converted into metric using the formula height/100
A new column body mass was created utilizing the weight and height variabl by converting height variable into metrics and using the formula Body_mass_Index = weight/height^2
var_height <- format(as.numeric(sports_mod$Height) / 100)
var_height <- as.numeric(var_height)
body_mass <- mutate(sports_mod, body_mass = sports_mod$Weight/sports_mod$Height^2)
sports_mod <- cbind(body_mass)
head(sports_mod)
Scan I
The dataset is scanned to check N/A’s in the column variables. The N/A’s in the variable age is replaced by uing the mean age of players based on the gender of the players. The variable height, weight, and body mass N/A’s were replaced using their mean by grouping them by gender. The medals variable contained N/A’s which was replaced by using Not_awarded in place of N/A’s as only maximum 3 possible players in every game will wins one of the medal and all others are not awarded medals for securing other positions in the games.
The remaing N/A’s were removed and using the colsums(is.na()) we check for colunms with N/A’s and omit them.
colSums(is.na(sports_mod))
ID Name Gender Age Height Weight
0 0 0 9474 60171 62875
Team Country_code Date_Year Seasons Year Season
0 0 0 0 0 0
City Sport Event Medal Country notes
0 0 0 231333 370 349
season_year body_mass
0 64263
sports_mod$Age[is.na(sports_mod$Age)] <- mean(sports_mod$Age, na.rm = TRUE)
sports_mod$Height[is.na(sports_mod$Height)] <- mean(sports_mod$Height, na.rm = TRUE)
sports_mod$Weight[is.na(sports_mod$Weight)] <- mean(sports_mod$Weight, na.rm = TRUE)
sports_mod$body_mass[is.na(sports_mod$body_mass)] <- mean(sports_mod$body_mass, na.rm = TRUE)
sports_mod$Medal<- as.character(sports_mod$Medal)
sports_mod$Medal[is.na(sports_mod$Medal)] <- "not_awarded"
sports_mod$Medal <- ordered(sports_mod$Medal, c("Gold", "Silver", "Bronze", "not_awarded"))
sports_mod <- na.omit(sports_mod)
head(sports_mod)
colSums(is.na(sports_mod))
ID Name Gender Age Height Weight
0 0 0 0 0 0
Team Country_code Date_Year Seasons Year Season
0 0 0 0 0 0
City Sport Event Medal Country notes
0 0 0 0 0 0
season_year body_mass
0 0
Scan II
The data is scanned again using boxplot function to check for outliers. All the numeric variables had outliers in them. To treat the outliers in the variables, the outliers were capped. These variables were again plotted to check for outliers and there were no signs of outliers.
par(mfrow = c(1,2))
boxplot(var_height, main = "Athletes Weight", col="orange")
boxplot(sports_mod$Weight, main= " Athletes Weight", col="blue")

par(mfrow = c(1,2))
boxplot(sports_mod$body_mass, main = "Body Mass Index of Athletes", col="green")
boxplot(sports_mod$Age, main = "Athletes Age", col="red")

cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
height_var <- sports_mod$Height %>% cap()
weight_var <- sports_mod$Weight %>% cap()
bosy_mass <- sports_mod$body_mass %>% cap()
age_var <- sports_mod$Age %>% cap()
par(mfrow = c(1,2))
boxplot(var_height, main = "Athletes Height", col="orange")
boxplot(sports_mod$Weight, main= "Athletes Weight", col="blue")

par(mfrow = c(1,2))
boxplot(sports_mod$body_mass, main = "Body Mass Index of Athletes", col="green")
boxplot(sports_mod$Age, main = "Age of Athletes", col="red")

Conclusion
The olympic games dataset was mostly tidy but had some untidy variables and needed new variables to be created to analyse the events more accurately. Various functions from the tidyr, dplyr,lubricate, outliers etc were used to tidy the dataset. The data set numeric variables were transfored and plotted to understand and analyse the olympic dataset.
---
title: "Semester 1, 2019 - MATH2349(Data Preprocessing)"
author: "Varshini Ravi(s3654272) "
subtitle: Assignment 3
output:
  html_notebook: default
---

## Packages Required

```{r}
library(readr);library(tidyr);library(dplyr)
library(validate);library(lubridate);library(deductive)
library(outliers);library(Hmisc);library(stringr)
library(knitr);library(ggplot2)

```


##  Abstract  

The assignment is based on pre-processing the data. The dataset used for this purpose is exported from Kaggle. The dataset contains two files with descriptions of  olympic events and players demographics. The two data files are mergered together using the function left_join. The dataset is analysed to understand the variable types.

The variable NOC were renamed as country code in both the datasets.Some variables such as Country code and height of players were converted into factor and numeric variables respectively. The date variable was analysed and converted based the seasons. The variable medals eas ordered and assigned labels based on Gld, Silver and Bronze and N/A's as Not_Awarded. The variable games contained year and seasons which was spilt and assigned them into two columns. The variable column notes was dropped as it wasn't useful and it mostly contained N/A's.

The Body mass Index column was created using the variables weight and height by converting height variable and using function mutate to create a new column Bosy Mass Index. Next, the dataset was scanned to check for N/A's. The N/A's in the weight, height and bosy_mass columns were replaced based on gender. The age variable was replaced with the mean ages of players. The N/A's in the medals variable was replaced using Not_awarded.

Then the height and body mass index variables were transformed and ploted using boxplot. They happened to have many outliers, tose outliers were treated by capping them. The height and body mass variables were transformed using log and log10 transformations. Then those transformed variables were plotted using histograms.

## Data 

The data of olympic events and their regional information is taken from Kaggle.  The dataset source: 

https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results#athlete_events.csv

The dataset contains information of all events and players from 1896 to 2016. the data contains 2,71,116 observations containing 15 columns in the athletics dataset. The dataset nationality contains 230 observation and 3 variables. They are merged together based on the Country code of the olympics being played.

The datset athletics contains variables such as ID, Names, Sex, Age, Weight, Height, Teams, counry code, games, year,seasons, sport played, city, events and medals won. The other dataset nationality contains 3 variables such as country code, countries and notes.



## Reading the dataset into R 
 The datsets both athletics and nationality were read to R Studio using function read_csv and the NOC cariable was renamed as countrycode. Then the nationality dataset was read into R and both the columns NOC and Regions were renamed as country code and country respectively. Both the dataset was then merged together and using the head()
function the first few rows of the datset is displayed.

```{r, eval=FALSE}
athletics <- read_csv("C:/Users/Varshini/Desktop/Data Pre-Processing/athletics.csv")
```

```{r, eval=FALSE, include=FALSE}
names(athletics)[names(athletics) == "NOC"] <- "Country_code"
names(athletics)[names(athletics)=="Sex"] <- "Gender"

athletics$Gender <- factor(athletics$Gender, levels=c('M','F'),
  labels=c('Male','Female'))
```

```{r}
head(athletics)
```

```{r}
nationality <- read.csv("C:/Users/Varshini/Desktop/Data Pre-Processing/nationality.csv")
colnames(nationality)
names(nationality)[names(nationality) == "NOC"] <- "Country_code"
names(nationality)[names(nationality) == "region"] <- "Country"
head(nationality)
```

```{r, eval=FALSE}
sports <- left_join(athletics, nationality)
```

```{r}
head(sports)
```

## Understand 

The data is scanned using the str() function to understad the variable types. The variable types were read in correctly. Country code variable is converted into factor variable. Changing the height variable to be numeric. The year variable is converted to date format using the ifelse statement to determine summer and winter olympics.
The medal variable contained silver, silve and bronze levels. These levels were ordered from gold to bronze.


```{r}
sports$Height <- as.numeric(sports$Height)
sports$Country_code <- as.factor(sports$Country_code)

year_season <- ifelse(sports$Season == "Winter", paste(sports$Year, 7, 20, sep = "-"), paste(sports$Year, 1, 20, sep = "-"))
year_season <- ymd(year_season)
sports$season_year <- year_season

sports$Medal <- ordered(sports$Medal, levels = c("Gold", "Silver", "Bronze"), labels=c("Gold_Medal", "Silver_Medal", "Bronze_Medal"))

```

```{r}
str(sports)
```
##	Tidy & Manipulate Data I 

The dataset is cleaned but the games played variable contains year and season information which needed to be spilt by the use of the seperate()  to create new column season and year from the spilt information of the games variable.
Then the column games and notes was dropped as they were not useful for the analysis.

```{r}
sports_mod <- sports %>% separate(Games, into = c("Date_Year", "Seasons"), sep = " ")
head(sports_mod)
```

##	Tidy & Manipulate Data II 

The height variable is converted into metric using the formula
                          height/100      
A new column body mass was created utilizing the weight and height variabl by converting height variable into metrics and using the formula 
                    Body_mass_Index = weight/height^2

```{r, eval=FALSE}
var_height <- format(as.numeric(sports_mod$Height) / 100)
var_height <- as.numeric(var_height)

```

```{r}
body_mass <- mutate(sports_mod, body_mass = sports_mod$Weight/sports_mod$Height^2)
sports_mod <- cbind(body_mass)

head(sports_mod)
```

##	Scan I 

The dataset is scanned to check N/A's in the column variables. The N/A's in the variable age is replaced by uing the mean age of players based on the gender of the players. The variable height, weight, and body mass N/A's were replaced using their mean by grouping them by gender. The medals variable contained N/A's which was replaced by using Not_awarded in place of N/A's as only maximum 3 possible players in every game will wins one of the medal and all others are not awarded medals for securing other positions in the games.

The remaing N/A's were removed and using the colsums(is.na()) we check for colunms with N/A's and omit them.


```{r}
colSums(is.na(sports_mod))

```
```{r, eval=FALSE, include=FALSE}
names(sports_mod)[names(sports_mod) == "Sex"] <- "Gender"
```

```{r}
sports_mod$Age[is.na(sports_mod$Age)] <- mean(sports_mod$Age, na.rm = TRUE)

sports_mod$Height[is.na(sports_mod$Height)] <- mean(sports_mod$Height, na.rm = TRUE)

sports_mod$Weight[is.na(sports_mod$Weight)] <- mean(sports_mod$Weight, na.rm = TRUE)

sports_mod$body_mass[is.na(sports_mod$body_mass)] <- mean(sports_mod$body_mass, na.rm = TRUE)

```

```{r}
sports_mod$Medal<- as.character(sports_mod$Medal)
sports_mod$Medal[is.na(sports_mod$Medal)] <- "not_awarded"
sports_mod$Medal <- ordered(sports_mod$Medal, c("Gold", "Silver", "Bronze", "not_awarded"))
```

```{r}
sports_mod <- na.omit(sports_mod)
head(sports_mod)
```

```{r}
colSums(is.na(sports_mod))
```

##	Scan II

The data is scanned again using boxplot function to check for outliers. All the numeric variables had outliers in them. To treat the outliers in the variables, the outliers were capped. These variables were again plotted to check for outliers and there were no signs of outliers.

```{r}
par(mfrow = c(1,2))
boxplot(var_height, main = "Athletes Weight", col="orange")
boxplot(sports_mod$Weight, main= " Athletes Weight", col="blue")
```

```{r}
par(mfrow = c(1,2))
boxplot(sports_mod$body_mass, main = "Body Mass Index of Athletes", col="green")
boxplot(sports_mod$Age, main = "Athletes Age", col="red")
```

```{r}
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

height_var <- sports_mod$Height %>% cap()
weight_var <- sports_mod$Weight %>% cap()
bosy_mass <- sports_mod$body_mass %>% cap()
age_var <- sports_mod$Age %>% cap()

```

```{r}
par(mfrow = c(1,2))
boxplot(var_height, main = "Athletes Height", col="orange")
boxplot(sports_mod$Weight, main= "Athletes Weight", col="blue")
```

```{r}
par(mfrow = c(1,2))
boxplot(sports_mod$body_mass, main = "Body Mass Index of Athletes", col="green")
boxplot(sports_mod$Age, main = "Age of Athletes", col="red")

```

##	Transform 

The dataset containing numeric variables such as body mass index and height were transformed using log10 and log natural function in R. This transformation is done to these variable to make the data look more normally distributed. Using the plot function in ggplot_2 we plot the histogram of the transformed variables.



```{r}
body_mass_log <-log10(sports_mod$body_mass)
ggplot(data = sports_mod, mapping = aes(body_mass_log)) + 
  geom_histogram(binwidth = 0.05, color="black", fill=rgb(0.2,0.7,0.1,0.8))

```

```{r}
ggplot(data = sports_mod, mapping = aes(sports_mod$Height)) + 
  geom_histogram(binwidth = 1, color="black", fill=rgb(0.7,0.4,0.7,0.9))

```

```{r}
log <- log10(sports_mod$Height)
ggplot(data = sports_mod, mapping = aes(log)) + 
  geom_histogram(binwidth = 0.01,color="black", fill=rgb(0.8,0.3,0.3,0.8) )

```

## Conclusion

The olympic games dataset was mostly tidy but had some untidy variables and needed new variables to be created to analyse the events more accurately. Various  functions from the tidyr, dplyr,lubricate, outliers etc were used to tidy the dataset. The data set numeric variables were transfored and plotted to understand and analyse the olympic dataset.

<br>
<br>
