MATH2349 Semester 1, 2020

Required packages

# This is the R chunk for the required packages
library(readr)
library(outliers)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:dplyr':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(editrules)

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:lubridate':
## 
##     %--%, union

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

## The following object is masked from 'package:dplyr':
## 
##     contains

## The following objects are masked from 'package:tidyr':
## 
##     contains, separate

Executive Summary

This dataset consists of information of athletes taking part in the Olympic Games in Rio de Janeiro 2016 and the second dataset consist of information of countries with details of GDP per capita. Initially, the column name nationality is changed to code. Then merge both the datasets of athletes and countries as one dataset by using the similar column name code. Dataset has only unique data for further analysis. The “Sex” variable is converted from character to factor for further analysis. Summarising the Rio dataset.Before tidy the dataset Checked if there are any missing values in the dataset for each column and action taken for missing values. Using boxplot outliers are determined for every numeric variable and certain action are taken for the outliers on basis of count of outliers. Used gather function to tidy the data. Calculated the BMI of each player and combined it with the dataset using mutate function to check the fitness for each athlete. . Histogram for BMI is graphed and seen that it was right-skewed so the log is calculated to remove the right skewness in data and transform it into a normal distribution.

Data

Data is taken from kaggle:https://www.kaggle.com/rio2016/olympic-games?select=athletes.csv. The dataset consists of three files athletes, country, and events. Athletes dataset consists of data of players in the country they are playing for and the medals won by the player in the 2016 Olympics. Country data consist of the country name, code, population, and GDP per capita. The events dataset consist of the 2016 Rio Olympics events and venues.

# This is the R chunk for the Data Section

athlete <- read_csv("/Users/shubhamchougule/Downloads/datasets_583_1212_athletes (2).csv")

## Parsed with column specification:
## cols(
##   id = col_double(),
##   name = col_character(),
##   nationality = col_character(),
##   sex = col_character(),
##   dob = col_character(),
##   height = col_double(),
##   weight = col_double(),
##   sport = col_character(),
##   gold = col_double(),
##   silver = col_double(),
##   bronze = col_double()
## )

head(athlete)

countries <- read_csv("/Users/shubhamchougule/Downloads/datasets_583_1212_countries (1).csv")

## Parsed with column specification:
## cols(
##   country = col_character(),
##   code = col_character(),
##   population = col_double(),
##   gdp_per_capita = col_double()
## )

head(countries)

athlete<-athlete %>%rename(code=nationality)

rio<-merge(athlete,countries,by="code",all = TRUE,no.dups = TRUE)
rio<-rio%>%select(-c(population))
rio<-rio%>%filter(gdp_per_capita!="NA" & dob!="NA")
head(rio)

Understand

The dataset is combined using the column code that is common in both the tables and only unique values are considered from the dataset. After merge the dataset, convert sex column from character to factor. Summarising the dataset to see the data types and summary for each column in the combined dataset.

# This is the R chunk for the Understand Section
rio$sex<-as.factor(rio$sex)
rio<-unique(rio)
dim(rio)

## [1] 10721    13

summary(rio)

##      code                 id                name               sex      
##  Length:10721       Min.   :    18347   Length:10721       female:4866  
##  Class :character   1st Qu.:244862813   Class :character   male  :5855  
##  Mode  :character   Median :498918427   Mode  :character                
##                     Mean   :499157601                                   
##                     3rd Qu.:751514676                                   
##                     Max.   :999987786                                   
##                                                                         
##      dob                height          weight          sport          
##  Length:10721       Min.   :1.210   Min.   : 31.00   Length:10721      
##  Class :character   1st Qu.:1.690   1st Qu.: 60.00   Class :character  
##  Mode  :character   Median :1.760   Median : 70.00   Mode  :character  
##                     Mean   :1.766   Mean   : 71.94                     
##                     3rd Qu.:1.840   3rd Qu.: 80.00                     
##                     Max.   :2.210   Max.   :165.00                     
##                     NA's   :292     NA's   :592                        
##       gold             silver            bronze          country         
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Length:10721      
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   Class :character  
##  Median :0.00000   Median :0.00000   Median :0.00000   Mode  :character  
##  Mean   :0.05914   Mean   :0.05783   Mean   :0.06175                     
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000                     
##  Max.   :5.00000   Max.   :2.00000   Max.   :2.00000                     
##                                                                          
##  gdp_per_capita    
##  Min.   :   277.1  
##  1st Qu.:  8027.7  
##  Median : 19222.2  
##  Mean   : 25215.8  
##  3rd Qu.: 42311.0  
##  Max.   :101450.0  
##

Scan I

Scanning every column for missing values and counting the sum of the missing values for each column. The count of NA values is more in height and weight columns so deleting the NA values is not safe so replacing the NA values in height and weight with their mean.

# This is the R chunk for the Scan I

sum(is.na(rio))

## [1] 884

colSums(is.na(rio))

##           code             id           name            sex            dob 
##              0              0              0              0              0 
##         height         weight          sport           gold         silver 
##            292            592              0              0              0 
##         bronze        country gdp_per_capita 
##              0              0              0

rio$height[is.na(rio$height)]<-mean(rio$height,na.rm=TRUE)
rio$weight[is.na(rio$weight)]<-mean(rio$weight,na.rm=TRUE)
#Rechecking the missing values
colSums(is.na(rio))

##           code             id           name            sex            dob 
##              0              0              0              0              0 
##         height         weight          sport           gold         silver 
##              0              0              0              0              0 
##         bronze        country gdp_per_capita 
##              0              0              0

Scan II

Scanning all the numeric data for checking the outliers. Outliers are identified using the boxplot. Then checking the occurrence of the outliers in the column field. For the GDP column, the number of outliers is very less that is less than 2 percent so directly removing can be safe. For height and weight, the number of outliers detected is more than 2 percent so replacing the outliers using the upper fence for the outliers greater than upper fence value and lower fence for the outliers less than lower fence value.

# This is the R chunk for the Scan II
summary(rio$gdp_per_capita)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    277.1   8027.7  19222.2  25215.8  42311.0 101450.0

bench<-42327.8+1.5*IQR(rio$gdp_per_capita)
bench

## [1] 93752.83

rio<-rio%>%filter(gdp_per_capita<93778.03)

boxplot(rio$height,main="height with outliers")

summary(rio$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.210   1.690   1.760   1.766   1.840   2.210

#Calculating upper fence and lower fence values
ht1<-1.840+1.5*IQR(rio$height)
ht2<-1.690-1.5*IQR(rio$height)
#replacing outliers with the upper fence and lower fence values
rio$height[rio$height>ht1]<-ht1
rio$height[rio$height<ht2]<-ht2


boxplot(rio$weight,main="Weight with outliers")

summary(rio$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.00   61.00   70.00   71.94   80.00  165.00

#Calculating upper fence and lower fence values
wt1<-80+1.5*IQR(rio$weight)
wt2<-61-1.5*IQR(rio$weight)
#replacing outliers with the upper fence and lower fence values
rio$weight[rio$weight>wt1]<-wt1
rio$weight[rio$weight<wt2]<-wt2

boxplot(rio$height,main="Height without outliers")

boxplot(rio$weight,main="Weight without outliers")

Tidy & Manipulate Data I

The aim behind the conversion of dataser is that in the dataset it is not easy to identify the count of medals for each gender using the three column gold,silver and bronze. So gather function is used to make the data tidy.

# This is the R chunk for the Tidy & Manipulate Data I 
rio<-rio%>%gather(gold,silver,bronze,key = Medal,value =count,na.rm = TRUE)

medalgender<-rio%>%group_by(sex)%>%summarise(medals=sum(count))
medalgender

medaltypegender<-rio%>%group_by(sex,Medal)%>%summarise(medals=sum(count))
medaltypegender

Tidy & Manipulate Data II

In sports, fitness of players is calculated with the help of BMI so calculating BMI for each player using the height and weight column. Through mutate function adding the BMI column to the dataset so we get an idea of the fitness of each player.

# This is the R chunk for the Tidy & Manipulate Data II 
rio<-rio%>%mutate(BMI=weight/(height^2))

Transform

For transformation, we use the BMI column. Before the transformation, we can see that graph is right-skewed. To fix this we can normally distribute the data by taking a log of data. After transformation, we can see that data is normally distributed.

# This is the R chunk for the Transform Section

hist(rio$BMI,xlab = "BMI",main=" Histogram before transformation")

logdata<-log(rio$BMI)

hist(logdata,xlab = "BMI",main=" Histogram after transformation")