# This is the R chunk for the required packages
library(readr)
library(outliers)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:dplyr':
##
## intersect, setdiff, union
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(editrules)
## Loading required package: igraph
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:lubridate':
##
## %--%, union
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
##
## Attaching package: 'editrules'
## The following objects are masked from 'package:igraph':
##
## blocks, normalize
## The following object is masked from 'package:dplyr':
##
## contains
## The following objects are masked from 'package:tidyr':
##
## contains, separate
This dataset consists of information of athletes taking part in the Olympic Games in Rio de Janeiro 2016 and the second dataset consist of information of countries with details of GDP per capita. Initially, the column name nationality is changed to code. Then merge both the datasets of athletes and countries as one dataset by using the similar column name code. Dataset has only unique data for further analysis. The “Sex” variable is converted from character to factor for further analysis. Summarising the Rio dataset.Before tidy the dataset Checked if there are any missing values in the dataset for each column and action taken for missing values. Using boxplot outliers are determined for every numeric variable and certain action are taken for the outliers on basis of count of outliers. Used gather function to tidy the data. Calculated the BMI of each player and combined it with the dataset using mutate function to check the fitness for each athlete. . Histogram for BMI is graphed and seen that it was right-skewed so the log is calculated to remove the right skewness in data and transform it into a normal distribution.
Data is taken from kaggle:https://www.kaggle.com/rio2016/olympic-games?select=athletes.csv. The dataset consists of three files athletes, country, and events. Athletes dataset consists of data of players in the country they are playing for and the medals won by the player in the 2016 Olympics. Country data consist of the country name, code, population, and GDP per capita. The events dataset consist of the 2016 Rio Olympics events and venues.
# This is the R chunk for the Data Section
athlete <- read_csv("/Users/shubhamchougule/Downloads/datasets_583_1212_athletes (2).csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## name = col_character(),
## nationality = col_character(),
## sex = col_character(),
## dob = col_character(),
## height = col_double(),
## weight = col_double(),
## sport = col_character(),
## gold = col_double(),
## silver = col_double(),
## bronze = col_double()
## )
head(athlete)
countries <- read_csv("/Users/shubhamchougule/Downloads/datasets_583_1212_countries (1).csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## code = col_character(),
## population = col_double(),
## gdp_per_capita = col_double()
## )
head(countries)
athlete<-athlete %>%rename(code=nationality)
rio<-merge(athlete,countries,by="code",all = TRUE,no.dups = TRUE)
rio<-rio%>%select(-c(population))
rio<-rio%>%filter(gdp_per_capita!="NA" & dob!="NA")
head(rio)
The dataset is combined using the column code that is common in both the tables and only unique values are considered from the dataset. After merge the dataset, convert sex column from character to factor. Summarising the dataset to see the data types and summary for each column in the combined dataset.
# This is the R chunk for the Understand Section
rio$sex<-as.factor(rio$sex)
rio<-unique(rio)
dim(rio)
## [1] 10721 13
summary(rio)
## code id name sex
## Length:10721 Min. : 18347 Length:10721 female:4866
## Class :character 1st Qu.:244862813 Class :character male :5855
## Mode :character Median :498918427 Mode :character
## Mean :499157601
## 3rd Qu.:751514676
## Max. :999987786
##
## dob height weight sport
## Length:10721 Min. :1.210 Min. : 31.00 Length:10721
## Class :character 1st Qu.:1.690 1st Qu.: 60.00 Class :character
## Mode :character Median :1.760 Median : 70.00 Mode :character
## Mean :1.766 Mean : 71.94
## 3rd Qu.:1.840 3rd Qu.: 80.00
## Max. :2.210 Max. :165.00
## NA's :292 NA's :592
## gold silver bronze country
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Length:10721
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 Class :character
## Median :0.00000 Median :0.00000 Median :0.00000 Mode :character
## Mean :0.05914 Mean :0.05783 Mean :0.06175
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :5.00000 Max. :2.00000 Max. :2.00000
##
## gdp_per_capita
## Min. : 277.1
## 1st Qu.: 8027.7
## Median : 19222.2
## Mean : 25215.8
## 3rd Qu.: 42311.0
## Max. :101450.0
##
Scanning every column for missing values and counting the sum of the missing values for each column. The count of NA values is more in height and weight columns so deleting the NA values is not safe so replacing the NA values in height and weight with their mean.
# This is the R chunk for the Scan I
sum(is.na(rio))
## [1] 884
colSums(is.na(rio))
## code id name sex dob
## 0 0 0 0 0
## height weight sport gold silver
## 292 592 0 0 0
## bronze country gdp_per_capita
## 0 0 0
rio$height[is.na(rio$height)]<-mean(rio$height,na.rm=TRUE)
rio$weight[is.na(rio$weight)]<-mean(rio$weight,na.rm=TRUE)
#Rechecking the missing values
colSums(is.na(rio))
## code id name sex dob
## 0 0 0 0 0
## height weight sport gold silver
## 0 0 0 0 0
## bronze country gdp_per_capita
## 0 0 0
Scanning all the numeric data for checking the outliers. Outliers are identified using the boxplot. Then checking the occurrence of the outliers in the column field. For the GDP column, the number of outliers is very less that is less than 2 percent so directly removing can be safe. For height and weight, the number of outliers detected is more than 2 percent so replacing the outliers using the upper fence for the outliers greater than upper fence value and lower fence for the outliers less than lower fence value.
# This is the R chunk for the Scan II
summary(rio$gdp_per_capita)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 277.1 8027.7 19222.2 25215.8 42311.0 101450.0
bench<-42327.8+1.5*IQR(rio$gdp_per_capita)
bench
## [1] 93752.83
rio<-rio%>%filter(gdp_per_capita<93778.03)
boxplot(rio$height,main="height with outliers")
summary(rio$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.210 1.690 1.760 1.766 1.840 2.210
#Calculating upper fence and lower fence values
ht1<-1.840+1.5*IQR(rio$height)
ht2<-1.690-1.5*IQR(rio$height)
#replacing outliers with the upper fence and lower fence values
rio$height[rio$height>ht1]<-ht1
rio$height[rio$height<ht2]<-ht2
boxplot(rio$weight,main="Weight with outliers")
summary(rio$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.00 61.00 70.00 71.94 80.00 165.00
#Calculating upper fence and lower fence values
wt1<-80+1.5*IQR(rio$weight)
wt2<-61-1.5*IQR(rio$weight)
#replacing outliers with the upper fence and lower fence values
rio$weight[rio$weight>wt1]<-wt1
rio$weight[rio$weight<wt2]<-wt2
boxplot(rio$height,main="Height without outliers")
boxplot(rio$weight,main="Weight without outliers")
The aim behind the conversion of dataser is that in the dataset it is not easy to identify the count of medals for each gender using the three column gold,silver and bronze. So gather function is used to make the data tidy.
# This is the R chunk for the Tidy & Manipulate Data I
rio<-rio%>%gather(gold,silver,bronze,key = Medal,value =count,na.rm = TRUE)
medalgender<-rio%>%group_by(sex)%>%summarise(medals=sum(count))
medalgender
medaltypegender<-rio%>%group_by(sex,Medal)%>%summarise(medals=sum(count))
medaltypegender
In sports, fitness of players is calculated with the help of BMI so calculating BMI for each player using the height and weight column. Through mutate function adding the BMI column to the dataset so we get an idea of the fitness of each player.
# This is the R chunk for the Tidy & Manipulate Data II
rio<-rio%>%mutate(BMI=weight/(height^2))
For transformation, we use the BMI column. Before the transformation, we can see that graph is right-skewed. To fix this we can normally distribute the data by taking a log of data. After transformation, we can see that data is normally distributed.
# This is the R chunk for the Transform Section
hist(rio$BMI,xlab = "BMI",main=" Histogram before transformation")
logdata<-log(rio$BMI)
hist(logdata,xlab = "BMI",main=" Histogram after transformation")