The main Purpose of this assignment is to try applying all data pre-processing techniques learnt during the session. In this assignment we will be using data from Olympic Sports and Medals, 1896-2014. The Data set os an public data set available on kaggle(link : https://www.kaggle.com/the-guardian/olympic-games/download).
About the data: More than 35,000 medals have been awarded at the Olympics since 1896. The first two Olympiads awarded silver medals and an olive wreath for the winner, and the IOC retrospectively awarded gold, silver, and bronze to athletes based on their rankings. This dataset includes a row for every Olympic athlete that has won a medal since the first games.
For the Purpose of this assignment we will only be using 2 files one containing information about medals won in summer olympics and other containing information about the countries.
Files and variables which we will be using in this assignment will be as follows. Summer.csv: data about athletes and medals won in different sports in summer olympics during 1896-2014 Year: year in which olympics tooks place City: city in which the games where held Sport: sports in which the athlete performed Discipline: Main Discipline of the sports.
Athlete: name of the athlete Country: Country of the athlete (country code) Gender: gender of the athlete Event: event/Match name or type Medal: type of medal won by the athlete
Dictionary.csv: Country: Country Name Code: Country Code Population: Population of the country in numbers GDP.per.Capita: GDP per capita of the Country.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(scales)
setwd("C:/Users/karan/Downloads")
Load Summer Olympics medal data:
#Importing of summer csv
summer_medals <- read.csv("summer.csv",
stringsAsFactors=FALSE, na.string = "N/A")
#Removal of summer_medals duplicate entries
summer_medals <- distinct(summer_medals)
#Preview of summer_medals dataframe
head(summer_medals)
## Year City Sport Discipline Athlete Country Gender
## 1 1896 Athens Aquatics Swimming HAJOS, Alfred HUN Men
## 2 1896 Athens Aquatics Swimming HERSCHMANN, Otto AUT Men
## 3 1896 Athens Aquatics Swimming DRIVAS, Dimitrios GRE Men
## 4 1896 Athens Aquatics Swimming MALOKINIS, Ioannis GRE Men
## 5 1896 Athens Aquatics Swimming CHASAPIS, Spiridon GRE Men
## 6 1896 Athens Aquatics Swimming CHOROPHAS, Efstathios GRE Men
## Event Medal
## 1 100M Freestyle Gold
## 2 100M Freestyle Silver
## 3 100M Freestyle For Sailors Bronze
## 4 100M Freestyle For Sailors Gold
## 5 100M Freestyle For Sailors Silver
## 6 1200M Freestyle Bronze
Load Countries data:
#Importing of Dictionary csv
countries <- read.csv("dictionary.csv",
stringsAsFactors=FALSE, na.string = "N/A")
#Removal of countries duplicate entries
countries <- distinct(countries)
#Preview of countries dataframe
head(countries)
## Country Code Population GDP.per.Capita
## 1 Afghanistan AFG 32526562 594.3231
## 2 Albania ALB 2889167 3945.2176
## 3 Algeria ALG 39666519 4206.0312
## 4 American Samoa* ASA 55538 NA
## 5 Andorra AND 70473 NA
## 6 Angola ANG 25021974 4101.4722
Merge the 2 Datasets on the basis of Country Code.
#merge summer_medal data and Countries Data on the basis of Country Code.
summer_medals.country <- merge(summer_medals, countries, by.x=c("Country"),
by.y=c("Code"))
#Preview of merged dataframe
head(summer_medals.country)
## Country Year City Sport Discipline Athlete
## 1 AFG 2008 Beijing Taekwondo Taekwondo NIKPAI, Rohullah
## 2 AFG 2012 London Taekwondo Taekwondo NIKPAI, Rohullah
## 3 AHO 1988 Seoul Sailing Sailing BOERSMA, Jan D.
## 4 ALG 2000 Sydney Athletics Athletics HAMMAD, Abderrahmane
## 5 ALG 1992 Barcelona Boxing Boxing SOLTANI, Hocine
## 6 ALG 1984 Los Angeles Boxing Boxing MOUSSA, Mustapha
## Gender Event Medal Country.y
## 1 Men - 58 KG Bronze Afghanistan
## 2 Men 58 - 68 KG Bronze Afghanistan
## 3 Men Board (Division Ii) Silver Netherlands Antilles*
## 4 Men High Jump Bronze Algeria
## 5 Men 54 - 57KG (Featherweight) Bronze Algeria
## 6 Men 75 - 81KG (Light-Heavyweight) Bronze Algeria
## Population GDP.per.Capita
## 1 32526562 594.3231
## 2 32526562 594.3231
## 3 NA NA
## 4 39666519 4206.0312
## 5 39666519 4206.0312
## 6 39666519 4206.0312
Understand the data is one of key requirement of Data Preprocessing. For this purpose we need to to check the type of each variable and convert them to the correct variable type if needed.
#Checking movie structure and variable type
str(summer_medals.country)
## 'data.frame': 25740 obs. of 12 variables:
## $ Country : chr "AFG" "AFG" "AHO" "ALG" ...
## $ Year : int 2008 2012 1988 2000 1992 1984 2008 2000 1992 1996 ...
## $ City : chr "Beijing" "London" "Seoul" "Sydney" ...
## $ Sport : chr "Taekwondo" "Taekwondo" "Sailing" "Athletics" ...
## $ Discipline : chr "Taekwondo" "Taekwondo" "Sailing" "Athletics" ...
## $ Athlete : chr "NIKPAI, Rohullah" "NIKPAI, Rohullah" "BOERSMA, Jan D." "HAMMAD, Abderrahmane" ...
## $ Gender : chr "Men" "Men" "Men" "Men" ...
## $ Event : chr "- 58 KG" "58 - 68 KG" "Board (Division Ii)" "High Jump" ...
## $ Medal : chr "Bronze" "Bronze" "Silver" "Bronze" ...
## $ Country.y : chr "Afghanistan" "Afghanistan" "Netherlands Antilles*" "Algeria" ...
## $ Population : int 32526562 32526562 NA 39666519 39666519 39666519 39666519 39666519 39666519 39666519 ...
## $ GDP.per.Capita: num 594 594 NA 4206 4206 ...
After this step we need to tidy and manipulate data as required and as per the need of further analysis.
The dataset that we constructed using the Summer Olympics Medals and Countries datasets is believed to already be tidy. The reason is because:
Each variable is in a separate column Each observation is in a separate row
Through the use of the mutate() function to convert Gender and Medal variables into Factor. After this step we will check the levels of the factor in these variables and reorder them if required.
summer_medals.country <- summer_medals.country %>%
mutate(Gender=factor(Gender),Medal=factor(Medal))
#Looking at movie rated classifcation levels
levels(summer_medals.country$Gender)
## [1] "Men" "Women"
#Looking at movie rated classifcation levels
levels(summer_medals.country$Medal)
## [1] "Bronze" "Gold" "Silver"
#Rearrange movie rated classification based on appropriate audience age
summer_medals.country$Medal <- factor(summer_medals.country$Medal,
levels = c("Bronze","Silver","Gold"),
ordered = TRUE)
#Rechecking levels of movie rate classification after re-order
levels(summer_medals.country$Medal)
## [1] "Bronze" "Silver" "Gold"
Data Scan function that will detect for missing values, inconsistencies or obvious errors in dataset. We also commenced the scan for numeric data for outliers using boxplot $out function. Final part will be the transform at least one variable in dataset. for this we g=have decided to rescale the population variable in the range of 0 to 1.
To figure out if there was any missing values in the dataset we used the any(is.na()) function. We investigated this problem further using the sum(is.na()) and colSums(is.na()) to determine number of missing values throughout the dataset. To address the issue of the missing data we decided to remove all observations that had missing values using the na.omit() function.
any(is.na(summer_medals.country))
## [1] TRUE
sum(is.na(summer_medals.country))
## [1] 646
colSums(is.na(summer_medals.country))
## Country Year City Sport Discipline
## 0 0 0 0 0
## Athlete Gender Event Medal Country.y
## 0 0 0 0 0
## Population GDP.per.Capita
## 46 600
# Removal of rows with NA
summer_medals.country <- na.omit(summer_medals.country)
# Recheck for NA's
any(is.na(summer_medals.country))
## [1] FALSE
To check for outliers we produced boxplots for all numeric variables in our dataset. Based on first impressions from the boxplots it appears our dataset is believed to contain outliers in population variable. while the GDP variable is belived to have no outliers.
# Boxplot of numeric variable
summer_medals.country$GDP.per.Capita %>% boxplot(main="Box Plot of summer_medals",
ylab="GDP ($)", col = "grey")
summer_medals.country$Population %>% boxplot(main="Box Plot of summer_medals",
ylab="Population", col = "grey")
## Warning in x[floor(d)] + x[ceiling(d)]: NAs produced by integer overflow
By running the above analysis we found that there are some outliers in variable population. Through the box plot it is clear that variable GDP.per.Capita does not have any outliers present.
We have created a function named capping. this will replace the outliers with their corresponding cap values. And in this way we are able to remove univariate outlires in the column population of our data.
capping <- function(x){
quantiles <- quantile( x, c(.05, .95 ),na.rm=TRUE )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
summer_medals.country$Population <- capping( summer_medals.country$Population )
summer_medals.country$Population %>% boxplot(main="Box Plot of summer_medals",
ylab="Population", col = "grey")
As We can see, now all outliers present in the Population have been removed.
For this step we have decided to Rescale the Column named Population in the Range of 0 to 1 for better undersatnding and analysis of the dataset.
summer_medals.country$Population.scaled <- rescale(summer_medals.country$Population)
head(summer_medals.country$Population.scaled)
## [1] 0.08642841 0.08642841 0.10900728 0.10900728 0.10900728 0.10900728
The Dataset is available for Download from Kaggle. Link: https://www.kaggle.com/the-guardian/olympic-games/
This Data was provided by the IOC Research and Reference Service and published by The Guardian’s Datablog.