Matrh 2349 Assignment 3 Karan Singh (S3801462) & Harsh (S3807649) & Gaurav Diwan (S3799691)

Executive Statement

The main Purpose of this assignment is to try applying all data pre-processing techniques learnt during the session. In this assignment we will be using data from Olympic Sports and Medals, 1896-2014. The Data set os an public data set available on kaggle(link : https://www.kaggle.com/the-guardian/olympic-games/download).

About the data: More than 35,000 medals have been awarded at the Olympics since 1896. The first two Olympiads awarded silver medals and an olive wreath for the winner, and the IOC retrospectively awarded gold, silver, and bronze to athletes based on their rankings. This dataset includes a row for every Olympic athlete that has won a medal since the first games.

For the Purpose of this assignment we will only be using 2 files one containing information about medals won in summer olympics and other containing information about the countries.

Files and variables which we will be using in this assignment will be as follows. Summer.csv: data about athletes and medals won in different sports in summer olympics during 1896-2014 Year: year in which olympics tooks place City: city in which the games where held Sport: sports in which the athlete performed Discipline: Main Discipline of the sports.
Athlete: name of the athlete Country: Country of the athlete (country code) Gender: gender of the athlete Event: event/Match name or type Medal: type of medal won by the athlete

Dictionary.csv: Country: Country Name Code: Country Code Population: Population of the country in numbers GDP.per.Capita: GDP per capita of the Country.

Load Packages

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(scales)

Set Working Directory

setwd("C:/Users/karan/Downloads")

Load Data

Load Summer Olympics medal data:

#Importing of summer csv
summer_medals <- read.csv("summer.csv", 
                  stringsAsFactors=FALSE, na.string = "N/A")

#Removal of summer_medals duplicate entries
summer_medals <- distinct(summer_medals)

#Preview of summer_medals dataframe
head(summer_medals)

##   Year   City    Sport Discipline               Athlete Country Gender
## 1 1896 Athens Aquatics   Swimming         HAJOS, Alfred     HUN    Men
## 2 1896 Athens Aquatics   Swimming      HERSCHMANN, Otto     AUT    Men
## 3 1896 Athens Aquatics   Swimming     DRIVAS, Dimitrios     GRE    Men
## 4 1896 Athens Aquatics   Swimming    MALOKINIS, Ioannis     GRE    Men
## 5 1896 Athens Aquatics   Swimming    CHASAPIS, Spiridon     GRE    Men
## 6 1896 Athens Aquatics   Swimming CHOROPHAS, Efstathios     GRE    Men
##                        Event  Medal
## 1             100M Freestyle   Gold
## 2             100M Freestyle Silver
## 3 100M Freestyle For Sailors Bronze
## 4 100M Freestyle For Sailors   Gold
## 5 100M Freestyle For Sailors Silver
## 6            1200M Freestyle Bronze

Load Countries data:

#Importing of Dictionary csv
countries <- read.csv("dictionary.csv", 
                  stringsAsFactors=FALSE, na.string = "N/A")

#Removal of countries duplicate entries
countries <- distinct(countries)

#Preview of countries dataframe
head(countries)

##           Country Code Population GDP.per.Capita
## 1     Afghanistan  AFG   32526562       594.3231
## 2         Albania  ALB    2889167      3945.2176
## 3         Algeria  ALG   39666519      4206.0312
## 4 American Samoa*  ASA      55538             NA
## 5         Andorra  AND      70473             NA
## 6          Angola  ANG   25021974      4101.4722

Merge the 2 Datasets on the basis of Country Code.

#merge summer_medal data and Countries Data on the basis of Country Code. 
summer_medals.country <-  merge(summer_medals, countries, by.x=c("Country"),
    by.y=c("Code"))

#Preview of merged dataframe
head(summer_medals.country)

##   Country Year        City     Sport Discipline              Athlete
## 1     AFG 2008     Beijing Taekwondo  Taekwondo     NIKPAI, Rohullah
## 2     AFG 2012      London Taekwondo  Taekwondo     NIKPAI, Rohullah
## 3     AHO 1988       Seoul   Sailing    Sailing      BOERSMA, Jan D.
## 4     ALG 2000      Sydney Athletics  Athletics HAMMAD, Abderrahmane
## 5     ALG 1992   Barcelona    Boxing     Boxing      SOLTANI, Hocine
## 6     ALG 1984 Los Angeles    Boxing     Boxing     MOUSSA, Mustapha
##   Gender                         Event  Medal             Country.y
## 1    Men                       - 58 KG Bronze           Afghanistan
## 2    Men                    58 - 68 KG Bronze           Afghanistan
## 3    Men           Board (Division Ii) Silver Netherlands Antilles*
## 4    Men                     High Jump Bronze               Algeria
## 5    Men     54 - 57KG (Featherweight) Bronze               Algeria
## 6    Men 75 - 81KG (Light-Heavyweight) Bronze               Algeria
##   Population GDP.per.Capita
## 1   32526562       594.3231
## 2   32526562       594.3231
## 3         NA             NA
## 4   39666519      4206.0312
## 5   39666519      4206.0312
## 6   39666519      4206.0312

Understand Dataset.

Understand the data is one of key requirement of Data Preprocessing. For this purpose we need to to check the type of each variable and convert them to the correct variable type if needed.

#Checking movie structure and variable type
str(summer_medals.country)

## 'data.frame':    25740 obs. of  12 variables:
##  $ Country       : chr  "AFG" "AFG" "AHO" "ALG" ...
##  $ Year          : int  2008 2012 1988 2000 1992 1984 2008 2000 1992 1996 ...
##  $ City          : chr  "Beijing" "London" "Seoul" "Sydney" ...
##  $ Sport         : chr  "Taekwondo" "Taekwondo" "Sailing" "Athletics" ...
##  $ Discipline    : chr  "Taekwondo" "Taekwondo" "Sailing" "Athletics" ...
##  $ Athlete       : chr  "NIKPAI, Rohullah" "NIKPAI, Rohullah" "BOERSMA, Jan D." "HAMMAD, Abderrahmane" ...
##  $ Gender        : chr  "Men" "Men" "Men" "Men" ...
##  $ Event         : chr  "- 58 KG" "58 - 68 KG" "Board (Division Ii)" "High Jump" ...
##  $ Medal         : chr  "Bronze" "Bronze" "Silver" "Bronze" ...
##  $ Country.y     : chr  "Afghanistan" "Afghanistan" "Netherlands Antilles*" "Algeria" ...
##  $ Population    : int  32526562 32526562 NA 39666519 39666519 39666519 39666519 39666519 39666519 39666519 ...
##  $ GDP.per.Capita: num  594 594 NA 4206 4206 ...

After this step we need to tidy and manipulate data as required and as per the need of further analysis.

Tidy & Manipulate Data I

The dataset that we constructed using the Summer Olympics Medals and Countries datasets is believed to already be tidy. The reason is because:

Each variable is in a separate column Each observation is in a separate row

Tidy & Manipulate Data II

Through the use of the mutate() function to convert Gender and Medal variables into Factor. After this step we will check the levels of the factor in these variables and reorder them if required.

summer_medals.country <- summer_medals.country %>% 
  mutate(Gender=factor(Gender),Medal=factor(Medal))
  
#Looking at movie rated classifcation levels
levels(summer_medals.country$Gender)

## [1] "Men"   "Women"

#Looking at movie rated classifcation levels
levels(summer_medals.country$Medal)

## [1] "Bronze" "Gold"   "Silver"

#Rearrange movie rated classification based on appropriate audience age
summer_medals.country$Medal <- factor(summer_medals.country$Medal, 
                    levels = c("Bronze","Silver","Gold"), 
                    ordered = TRUE)

#Rechecking levels of movie rate classification after re-order
levels(summer_medals.country$Medal)

## [1] "Bronze" "Silver" "Gold"

Scan and Transform Dataset.

Data Scan function that will detect for missing values, inconsistencies or obvious errors in dataset. We also commenced the scan for numeric data for outliers using boxplot $out function. Final part will be the transform at least one variable in dataset. for this we g=have decided to rescale the population variable in the range of 0 to 1.

Scan I

To figure out if there was any missing values in the dataset we used the any(is.na()) function. We investigated this problem further using the sum(is.na()) and colSums(is.na()) to determine number of missing values throughout the dataset. To address the issue of the missing data we decided to remove all observations that had missing values using the na.omit() function.

any(is.na(summer_medals.country))

## [1] TRUE

sum(is.na(summer_medals.country))

## [1] 646

colSums(is.na(summer_medals.country))

##        Country           Year           City          Sport     Discipline 
##              0              0              0              0              0 
##        Athlete         Gender          Event          Medal      Country.y 
##              0              0              0              0              0 
##     Population GDP.per.Capita 
##             46            600

# Removal of rows with NA
summer_medals.country <- na.omit(summer_medals.country)

# Recheck for NA's
any(is.na(summer_medals.country))

## [1] FALSE

Scan II

To check for outliers we produced boxplots for all numeric variables in our dataset. Based on first impressions from the boxplots it appears our dataset is believed to contain outliers in population variable. while the GDP variable is belived to have no outliers.

# Boxplot of numeric variable
summer_medals.country$GDP.per.Capita %>%  boxplot(main="Box Plot of summer_medals", 
                               ylab="GDP ($)", col = "grey")

summer_medals.country$Population %>%  boxplot(main="Box Plot of summer_medals", 
                               ylab="Population", col = "grey")

## Warning in x[floor(d)] + x[ceiling(d)]: NAs produced by integer overflow

By running the above analysis we found that there are some outliers in variable population. Through the box plot it is clear that variable GDP.per.Capita does not have any outliers present.

We have created a function named capping. this will replace the outliers with their corresponding cap values. And in this way we are able to remove univariate outlires in the column population of our data.

capping <- function(x){
    quantiles <- quantile( x, c(.05, .95 ),na.rm=TRUE )
    x[ x < quantiles[1] ] <- quantiles[1]
    x[ x > quantiles[2] ] <- quantiles[2]
    x
}

summer_medals.country$Population <- capping( summer_medals.country$Population )

summer_medals.country$Population %>%  boxplot(main="Box Plot of summer_medals", 
                               ylab="Population", col = "grey")

As We can see, now all outliers present in the Population have been removed.

Transform Data

For this step we have decided to Rescale the Column named Population in the Range of 0 to 1 for better undersatnding and analysis of the dataset.

summer_medals.country$Population.scaled <- rescale(summer_medals.country$Population)

head(summer_medals.country$Population.scaled)

## [1] 0.08642841 0.08642841 0.10900728 0.10900728 0.10900728 0.10900728

References

The Dataset is available for Download from Kaggle. Link: https://www.kaggle.com/the-guardian/olympic-games/

This Data was provided by the IOC Research and Reference Service and published by The Guardian’s Datablog.