library(readr)
library(readxl)
library(foreign)
library(gdata)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
Data preprocessing is a vital aspect of readying any data for further analysis. Here, two data sets with data on the Olympic Games were used to demonstrate the steps of data wrangling. The first step was to locate suitable data and read it into R. Next, the data had to be understood by checking for variable type and structure. Then, it was tidied and manipulated via gathering and separating of columns to ready the two data sets for merging. After merging, the data set was scanned for missing values and outliers, as well manipulated further in the form of mutating columns. Lastly, some variables were transformed to reduce skewness. Now, the data is ready for exploration and analysis.
The first data set looks at various data from all previous Summer Olympic Games. The URL is https://www.sportsvizsunday.com/olympics-winter-olympics, taken from the Sports Viz Sundays website. It includes multiple excel sheets on world record times, host cities, all-time medals by country, and sport history. The particular data set explored in this assignment is about Host Cities of Summer Olympics. There are 12 variables, listed below:
This data set was read into R using readxl package:
games <- read_xlsx("Olympics.xlsx", sheet = "Host Cities of Summer Olympics")
print.data.frame(head(games))
Games Year Host Opened by Dates Nations Competitors: Total Competitors: Men
1 I 1896 Athens, Greece King George I 6–15 April 14 241 241
2 II 1900 Paris, France N/A 14 May – 28 October 24 997 975
3 III 1904 St. Louis, United States Former Governor David R. Francis 1 July – 23 November 12 651 645
4 IV 1908 London, United Kingdom King Edward VII 27 April – 31 October 22 2008 1971
5 V 1912 Stockholm, Sweden King Gustaf V 6–22 July 28 2407 2359
6 VI 1916 Berlin, Germany <NA> <NA> NA NA NA
Competitors: Women Sports Events Notes
1 0 9 43 <NA>
2 22 19 85 <NA>
3 6 16 94 <NA>
4 37 22 110 <NA>
5 48 14 102 <NA>
6 NA NA NA Awarded to Berlin, cancelled due to World War I
Irrelevant variables, Opened by, Dates, and Notes were immediately dropped as they will not be needed for any of the upcoming pre-processing.
#Drop irrelevant variables
games1 <- games[-c(4,5,12)]
The second data set looks at the out-turn of sports-related costs of each Summer and Winter Olympic Games. The data set was found on data.world and is taken from a 2016 Oxford Olympics study. The URL is https://data.world/sports/olympics. The 7 variables are as follows:
This data set was read into R using Readr package:
costs <- read_csv("outturn sports-related costs of the Olympic Games .csv")
Parsed with column specification:
cols(
Games = col_character(),
Year = col_double(),
Country = col_character(),
Type = col_character(),
Events = col_double(),
Athletes = col_double(),
`Cost, Billion USD` = col_character()
)
print.data.frame(head(costs))
Games Year Country Type Events Athletes Cost, Billion USD
1 Rome 1960 Italy Summer 150 5338 n/a
2 Tokyo 1964 Japan Summer 163 5152 0.282
3 Mexico City 1968 Mexico Summer 172 5516 n/a**
4 Munich 1972 Germany Summer 195 7234 1.009
5 Montreal 1976 Canada Summer 198 6048 6.093
6 Moscow 1980 Soviet Union Summer 203 5179 6.331
Names of some columns were changed with dplyr (Kassambara n.d.) to simplify further pre-processing steps:
# changing column names in games data set
games1 <- games1 %>% rename(
Number = Games,
Male = `Competitors: Men`,
Female = `Competitors: Women`,
'Total Competitors' = `Competitors: Total`
)
# changing column names in costs data set
costs <- costs %>% rename(
City = Games,
'Cost in Billions (USD)' = `Cost, Billion USD`
)
The tidy data principles that each observation forms a row, each variable forms a column, and each value has its own cell (Wickham, 2014) are violated in the games data set, making it untidy. Firstly, the variable about number of competitors is split into men and women columns. Secondly, the host column has two variables in it: city and country.
# Tidy competitor columns so gender forms one column and number of competitors forms another
games2 <- games1 %>% gather('Male', 'Female', key = "Gender", value = "No. of Competitors")
# Tidy host column to separate into city and country
games3 <- games2 %>% separate_('Host', into = c("City", "Country"), sep = ",")
# Rearrange columns
colnames(games3)
[1] "Number" "Year" "City" "Country" "Nations" "Total Competitors"
[7] "Sports" "Events" "Gender" "No. of Competitors"
games3 <- games3[,c(1,2,3,4,5,9,10,6,7,8)]
colnames(games3)
[1] "Number" "Year" "City" "Country" "Nations" "Gender"
[7] "No. of Competitors" "Total Competitors" "Sports" "Events"
The costs data set is already in a tidy format, so does not need any reshaping.
The structure of games data set was checked as shown below and data type conversions made:
# Check structure of games3
str(games3)
tibble [68 x 10] (S3: tbl_df/tbl/data.frame)
$ Number : chr [1:68] "I" "II" "III" "IV" ...
$ Year : num [1:68] 1896 1900 1904 1908 1912 ...
$ City : chr [1:68] "Athens" "Paris" "St. Louis" "London" ...
$ Country : chr [1:68] " Greece" " France" " United States" " United Kingdom" ...
$ Nations : num [1:68] 14 24 12 22 28 NA 29 44 46 37 ...
$ Gender : chr [1:68] "Male" "Male" "Male" "Male" ...
$ No. of Competitors: num [1:68] 241 975 645 1971 2359 ...
$ Total Competitors : num [1:68] 241 997 651 2008 2407 ...
$ Sports : num [1:68] 9 19 16 22 14 NA 22 17 14 14 ...
$ Events : num [1:68] 43 85 94 110 102 NA 156 126 109 117 ...
# Convert Gender to factor variable
games3$Gender <- games3$Gender %>% factor(levels = c("Male", "Female"))
class(games3$Gender)
[1] "factor"
levels(games3$Gender)
[1] "Male" "Female"
# Convert Year to character variable
games3$Year<- as.character(games3$Year)
class(games3$Year)
[1] "character"
# Arranged by Year
games3 %>% arrange(Year)
The structure of costs data set was checked as shown below and data type conversions made:
# Check structure of costs
str(costs)
tibble [30 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ City : chr [1:30] "Rome" "Tokyo" "Mexico City" "Munich" ...
$ Year : num [1:30] 1960 1964 1968 1972 1976 ...
$ Country : chr [1:30] "Italy" "Japan" "Mexico" "Germany" ...
$ Type : chr [1:30] "Summer" "Summer" "Summer" "Summer" ...
$ Events : num [1:30] 150 163 172 195 198 203 221 237 257 271 ...
$ Athletes : num [1:30] 5338 5152 5516 7234 6048 ...
$ Cost in Billions (USD): chr [1:30] "n/a" "0.282" "n/a**" "1.009" ...
- attr(*, "spec")=
.. cols(
.. Games = col_character(),
.. Year = col_double(),
.. Country = col_character(),
.. Type = col_character(),
.. Events = col_double(),
.. Athletes = col_double(),
.. `Cost, Billion USD` = col_character()
.. )
# Convert Year to character variable
costs$Year<- as.character(costs$Year)
class(costs$Year)
[1] "character"
# Convert Type to factor variable
costs$Type <- costs$Type %>% factor(levels = c("Summer", "Winter"))
class(costs$Type)
[1] "factor"
levels(costs$Type)
[1] "Summer" "Winter"
# Convert Cost in Billions (USD) to numeric variable
costs$`Cost in Billions (USD)` <- as.double(costs$`Cost in Billions (USD)`)
NAs introduced by coercion
class(costs$`Cost in Billions (USD)`)
[1] "numeric"
str(costs)
tibble [30 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ City : chr [1:30] "Rome" "Tokyo" "Mexico City" "Munich" ...
$ Year : chr [1:30] "1960" "1964" "1968" "1972" ...
$ Country : chr [1:30] "Italy" "Japan" "Mexico" "Germany" ...
$ Type : Factor w/ 2 levels "Summer","Winter": 1 1 1 1 1 1 1 1 1 1 ...
$ Events : num [1:30] 150 163 172 195 198 203 221 237 257 271 ...
$ Athletes : num [1:30] 5338 5152 5516 7234 6048 ...
$ Cost in Billions (USD): num [1:30] NA 0.282 NA 1.009 6.093 ...
- attr(*, "spec")=
.. cols(
.. Games = col_character(),
.. Year = col_double(),
.. Country = col_character(),
.. Type = col_character(),
.. Events = col_double(),
.. Athletes = col_double(),
.. `Cost, Billion USD` = col_character()
.. )
As the games data set only deals with Summer Olympic Games, the costs data set was filtered to only include observations about Summer Olympics:
costs1 <- costs %>% filter(Type == 'Summer')
head(costs1)
Remove Gender and No. of Competitors variables from games data set as they will not be needed
games4 <- games3[-c(6,7)]
games4 <- distinct(games4)
head(games4)
Renamed observation, West Germany, in games data set to match name in costs data set:
games4[20, "Country"] <- "Germany"
The two data sets, games and costs, were merged on the common Year variable using left_join to keep all observations in games and add the matching rows in costs.
olympic <- left_join(games4, costs1, by = "Year")
print.data.frame(head(olympic))
Number Year City.x Country.x Nations Total Competitors Sports Events.x City.y Country.y Type Events.y Athletes
1 I 1896 Athens Greece 14 241 9 43 <NA> <NA> <NA> NA NA
2 II 1900 Paris France 24 997 19 85 <NA> <NA> <NA> NA NA
3 III 1904 St. Louis United States 12 651 16 94 <NA> <NA> <NA> NA NA
4 IV 1908 London United Kingdom 22 2008 22 110 <NA> <NA> <NA> NA NA
5 V 1912 Stockholm Sweden 28 2407 14 102 <NA> <NA> <NA> NA NA
6 VI 1916 Berlin Germany NA NA NA NA <NA> <NA> <NA> NA NA
Cost in Billions (USD)
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Scanned for missing values in data set per column and row. Then, narrowed search to Cost in Billions (USD) column and removed all observations that are missing a value in this column. This column will be needed for later tasks, so it is important that there is a value here. Lastly, checked that all observation were not missing any values.
# Scanning for missing values per column
colSums(is.na(olympic))
Number Year City.x Country.x Nations Total Competitors
0 0 2 2 6 6
Sports Events.x City.y Country.y Type Events.y
5 5 19 19 19 19
Athletes Cost in Billions (USD)
19 22
# Scanning for missing values per row
rowSums(is.na(olympic))
[1] 6 6 6 6 6 10 6 6 6 6 6 10 10 6 6 6 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 8 12 12
# Searching for observations with a missing cost data
is.na(olympic$`Cost in Billions (USD)`)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
[23] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
# Create vector of all rows that are missing a value in the 'Cost in Billions (USD)' column
nocost <- which(is.na(olympic$`Cost in Billions (USD)`))
nocost
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 24 32 33 34
# Deleting all observations without cost data
olympic1 <- olympic[-nocost,]
print.data.frame(head(olympic1))
Number Year City.x Country.x Nations Total Competitors Sports Events.x City.y Country.y Type Events.y Athletes
1 XVIII 1964 Tokyo Japan 93 5151 19 163 Tokyo Japan Summer 163 5152
2 XX 1972 Munich Germany 121 7134 21 195 Munich Germany Summer 195 7234
3 XXI 1976 Montreal Canada 92 6084 21 198 Montreal Canada Summer 198 6048
4 XXII 1980 Moscow Soviet Union 80 5179 21 203 Moscow Soviet Union Summer 203 5179
5 XXIII 1984 Los Angeles United States 140 6829 21 221 Los Angeles United States Summer 221 6829
6 XXV 1992 Barcelona Spain 169 9356 25 257 Barcelona Spain Summer 257 9356
Cost in Billions (USD)
1 0.282
2 1.009
3 6.093
4 6.331
5 0.719
6 9.687
# Checking for any incomplete rows
complete.cases(olympic1)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Of the two merged data sets, one had data for Olympic Games from 1896-2028 and the other had data only from Olympic Games from 1960-2016. Information about costs of running the Olympics is only found in the data set from 1960-2016. Therefore, all observations without values in the ‘Cost in Billions (USD)’ column were deleted as they either don’t have sufficient data on them (1896-1956) or they have not occurred yet (2020-2028). This leaves a data set with no missing values.
Two columns, ‘Athletes’ and ‘Total Competitors’, represent the same variable, but have slightly different values for some observations. To combat this, a new column was created with the average of the two aforementioned columns and the previous two columns dropped.
olympic2 <- mutate(olympic1, "Total Athletes" = (Athletes + `Total Competitors`)/2, Athletes = NULL, `Total Competitors` = NULL)
print.data.frame(head(olympic2))
Number Year City.x Country.x Nations Sports Events.x City.y Country.y Type Events.y Cost in Billions (USD)
1 XVIII 1964 Tokyo Japan 93 19 163 Tokyo Japan Summer 163 0.282
2 XX 1972 Munich Germany 121 21 195 Munich Germany Summer 195 1.009
3 XXI 1976 Montreal Canada 92 21 198 Montreal Canada Summer 198 6.093
4 XXII 1980 Moscow Soviet Union 80 21 203 Moscow Soviet Union Summer 203 6.331
5 XXIII 1984 Los Angeles United States 140 21 221 Los Angeles United States Summer 221 0.719
6 XXV 1992 Barcelona Spain 169 25 257 Barcelona Spain Summer 257 9.687
Total Athletes
1 5151.5
2 7184.0
3 6066.0
4 5179.0
5 6829.0
6 9356.0
The Events, City, and Country columns are all duplicated from when the datasets were merged, so the duplicates were removed.
olympic3 <- olympic2[-c(8,9,11)]
head(olympic3)
# Renaming columns to eliminate '.x' in the name
olympic3 <- olympic3 %>% rename(
City = City.x,
Country = Country.x,
Events = Events.x)
The Athlete and Cost in Billions (USD) columns were mutated to created a new column, Cost per Athlete.
olympic4 <- mutate(olympic3, "Cost per Athlete" = `Cost in Billions (USD)` / `Total Athletes`)
print.data.frame(head(olympic4))
Number Year City Country Nations Sports Events Type Cost in Billions (USD) Total Athletes Cost per Athlete
1 XVIII 1964 Tokyo Japan 93 19 163 Summer 0.282 5151.5 5.474134e-05
2 XX 1972 Munich Germany 121 21 195 Summer 1.009 7184.0 1.404510e-04
3 XXI 1976 Montreal Canada 92 21 198 Summer 6.093 6066.0 1.004451e-03
4 XXII 1980 Moscow Soviet Union 80 21 203 Summer 6.331 5179.0 1.222437e-03
5 XXIII 1984 Los Angeles United States 140 21 221 Summer 0.719 6829.0 1.052863e-04
6 XXV 1992 Barcelona Spain 169 25 257 Summer 9.687 9356.0 1.035378e-03
# Convert units of Cost per Athlete column into millions
olympic4$`Cost per Athlete in Millions (USD)` <- olympic4$`Cost per Athlete`*1000
# Remove original Cost per Athlete column
olympic5 <- olympic4[-11]
print.data.frame(head(olympic5))
Number Year City Country Nations Sports Events Type Cost in Billions (USD) Total Athletes
1 XVIII 1964 Tokyo Japan 93 19 163 Summer 0.282 5151.5
2 XX 1972 Munich Germany 121 21 195 Summer 1.009 7184.0
3 XXI 1976 Montreal Canada 92 21 198 Summer 6.093 6066.0
4 XXII 1980 Moscow Soviet Union 80 21 203 Summer 6.331 5179.0
5 XXIII 1984 Los Angeles United States 140 21 221 Summer 0.719 6829.0
6 XXV 1992 Barcelona Spain 169 25 257 Summer 9.687 9356.0
Cost per Athlete in Millions (USD)
1 0.05474134
2 0.14045100
3 1.00445104
4 1.22243676
5 0.10528628
6 1.03537837
By viewing the data set, the most obvious location for any outliers would be the ‘Cost in Billions (USD)’ column. A summary of the column shows a difference between median and mean, as well as a maximum value (14.957) quite far outside the Q3 (6.451). Other variables were also checked for outliers. To deal with the outlier in the Cost in Billions (USD) column, the capping method was used to replace the outlier with the 95th percentile value (Dolgun, 2020).
# Summary of cost in billions
summary(olympic5$`Cost in Billions (USD)`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.282 2.459 4.792 5.213 6.451 14.957
# Searching for outliers in cost in billions
olympic5$`Cost in Billions (USD)` %>% boxplot(main = "Boxplot of Cost in Billions (USD)", ylab = "Billions (USD)")
# Searching for outliers in cost per athlete
olympic5$`Cost per Athlete in Millions (USD)` %>% boxplot(main = "Boxplot of Cost per Athlete (USD)", ylab = "Millions (USD)")
# Creating function for capping
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
olympic5$`Cost in Billions (USD)` <- olympic5$`Cost in Billions (USD)` %>% cap()
# Checking that outlier is gone
olympic5$`Cost in Billions (USD)` %>% boxplot(main = "Boxplot of Cost in Billions (USD)", ylab = "Billions (USD)")
A transformation was applied to the Cost in Billions (USD) column in attempt to normalise it. Several transformation were trialled and the most relevant one was the square root transformation. A histogram was then used to display the transformation.
The Cost per Athlete in Millions (USD) column produced a right-skewed histogram, so a square root transformation was applied to normalise it.
# histogram of Cost in Billions (USD) column
olympic5$`Cost in Billions (USD)`%>% histogram(breaks = 10, main = "Histogram of Cost in Billions (USD)")
# Log e transformation
log(olympic5$`Cost in Billions (USD)`)%>% histogram(breaks = 10)
# Reciprocal square transformation
olympic5$`Cost in Billions (USD)`^(-2)%>% histogram(breaks = 10)
# Square root transformation
olympic5$`Cost in Billions (USD)`^(1/2)%>% histogram(breaks = 10, main = "Sqaure Root of Cost in Billions (USD)")
# Square transformation
olympic5$`Cost in Billions (USD)`^(2)%>% histogram(breaks = 10)
# Histogram of Cost per Athlete in Millions (USD)
olympic5$`Cost per Athlete in Millions (USD)` %>% histogram(breaks = 10, main = "Histogram of Cost per Athlete in Millions (USD)")
# Square root of Cost per Athlete in Millions (USD)
sqrt(olympic5$`Cost per Athlete in Millions (USD)`) %>% histogram(breaks = 10, main = "Square Root Histogram of Cost per Athlete in Millions (USD)")
Dolgun, A 2020 ‘Scan: Outliers’, module notes, MATH2349, RMIT University, viewed 17 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_06.html#capping_(aka_winsorising)
Kassambara, A n.d.Rename Data Frame Columns in R, Datanovia, viewed 18 October 2020, https://www.datanovia.com/en/lessons/rename-data-frame-columns-in-r/
Wickham, H 2014 ‘Tidy Data,’ Journal of Statistical Software, vol. 59 issue 10.