Data Wrangling Assessment Task 3: Dataset challenge

Background / Course Overview

As I am intending for this assessment to form part of my eportfolio, I am also providing background information on the learning outcomes of this course for the purposes of publishing this to a broader audience.

“Real-world data are commonly incomplete, noisy, and inconsistent. This course will cover a wide range of topics designed to equip you with the skills needed to prepare all forms of untidy data for analysis. The course will cover the core concepts of data pre-processing, namely tidy data, data integration, data cleaning, data transformation, data standardisation, data discretisation, and data reduction.” RMIT MATH 2405

Assessment Brief (Final assessment)

This assignment requires the sourcing of open data and use the knowledge you’ve gained during the course to preprocess the data.

In this assignment I am demonstrating the learning outcomes (in programming language R) in MATH 2405 (Data Wrangling) specifically:

- articulate how data meets the best practice standards (e.g. tidy data principles)
- select, perform and justify data validation processes for raw datasets.
- use leading open source software (e.g. R) for reproducible, automated data processing.
- accurately, logically and ethically combine data from multiple sources to make suitable for statistical 
  analysis and draw valid interpretations.

Specifically the modules in this course cover:

- using data preprocessing tools on real world datasets
- what are the steps of data wrangling?
- the importance of data wrangling
- how to identify data, get it and export it
- understand how to check the types of the variables, dimensions
  and structure of the data, and levels/values for the variables
- knowledge of packages: tidyr, lubridate, stringr and dplyr (and others)
- tidy data principles
- cleaning data: replacing and excluding the missing values (and how and
  when to decide on appropriate imputation techniques)
- detection of outliers
- data manipulation and format conversion
- data transformation (inclusive of normalisation and standardisation)
- data discretisation
- differentiation between wide- and long-formatted data
- integrating data using join functions in the dplyr package
- creating summary statistics
- manipulating strings
- variable selection, ranking and feature-extraction techniques for data reduction
- importing date and time (in particular conversion of strings to dates)
- extraction and manipulation of dates and parts of dates
- date arithmetic
- string manipulation using the stringr package

Objectives/Learning Outcomes/Capability Development:

This course contributes to the following Program Learning Outcomes for MC004 Master of Statistics and Operations Research and MC242 Master of Analytics:

- personal and professional awareness
- the ability to contextualise outputs where data are drawn from diverse and evolving social, 
  political and cultural dimensions
- the ability to reflect on experience and improve your own future practice
- the ability to apply the principles of lifelong learning to any new challenge
- knowledge and technical competence
- an understanding of appropriate and relevant, fundamental and applied mathematical and statistical knowledge, 
  methodologies and modern computational tools
- problem-solving
- ability to bring together and flexibly apply knowledge to characterise, analyse and solve a wide range 
  of problems
- an understanding of the balance between the complexity / accuracy of the mathematical / statistical models used 
  and the timeliness of the delivery of the solution
- information literacy
- ability to locate and use data and information and evaluate its quality with respect to its authority and relevance

[remove unused at end]

Setup and install packages

# assign packages variable to packages included in this report 

packages <- c("readr", "xlsx", "readxl", "foreign", "gdata", "rvest", "dplyr", "tidyr", "tidyverse", "Hmisc", "lubridate", "outliers", "ggplot2", "magrittr", "rmarkdown", "tidyverse", "knitr", "yaml", "stringr")

## If a package is installed, it will be loaded. If any 
## are not, the missing package(s) will be installed 
## from CRAN and then loaded.

package.check <- lapply(
  packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE)
      library(x, character.only = TRUE)
    }
  }
)

## Loading required package: readr

## Loading required package: xlsx

## Loading required package: readxl

## Loading required package: foreign

## Loading required package: gdata

## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

##

## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

## 
## Attaching package: 'gdata'

## The following object is masked from 'package:stats':
## 
##     nobs

## The following object is masked from 'package:utils':
## 
##     object.size

## The following object is masked from 'package:base':
## 
##     startsWith

## Loading required package: rvest

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:gdata':
## 
##     combine, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: tidyr

## Loading required package: tidyverse

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ stringr 1.4.0
## ✓ tibble  3.1.5     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::combine()        masks gdata::combine()
## x dplyr::filter()         masks stats::filter()
## x dplyr::first()          masks gdata::first()
## x rvest::guess_encoding() masks readr::guess_encoding()
## x purrr::keep()           masks gdata::keep()
## x dplyr::lag()            masks stats::lag()
## x dplyr::last()           masks gdata::last()

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Loading required package: lubridate

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

## Loading required package: outliers

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

## Loading required package: rmarkdown

## Loading required package: knitr

## Loading required package: yaml

knitr::opts_chunk$set(
    fig.align = "center",
    message = FALSE,
    warning = FALSE
)

Introduction

This analysis relates to the recent Tokyo 2020 Summer Olympic Games held this year in August (the name honours the original pre covid intended commencement year)

It was interesting to discover on Wikipedia that the official way the IOC deem the winner of the Olympics is by the Gold Medal Tally for the participating nation and that other nation/s also consider the Total Medal Tally as one worthy of noting see this excerpt from Wikipedia (referenced at the end):

“The gold first ranking system described above is used by most of the world media, as well as the IOC. While the gold first ranking system has been used occasionally by some American media outlets, newspapers in the United States primarily publish medal tables ordered by the total number of medals won.”

Data

There are 2 datasets: medals.csv medals_total.csv (downloaded from Kaggle) that provide the results of this years Tokyo Summer Olympic Games (held this year and named as per the original intended year 2020).

There are two datasets containing different information about the number of medals won.

Both datasets include a country code and will be merged on this common variablve using a right join to show a new dataframe that contains each athlete’s medals won per event and which country they are from. I have cross refenced the medal totals with the IOC website to confirm that my data set was accurate (not included in this report but consistent if checked by the reader)

Read in the data sets

medals <- read.csv("data/medals.csv", stringsAsFactors = TRUE)
totals <- read.csv("data/medals_total.csv", stringsAsFactors = TRUE)

medals <- as_tibble(medals)
totals <- as_tibble(totals)

str(medals)

## tibble [2,401 × 12] (S3: tbl_df/tbl/data.frame)
##  $ medal_type        : Factor w/ 3 levels "Bronze Medal",..: 2 2 3 3 1 1 2 3 1 2 ...
##  $ medal_code        : int [1:2401] 1 1 2 2 3 3 1 2 3 1 ...
##  $ medal_date        : Factor w/ 16 levels "2021-07-24 00:00:00.0",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ athlete_short_name: Factor w/ 2134 levels "A RAHAYU","ABALO L",..: 918 57 1665 2028 51 1917 278 1926 1486 1810 ...
##  $ athlete_name      : Factor w/ 2174 levels "ABALO Luc","ABBOTT Monica",..: 926 59 1686 2061 53 1945 279 1954 1506 1839 ...
##  $ athlete_sex       : Factor w/ 4 levels "M","O","W","X": 4 4 4 4 4 4 1 1 1 1 ...
##  $ athlete_link      : Factor w/ 2176 levels "../../../en/results/3x3-basketball/athlete-profile-n1312756-gray-allisha.htm",..: 47 37 48 49 25 26 674 667 669 834 ...
##  $ country_code      : Factor w/ 93 levels "ARG","ARM","AUS",..: 52 52 65 65 61 61 24 7 77 39 ...
##  $ discipline_code   : Factor w/ 46 levels "ARC","ATH","BDM",..: 1 1 1 1 1 1 11 11 11 18 ...
##  $ event             : Factor w/ 298 levels "10m Air Pistol Men",..: 166 166 166 166 166 166 142 142 142 143 ...
##  $ country           : Factor w/ 93 levels "Argentina","Armenia",..: 71 71 60 60 56 56 24 9 79 38 ...
##  $ discipline        : Factor w/ 46 levels "3x3 Basketball",..: 2 2 2 2 2 2 16 16 16 20 ...

str(totals)

## tibble [93 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Rank        : int [1:93] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Country.Code: Factor w/ 93 levels "ARG","ARM","AUS",..: 91 16 48 32 74 3 65 31 34 45 ...
##  $ Gold.Medal  : int [1:93] 39 38 27 22 20 17 10 10 10 10 ...
##  $ Silver.Medal: int [1:93] 41 32 14 21 28 7 12 12 11 10 ...
##  $ Bronze.Medal: int [1:93] 33 18 17 22 23 22 14 11 16 20 ...
##  $ Total       : int [1:93] 113 88 58 65 71 46 36 33 37 40 ...
##  $ Country     : Factor w/ 93 levels "Argentina","Armenia",..: 91 65 46 34 73 3 60 30 32 44 ...

print(head(totals))

## # A tibble: 6 × 7
##    Rank Country.Code Gold.Medal Silver.Medal Bronze.Medal Total Country         
##   <int> <fct>             <int>        <int>        <int> <int> <fct>           
## 1     1 USA                  39           41           33   113 United States o…
## 2     2 CHN                  38           32           18    88 People's Republ…
## 3     3 JPN                  27           14           17    58 Japan           
## 4     4 GBR                  22           21           22    65 Great Britain   
## 5     5 ROC                  20           28           23    71 ROC             
## 6     6 AUS                  17            7           22    46 Australia

print(head(medals))

## # A tibble: 6 × 12
##   medal_type   medal_code medal_date   athlete_short_n… athlete_name athlete_sex
##   <fct>             <int> <fct>        <fct>            <fct>        <fct>      
## 1 Gold Medal            1 2021-07-24 … KIM JD           KIM Je Deok  X          
## 2 Gold Medal            1 2021-07-24 … AN S             AN San       X          
## 3 Silver Medal          2 2021-07-24 … SCHLOESSER G     SCHLOESSER … X          
## 4 Silver Medal          2 2021-07-24 … WIJLER S         WIJLER Steve X          
## 5 Bronze Medal          3 2021-07-24 … ALVAREZ L        ALVAREZ Luis X          
## 6 Bronze Medal          3 2021-07-24 … VALENCIA A       VALENCIA Al… X          
## # … with 6 more variables: athlete_link <fct>, country_code <fct>,
## #   discipline_code <fct>, event <fct>, country <fct>, discipline <fct>

Country Code is common to both data sets and the below code will merge the two datasets on this variable. The medals data set Country Code column is called ‘country_code’ and the medals_total data set is called Country.Code. I am first going to change the name of the medals_total Country.Code column to match the medals data set.

# replace column names
colnames(totals) <- c("Country.Rank.by.no.Gold", "country.code", "Country.Gold.Count", "Country.Silver.Count", "Country.Bronze.Count", "Country.Total.all.medals", "Country")

colnames(medals) <- c("medal.type",         "medal.code",         "medal.date",         "athlete.short.name", "athlete.name",   "athlete.sex",        "athlete.link",       "country.code",       "discipline.code",    "event",  "country",            "discipline")

#### Right Join using merge function
#the time stamp from the date as it is not useful (and also drop the date further on once the date conversion is applied but thought it useful to carry out the conversion first for demonstration purposes only)
merged <- merge(x = medals, y = totals, all.y = TRUE)
merged$medal.date <- as.Date(merged$medal.date)
print(head(merged))

##   country.code   medal.type medal.code medal.date athlete.short.name
## 1          ARG Silver Medal          2 2021-08-06          MACCARI S
## 2          ARG Silver Medal          2 2021-08-06           MERINO D
## 3          ARG Bronze Medal          3 2021-07-28      BAZAN VELEZ L
## 4          ARG Silver Medal          2 2021-08-06           SUCCI MB
## 5          ARG Bronze Medal          3 2021-08-07         PALACIOS E
## 6          ARG Silver Medal          2 2021-08-06       FORCHERIO ME
##             athlete.name athlete.sex
## 1          MACCARI Sofia           W
## 2         MERINO Delfina           W
## 3    BAZAN VELEZ Lautaro           M
## 4      SUCCI Maria Belen           W
## 5      PALACIOS Ezequiel           M
## 6 FORCHERIO Maria Emilia           W
##                                                                        athlete.link
## 1             ../../../en/results/hockey/athlete-profile-n1466504-maccari-sofia.htm
## 2            ../../../en/results/hockey/athlete-profile-n1361692-merino-delfina.htm
## 3 ../../../en/results/rugby-sevens/athlete-profile-n1362707-bazan-velez-lautaro.htm
## 4         ../../../en/results/hockey/athlete-profile-n1361686-succi-maria-belen.htm
## 5     ../../../en/results/volleyball/athlete-profile-n1366896-palacios-ezequiel.htm
## 6    ../../../en/results/hockey/athlete-profile-n1361803-forcherio-maria-emilia.htm
##   discipline.code event   country   discipline Country.Rank.by.no.Gold
## 1             HOC Women Argentina       Hockey                      72
## 2             HOC Women Argentina       Hockey                      72
## 3             RUG   Men Argentina Rugby Sevens                      72
## 4             HOC Women Argentina       Hockey                      72
## 5             VVO   Men Argentina   Volleyball                      72
## 6             HOC Women Argentina       Hockey                      72
##   Country.Gold.Count Country.Silver.Count Country.Bronze.Count
## 1                  0                    1                    2
## 2                  0                    1                    2
## 3                  0                    1                    2
## 4                  0                    1                    2
## 5                  0                    1                    2
## 6                  0                    1                    2
##   Country.Total.all.medals   Country
## 1                        3 Argentina
## 2                        3 Argentina
## 3                        3 Argentina
## 4                        3 Argentina
## 5                        3 Argentina
## 6                        3 Argentina

The structure of the merged data set is checked using the str() function

#create dataframe of the merged data
merged <- as_tibble(merged)
print(names(merged))

##  [1] "country.code"             "medal.type"              
##  [3] "medal.code"               "medal.date"              
##  [5] "athlete.short.name"       "athlete.name"            
##  [7] "athlete.sex"              "athlete.link"            
##  [9] "discipline.code"          "event"                   
## [11] "country"                  "discipline"              
## [13] "Country.Rank.by.no.Gold"  "Country.Gold.Count"      
## [15] "Country.Silver.Count"     "Country.Bronze.Count"    
## [17] "Country.Total.all.medals" "Country"

print(str(merged))

## tibble [2,401 × 18] (S3: tbl_df/tbl/data.frame)
##  $ country.code            : Factor w/ 93 levels "ARG","ARM","AUS",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ medal.type              : Factor w/ 3 levels "Bronze Medal",..: 3 3 1 3 1 3 1 3 1 1 ...
##  $ medal.code              : int [1:2401] 2 2 3 2 3 2 3 2 3 3 ...
##  $ medal.date              : Date[1:2401], format: "2021-08-06" "2021-08-06" ...
##  $ athlete.short.name      : Factor w/ 2134 levels "A RAHAYU","ABALO L",..: 1122 1219 143 1794 1415 555 52 1539 1757 1642 ...
##  $ athlete.name            : Factor w/ 2174 levels "ABALO Luc","ABBOTT Monica",..: 1136 1235 144 1820 1433 559 54 1560 1784 1663 ...
##  $ athlete.sex             : Factor w/ 4 levels "M","O","W","X": 3 3 1 3 1 3 1 3 1 1 ...
##  $ athlete.link            : Factor w/ 2176 levels "../../../en/results/3x3-basketball/athlete-profile-n1312756-gray-allisha.htm",..: 1177 1130 1524 1127 1970 1140 1523 1178 1963 1962 ...
##  $ discipline.code         : Factor w/ 46 levels "ARC","ATH","BDM",..: 24 24 31 24 43 24 31 24 43 43 ...
##  $ event                   : Factor w/ 298 levels "10m Air Pistol Men",..: 174 174 24 174 24 174 24 174 24 24 ...
##  $ country                 : Factor w/ 93 levels "Argentina","Armenia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ discipline              : Factor w/ 46 levels "3x3 Basketball",..: 24 24 31 24 43 24 31 24 43 43 ...
##  $ Country.Rank.by.no.Gold : int [1:2401] 72 72 72 72 72 72 72 72 72 72 ...
##  $ Country.Gold.Count      : int [1:2401] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Country.Silver.Count    : int [1:2401] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Country.Bronze.Count    : int [1:2401] 2 2 2 2 2 2 2 2 2 2 ...
##  $ Country.Total.all.medals: int [1:2401] 3 3 3 3 3 3 3 3 3 3 ...
##  $ Country                 : Factor w/ 93 levels "Argentina","Armenia",..: 1 1 1 1 1 1 1 1 1 1 ...
## NULL

I am now subetting the merged data to consist of only the columns I wish to evaluate.

merged_sub <- merged %>% dplyr::select(c(medal.type, athlete.name, Country.Gold.Count, 
                  Country.Silver.Count, Country.Bronze.Count, 
                  Country.Total.all.medals, Country))

merged_sub <- data.frame(merged_sub)

print(head(merged_sub))

##     medal.type           athlete.name Country.Gold.Count Country.Silver.Count
## 1 Silver Medal          MACCARI Sofia                  0                    1
## 2 Silver Medal         MERINO Delfina                  0                    1
## 3 Bronze Medal    BAZAN VELEZ Lautaro                  0                    1
## 4 Silver Medal      SUCCI Maria Belen                  0                    1
## 5 Bronze Medal      PALACIOS Ezequiel                  0                    1
## 6 Silver Medal FORCHERIO Maria Emilia                  0                    1
##   Country.Bronze.Count Country.Total.all.medals   Country
## 1                    2                        3 Argentina
## 2                    2                        3 Argentina
## 3                    2                        3 Argentina
## 4                    2                        3 Argentina
## 5                    2                        3 Argentina
## 6                    2                        3 Argentina

Checking for missing values I am using the colSums approach which checks for missing values within every column and produces the number of missing values in each column. In this case as per below there are none.

# check for missing values in the merged dataset
colSums(is.na(merged_sub))

##               medal.type             athlete.name       Country.Gold.Count 
##                        0                        0                        0 
##     Country.Silver.Count     Country.Bronze.Count Country.Total.all.medals 
##                        0                        0                        0 
##                  Country 
##                        0

#print the merged_sub dataframe using print and head functions
print(head(merged_sub))

##     medal.type           athlete.name Country.Gold.Count Country.Silver.Count
## 1 Silver Medal          MACCARI Sofia                  0                    1
## 2 Silver Medal         MERINO Delfina                  0                    1
## 3 Bronze Medal    BAZAN VELEZ Lautaro                  0                    1
## 4 Silver Medal      SUCCI Maria Belen                  0                    1
## 5 Bronze Medal      PALACIOS Ezequiel                  0                    1
## 6 Silver Medal FORCHERIO Maria Emilia                  0                    1
##   Country.Bronze.Count Country.Total.all.medals   Country
## 1                    2                        3 Argentina
## 2                    2                        3 Argentina
## 3                    2                        3 Argentina
## 4                    2                        3 Argentina
## 5                    2                        3 Argentina
## 6                    2                        3 Argentina

Does the merged_sub data frame (the one I will use as my default data for analysis for this report) adhere to tidy data principles?

To ascertain this I will go through each of the 3 components of tidy data: - is every row a unique observation? Yes as each athlete has a row for each medal - does every variable belong to a unique column and is every column a variable? yes - does every cell contain only a single value ie no merged values? yes (aside from athlete name however the analysis does not require the surname to be a separate variable and thus the full name is treated as one variable in the athlete.name column. similarly the medal.type variable is considered to be one variable as ’Medal" is common to all and just descriptive)

The data as it is currently structured includes medal totals by Country along with individaul athlete medal results therefore in order to check for outliers in the data, I need to group by and summarise the Country.Gold.Count and Country.Rank.by.no.Gold to be able to check for outliers in these variables in particular.

gold <- merged_sub %>% 
  group_by(Country) %>% 
  summarise(gold_medal_tally = mean(Country.Gold.Count)) %>%
  arrange(desc(gold_medal_tally))

print(head(gold))

## # A tibble: 6 × 2
##   Country                    gold_medal_tally
##   <fct>                                 <dbl>
## 1 United States of America                 39
## 2 People's Republic of China               38
## 3 Japan                                    27
## 4 Great Britain                            22
## 5 ROC                                      20
## 6 Australia                                17

Create a boxplot for the gold medals tally for each country:

#Apply boxplot() function
goldboxplot <- gold$gold_medal_tally %>%  
  boxplot(main = "Gold Medal Count by Country Tokyo 2020 (2021) Olympic Games", ylab = "count", col = "yellow")

The outliers for the gold medal tally are achieved using the below code which sources the information produced in the above boxplot

goldoutliers <- goldboxplot$out
goldoutliers

##  [1] 39 38 27 22 20 17 10 10 10 10

This is interesting as we see that the 10 medal mark is significant as a cutoff point for outliers

There are outliers in this data however given this data is based on actual results of the most recent Summer Olympic games, I will not removing the outliers for obvious purposes. This exercise was to fulfill part of step 9 which requires all numeric variables to be scanned for outliers.

Continue on for Silver Bronze and Combined Medal Total

Variable names are: - Country.Silver.Count - Country.Bronze.Count - Country.Total.all.medals

NB Rank has been excluded as it is a range from 1:93

Silver boxplot

silver <- merged_sub %>% 
  group_by(Country) %>% 
  summarise(silver_medal_tally = mean(Country.Silver.Count)) %>%
  arrange(desc(silver_medal_tally))

print(head(silver))

## # A tibble: 6 × 2
##   Country                    silver_medal_tally
##   <fct>                                   <dbl>
## 1 United States of America                   41
## 2 People's Republic of China                 32
## 3 ROC                                        28
## 4 Great Britain                              21
## 5 Japan                                      14
## 6 France                                     12

Create a boxplot for the silver medals tally for each country:

#Apply boxplot() function
silverboxplot <- silver$silver_medal_tally %>%  
  boxplot(main = "Silver Medal Count by Country Tokyo 2020 (2021) Olympic Games", ylab = "count", col = "grey")

The outliers for the silver medal tally are achieved using the below code which sources the information produced in the above boxplot

silveroutliers <- silverboxplot$out
silveroutliers

## [1] 41 32 28 21 14 12 12 11

Similarly to the gold medals above, the outliers appear at the 11 mark (close to 10)

Bronze boxplot

bronze <- merged_sub %>% 
  group_by(Country) %>% 
  summarise(bronze_medal_tally = mean(Country.Bronze.Count)) %>%
  arrange(desc(bronze_medal_tally))

print(head(bronze))

## # A tibble: 6 × 2
##   Country                    bronze_medal_tally
##   <fct>                                   <dbl>
## 1 United States of America                   33
## 2 ROC                                        23
## 3 Australia                                  22
## 4 Great Britain                              22
## 5 Italy                                      20
## 6 People's Republic of China                 18

Create a boxplot for the gold medals tally for each country:

#Apply boxplot() function
bronzeboxplot <- bronze$bronze_medal_tally %>%  
  boxplot(main = "Bronze Medal Count by Country Tokyo 2020 (2021) Olympic Games", ylab = "count", col = "brown")

bronzeoutliers <- bronzeboxplot$out
bronzeoutliers

##  [1] 33 23 22 22 20 18 17 16 14 12

Country Total All Medals

totalmedals <- merged_sub %>% 
  group_by(Country) %>% 
  summarise(total_medal_tally = mean(Country.Total.all.medals)) %>%
  arrange(desc(total_medal_tally))

print(head(totalmedals))

## # A tibble: 6 × 2
##   Country                    total_medal_tally
##   <fct>                                  <dbl>
## 1 United States of America                 113
## 2 People's Republic of China                88
## 3 ROC                                       71
## 4 Great Britain                             65
## 5 Japan                                     58
## 6 Australia                                 46

#Apply boxplot() function
totalboxplot <- totalmedals$total_medal_tally %>%  
  boxplot(main = "Total Medal Count by Country Tokyo 2020 (2021) Olympic Games", ylab = "count", col = "cornflowerblue")

totaloutliers <- totalboxplot$out
totaloutliers

##  [1] 113  88  71  65  58  46  40  37  36  33

histogram of gold by country

I built this via an export of the merged data frame in excel and in the interest of time I’m including it as a pasted in image as it is an additional representation of the data that I would like to explore further at some point in time. alt text here

Comparison of range size of outliers Total Medal Count by Country and Gold Medal Count by Country (which is currently the standard for how the IOC provide an offical performance rank by Country)

boxplot(totaloutliers, goldoutliers, main = "outliers total medal count and gold medal count by country",
        ylab = "no of medals", names = c("total", "gold"))

Apply a transforamtion on the new variables to scale to the total medal column

Additional Data Frame Gold Total and Total Medal Count

Athlete_Tally

I was interested also in finding out the athlete with the highest medal count (no further analysis is conducted on this data frame however I will refer to it in the final notes as an option for exploring in future - it looks interesting from the perspective of which discipline yields the most gold medals and to compare this over time)

athlete_tally <- data.frame(merged %>% 
  group_by(medal.type, athlete.name) %>% 
  mutate(athlete.total.medals = n(), medal.type = medal.type, event = event, discipline = discipline, Country.Gold.Count, Country.Silver.Count, Country.Bronze.Count, Country.Total.all.medals) %>% 
  arrange(desc(athlete.total.medals)))

print(head(athlete_tally, n=20))

##    country.code   medal.type medal.code medal.date athlete.short.name
## 1           USA   Gold Medal          1 2021-07-31          DRESSEL C
## 2           USA   Gold Medal          1 2021-07-26          DRESSEL C
## 3           USA   Gold Medal          1 2021-08-01          DRESSEL C
## 4           USA   Gold Medal          1 2021-07-29          DRESSEL C
## 5           USA   Gold Medal          1 2021-08-01          DRESSEL C
## 6           AUS   Gold Medal          1 2021-08-01           McKEON E
## 7           AUS   Gold Medal          1 2021-07-30           McKEON E
## 8           AUS   Gold Medal          1 2021-07-25           McKEON E
## 9           AUS   Gold Medal          1 2021-08-01           McKEON E
## 10          AUS   Gold Medal          1 2021-07-31          McKEOWN K
## 11          AUS Bronze Medal          3 2021-07-31           McKEON E
## 12          AUS   Gold Medal          1 2021-08-01          McKEOWN K
## 13          AUS Bronze Medal          3 2021-07-26           McKEON E
## 14          AUS   Gold Medal          1 2021-07-27          McKEOWN K
## 15          AUS Bronze Medal          3 2021-07-29           McKEON E
## 16          GBR Silver Medal          2 2021-07-30            SCOTT D
## 17          GBR Silver Medal          2 2021-07-27            SCOTT D
## 18          GBR Silver Medal          2 2021-08-01            SCOTT D
## 19          JAM   Gold Medal          1 2021-08-06   THOMPSON-HERAH E
## 20          JAM   Gold Medal          1 2021-08-03   THOMPSON-HERAH E
##             athlete.name athlete.sex
## 1         DRESSEL Caeleb           M
## 2         DRESSEL Caeleb           M
## 3         DRESSEL Caeleb           M
## 4         DRESSEL Caeleb           M
## 5         DRESSEL Caeleb           M
## 6            McKEON Emma           W
## 7            McKEON Emma           W
## 8            McKEON Emma           W
## 9            McKEON Emma           W
## 10        McKEOWN Kaylee           W
## 11           McKEON Emma           X
## 12        McKEOWN Kaylee           W
## 13           McKEON Emma           W
## 14        McKEOWN Kaylee           W
## 15           McKEON Emma           W
## 16          SCOTT Duncan           M
## 17          SCOTT Duncan           M
## 18          SCOTT Duncan           M
## 19 THOMPSON-HERAH Elaine           W
## 20 THOMPSON-HERAH Elaine           W
##                                                                        athlete.link
## 1          ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 2          ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 3          ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 4          ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 5          ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 6             ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 7             ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 8             ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 9             ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 10         ../../../en/results/swimming/athlete-profile-n1482870-mckeown-kaylee.htm
## 11            ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 12         ../../../en/results/swimming/athlete-profile-n1482870-mckeown-kaylee.htm
## 13            ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 14         ../../../en/results/swimming/athlete-profile-n1482870-mckeown-kaylee.htm
## 15            ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 16           ../../../en/results/swimming/athlete-profile-n1444193-scott-duncan.htm
## 17           ../../../en/results/swimming/athlete-profile-n1444193-scott-duncan.htm
## 18           ../../../en/results/swimming/athlete-profile-n1444193-scott-duncan.htm
## 19 ../../../en/results/athletics/athlete-profile-n1412290-thompson-herah-elaine.htm
## 20 ../../../en/results/athletics/athlete-profile-n1412290-thompson-herah-elaine.htm
##    discipline.code                            event                  country
## 1              SWM             Men's 100m Butterfly United States of America
## 2              SWM   Men's 4 x 100m Freestyle Relay United States of America
## 3              SWM              Men's 50m Freestyle United States of America
## 4              SWM             Men's 100m Freestyle United States of America
## 5              SWM      Men's 4 x 100m Medley Relay United States of America
## 6              SWM            Women's 50m Freestyle                Australia
## 7              SWM           Women's 100m Freestyle                Australia
## 8              SWM Women's 4 x 100m Freestyle Relay                Australia
## 9              SWM    Women's 4 x 100m Medley Relay                Australia
## 10             SWM          Women's 200m Backstroke                Australia
## 11             SWM      Mixed 4 x 100m Medley Relay                Australia
## 12             SWM    Women's 4 x 100m Medley Relay                Australia
## 13             SWM           Women's 100m Butterfly                Australia
## 14             SWM          Women's 100m Backstroke                Australia
## 15             SWM Women's 4 x 200m Freestyle Relay                Australia
## 16             SWM     Men's 200m Individual Medley            Great Britain
## 17             SWM             Men's 200m Freestyle            Great Britain
## 18             SWM      Men's 4 x 100m Medley Relay            Great Britain
## 19             ATH           Women's 4 x 100m Relay                  Jamaica
## 20             ATH                     Women's 200m                  Jamaica
##    discipline Country.Rank.by.no.Gold Country.Gold.Count Country.Silver.Count
## 1    Swimming                       1                 39                   41
## 2    Swimming                       1                 39                   41
## 3    Swimming                       1                 39                   41
## 4    Swimming                       1                 39                   41
## 5    Swimming                       1                 39                   41
## 6    Swimming                       6                 17                    7
## 7    Swimming                       6                 17                    7
## 8    Swimming                       6                 17                    7
## 9    Swimming                       6                 17                    7
## 10   Swimming                       6                 17                    7
## 11   Swimming                       6                 17                    7
## 12   Swimming                       6                 17                    7
## 13   Swimming                       6                 17                    7
## 14   Swimming                       6                 17                    7
## 15   Swimming                       6                 17                    7
## 16   Swimming                       4                 22                   21
## 17   Swimming                       4                 22                   21
## 18   Swimming                       4                 22                   21
## 19  Athletics                      21                  4                    1
## 20  Athletics                      21                  4                    1
##    Country.Bronze.Count Country.Total.all.medals                  Country
## 1                    33                      113 United States of America
## 2                    33                      113 United States of America
## 3                    33                      113 United States of America
## 4                    33                      113 United States of America
## 5                    33                      113 United States of America
## 6                    22                       46                Australia
## 7                    22                       46                Australia
## 8                    22                       46                Australia
## 9                    22                       46                Australia
## 10                   22                       46                Australia
## 11                   22                       46                Australia
## 12                   22                       46                Australia
## 13                   22                       46                Australia
## 14                   22                       46                Australia
## 15                   22                       46                Australia
## 16                   22                       65            Great Britain
## 17                   22                       65            Great Britain
## 18                   22                       65            Great Britain
## 19                    4                        9                  Jamaica
## 20                    4                        9                  Jamaica
##    athlete.total.medals
## 1                     5
## 2                     5
## 3                     5
## 4                     5
## 5                     5
## 6                     4
## 7                     4
## 8                     4
## 9                     4
## 10                    3
## 11                    3
## 12                    3
## 13                    3
## 14                    3
## 15                    3
## 16                    3
## 17                    3
## 18                    3
## 19                    3
## 20                    3

References

https://en.wikipedia.org/wiki/Olympic_medal_table#Official_reports

https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode

https://www.kaggle.com/piterfm/tokyo-2020-olympics]