As I am intending for this assessment to form part of my eportfolio, I am also providing background information on the learning outcomes of this course for the purposes of publishing this to a broader audience.
“Real-world data are commonly incomplete, noisy, and inconsistent. This course will cover a wide range of topics designed to equip you with the skills needed to prepare all forms of untidy data for analysis. The course will cover the core concepts of data pre-processing, namely tidy data, data integration, data cleaning, data transformation, data standardisation, data discretisation, and data reduction.” RMIT MATH 2405
This assignment requires the sourcing of open data and use the knowledge you’ve gained during the course to preprocess the data.
In this assignment I am demonstrating the learning outcomes (in programming language R) in MATH 2405 (Data Wrangling) specifically:
- articulate how data meets the best practice standards (e.g. tidy data principles)
- select, perform and justify data validation processes for raw datasets.
- use leading open source software (e.g. R) for reproducible, automated data processing.
- accurately, logically and ethically combine data from multiple sources to make suitable for statistical
analysis and draw valid interpretations.
Specifically the modules in this course cover:
- using data preprocessing tools on real world datasets
- what are the steps of data wrangling?
- the importance of data wrangling
- how to identify data, get it and export it
- understand how to check the types of the variables, dimensions
and structure of the data, and levels/values for the variables
- knowledge of packages: tidyr, lubridate, stringr and dplyr (and others)
- tidy data principles
- cleaning data: replacing and excluding the missing values (and how and
when to decide on appropriate imputation techniques)
- detection of outliers
- data manipulation and format conversion
- data transformation (inclusive of normalisation and standardisation)
- data discretisation
- differentiation between wide- and long-formatted data
- integrating data using join functions in the dplyr package
- creating summary statistics
- manipulating strings
- variable selection, ranking and feature-extraction techniques for data reduction
- importing date and time (in particular conversion of strings to dates)
- extraction and manipulation of dates and parts of dates
- date arithmetic
- string manipulation using the stringr package
This course contributes to the following Program Learning Outcomes for MC004 Master of Statistics and Operations Research and MC242 Master of Analytics:
- personal and professional awareness
- the ability to contextualise outputs where data are drawn from diverse and evolving social,
political and cultural dimensions
- the ability to reflect on experience and improve your own future practice
- the ability to apply the principles of lifelong learning to any new challenge
- knowledge and technical competence
- an understanding of appropriate and relevant, fundamental and applied mathematical and statistical knowledge,
methodologies and modern computational tools
- problem-solving
- ability to bring together and flexibly apply knowledge to characterise, analyse and solve a wide range
of problems
- an understanding of the balance between the complexity / accuracy of the mathematical / statistical models used
and the timeliness of the delivery of the solution
- information literacy
- ability to locate and use data and information and evaluate its quality with respect to its authority and relevance
[remove unused at end]
# assign packages variable to packages included in this report
packages <- c("readr", "xlsx", "readxl", "foreign", "gdata", "rvest", "dplyr", "tidyr", "tidyverse", "Hmisc", "lubridate", "outliers", "ggplot2", "magrittr", "rmarkdown", "tidyverse", "knitr", "yaml", "stringr")
## If a package is installed, it will be loaded. If any
## are not, the missing package(s) will be installed
## from CRAN and then loaded.
package.check <- lapply(
packages,
FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
}
)
## Loading required package: readr
## Loading required package: xlsx
## Loading required package: readxl
## Loading required package: foreign
## Loading required package: gdata
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
##
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
##
## Attaching package: 'gdata'
## The following object is masked from 'package:stats':
##
## nobs
## The following object is masked from 'package:utils':
##
## object.size
## The following object is masked from 'package:base':
##
## startsWith
## Loading required package: rvest
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:gdata':
##
## combine, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: tidyr
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ stringr 1.4.0
## ✓ tibble 3.1.5 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::combine() masks gdata::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks gdata::first()
## x rvest::guess_encoding() masks readr::guess_encoding()
## x purrr::keep() masks gdata::keep()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks gdata::last()
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Loading required package: outliers
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
## Loading required package: rmarkdown
## Loading required package: knitr
## Loading required package: yaml
knitr::opts_chunk$set(
fig.align = "center",
message = FALSE,
warning = FALSE
)
This analysis relates to the recent Tokyo 2020 Summer Olympic Games held this year in August (the name honours the original pre covid intended commencement year)
It was interesting to discover on Wikipedia that the official way the IOC deem the winner of the Olympics is by the Gold Medal Tally for the participating nation and that other nation/s also consider the Total Medal Tally as one worthy of noting see this excerpt from Wikipedia (referenced at the end):
“The gold first ranking system described above is used by most of the world media, as well as the IOC. While the gold first ranking system has been used occasionally by some American media outlets, newspapers in the United States primarily publish medal tables ordered by the total number of medals won.”
There are 2 datasets: medals.csv medals_total.csv (downloaded from Kaggle) that provide the results of this years Tokyo Summer Olympic Games (held this year and named as per the original intended year 2020).
There are two datasets containing different information about the number of medals won.
Both datasets include a country code and will be merged on this common variablve using a right join to show a new dataframe that contains each athlete’s medals won per event and which country they are from. I have cross refenced the medal totals with the IOC website to confirm that my data set was accurate (not included in this report but consistent if checked by the reader)
medals <- read.csv("data/medals.csv", stringsAsFactors = TRUE)
totals <- read.csv("data/medals_total.csv", stringsAsFactors = TRUE)
medals <- as_tibble(medals)
totals <- as_tibble(totals)
str(medals)
## tibble [2,401 × 12] (S3: tbl_df/tbl/data.frame)
## $ medal_type : Factor w/ 3 levels "Bronze Medal",..: 2 2 3 3 1 1 2 3 1 2 ...
## $ medal_code : int [1:2401] 1 1 2 2 3 3 1 2 3 1 ...
## $ medal_date : Factor w/ 16 levels "2021-07-24 00:00:00.0",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ athlete_short_name: Factor w/ 2134 levels "A RAHAYU","ABALO L",..: 918 57 1665 2028 51 1917 278 1926 1486 1810 ...
## $ athlete_name : Factor w/ 2174 levels "ABALO Luc","ABBOTT Monica",..: 926 59 1686 2061 53 1945 279 1954 1506 1839 ...
## $ athlete_sex : Factor w/ 4 levels "M","O","W","X": 4 4 4 4 4 4 1 1 1 1 ...
## $ athlete_link : Factor w/ 2176 levels "../../../en/results/3x3-basketball/athlete-profile-n1312756-gray-allisha.htm",..: 47 37 48 49 25 26 674 667 669 834 ...
## $ country_code : Factor w/ 93 levels "ARG","ARM","AUS",..: 52 52 65 65 61 61 24 7 77 39 ...
## $ discipline_code : Factor w/ 46 levels "ARC","ATH","BDM",..: 1 1 1 1 1 1 11 11 11 18 ...
## $ event : Factor w/ 298 levels "10m Air Pistol Men",..: 166 166 166 166 166 166 142 142 142 143 ...
## $ country : Factor w/ 93 levels "Argentina","Armenia",..: 71 71 60 60 56 56 24 9 79 38 ...
## $ discipline : Factor w/ 46 levels "3x3 Basketball",..: 2 2 2 2 2 2 16 16 16 20 ...
str(totals)
## tibble [93 × 7] (S3: tbl_df/tbl/data.frame)
## $ Rank : int [1:93] 1 2 3 4 5 6 7 8 9 10 ...
## $ Country.Code: Factor w/ 93 levels "ARG","ARM","AUS",..: 91 16 48 32 74 3 65 31 34 45 ...
## $ Gold.Medal : int [1:93] 39 38 27 22 20 17 10 10 10 10 ...
## $ Silver.Medal: int [1:93] 41 32 14 21 28 7 12 12 11 10 ...
## $ Bronze.Medal: int [1:93] 33 18 17 22 23 22 14 11 16 20 ...
## $ Total : int [1:93] 113 88 58 65 71 46 36 33 37 40 ...
## $ Country : Factor w/ 93 levels "Argentina","Armenia",..: 91 65 46 34 73 3 60 30 32 44 ...
print(head(totals))
## # A tibble: 6 × 7
## Rank Country.Code Gold.Medal Silver.Medal Bronze.Medal Total Country
## <int> <fct> <int> <int> <int> <int> <fct>
## 1 1 USA 39 41 33 113 United States o…
## 2 2 CHN 38 32 18 88 People's Republ…
## 3 3 JPN 27 14 17 58 Japan
## 4 4 GBR 22 21 22 65 Great Britain
## 5 5 ROC 20 28 23 71 ROC
## 6 6 AUS 17 7 22 46 Australia
print(head(medals))
## # A tibble: 6 × 12
## medal_type medal_code medal_date athlete_short_n… athlete_name athlete_sex
## <fct> <int> <fct> <fct> <fct> <fct>
## 1 Gold Medal 1 2021-07-24 … KIM JD KIM Je Deok X
## 2 Gold Medal 1 2021-07-24 … AN S AN San X
## 3 Silver Medal 2 2021-07-24 … SCHLOESSER G SCHLOESSER … X
## 4 Silver Medal 2 2021-07-24 … WIJLER S WIJLER Steve X
## 5 Bronze Medal 3 2021-07-24 … ALVAREZ L ALVAREZ Luis X
## 6 Bronze Medal 3 2021-07-24 … VALENCIA A VALENCIA Al… X
## # … with 6 more variables: athlete_link <fct>, country_code <fct>,
## # discipline_code <fct>, event <fct>, country <fct>, discipline <fct>
Country Code is common to both data sets and the below code will merge the two datasets on this variable. The medals data set Country Code column is called ‘country_code’ and the medals_total data set is called Country.Code. I am first going to change the name of the medals_total Country.Code column to match the medals data set.
# replace column names
colnames(totals) <- c("Country.Rank.by.no.Gold", "country.code", "Country.Gold.Count", "Country.Silver.Count", "Country.Bronze.Count", "Country.Total.all.medals", "Country")
colnames(medals) <- c("medal.type", "medal.code", "medal.date", "athlete.short.name", "athlete.name", "athlete.sex", "athlete.link", "country.code", "discipline.code", "event", "country", "discipline")
#### Right Join using merge function
#the time stamp from the date as it is not useful (and also drop the date further on once the date conversion is applied but thought it useful to carry out the conversion first for demonstration purposes only)
merged <- merge(x = medals, y = totals, all.y = TRUE)
merged$medal.date <- as.Date(merged$medal.date)
print(head(merged))
## country.code medal.type medal.code medal.date athlete.short.name
## 1 ARG Silver Medal 2 2021-08-06 MACCARI S
## 2 ARG Silver Medal 2 2021-08-06 MERINO D
## 3 ARG Bronze Medal 3 2021-07-28 BAZAN VELEZ L
## 4 ARG Silver Medal 2 2021-08-06 SUCCI MB
## 5 ARG Bronze Medal 3 2021-08-07 PALACIOS E
## 6 ARG Silver Medal 2 2021-08-06 FORCHERIO ME
## athlete.name athlete.sex
## 1 MACCARI Sofia W
## 2 MERINO Delfina W
## 3 BAZAN VELEZ Lautaro M
## 4 SUCCI Maria Belen W
## 5 PALACIOS Ezequiel M
## 6 FORCHERIO Maria Emilia W
## athlete.link
## 1 ../../../en/results/hockey/athlete-profile-n1466504-maccari-sofia.htm
## 2 ../../../en/results/hockey/athlete-profile-n1361692-merino-delfina.htm
## 3 ../../../en/results/rugby-sevens/athlete-profile-n1362707-bazan-velez-lautaro.htm
## 4 ../../../en/results/hockey/athlete-profile-n1361686-succi-maria-belen.htm
## 5 ../../../en/results/volleyball/athlete-profile-n1366896-palacios-ezequiel.htm
## 6 ../../../en/results/hockey/athlete-profile-n1361803-forcherio-maria-emilia.htm
## discipline.code event country discipline Country.Rank.by.no.Gold
## 1 HOC Women Argentina Hockey 72
## 2 HOC Women Argentina Hockey 72
## 3 RUG Men Argentina Rugby Sevens 72
## 4 HOC Women Argentina Hockey 72
## 5 VVO Men Argentina Volleyball 72
## 6 HOC Women Argentina Hockey 72
## Country.Gold.Count Country.Silver.Count Country.Bronze.Count
## 1 0 1 2
## 2 0 1 2
## 3 0 1 2
## 4 0 1 2
## 5 0 1 2
## 6 0 1 2
## Country.Total.all.medals Country
## 1 3 Argentina
## 2 3 Argentina
## 3 3 Argentina
## 4 3 Argentina
## 5 3 Argentina
## 6 3 Argentina
The structure of the merged data set is checked using the str() function
#create dataframe of the merged data
merged <- as_tibble(merged)
print(names(merged))
## [1] "country.code" "medal.type"
## [3] "medal.code" "medal.date"
## [5] "athlete.short.name" "athlete.name"
## [7] "athlete.sex" "athlete.link"
## [9] "discipline.code" "event"
## [11] "country" "discipline"
## [13] "Country.Rank.by.no.Gold" "Country.Gold.Count"
## [15] "Country.Silver.Count" "Country.Bronze.Count"
## [17] "Country.Total.all.medals" "Country"
print(str(merged))
## tibble [2,401 × 18] (S3: tbl_df/tbl/data.frame)
## $ country.code : Factor w/ 93 levels "ARG","ARM","AUS",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ medal.type : Factor w/ 3 levels "Bronze Medal",..: 3 3 1 3 1 3 1 3 1 1 ...
## $ medal.code : int [1:2401] 2 2 3 2 3 2 3 2 3 3 ...
## $ medal.date : Date[1:2401], format: "2021-08-06" "2021-08-06" ...
## $ athlete.short.name : Factor w/ 2134 levels "A RAHAYU","ABALO L",..: 1122 1219 143 1794 1415 555 52 1539 1757 1642 ...
## $ athlete.name : Factor w/ 2174 levels "ABALO Luc","ABBOTT Monica",..: 1136 1235 144 1820 1433 559 54 1560 1784 1663 ...
## $ athlete.sex : Factor w/ 4 levels "M","O","W","X": 3 3 1 3 1 3 1 3 1 1 ...
## $ athlete.link : Factor w/ 2176 levels "../../../en/results/3x3-basketball/athlete-profile-n1312756-gray-allisha.htm",..: 1177 1130 1524 1127 1970 1140 1523 1178 1963 1962 ...
## $ discipline.code : Factor w/ 46 levels "ARC","ATH","BDM",..: 24 24 31 24 43 24 31 24 43 43 ...
## $ event : Factor w/ 298 levels "10m Air Pistol Men",..: 174 174 24 174 24 174 24 174 24 24 ...
## $ country : Factor w/ 93 levels "Argentina","Armenia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ discipline : Factor w/ 46 levels "3x3 Basketball",..: 24 24 31 24 43 24 31 24 43 43 ...
## $ Country.Rank.by.no.Gold : int [1:2401] 72 72 72 72 72 72 72 72 72 72 ...
## $ Country.Gold.Count : int [1:2401] 0 0 0 0 0 0 0 0 0 0 ...
## $ Country.Silver.Count : int [1:2401] 1 1 1 1 1 1 1 1 1 1 ...
## $ Country.Bronze.Count : int [1:2401] 2 2 2 2 2 2 2 2 2 2 ...
## $ Country.Total.all.medals: int [1:2401] 3 3 3 3 3 3 3 3 3 3 ...
## $ Country : Factor w/ 93 levels "Argentina","Armenia",..: 1 1 1 1 1 1 1 1 1 1 ...
## NULL
I am now subetting the merged data to consist of only the columns I wish to evaluate.
merged_sub <- merged %>% dplyr::select(c(medal.type, athlete.name, Country.Gold.Count,
Country.Silver.Count, Country.Bronze.Count,
Country.Total.all.medals, Country))
merged_sub <- data.frame(merged_sub)
print(head(merged_sub))
## medal.type athlete.name Country.Gold.Count Country.Silver.Count
## 1 Silver Medal MACCARI Sofia 0 1
## 2 Silver Medal MERINO Delfina 0 1
## 3 Bronze Medal BAZAN VELEZ Lautaro 0 1
## 4 Silver Medal SUCCI Maria Belen 0 1
## 5 Bronze Medal PALACIOS Ezequiel 0 1
## 6 Silver Medal FORCHERIO Maria Emilia 0 1
## Country.Bronze.Count Country.Total.all.medals Country
## 1 2 3 Argentina
## 2 2 3 Argentina
## 3 2 3 Argentina
## 4 2 3 Argentina
## 5 2 3 Argentina
## 6 2 3 Argentina
Checking for missing values I am using the colSums approach which checks for missing values within every column and produces the number of missing values in each column. In this case as per below there are none.
# check for missing values in the merged dataset
colSums(is.na(merged_sub))
## medal.type athlete.name Country.Gold.Count
## 0 0 0
## Country.Silver.Count Country.Bronze.Count Country.Total.all.medals
## 0 0 0
## Country
## 0
#print the merged_sub dataframe using print and head functions
print(head(merged_sub))
## medal.type athlete.name Country.Gold.Count Country.Silver.Count
## 1 Silver Medal MACCARI Sofia 0 1
## 2 Silver Medal MERINO Delfina 0 1
## 3 Bronze Medal BAZAN VELEZ Lautaro 0 1
## 4 Silver Medal SUCCI Maria Belen 0 1
## 5 Bronze Medal PALACIOS Ezequiel 0 1
## 6 Silver Medal FORCHERIO Maria Emilia 0 1
## Country.Bronze.Count Country.Total.all.medals Country
## 1 2 3 Argentina
## 2 2 3 Argentina
## 3 2 3 Argentina
## 4 2 3 Argentina
## 5 2 3 Argentina
## 6 2 3 Argentina
To ascertain this I will go through each of the 3 components of tidy data: - is every row a unique observation? Yes as each athlete has a row for each medal - does every variable belong to a unique column and is every column a variable? yes - does every cell contain only a single value ie no merged values? yes (aside from athlete name however the analysis does not require the surname to be a separate variable and thus the full name is treated as one variable in the athlete.name column. similarly the medal.type variable is considered to be one variable as ’Medal" is common to all and just descriptive)
The data as it is currently structured includes medal totals by Country along with individaul athlete medal results therefore in order to check for outliers in the data, I need to group by and summarise the Country.Gold.Count and Country.Rank.by.no.Gold to be able to check for outliers in these variables in particular.
gold <- merged_sub %>%
group_by(Country) %>%
summarise(gold_medal_tally = mean(Country.Gold.Count)) %>%
arrange(desc(gold_medal_tally))
print(head(gold))
## # A tibble: 6 × 2
## Country gold_medal_tally
## <fct> <dbl>
## 1 United States of America 39
## 2 People's Republic of China 38
## 3 Japan 27
## 4 Great Britain 22
## 5 ROC 20
## 6 Australia 17
Create a boxplot for the gold medals tally for each country:
#Apply boxplot() function
goldboxplot <- gold$gold_medal_tally %>%
boxplot(main = "Gold Medal Count by Country Tokyo 2020 (2021) Olympic Games", ylab = "count", col = "yellow")
The outliers for the gold medal tally are achieved using the below code which sources the information produced in the above boxplot
goldoutliers <- goldboxplot$out
goldoutliers
## [1] 39 38 27 22 20 17 10 10 10 10
This is interesting as we see that the 10 medal mark is significant as a cutoff point for outliers
There are outliers in this data however given this data is based on actual results of the most recent Summer Olympic games, I will not removing the outliers for obvious purposes. This exercise was to fulfill part of step 9 which requires all numeric variables to be scanned for outliers.
Variable names are: - Country.Silver.Count - Country.Bronze.Count - Country.Total.all.medals
NB Rank has been excluded as it is a range from 1:93
silver <- merged_sub %>%
group_by(Country) %>%
summarise(silver_medal_tally = mean(Country.Silver.Count)) %>%
arrange(desc(silver_medal_tally))
print(head(silver))
## # A tibble: 6 × 2
## Country silver_medal_tally
## <fct> <dbl>
## 1 United States of America 41
## 2 People's Republic of China 32
## 3 ROC 28
## 4 Great Britain 21
## 5 Japan 14
## 6 France 12
Create a boxplot for the silver medals tally for each country:
#Apply boxplot() function
silverboxplot <- silver$silver_medal_tally %>%
boxplot(main = "Silver Medal Count by Country Tokyo 2020 (2021) Olympic Games", ylab = "count", col = "grey")
The outliers for the silver medal tally are achieved using the below code which sources the information produced in the above boxplot
silveroutliers <- silverboxplot$out
silveroutliers
## [1] 41 32 28 21 14 12 12 11
Similarly to the gold medals above, the outliers appear at the 11 mark (close to 10)
bronze <- merged_sub %>%
group_by(Country) %>%
summarise(bronze_medal_tally = mean(Country.Bronze.Count)) %>%
arrange(desc(bronze_medal_tally))
print(head(bronze))
## # A tibble: 6 × 2
## Country bronze_medal_tally
## <fct> <dbl>
## 1 United States of America 33
## 2 ROC 23
## 3 Australia 22
## 4 Great Britain 22
## 5 Italy 20
## 6 People's Republic of China 18
Create a boxplot for the gold medals tally for each country:
#Apply boxplot() function
bronzeboxplot <- bronze$bronze_medal_tally %>%
boxplot(main = "Bronze Medal Count by Country Tokyo 2020 (2021) Olympic Games", ylab = "count", col = "brown")
bronzeoutliers <- bronzeboxplot$out
bronzeoutliers
## [1] 33 23 22 22 20 18 17 16 14 12
totalmedals <- merged_sub %>%
group_by(Country) %>%
summarise(total_medal_tally = mean(Country.Total.all.medals)) %>%
arrange(desc(total_medal_tally))
print(head(totalmedals))
## # A tibble: 6 × 2
## Country total_medal_tally
## <fct> <dbl>
## 1 United States of America 113
## 2 People's Republic of China 88
## 3 ROC 71
## 4 Great Britain 65
## 5 Japan 58
## 6 Australia 46
#Apply boxplot() function
totalboxplot <- totalmedals$total_medal_tally %>%
boxplot(main = "Total Medal Count by Country Tokyo 2020 (2021) Olympic Games", ylab = "count", col = "cornflowerblue")
totaloutliers <- totalboxplot$out
totaloutliers
## [1] 113 88 71 65 58 46 40 37 36 33
I built this via an export of the merged data frame in excel and in the interest of time I’m including it as a pasted in image as it is an additional representation of the data that I would like to explore further at some point in time.
boxplot(totaloutliers, goldoutliers, main = "outliers total medal count and gold medal count by country",
ylab = "no of medals", names = c("total", "gold"))
I was interested also in finding out the athlete with the highest medal count (no further analysis is conducted on this data frame however I will refer to it in the final notes as an option for exploring in future - it looks interesting from the perspective of which discipline yields the most gold medals and to compare this over time)
athlete_tally <- data.frame(merged %>%
group_by(medal.type, athlete.name) %>%
mutate(athlete.total.medals = n(), medal.type = medal.type, event = event, discipline = discipline, Country.Gold.Count, Country.Silver.Count, Country.Bronze.Count, Country.Total.all.medals) %>%
arrange(desc(athlete.total.medals)))
print(head(athlete_tally, n=20))
## country.code medal.type medal.code medal.date athlete.short.name
## 1 USA Gold Medal 1 2021-07-31 DRESSEL C
## 2 USA Gold Medal 1 2021-07-26 DRESSEL C
## 3 USA Gold Medal 1 2021-08-01 DRESSEL C
## 4 USA Gold Medal 1 2021-07-29 DRESSEL C
## 5 USA Gold Medal 1 2021-08-01 DRESSEL C
## 6 AUS Gold Medal 1 2021-08-01 McKEON E
## 7 AUS Gold Medal 1 2021-07-30 McKEON E
## 8 AUS Gold Medal 1 2021-07-25 McKEON E
## 9 AUS Gold Medal 1 2021-08-01 McKEON E
## 10 AUS Gold Medal 1 2021-07-31 McKEOWN K
## 11 AUS Bronze Medal 3 2021-07-31 McKEON E
## 12 AUS Gold Medal 1 2021-08-01 McKEOWN K
## 13 AUS Bronze Medal 3 2021-07-26 McKEON E
## 14 AUS Gold Medal 1 2021-07-27 McKEOWN K
## 15 AUS Bronze Medal 3 2021-07-29 McKEON E
## 16 GBR Silver Medal 2 2021-07-30 SCOTT D
## 17 GBR Silver Medal 2 2021-07-27 SCOTT D
## 18 GBR Silver Medal 2 2021-08-01 SCOTT D
## 19 JAM Gold Medal 1 2021-08-06 THOMPSON-HERAH E
## 20 JAM Gold Medal 1 2021-08-03 THOMPSON-HERAH E
## athlete.name athlete.sex
## 1 DRESSEL Caeleb M
## 2 DRESSEL Caeleb M
## 3 DRESSEL Caeleb M
## 4 DRESSEL Caeleb M
## 5 DRESSEL Caeleb M
## 6 McKEON Emma W
## 7 McKEON Emma W
## 8 McKEON Emma W
## 9 McKEON Emma W
## 10 McKEOWN Kaylee W
## 11 McKEON Emma X
## 12 McKEOWN Kaylee W
## 13 McKEON Emma W
## 14 McKEOWN Kaylee W
## 15 McKEON Emma W
## 16 SCOTT Duncan M
## 17 SCOTT Duncan M
## 18 SCOTT Duncan M
## 19 THOMPSON-HERAH Elaine W
## 20 THOMPSON-HERAH Elaine W
## athlete.link
## 1 ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 2 ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 3 ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 4 ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 5 ../../../en/results/swimming/athlete-profile-n1302244-dressel-caeleb.htm
## 6 ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 7 ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 8 ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 9 ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 10 ../../../en/results/swimming/athlete-profile-n1482870-mckeown-kaylee.htm
## 11 ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 12 ../../../en/results/swimming/athlete-profile-n1482870-mckeown-kaylee.htm
## 13 ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 14 ../../../en/results/swimming/athlete-profile-n1482870-mckeown-kaylee.htm
## 15 ../../../en/results/swimming/athlete-profile-n1482869-mckeon-emma.htm
## 16 ../../../en/results/swimming/athlete-profile-n1444193-scott-duncan.htm
## 17 ../../../en/results/swimming/athlete-profile-n1444193-scott-duncan.htm
## 18 ../../../en/results/swimming/athlete-profile-n1444193-scott-duncan.htm
## 19 ../../../en/results/athletics/athlete-profile-n1412290-thompson-herah-elaine.htm
## 20 ../../../en/results/athletics/athlete-profile-n1412290-thompson-herah-elaine.htm
## discipline.code event country
## 1 SWM Men's 100m Butterfly United States of America
## 2 SWM Men's 4 x 100m Freestyle Relay United States of America
## 3 SWM Men's 50m Freestyle United States of America
## 4 SWM Men's 100m Freestyle United States of America
## 5 SWM Men's 4 x 100m Medley Relay United States of America
## 6 SWM Women's 50m Freestyle Australia
## 7 SWM Women's 100m Freestyle Australia
## 8 SWM Women's 4 x 100m Freestyle Relay Australia
## 9 SWM Women's 4 x 100m Medley Relay Australia
## 10 SWM Women's 200m Backstroke Australia
## 11 SWM Mixed 4 x 100m Medley Relay Australia
## 12 SWM Women's 4 x 100m Medley Relay Australia
## 13 SWM Women's 100m Butterfly Australia
## 14 SWM Women's 100m Backstroke Australia
## 15 SWM Women's 4 x 200m Freestyle Relay Australia
## 16 SWM Men's 200m Individual Medley Great Britain
## 17 SWM Men's 200m Freestyle Great Britain
## 18 SWM Men's 4 x 100m Medley Relay Great Britain
## 19 ATH Women's 4 x 100m Relay Jamaica
## 20 ATH Women's 200m Jamaica
## discipline Country.Rank.by.no.Gold Country.Gold.Count Country.Silver.Count
## 1 Swimming 1 39 41
## 2 Swimming 1 39 41
## 3 Swimming 1 39 41
## 4 Swimming 1 39 41
## 5 Swimming 1 39 41
## 6 Swimming 6 17 7
## 7 Swimming 6 17 7
## 8 Swimming 6 17 7
## 9 Swimming 6 17 7
## 10 Swimming 6 17 7
## 11 Swimming 6 17 7
## 12 Swimming 6 17 7
## 13 Swimming 6 17 7
## 14 Swimming 6 17 7
## 15 Swimming 6 17 7
## 16 Swimming 4 22 21
## 17 Swimming 4 22 21
## 18 Swimming 4 22 21
## 19 Athletics 21 4 1
## 20 Athletics 21 4 1
## Country.Bronze.Count Country.Total.all.medals Country
## 1 33 113 United States of America
## 2 33 113 United States of America
## 3 33 113 United States of America
## 4 33 113 United States of America
## 5 33 113 United States of America
## 6 22 46 Australia
## 7 22 46 Australia
## 8 22 46 Australia
## 9 22 46 Australia
## 10 22 46 Australia
## 11 22 46 Australia
## 12 22 46 Australia
## 13 22 46 Australia
## 14 22 46 Australia
## 15 22 46 Australia
## 16 22 65 Great Britain
## 17 22 65 Great Britain
## 18 22 65 Great Britain
## 19 4 9 Jamaica
## 20 4 9 Jamaica
## athlete.total.medals
## 1 5
## 2 5
## 3 5
## 4 5
## 5 5
## 6 4
## 7 4
## 8 4
## 9 4
## 10 3
## 11 3
## 12 3
## 13 3
## 14 3
## 15 3
## 16 3
## 17 3
## 18 3
## 19 3
## 20 3