1. Introduction

1.1 What do you remember about the 2016 Rio Olympics?
If you ask someone in the US, chances are they will say “Michael Phelps”. Michael dominated the coverage and became a media sensation, winning 5 gold medals and 1 silver, an extraordinary athlete by every measure.

Aside from this spotlight, Rio hosted over 11,000 athletes from over 200 nations, with over 300 events. Doing the math… 11000 x 200 x 300 .. we get an astronomical array of combinations to explore. We have curated some interesting insights about these games that show how special these games were.   To get started, lets discuss the under-dog choice of Rio itself. - First Olympic Games to be held in South America - First time entrants from Kosova and South Sudan, and the Refugee Olympic Team - First Olympics to be held in South America - First to be held in Portuguese-speaking country - First summer games to be held in the host country’s winter season

We will take you on a journey to explore interesting facets of these games, exploring mins & max’s, highs & lows.   We guarantee you will have a new appreciation for this historic event.

1.2 How will we do this?
First, we will do our own research into these games to learn more about this venue in particular. This provides the context for everything else, which we will not get by looking at the data. A key resource is: Olympic Website

Second, we will explore visualizations that others have done to get inspired, and to figure out what other unique angle we can bring. Tableau Public Rio

1.3 Our approach
We’ll use statistical and visual analysis to bring this data to life. This will appeal to visual learners as well as those that want to “see the numbers”. We will connect data about athletes with information about their home country so we can explore both avenues individually and collectively. We’ll employ the muscle of R (no pun intended) to compare different populations to see if there is truly a significant difference. 

1.4 How will our analysis will help the consumer
We’ll provide clear interpretations and easy to digest information and visuals – we will do the hard work and provide insights to provide an engaging journey that is enteratining as well as informative.

2. Packages Used

## Load Required Packages ##

library(dplyr) ## Manipulating data
library(tidyverse) ## Tidying data
library(ggplot2) ## Visualizing data
library(readxl)  ## import Excel files
library(kableExtra) ## Data Formatting

3. Data Preparation

3.1 Source of data
This dataset comes from Kaggle - Rio Olympics, which also cites the Official Olympics Website as the original source. The dataset was created in Dec 2016.
It consists of the official statistics for all of the athletes and the 306 events at the 2016 Olympic Games in Rio de Janeiro.
3 csv files were available, but we only used two of them. Please see explanation under “Peculiarities”.
Athletes.csv
Countries.csv
Events. csv

3.2 Variables, Missing Values, Peculiarities
3.2.1 Variables
In this step we import athletes and countries and merge together on the nationality & country code. We specified “all.x = TRUE” which does a left join from our athlete file, which will allowus to see how many participants did not get associated with a country.

athletes <- read.csv("athletes.csv" , stringsAsFactors=FALSE)
#dim(athletes)

countries <- read.csv("countries.csv" , stringsAsFactors=FALSE)
# dim(countries)
# colnames(countries)

rio_orig <- merge(x=athletes, y=countries, by.x="nationality", by.y="code" , all.x = TRUE)
# dim(rio_orig)
# colnames(rio_orig)

Athletes data dictionary (manually created)

data_dict <- read_excel("data_dict.xlsx" )
athletes_data_dict <- filter(data_dict, csv == "athletes.csv" )
athletes_data_dict <- select(athletes_data_dict, -csv)


#athletes_data_dict %>%
 # kbl(caption = "Data Dictionary - Athletes") %>%
  #kable_classic(full_width = F, html_font = "Cambria")

Countries data dictionary (manually created)

countries_data_dict <- filter( data_dict, csv == "countries.csv" )
countries_data_dict <- select( countries_data_dict, -csv )
countries_data_dict %>%
kbl( caption = "Data Dictionary - Countries" ) %>%
kable_classic( full_width = F, html_font = "Cambria" )
Data Dictionary - Countries
Variable Units Type Description
country NA char full name of country
code NA char 3 letter code of country
population NA numerical NA
gdp_per_capita NA numerical gross domestic product per person

3.2.2 Missing Values and Imputation
First, let’s look at the number of complete observations.
We have 10,109 complete records out of 11,538 = 87% complete

num_complete <- sum(complete.cases(rio_orig)  ) 
print(paste(cat(num_complete),  " complete"))
## 10109[1] "  complete"
print(paste(cat(nrow(rio_orig)), " total"))
## 11538[1] "  total"
print(paste(cat(sum( complete.cases(rio_orig) ) / nrow(rio_orig) * 100 ), " %" ))
## 87.61484[1] "  %"

We have 5 columns with missing values
Percent Missing by Column

percent_missing <- as.data.frame(colSums(is.na(rio_orig))/nrow(rio_orig)*100)
names(percent_missing) <- c("percent_missing") 
percent_missing  %>% 
       filter(percent_missing  > 0 )
##                percent_missing
## height                2.860114
## weight                5.711562
## country               2.660773
## population            3.380135
## gdp_per_capita        7.072283

What should we do about missing values?
- Height and Weight variables would need to be imputed carefully. We will see later on that athletes with certain physical characterics are more likely to participate in certain events, so we would need to impute by gender as well as the sport. Thus, we decided to omit the NA’s from our analysis when using these variables rather than impute them.
- Population and GDP - this is very specific information about a country so it doesn’t make sense to impute.
- Country Looking at the nationality codes, we can look up the country and add it to the dataset.

rio_orig %>%
  filter( is.na(rio_orig$country) ) %>%
  group_by( country , nationality) %>%    #
  select( country, nationality ) -> missing_country

rio_orig$country[rio_orig$nationality == "KIR" ] <- "Kiribati"
rio_orig$country[rio_orig$nationality == "KOS" ] <- "Kosovo"
rio_orig$country[rio_orig$nationality == "MHL" ] <- "Marshall Islands"
rio_orig$country[rio_orig$nationality == "MNE" ] <- "Montenegro"
rio_orig$country[rio_orig$nationality == "ROU" ] <- "Romania"
rio_orig$country[rio_orig$nationality == "SRB" ] <- "Serbia"
rio_orig$country[rio_orig$nationality == "SSD" ] <- "South Sudan"
rio_orig$country[rio_orig$nationality == "TTO" ] <- "Trinidad"
rio_orig$country[rio_orig$nationality == "TUV" ] <- "Tuvalu"
rio_orig$country[rio_orig$nationality == "IOA" ] <- "Intl Olympic Assoc."
rio_orig$country[rio_orig$nationality == "ROT" ] <- "Refugee Olympic Tm"

#check
percent_missing <- as.data.frame(colSums(is.na(rio_orig))/nrow(rio_orig)*100)
names(percent_missing) <- c("percent_missing") 
percent_missing  %>% 
       filter(percent_missing  > 0 )
##                percent_missing
## height                2.860114
## weight                5.711562
## population            3.380135
## gdp_per_capita        7.072283

3.2.3 Peculiarities File events.csv: This file includes specific events, like “100m Womens Backstroke”, however, the key in this file did not link to the athletes so we did not use it. Athletes that were refugees competed under the country code of “ROT” (Refugee Olympic Team). Since this was not included in our country file, we manually added it so we could easily filter for these athletes.

3.3 Cleaning 3.3.1 gdp_per_capita First, we will rename column “gdp_per_capita” to just “gdp”, since the standard definition of gdp includes per capita, and a shorter name will allow our data to fit neatly onto the screen.
We will round the gdp to 0 decimals, since we are not doing anything in our analysis that needs more precision.

colnames(rio_orig)[15] <- "gdp"
rio_orig$gdp <- round(rio_orig$gdp)

#check: 
head(rio_orig$gdp, n=5)
## [1]  594  594  594 3945 3945

** 3.3.2 change sport = athletics** We will rename “athletics” in the sport variable to be “track and field”. We felt “athletics” was too general for the event category these athletes participated in.

rio_orig$sport[rio_orig$sport  == "athletics"] <- "track and field" 

#check
rio_orig %>% 
  filter(sport == "track and field") %>% 
        summarise( n = n()) %>% 
  kbl(caption = "track and field") %>%
  kable_classic(full_width = F, html_font = "Cambria")
track and field
n
2363

3.3.3 Do we have incorrect values? Everything looks realistic except age

num.cols <- c( 6,7,9,10,11 , 15 )
height_wt_summary<-sapply(rio_orig[,num.cols][sapply(num.cols,is.numeric)], function(num.cols) 
   c( 
                           "Mean"=    round(mean(na.omit(num.cols,na.rm=TRUE)),1),
                           "Median" = round(median(na.omit(num.cols)),1),
                          "Minimum" = round(min(na.omit(num.cols)),1),
                          "Maximun" = round(max(na.omit(num.cols)),1),
                          "Std Dev" = round(sd(na.omit(num.cols)),1),
                          "Count"   = length(na.omit(num.cols))

                     )
 )

 
# transpose 
height_weight_table <- as.data.frame(t(height_wt_summary))

height_weight_table %>% 
  kbl(caption = "track and field") %>%
  kable_classic(full_width = F, html_font = "Cambria")
track and field
Mean Median Minimum Maximun Std Dev Count
height 1.8 1.8 1.2 2.2 0.1 11208
weight 72.1 70.0 31.0 170.0 16.2 10879
gold 0.1 0.0 0.0 5.0 0.3 11538
silver 0.1 0.0 0.0 2.0 0.2 11538
bronze 0.1 0.0 0.0 2.0 0.2 11538
gdp 25214.4 19222.0 277.0 101450.0 20074.2 10722

Impute ages > 75 with the median age

print(paste( "The youngest person was ", min(rio_orig$age), " years old, and the oldest was " ,   max(rio_orig$age)  ) ) 
## [1] "The youngest person was  13  years old, and the oldest was  116"
median_age <- median(rio_orig$age) 
print(paste("The median age is: ", median_age))
## [1] "The median age is:  26"
# calculate the median of this column 
rio_orig$age[rio_orig$age > 75] <- median_age

#check
print(paste( "Check:  The youngest person was ", min(rio_orig$age), " years old, and the oldest was " ,   max(rio_orig$age)  ) )
## [1] "Check:  The youngest person was  13  years old, and the oldest was  62"

3.3.4 Create new variables Add new variables for height, weight using English units for our American audience, and calculate Body Mass Index (BMI)
\[height(inches) = height * 39.3701\] \[weight(lbs) = weight * 2.20462\] \[BMI = weight (kg) / [height (m)]^{2}\]

rio_orig["height_eng"] <- round( rio_orig$height * 39.3701 , 1)

rio_orig["weight_eng"] <- round( rio_orig$weight * 2.20462 , 1)

rio_orig["BMI"] <-        round( rio_orig$weight / (rio_orig$height)^2 , 1)

3.4 Review Cleaned Data Asses dimensions ** Rows and Columns

print( paste( "There are ", dim(rio_orig)[1], " rows" )) # rows of data
## [1] "There are  11538  rows"
print( paste( "There are ", dim(rio_orig)[2]  , " columns"  ))   # columns of data
## [1] "There are  18  columns"

Look at the beginning of the data

  select( rio_orig, -c( id, nationality )) %>%
  head(n = 5)  %>% 
  kbl(caption = "First 5 rows") %>%
  kable_classic(full_width = F, html_font = "Cambria")
First 5 rows
name sex dob height weight sport gold silver bronze age country population gdp height_eng weight_eng BMI
Mohammad Tawfiq Bakhshi male 3/11/1986 1.81 99 judo 0 0 0 30 Afghanistan 32526562 594 71.3 218.3 30.2
Kamia Yousufi female 5/20/1996 1.65 55 track and field 0 0 0 20 Afghanistan 32526562 594 65.0 121.3 20.2
Abdul Wahab Zahiri male 5/27/1992 1.75 68 track and field 0 0 0 24 Afghanistan 32526562 594 68.9 149.9 22.2
Nikol Merizaj female 8/7/1998 1.80 65 aquatics 0 0 0 17 Albania 2889167 3945 70.9 143.3 20.1
Evagjelia Veli female 7/16/1991 1.60 52 weightlifting 0 0 0 25 Albania 2889167 3945 63.0 114.6 20.3

Look at the end of the data I think we should move this to end of this section after all the cleaning is done

  select( rio_orig, -c( id, nationality )) %>%
  tail(n = 5)  %>% 
  kbl(caption = "Last 5 rows") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Last 5 rows
name sex dob height weight sport gold silver bronze age country population gdp height_eng weight_eng BMI
11534 Cuthbert Nyasango male 9/17/1982 1.63 56 track and field 0 0 0 33 Zimbabwe 15602751 924 64.2 123.5 21.1
11535 Sheila Makoto female 1/14/1990 1.58 59 football 0 0 0 26 Zimbabwe 15602751 924 62.2 130.1 23.6
11536 Kirsty Leigh Coventry female 9/16/1983 1.76 64 aquatics 0 0 0 32 Zimbabwe 15602751 924 69.3 141.1 20.7
11537 Pardon Ndhlovu male 8/23/1987 1.58 54 track and field 0 0 0 28 Zimbabwe 15602751 924 62.2 119.0 21.6
11538 Rutendo Makore female 9/30/1992 1.65 68 football 0 0 0 23 Zimbabwe 15602751 924 65.0 149.9 25.0

Look at the Datatypes

knitr::kable( sapply (rio_orig, class), "simple" , caption = "Data Types" , col.names = c("Type"))
Data Types
Type
nationality character
id integer
name character
sex character
dob character
height numeric
weight integer
sport character
gold integer
silver integer
bronze integer
age numeric
country character
population integer
gdp numeric
height_eng numeric
weight_eng numeric
BMI numeric

We have no duplication of data

sum(duplicated(rio_orig))
## [1] 0

4. Proposed Data Analysis

4.1 Discuss how you plan to uncover new information in the data that is not self evident. What are different ways you can look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frame to create new summary information? How could you summarize your data to answer key questions?

  • Proposed plan: Below you will find some initial data discovery visualizations where we found some interesting patterns to take a deeper dive.
  • For example, there are a few sports that dominate participation… do they also dominate medal counts?
  • We see some interesting height, weight patterns for different sports … what can we learn from this?
  • We see that there is one sport where the age “looks” significantly higher than others – is it a statistical difference? What sport lends itself to a more mature athlete?
  • Our country data has shown some surprising insights already (below)
  • Who were the refugee athletes? Who competed under the International Olympic Association flag and not their own country? did they win any medals

4.2 What types of plot and tables will help you to illustrate the findings in your questions? * For data discovery of our numerical variables, box plots are great to see if there is some interesting variation to explore. * We have a lot of categorical data so bar plots will help dimensionalize these variables. * For “top 10” type of things, we use tables so we can display all of the data that went into the calculation.

4.3 What do you not know how to do right nowthat you need to learn to answer your questions? * Better visualization techniques – for example, how can you order bars in a bar chart?
* Better ways to display of information in tables – have used kable but struggle with formatting numbers * Calculations with dates – subtracting dates to get time inbetween, calculate age, things like that.

4.4 Do you plan to incorporate any machine learning techniques? * Possibly… what are the variables that might predict whether an athlete will win a medal? Age? Country? Sport? Height? Weight?

GENERAL ORIENTATION TO THE DATA WHICH SPORTS HAVE THE HIGHEST PARTICIPATION?

par(mfrow=c(1,2))
barplot(table(rio_orig$sex),    main = "Number Males & Females" )
barplot(table(rio_orig$sport), main = "Number of pariticipants by sport\n Two sports dominate\n We'll explore this further" , cex.names=0.5 )

WHICH SPORTS ARE DOMINATED BY MALES , FEMALES?
NOT SURE THIS IS THE BEST VIZ We have uneven participation in certain sports by males and females. We will explore this further

mosaicplot(sport~sex,data=rio_orig, main = "Males and Females by Sport \nSome uneven participartion - We will explore this further", color=c("#fb8072", "#8dd3c7", cex.main =20, cex.lab = 5, cex.axis = 3)
          ) 

ATHLETE CHARACTERISTICS

plot(rio_orig$weight_eng, rio_orig$height_eng, ylab = "Height (in)", xlab = "Weight (lb)", main = "Height vs Weight of Athletes\n We see a general positive relationship")

Is there a height advantage in certain events? We will explore this further

boxplot(height ~ sport, data = rio_orig, main = "Height vs Event")

Is there a weight advantage in certain events? We will explore this further

boxplot(weight ~ sport, data = rio_orig, main = "Weight vs Event")

STATISICAL ANALYSIS COMPARING AGE? Do certain events attract older or younger athletes? We will explore this further

boxplot(age ~ sport, data = rio_orig, main = "Age vs Event")

Country Exploration

top_gdp_n <- rio_orig %>%  
    group_by( country ) %>% 
       summarise(
         gdp=max(gdp) ,
         n=n()
              )

Athletes without a country

  • Independent Olympic Athletes - Kuwaiti athletes who competed under the Olympic flag, as the Kuwait Olympic Committee had been suspended by the International Olympic Committee (IOC) for the second time in five years due to governmental interference

  • Kuwaiti shooter Fehaid Al-Deehani became the first independent athlete to win a gold medal

#[](https://www.olympiandatabase.com/index.php?id=123580&L=1)

#[](https://www.unhcr.org/en-us/rio-2016-refugee-olympic-team.html)

What are the top 5 nations with the lowest gdp that sent athletes? How many did they send?

top_gdp_n %>% 
            arrange(gdp) %>% 
                head(n=5) %>% 
                     kbl(caption = "") %>%
                     kable_classic(full_width = F, html_font = "Cambria")
country gdp n
Burundi 277 9
Central African Republic 323 6
Niger 359 6
Malawi 372 5
Madagascar 402 6

What are the top 5 nations with the highest gdp that sent athletes? How many did they send? USA not in list!

rio_orig %>%  
    group_by( country ) %>% 
       summarise(
        highest_gdp=max(gdp) ,
         n=n()
              ) %>% 
            arrange(desc(highest_gdp)) %>% 
                head(n=5) %>% 
                     kbl(caption = "") %>%
                     kable_classic(full_width = F, html_font = "Cambria")
country highest_gdp n
Luxembourg 101450 10
Switzerland 80945 104
Norway 74400 62
Qatar 73653 39
Ireland 61134 80

Which countries received the most medals?

*Which countries were the most productive?* Jamaica - a tiny nation that had over half of it’s athletes earning a medal.**

\[productivity = #medals / #athletes\]

medal_per_country %>% 
         group_by(country, total_medals, n) %>% 
            summarise(
            
            productivity = total_medals / n
               ) %>% 
              arrange(desc(productivity))  %>% 
                  head(n=10) %>% 
                      kbl(caption = "") %>%
                      kable_classic(full_width = F, html_font = "Cambria")
country total_medals n productivity
Jamaica 30 57 0.5263158
Serbia 53 103 0.5145631
United States 264 567 0.4656085
Russia 115 286 0.4020979
United Kingdom 145 374 0.3877005
Germany 160 441 0.3628118
Azerbaijan 18 56 0.3214286
Denmark 41 128 0.3203125
Norway 19 62 0.3064516
China 113 404 0.2797030