RioData.knit

1. Introduction

1.1 What do you remember about the 2016 Rio Olympics?
If you ask someone in the US, chances are they will say “Michael Phelps”. Michael dominated the coverage and became a media sensation, winning 5 gold medals and 1 silver, an extraordinary athlete by every measure.

Aside from this spotlight, Rio hosted over 11,000 athletes from over 200 nations, with over 300 events. Doing the math… 11000 x 200 x 300 .. we get an astronomical array of combinations to explore. We have curated some interesting insights about these games that show how special these games were. To get started, lets discuss the under-dog choice of Rio itself. - First Olympic Games to be held in South America - First time entrants from Kosova and South Sudan, and the Refugee Olympic Team - First Olympics to be held in South America - First to be held in Portuguese-speaking country - First summer games to be held in the host country’s winter season

We will take you on a journey to explore interesting facets of these games, exploring mins & max’s, highs & lows. We guarantee you will have a new appreciation for this historic event.

1.2 How will we do this?
First, we will do our own research into these games to learn more about this venue in particular. This provides the context for everything else, which we will not get by looking at the data. A key resource is: Olympic Website

Second, we will explore visualizations that others have done to get inspired, and to figure out what other unique angle we can bring. Tableau Public Rio

1.3 Our approach
We’ll use statistical and visual analysis to bring this data to life. This will appeal to visual learners as well as those that want to “see the numbers”. We will connect data about athletes with information about their home country so we can explore both avenues individually and collectively. We’ll employ the muscle of R (no pun intended) to compare different populations to see if there is truly a significant difference.

1.4 How will our analysis will help the consumer
We’ll provide clear interpretations and easy to digest information and visuals – we will do the hard work and provide insights to provide an engaging journey that is enteratining as well as informative.

2. Packages Used

## Load Required Packages ##

library(dplyr) ## Manipulating data
library(tidyverse) ## Tidying data
library(ggplot2) ## Visualizing data
library(readxl)  ## import Excel files
library(kableExtra) ## Data Formatting

3. Data Preparation

3.1 Source of data
This dataset comes from Kaggle - Rio Olympics, which also cites the Official Olympics Website as the original source. The dataset was created in Dec 2016.
It consists of the official statistics for all of the athletes and the 306 events at the 2016 Olympic Games in Rio de Janeiro.
3 csv files were available, but we only used two of them. Please see explanation under “Peculiarities”.
Athletes.csv
Countries.csv
Events. csv

3.2 Variables, Missing Values, Peculiarities
3.2.1 Variables
In this step we import athletes and countries and merge together on the nationality & country code. We specified “all.x = TRUE” which does a left join from our athlete file, which will allowus to see how many participants did not get associated with a country.

athletes <- read.csv("athletes.csv" , stringsAsFactors=FALSE)
#dim(athletes)

countries <- read.csv("countries.csv" , stringsAsFactors=FALSE)
# dim(countries)
# colnames(countries)

rio_orig <- merge(x=athletes, y=countries, by.x="nationality", by.y="code" , all.x = TRUE)
# dim(rio_orig)
# colnames(rio_orig)

Athletes data dictionary (manually created)

data_dict <- read_excel("data_dict.xlsx" )
athletes_data_dict <- filter(data_dict, csv == "athletes.csv" )
athletes_data_dict <- select(athletes_data_dict, -csv)


#athletes_data_dict %>%
 # kbl(caption = "Data Dictionary - Athletes") %>%
  #kable_classic(full_width = F, html_font = "Cambria")

Countries data dictionary (manually created)

countries_data_dict <- filter( data_dict, csv == "countries.csv" )
countries_data_dict <- select( countries_data_dict, -csv )
countries_data_dict %>%
kbl( caption = "Data Dictionary - Countries" ) %>%
kable_classic( full_width = F, html_font = "Cambria" )

Data Dictionary - Countries
Variable	Units	Type	Description
country	NA	char	full name of country
code	NA	char	3 letter code of country
population	NA	numerical	NA
gdp_per_capita	NA	numerical	gross domestic product per person

3.2.2 Missing Values and Imputation
First, let’s look at the number of complete observations.
We have 10,109 complete records out of 11,538 = 87% complete

num_complete <- sum(complete.cases(rio_orig)  ) 
print(paste(cat(num_complete),  " complete"))

## 10109[1] "  complete"

print(paste(cat(nrow(rio_orig)), " total"))

## 11538[1] "  total"

print(paste(cat(sum( complete.cases(rio_orig) ) / nrow(rio_orig) * 100 ), " %" ))

## 87.61484[1] "  %"

We have 5 columns with missing values
Percent Missing by Column

percent_missing <- as.data.frame(colSums(is.na(rio_orig))/nrow(rio_orig)*100)
names(percent_missing) <- c("percent_missing") 
percent_missing  %>% 
       filter(percent_missing  > 0 )

##                percent_missing
## height                2.860114
## weight                5.711562
## country               2.660773
## population            3.380135
## gdp_per_capita        7.072283

What should we do about missing values?
- Height and Weight variables would need to be imputed carefully. We will see later on that athletes with certain physical characterics are more likely to participate in certain events, so we would need to impute by gender as well as the sport. Thus, we decided to omit the NA’s from our analysis when using these variables rather than impute them.
- Population and GDP - this is very specific information about a country so it doesn’t make sense to impute.
- Country Looking at the nationality codes, we can look up the country and add it to the dataset.

rio_orig %>%
  filter( is.na(rio_orig$country) ) %>%
  group_by( country , nationality) %>%    #
  select( country, nationality ) -> missing_country

rio_orig$country[rio_orig$nationality == "KIR" ] <- "Kiribati"
rio_orig$country[rio_orig$nationality == "KOS" ] <- "Kosovo"
rio_orig$country[rio_orig$nationality == "MHL" ] <- "Marshall Islands"
rio_orig$country[rio_orig$nationality == "MNE" ] <- "Montenegro"
rio_orig$country[rio_orig$nationality == "ROU" ] <- "Romania"
rio_orig$country[rio_orig$nationality == "SRB" ] <- "Serbia"
rio_orig$country[rio_orig$nationality == "SSD" ] <- "South Sudan"
rio_orig$country[rio_orig$nationality == "TTO" ] <- "Trinidad"
rio_orig$country[rio_orig$nationality == "TUV" ] <- "Tuvalu"
rio_orig$country[rio_orig$nationality == "IOA" ] <- "Intl Olympic Assoc."
rio_orig$country[rio_orig$nationality == "ROT" ] <- "Refugee Olympic Tm"

#check
percent_missing <- as.data.frame(colSums(is.na(rio_orig))/nrow(rio_orig)*100)
names(percent_missing) <- c("percent_missing") 
percent_missing  %>% 
       filter(percent_missing  > 0 )

##                percent_missing
## height                2.860114
## weight                5.711562
## population            3.380135
## gdp_per_capita        7.072283

3.2.3 Peculiarities File events.csv: This file includes specific events, like “100m Womens Backstroke”, however, the key in this file did not link to the athletes so we did not use it. Athletes that were refugees competed under the country code of “ROT” (Refugee Olympic Team). Since this was not included in our country file, we manually added it so we could easily filter for these athletes.

3.3 Cleaning 3.3.1 gdp_per_capita First, we will rename column “gdp_per_capita” to just “gdp”, since the standard definition of gdp includes per capita, and a shorter name will allow our data to fit neatly onto the screen.
We will round the gdp to 0 decimals, since we are not doing anything in our analysis that needs more precision.

colnames(rio_orig)[15] <- "gdp"
rio_orig$gdp <- round(rio_orig$gdp)

#check: 
head(rio_orig$gdp, n=5)

## [1]  594  594  594 3945 3945

** 3.3.2 change sport = athletics** We will rename “athletics” in the sport variable to be “track and field”. We felt “athletics” was too general for the event category these athletes participated in.

rio_orig$sport[rio_orig$sport  == "athletics"] <- "track and field" 

#check
rio_orig %>% 
  filter(sport == "track and field") %>% 
        summarise( n = n()) %>% 
  kbl(caption = "track and field") %>%
  kable_classic(full_width = F, html_font = "Cambria")

track and field
n
2363

3.3.3 Do we have incorrect values? Everything looks realistic except age

num.cols <- c( 6,7,9,10,11 , 15 )
height_wt_summary<-sapply(rio_orig[,num.cols][sapply(num.cols,is.numeric)], function(num.cols) 
   c( 
                           "Mean"=    round(mean(na.omit(num.cols,na.rm=TRUE)),1),
                           "Median" = round(median(na.omit(num.cols)),1),
                          "Minimum" = round(min(na.omit(num.cols)),1),
                          "Maximun" = round(max(na.omit(num.cols)),1),
                          "Std Dev" = round(sd(na.omit(num.cols)),1),
                          "Count"   = length(na.omit(num.cols))

                     )
 )

 
# transpose 
height_weight_table <- as.data.frame(t(height_wt_summary))

height_weight_table %>% 
  kbl(caption = "track and field") %>%
  kable_classic(full_width = F, html_font = "Cambria")

track and field
	Mean	Median	Minimum	Maximun	Std Dev	Count
height	1.8	1.8	1.2	2.2	0.1	11208
weight	72.1	70.0	31.0	170.0	16.2	10879
gold	0.1	0.0	0.0	5.0	0.3	11538
silver	0.1	0.0	0.0	2.0	0.2	11538
bronze	0.1	0.0	0.0	2.0	0.2	11538
gdp	25214.4	19222.0	277.0	101450.0	20074.2	10722

Impute ages > 75 with the median age

print(paste( "The youngest person was ", min(rio_orig$age), " years old, and the oldest was " ,   max(rio_orig$age)  ) )

## [1] "The youngest person was  13  years old, and the oldest was  116"

median_age <- median(rio_orig$age) 
print(paste("The median age is: ", median_age))

## [1] "The median age is:  26"

# calculate the median of this column 
rio_orig$age[rio_orig$age > 75] <- median_age

#check
print(paste( "Check:  The youngest person was ", min(rio_orig$age), " years old, and the oldest was " ,   max(rio_orig$age)  ) )

## [1] "Check:  The youngest person was  13  years old, and the oldest was  62"

3.3.4 Create new variables Add new variables for height, weight using English units for our American audience, and calculate Body Mass Index (BMI)
\[height(inches) = height * 39.3701\] \[weight(lbs) = weight * 2.20462\] \[BMI = weight (kg) / [height (m)]^{2}\]

rio_orig["height_eng"] <- round( rio_orig$height * 39.3701 , 1)

rio_orig["weight_eng"] <- round( rio_orig$weight * 2.20462 , 1)

rio_orig["BMI"] <-        round( rio_orig$weight / (rio_orig$height)^2 , 1)

3.4 Review Cleaned Data Asses dimensions ** Rows and Columns

print( paste( "There are ", dim(rio_orig)[1], " rows" )) # rows of data

## [1] "There are  11538  rows"

print( paste( "There are ", dim(rio_orig)[2]  , " columns"  ))   # columns of data

## [1] "There are  18  columns"

Look at the beginning of the data

  select( rio_orig, -c( id, nationality )) %>%
  head(n = 5)  %>% 
  kbl(caption = "First 5 rows") %>%
  kable_classic(full_width = F, html_font = "Cambria")

First 5 rows
name	sex	dob	height	weight	sport	age	country	population	gdp	height_eng	weight_eng	BMI
Mohammad Tawfiq Bakhshi	male	3/11/1986	1.81	99	judo	30	Afghanistan	32526562	594	71.3	218.3	30.2
Kamia Yousufi	female	5/20/1996	1.65	55	track and field	20	Afghanistan	32526562	594	65.0	121.3	20.2
Abdul Wahab Zahiri	male	5/27/1992	1.75	68	track and field	24	Afghanistan	32526562	594	68.9	149.9	22.2
Nikol Merizaj	female	8/7/1998	1.80	65	aquatics	17	Albania	2889167	3945	70.9	143.3	20.1
Evagjelia Veli	female	7/16/1991	1.60	52	weightlifting	25	Albania	2889167	3945	63.0	114.6	20.3

Look at the end of the data I think we should move this to end of this section after all the cleaning is done

  select( rio_orig, -c( id, nationality )) %>%
  tail(n = 5)  %>% 
  kbl(caption = "Last 5 rows") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Last 5 rows
	name	sex	dob	height	weight	sport	age	country	population	gdp	height_eng	weight_eng	BMI
11534	Cuthbert Nyasango	male	9/17/1982	1.63	56	track and field	33	Zimbabwe	15602751	924	64.2	123.5	21.1
11535	Sheila Makoto	female	1/14/1990	1.58	59	football	26	Zimbabwe	15602751	924	62.2	130.1	23.6
11536	Kirsty Leigh Coventry	female	9/16/1983	1.76	64	aquatics	32	Zimbabwe	15602751	924	69.3	141.1	20.7
11537	Pardon Ndhlovu	male	8/23/1987	1.58	54	track and field	28	Zimbabwe	15602751	924	62.2	119.0	21.6
11538	Rutendo Makore	female	9/30/1992	1.65	68	football	23	Zimbabwe	15602751	924	65.0	149.9	25.0

Look at the Datatypes

knitr::kable( sapply (rio_orig, class), "simple" , caption = "Data Types" , col.names = c("Type"))

Data Types
	Type
nationality	character
id	integer
name	character
sex	character
dob	character
height	numeric
weight	integer
sport	character
gold	integer
silver	integer
bronze	integer
age	numeric
country	character
population	integer
gdp	numeric
height_eng	numeric
weight_eng	numeric
BMI	numeric

We have no duplication of data

sum(duplicated(rio_orig))

## [1] 0

4. Proposed Data Analysis

4.1 Discuss how you plan to uncover new information in the data that is not self evident. What are different ways you can look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frame to create new summary information? How could you summarize your data to answer key questions?

Proposed plan: Below you will find some initial data discovery visualizations where we found some interesting patterns to take a deeper dive.
For example, there are a few sports that dominate participation… do they also dominate medal counts?
We see some interesting height, weight patterns for different sports … what can we learn from this?
We see that there is one sport where the age “looks” significantly higher than others – is it a statistical difference? What sport lends itself to a more mature athlete?
Our country data has shown some surprising insights already (below)
Who were the refugee athletes? Who competed under the International Olympic Association flag and not their own country? did they win any medals

4.2 What types of plot and tables will help you to illustrate the findings in your questions? * For data discovery of our numerical variables, box plots are great to see if there is some interesting variation to explore. * We have a lot of categorical data so bar plots will help dimensionalize these variables. * For “top 10” type of things, we use tables so we can display all of the data that went into the calculation.

4.3 What do you not know how to do right nowthat you need to learn to answer your questions? * Better visualization techniques – for example, how can you order bars in a bar chart?
* Better ways to display of information in tables – have used kable but struggle with formatting numbers * Calculations with dates – subtracting dates to get time inbetween, calculate age, things like that.

4.4 Do you plan to incorporate any machine learning techniques? * Possibly… what are the variables that might predict whether an athlete will win a medal? Age? Country? Sport? Height? Weight?

GENERAL ORIENTATION TO THE DATA WHICH SPORTS HAVE THE HIGHEST PARTICIPATION?

par(mfrow=c(1,2))
barplot(table(rio_orig$sex),    main = "Number Males & Females" )
barplot(table(rio_orig$sport), main = "Number of pariticipants by sport\n Two sports dominate\n We'll explore this further" , cex.names=0.5 )

WHICH SPORTS ARE DOMINATED BY MALES , FEMALES?
NOT SURE THIS IS THE BEST VIZ We have uneven participation in certain sports by males and females. We will explore this further

mosaicplot(sport~sex,data=rio_orig, main = "Males and Females by Sport \nSome uneven participartion - We will explore this further", color=c("#fb8072", "#8dd3c7", cex.main =20, cex.lab = 5, cex.axis = 3)
          )

ATHLETE CHARACTERISTICS

plot(rio_orig$weight_eng, rio_orig$height_eng, ylab = "Height (in)", xlab = "Weight (lb)", main = "Height vs Weight of Athletes\n We see a general positive relationship")

Is there a height advantage in certain events? We will explore this further

boxplot(height ~ sport, data = rio_orig, main = "Height vs Event")

Is there a weight advantage in certain events? We will explore this further

boxplot(weight ~ sport, data = rio_orig, main = "Weight vs Event")

STATISICAL ANALYSIS COMPARING AGE? Do certain events attract older or younger athletes? We will explore this further

boxplot(age ~ sport, data = rio_orig, main = "Age vs Event")

Country Exploration

top_gdp_n <- rio_orig %>%  
    group_by( country ) %>% 
       summarise(
         gdp=max(gdp) ,
         n=n()
              )

Athletes without a country

Independent Olympic Athletes - Kuwaiti athletes who competed under the Olympic flag, as the Kuwait Olympic Committee had been suspended by the International Olympic Committee (IOC) for the second time in five years due to governmental interference
Kuwaiti shooter Fehaid Al-Deehani became the first independent athlete to win a gold medal

#[](https://www.olympiandatabase.com/index.php?id=123580&L=1)

#[](https://www.unhcr.org/en-us/rio-2016-refugee-olympic-team.html)

What are the top 5 nations with the lowest gdp that sent athletes? How many did they send?

top_gdp_n %>% 
            arrange(gdp) %>% 
                head(n=5) %>% 
                     kbl(caption = "") %>%
                     kable_classic(full_width = F, html_font = "Cambria")


country	gdp	n
Burundi	277	9
Central African Republic	323	6
Niger	359	6
Malawi	372	5
Madagascar	402	6

What are the top 5 nations with the highest gdp that sent athletes? How many did they send? USA not in list!

rio_orig %>%  
    group_by( country ) %>% 
       summarise(
        highest_gdp=max(gdp) ,
         n=n()
              ) %>% 
            arrange(desc(highest_gdp)) %>% 
                head(n=5) %>% 
                     kbl(caption = "") %>%
                     kable_classic(full_width = F, html_font = "Cambria")


country	highest_gdp	n
Luxembourg	101450	10
Switzerland	80945	104
Norway	74400	62
Qatar	73653	39
Ireland	61134	80

Which countries received the most medals?

*Which countries were the most productive?* Jamaica - a tiny nation that had over half of it’s athletes earning a medal.**

\[productivity = #medals / #athletes\]

medal_per_country %>% 
         group_by(country, total_medals, n) %>% 
            summarise(
            
            productivity = total_medals / n
               ) %>% 
              arrange(desc(productivity))  %>% 
                  head(n=10) %>% 
                      kbl(caption = "") %>%
                      kable_classic(full_width = F, html_font = "Cambria")


country	total_medals	n	productivity
Jamaica	30	57	0.5263158
Serbia	53	103	0.5145631
United States	264	567	0.4656085
Russia	115	286	0.4020979
United Kingdom	145	374	0.3877005
Germany	160	441	0.3628118
Azerbaijan	18	56	0.3214286
Denmark	41	128	0.3203125
Norway	19	62	0.3064516
China	113	404	0.2797030

Rio: the Unsung Stories

Rio: the Unsung Stories

Delores Mincarelli, Saloni Goel, Naveen Reddy

November 8th, 2021

1. Introduction

2. Packages Used

3. Data Preparation

4. Proposed Data Analysis