1.1 What do you remember about the 2016 Rio Olympics?
If you ask someone in the US, chances are they will say “Michael Phelps”. Michael dominated the coverage and became a media sensation, winning 5 gold medals and 1 silver, an extraordinary athlete by every measure.
Aside from this spotlight, Rio hosted over 11,000 athletes from over 200 nations, with over 300 events. Doing the math… 11000 x 200 x 300 .. we get an astronomical array of combinations to explore. We have curated some interesting insights about these games that show how special these games were. To get started, lets discuss the under-dog choice of Rio itself. - First Olympic Games to be held in South America - First time entrants from Kosova and South Sudan, and the Refugee Olympic Team - First Olympics to be held in South America - First to be held in Portuguese-speaking country - First summer games to be held in the host country’s winter season
We will take you on a journey to explore interesting facets of these games, exploring mins & max’s, highs & lows. We guarantee you will have a new appreciation for this historic event.
1.2 How will we do this?
First, we will do our own research into these games to learn more about this venue in particular. This provides the context for everything else, which we will not get by looking at the data. A key resource is: Olympic Website
Second, we will explore visualizations that others have done to get inspired, and to figure out what other unique angle we can bring. Tableau Public Rio
1.3 Our approach
We’ll use statistical and visual analysis to bring this data to life. This will appeal to visual learners as well as those that want to “see the numbers”. We will connect data about athletes with information about their home country so we can explore both avenues individually and collectively. We’ll employ the muscle of R (no pun intended) to compare different populations to see if there is truly a significant difference.
1.4 How will our analysis will help the consumer
We’ll provide clear interpretations and easy to digest information and visuals – we will do the hard work and provide insights to provide an engaging journey that is enteratining as well as informative.
3.1 Source of data
This dataset comes from Kaggle - Rio Olympics, which also cites the Official Olympics Website as the original source. The dataset was created in Dec 2016.
It consists of the official statistics for all of the athletes and the 306 events at the 2016 Olympic Games in Rio de Janeiro.
3 csv files were available, but we only used two of them. Please see explanation under “Peculiarities”.
Athletes.csv
Countries.csv
Events. csv
3.2 Variables, Missing Values, Peculiarities
3.2.1 Variables
In this step we import athletes and countries and merge together on the nationality & country code. We specified “all.x = TRUE” which does a left join from our athlete file, which will allowus to see how many participants did not get associated with a country.
athletes <- read.csv("athletes.csv" , stringsAsFactors=FALSE)
#dim(athletes)
countries <- read.csv("countries.csv" , stringsAsFactors=FALSE)
# dim(countries)
# colnames(countries)
rio_orig <- merge(x=athletes, y=countries, by.x="nationality", by.y="code" , all.x = TRUE)
# dim(rio_orig)
# colnames(rio_orig) Athletes data dictionary (manually created)
data_dict <- read_excel("data_dict.xlsx" )
athletes_data_dict <- filter(data_dict, csv == "athletes.csv" )
athletes_data_dict <- select(athletes_data_dict, -csv)
#athletes_data_dict %>%
# kbl(caption = "Data Dictionary - Athletes") %>%
#kable_classic(full_width = F, html_font = "Cambria") Countries data dictionary (manually created)
countries_data_dict <- filter( data_dict, csv == "countries.csv" )
countries_data_dict <- select( countries_data_dict, -csv )
countries_data_dict %>%
kbl( caption = "Data Dictionary - Countries" ) %>%
kable_classic( full_width = F, html_font = "Cambria" )| Variable | Units | Type | Description |
|---|---|---|---|
| country | NA | char | full name of country |
| code | NA | char | 3 letter code of country |
| population | NA | numerical | NA |
| gdp_per_capita | NA | numerical | gross domestic product per person |
3.2.2 Missing Values and Imputation
First, let’s look at the number of complete observations.
We have 10,109 complete records out of 11,538 = 87% complete
## 10109[1] " complete"
## 11538[1] " total"
## 87.61484[1] " %"
We have 5 columns with missing values
Percent Missing by Column
percent_missing <- as.data.frame(colSums(is.na(rio_orig))/nrow(rio_orig)*100)
names(percent_missing) <- c("percent_missing")
percent_missing %>%
filter(percent_missing > 0 )## percent_missing
## height 2.860114
## weight 5.711562
## country 2.660773
## population 3.380135
## gdp_per_capita 7.072283
What should we do about missing values?
- Height and Weight variables would need to be imputed carefully. We will see later on that athletes with certain physical characterics are more likely to participate in certain events, so we would need to impute by gender as well as the sport. Thus, we decided to omit the NA’s from our analysis when using these variables rather than impute them.
- Population and GDP - this is very specific information about a country so it doesn’t make sense to impute.
- Country Looking at the nationality codes, we can look up the country and add it to the dataset.
rio_orig %>%
filter( is.na(rio_orig$country) ) %>%
group_by( country , nationality) %>% #
select( country, nationality ) -> missing_country
rio_orig$country[rio_orig$nationality == "KIR" ] <- "Kiribati"
rio_orig$country[rio_orig$nationality == "KOS" ] <- "Kosovo"
rio_orig$country[rio_orig$nationality == "MHL" ] <- "Marshall Islands"
rio_orig$country[rio_orig$nationality == "MNE" ] <- "Montenegro"
rio_orig$country[rio_orig$nationality == "ROU" ] <- "Romania"
rio_orig$country[rio_orig$nationality == "SRB" ] <- "Serbia"
rio_orig$country[rio_orig$nationality == "SSD" ] <- "South Sudan"
rio_orig$country[rio_orig$nationality == "TTO" ] <- "Trinidad"
rio_orig$country[rio_orig$nationality == "TUV" ] <- "Tuvalu"
rio_orig$country[rio_orig$nationality == "IOA" ] <- "Intl Olympic Assoc."
rio_orig$country[rio_orig$nationality == "ROT" ] <- "Refugee Olympic Tm"
#check
percent_missing <- as.data.frame(colSums(is.na(rio_orig))/nrow(rio_orig)*100)
names(percent_missing) <- c("percent_missing")
percent_missing %>%
filter(percent_missing > 0 )## percent_missing
## height 2.860114
## weight 5.711562
## population 3.380135
## gdp_per_capita 7.072283
3.2.3 Peculiarities File events.csv: This file includes specific events, like “100m Womens Backstroke”, however, the key in this file did not link to the athletes so we did not use it. Athletes that were refugees competed under the country code of “ROT” (Refugee Olympic Team). Since this was not included in our country file, we manually added it so we could easily filter for these athletes.
3.3 Cleaning 3.3.1 gdp_per_capita First, we will rename column “gdp_per_capita” to just “gdp”, since the standard definition of gdp includes per capita, and a shorter name will allow our data to fit neatly onto the screen.
We will round the gdp to 0 decimals, since we are not doing anything in our analysis that needs more precision.
## [1] 594 594 594 3945 3945
** 3.3.2 change sport = athletics** We will rename “athletics” in the sport variable to be “track and field”. We felt “athletics” was too general for the event category these athletes participated in.
rio_orig$sport[rio_orig$sport == "athletics"] <- "track and field"
#check
rio_orig %>%
filter(sport == "track and field") %>%
summarise( n = n()) %>%
kbl(caption = "track and field") %>%
kable_classic(full_width = F, html_font = "Cambria")| n |
|---|
| 2363 |
3.3.3 Do we have incorrect values? Everything looks realistic except age
num.cols <- c( 6,7,9,10,11 , 15 )
height_wt_summary<-sapply(rio_orig[,num.cols][sapply(num.cols,is.numeric)], function(num.cols)
c(
"Mean"= round(mean(na.omit(num.cols,na.rm=TRUE)),1),
"Median" = round(median(na.omit(num.cols)),1),
"Minimum" = round(min(na.omit(num.cols)),1),
"Maximun" = round(max(na.omit(num.cols)),1),
"Std Dev" = round(sd(na.omit(num.cols)),1),
"Count" = length(na.omit(num.cols))
)
)
# transpose
height_weight_table <- as.data.frame(t(height_wt_summary))
height_weight_table %>%
kbl(caption = "track and field") %>%
kable_classic(full_width = F, html_font = "Cambria")| Mean | Median | Minimum | Maximun | Std Dev | Count | |
|---|---|---|---|---|---|---|
| height | 1.8 | 1.8 | 1.2 | 2.2 | 0.1 | 11208 |
| weight | 72.1 | 70.0 | 31.0 | 170.0 | 16.2 | 10879 |
| gold | 0.1 | 0.0 | 0.0 | 5.0 | 0.3 | 11538 |
| silver | 0.1 | 0.0 | 0.0 | 2.0 | 0.2 | 11538 |
| bronze | 0.1 | 0.0 | 0.0 | 2.0 | 0.2 | 11538 |
| gdp | 25214.4 | 19222.0 | 277.0 | 101450.0 | 20074.2 | 10722 |
Impute ages > 75 with the median age
print(paste( "The youngest person was ", min(rio_orig$age), " years old, and the oldest was " , max(rio_orig$age) ) ) ## [1] "The youngest person was 13 years old, and the oldest was 116"
## [1] "The median age is: 26"
# calculate the median of this column
rio_orig$age[rio_orig$age > 75] <- median_age
#check
print(paste( "Check: The youngest person was ", min(rio_orig$age), " years old, and the oldest was " , max(rio_orig$age) ) )## [1] "Check: The youngest person was 13 years old, and the oldest was 62"
3.3.4 Create new variables Add new variables for height, weight using English units for our American audience, and calculate Body Mass Index (BMI)
\[height(inches) = height * 39.3701\] \[weight(lbs) = weight * 2.20462\] \[BMI = weight (kg) / [height (m)]^{2}\]
rio_orig["height_eng"] <- round( rio_orig$height * 39.3701 , 1)
rio_orig["weight_eng"] <- round( rio_orig$weight * 2.20462 , 1)
rio_orig["BMI"] <- round( rio_orig$weight / (rio_orig$height)^2 , 1) 3.4 Review Cleaned Data Asses dimensions ** Rows and Columns
## [1] "There are 11538 rows"
## [1] "There are 18 columns"
Look at the beginning of the data
select( rio_orig, -c( id, nationality )) %>%
head(n = 5) %>%
kbl(caption = "First 5 rows") %>%
kable_classic(full_width = F, html_font = "Cambria")| name | sex | dob | height | weight | sport | gold | silver | bronze | age | country | population | gdp | height_eng | weight_eng | BMI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mohammad Tawfiq Bakhshi | male | 3/11/1986 | 1.81 | 99 | judo | 0 | 0 | 0 | 30 | Afghanistan | 32526562 | 594 | 71.3 | 218.3 | 30.2 |
| Kamia Yousufi | female | 5/20/1996 | 1.65 | 55 | track and field | 0 | 0 | 0 | 20 | Afghanistan | 32526562 | 594 | 65.0 | 121.3 | 20.2 |
| Abdul Wahab Zahiri | male | 5/27/1992 | 1.75 | 68 | track and field | 0 | 0 | 0 | 24 | Afghanistan | 32526562 | 594 | 68.9 | 149.9 | 22.2 |
| Nikol Merizaj | female | 8/7/1998 | 1.80 | 65 | aquatics | 0 | 0 | 0 | 17 | Albania | 2889167 | 3945 | 70.9 | 143.3 | 20.1 |
| Evagjelia Veli | female | 7/16/1991 | 1.60 | 52 | weightlifting | 0 | 0 | 0 | 25 | Albania | 2889167 | 3945 | 63.0 | 114.6 | 20.3 |
Look at the end of the data I think we should move this to end of this section after all the cleaning is done
select( rio_orig, -c( id, nationality )) %>%
tail(n = 5) %>%
kbl(caption = "Last 5 rows") %>%
kable_classic(full_width = F, html_font = "Cambria")| name | sex | dob | height | weight | sport | gold | silver | bronze | age | country | population | gdp | height_eng | weight_eng | BMI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11534 | Cuthbert Nyasango | male | 9/17/1982 | 1.63 | 56 | track and field | 0 | 0 | 0 | 33 | Zimbabwe | 15602751 | 924 | 64.2 | 123.5 | 21.1 |
| 11535 | Sheila Makoto | female | 1/14/1990 | 1.58 | 59 | football | 0 | 0 | 0 | 26 | Zimbabwe | 15602751 | 924 | 62.2 | 130.1 | 23.6 |
| 11536 | Kirsty Leigh Coventry | female | 9/16/1983 | 1.76 | 64 | aquatics | 0 | 0 | 0 | 32 | Zimbabwe | 15602751 | 924 | 69.3 | 141.1 | 20.7 |
| 11537 | Pardon Ndhlovu | male | 8/23/1987 | 1.58 | 54 | track and field | 0 | 0 | 0 | 28 | Zimbabwe | 15602751 | 924 | 62.2 | 119.0 | 21.6 |
| 11538 | Rutendo Makore | female | 9/30/1992 | 1.65 | 68 | football | 0 | 0 | 0 | 23 | Zimbabwe | 15602751 | 924 | 65.0 | 149.9 | 25.0 |
Look at the Datatypes
| Type | |
|---|---|
| nationality | character |
| id | integer |
| name | character |
| sex | character |
| dob | character |
| height | numeric |
| weight | integer |
| sport | character |
| gold | integer |
| silver | integer |
| bronze | integer |
| age | numeric |
| country | character |
| population | integer |
| gdp | numeric |
| height_eng | numeric |
| weight_eng | numeric |
| BMI | numeric |
We have no duplication of data
## [1] 0
4.1 Discuss how you plan to uncover new information in the data that is not self evident. What are different ways you can look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frame to create new summary information? How could you summarize your data to answer key questions?
4.2 What types of plot and tables will help you to illustrate the findings in your questions? * For data discovery of our numerical variables, box plots are great to see if there is some interesting variation to explore. * We have a lot of categorical data so bar plots will help dimensionalize these variables. * For “top 10” type of things, we use tables so we can display all of the data that went into the calculation.
4.3 What do you not know how to do right nowthat you need to learn to answer your questions? * Better visualization techniques – for example, how can you order bars in a bar chart?
* Better ways to display of information in tables – have used kable but struggle with formatting numbers * Calculations with dates – subtracting dates to get time inbetween, calculate age, things like that.
4.4 Do you plan to incorporate any machine learning techniques? * Possibly… what are the variables that might predict whether an athlete will win a medal? Age? Country? Sport? Height? Weight?
GENERAL ORIENTATION TO THE DATA WHICH SPORTS HAVE THE HIGHEST PARTICIPATION?
par(mfrow=c(1,2))
barplot(table(rio_orig$sex), main = "Number Males & Females" )
barplot(table(rio_orig$sport), main = "Number of pariticipants by sport\n Two sports dominate\n We'll explore this further" , cex.names=0.5 ) WHICH SPORTS ARE DOMINATED BY MALES , FEMALES?
NOT SURE THIS IS THE BEST VIZ We have uneven participation in certain sports by males and females. We will explore this further
mosaicplot(sport~sex,data=rio_orig, main = "Males and Females by Sport \nSome uneven participartion - We will explore this further", color=c("#fb8072", "#8dd3c7", cex.main =20, cex.lab = 5, cex.axis = 3)
) ATHLETE CHARACTERISTICS
plot(rio_orig$weight_eng, rio_orig$height_eng, ylab = "Height (in)", xlab = "Weight (lb)", main = "Height vs Weight of Athletes\n We see a general positive relationship")Is there a height advantage in certain events? We will explore this further
Is there a weight advantage in certain events? We will explore this further
STATISICAL ANALYSIS COMPARING AGE? Do certain events attract older or younger athletes? We will explore this further
Country Exploration
Athletes without a country
Independent Olympic Athletes - Kuwaiti athletes who competed under the Olympic flag, as the Kuwait Olympic Committee had been suspended by the International Olympic Committee (IOC) for the second time in five years due to governmental interference
Kuwaiti shooter Fehaid Al-Deehani became the first independent athlete to win a gold medal
#[](https://www.olympiandatabase.com/index.php?id=123580&L=1)
#[](https://www.unhcr.org/en-us/rio-2016-refugee-olympic-team.html)What are the top 5 nations with the lowest gdp that sent athletes? How many did they send?
top_gdp_n %>%
arrange(gdp) %>%
head(n=5) %>%
kbl(caption = "") %>%
kable_classic(full_width = F, html_font = "Cambria")| country | gdp | n |
|---|---|---|
| Burundi | 277 | 9 |
| Central African Republic | 323 | 6 |
| Niger | 359 | 6 |
| Malawi | 372 | 5 |
| Madagascar | 402 | 6 |
What are the top 5 nations with the highest gdp that sent athletes? How many did they send? USA not in list!
rio_orig %>%
group_by( country ) %>%
summarise(
highest_gdp=max(gdp) ,
n=n()
) %>%
arrange(desc(highest_gdp)) %>%
head(n=5) %>%
kbl(caption = "") %>%
kable_classic(full_width = F, html_font = "Cambria")| country | highest_gdp | n |
|---|---|---|
| Luxembourg | 101450 | 10 |
| Switzerland | 80945 | 104 |
| Norway | 74400 | 62 |
| Qatar | 73653 | 39 |
| Ireland | 61134 | 80 |
Which countries received the most medals?
*Which countries were the most productive?* Jamaica - a tiny nation that had over half of it’s athletes earning a medal.**
\[productivity = #medals / #athletes\]
medal_per_country %>%
group_by(country, total_medals, n) %>%
summarise(
productivity = total_medals / n
) %>%
arrange(desc(productivity)) %>%
head(n=10) %>%
kbl(caption = "") %>%
kable_classic(full_width = F, html_font = "Cambria")| country | total_medals | n | productivity |
|---|---|---|---|
| Jamaica | 30 | 57 | 0.5263158 |
| Serbia | 53 | 103 | 0.5145631 |
| United States | 264 | 567 | 0.4656085 |
| Russia | 115 | 286 | 0.4020979 |
| United Kingdom | 145 | 374 | 0.3877005 |
| Germany | 160 | 441 | 0.3628118 |
| Azerbaijan | 18 | 56 | 0.3214286 |
| Denmark | 41 | 128 | 0.3203125 |
| Norway | 19 | 62 | 0.3064516 |
| China | 113 | 404 | 0.2797030 |