Project Topic

This project will include an analysis of the 2016 Rio Olympics data.

Source: https://github.com/flother/rio2016

Data Descriptions

The dataset contains information for all 11,538 athletes across 306 events. The data captures information by athlete, specifing:

The data is formatted in a “tidy” rectangular format: - row for each athlete (observation), - column for each characteristic (variable).

Import Your Data

In the following code hunk, import your data.

athletes <- read.csv("athletes.csv")
ath <- tibble(athletes)

events <- read.csv("events.csv")
eve <- tibble(events)

head(athletes)
##          id           name nationality    sex date_of_birth height weight
## 1 736041664 A Jesus Garcia         ESP   male    1969-10-17   1.72     64
## 2 532037425     A Lam Shin         KOR female    1986-09-23   1.68     56
## 3 435962603    Aaron Brown         CAN   male    1992-05-27   1.98     79
## 4 521041435     Aaron Cook         MDA   male    1991-01-02   1.83     80
## 5  33922579     Aaron Gate         NZL   male    1990-11-26   1.81     71
## 6 173071782    Aaron Royle         AUS   male    1990-01-26   1.80     67
##       sport gold silver bronze info
## 1 athletics    0      0      0     
## 2   fencing    0      0      0     
## 3 athletics    0      0      1     
## 4 taekwondo    0      0      0     
## 5   cycling    0      0      0     
## 6 triathlon    0      0      0
head(events)
##       id    sport   discipline                      name    sex
## 1 701492 aquatics   backstroke   Women's 100m Backstroke female
## 2 305278 aquatics   backstroke   Women's 200m Backstroke female
## 3 708010 aquatics   backstroke     Men's 100m Backstroke   male
## 4 729643 aquatics   backstroke     Men's 200m Backstroke   male
## 5 567019 aquatics breaststroke Women's 100m Breaststroke female
## 6 607924 aquatics breaststroke Women's 200m Breaststroke female
##                     venues
## 1 Olympic Aquatics Stadium
## 2 Olympic Aquatics Stadium
## 3 Olympic Aquatics Stadium
## 4 Olympic Aquatics Stadium
## 5 Olympic Aquatics Stadium
## 6 Olympic Aquatics Stadium

Part 1

Using words, describe the visualization you are going to make using which variables/characteristics in your data:

First, I am going to look at how many female and male events there where. For that, I am going to use the events data:

  1. Rename the sex column to Gender.
  2. Create a df containing 3 rows (male, female, mixed) and the total for each category.

Which gender is most represented in the event?

fig_dat1<-eve %>% 
  select(Gender = sex) %>%
  mutate(n = 1) %>%
  group_by(Gender) %>%
  summarise(total = sum(n))

fig_dat1
## # A tibble: 3 × 2
##   Gender total
##   <chr>  <dbl>
## 1 female   136
## 2 male     161
## 3 mixed      9

Part 2

Using words, describe the second visualization you are going to make using which variables/characteristics in your data:

A stacked bar of: 1. total athletes per country 2. against total medals per country.

Do countries with more athletes win more medals?

In the code chunk below, show your work filtering the data and create the subset of data you will display graphically.

## Coding and Formating of the Data

ath1 <- ath %>%
  mutate(country = case_when(
    nationality == 'IOA' ~ 'Independent Olympians',
    nationality == 'ISV' ~ 'United States Virgin Islands',
    nationality == 'LIB' ~ 'Lebanon',
    nationality == 'ROT' ~ 'Refugee Olympic Team',
    nationality == 'SIN' ~ 'Singapore',
    nationality >0 ~ countrycode(nationality, "ioc", "country.name"))) %>%
  mutate(continent = case_when(
    nationality == 'IOA' ~ 'Independent',
    nationality == 'ISV' ~ 'Americas',
    nationality == 'LIB' ~ 'Asia',
    nationality == 'ROT' ~ 'Independent',
    nationality == 'SIN' ~ 'Asia',
    nationality == 'KOS' ~ 'Europe',
    nationality >0 ~ countrycode(nationality, "ioc", "continent")))  %>%
  mutate(total_medals = gold + silver + bronze) 
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `country = case_when(...)`.
## Caused by warning:
## ! Some values were not matched unambiguously: IOA, LIB, ROT, SIN
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `continent = case_when(...)`.
## Caused by warning:
## ! Some values were not matched unambiguously: IOA, KOS, LIB, ROT, SIN
## Data Selection
fig_dat2<-ath1 %>% 
  select(country)

temp_dat2 <- ath1 %>% select(name,
                             country)%>%
  mutate(number_of_athletes = 1) %>%
  group_by(country)%>%
  summarise(number_of_athletes = sum(number_of_athletes))

temp_dat1 <- ath1 %>% select(country,
                             total_medals) %>%
  group_by(country)%>%
  summarise(total_number_medals = sum(total_medals))


fig_dat2 <- fig_dat2 %>% distinct(country, .keep_all = TRUE)

fig_dat2 <- left_join(temp_dat2,fig_dat2,by="country") %>%
  arrange(desc(number_of_athletes))

fig_dat2 <- left_join(temp_dat1,
                      fig_dat2,
                      by = "country")

fig_dat2
## # A tibble: 207 × 3
##    country           total_number_medals number_of_athletes
##    <chr>                           <int>              <dbl>
##  1 Afghanistan                         0                  3
##  2 Albania                             0                  6
##  3 Algeria                             2                 68
##  4 American Samoa                      0                  4
##  5 Andorra                             0                  5
##  6 Angola                              0                 26
##  7 Antigua & Barbuda                   0                  9
##  8 Argentina                          22                223
##  9 Armenia                             4                 32
## 10 Aruba                               0                  7
## # ℹ 197 more rows

Part 3

Using words, describe the third visualization you are going to make using which variables/characteristics in your data:

A scatter plot of weight against height, with size of observations set by number of medals won.

Do taller and heavier athletes win more medals?

fig_dat3<-ath1 %>% 
  select(country)

fig_dat3 <- fig_dat3 %>% distinct(country, .keep_all = TRUE)

temp_dat3 <- ath1 %>%
  select(name,
         country,
         weight) %>%
  drop_na(weight) %>%
  group_by(country) %>%
  summarise(Mean_weight = mean(weight))

temp_dat4 <- ath1 %>%
  select(name,
         country,
         height) %>%
  drop_na(height) %>%
  group_by(country) %>%
  summarise(Mean_height = mean(height))

fig_dat3 <- right_join(temp_dat1,
                      fig_dat3,
                      by="country") %>%
  filter(total_number_medals > 0)

fig_dat3 <- right_join(temp_dat3, 
                      fig_dat3, 
                      by = "country")
fig_dat3 <- right_join(temp_dat4,
                      fig_dat3,
                      by = "country")


fig_dat3
## # A tibble: 87 × 4
##    country    Mean_height Mean_weight total_number_medals
##    <chr>            <dbl>       <dbl>               <int>
##  1 Algeria           1.76        70.3                   2
##  2 Argentina         1.78        75.7                  22
##  3 Armenia           1.72        78.7                   4
##  4 Australia         1.78        73.4                  82
##  5 Austria           1.77        70.5                   2
##  6 Azerbaijan        1.78        75.7                  18
##  7 Bahamas           1.76        68.9                   6
##  8 Bahrain           1.71        60.5                   2
##  9 Belarus           1.77        72.9                  12
## 10 Belgium           1.78        69.8                  21
## # ℹ 77 more rows