This project will include an analysis of the 2016 Rio Olympics data.
The dataset contains information for all 11,538 athletes across 306 events. The data captures information by athlete, specifing:
The data is formatted in a “tidy” rectangular format: - row for each athlete (observation), - column for each characteristic (variable).
In the following code hunk, import your data.
athletes <- read.csv("athletes.csv")
ath <- tibble(athletes)
events <- read.csv("events.csv")
eve <- tibble(events)
head(athletes)
## id name nationality sex date_of_birth height weight
## 1 736041664 A Jesus Garcia ESP male 1969-10-17 1.72 64
## 2 532037425 A Lam Shin KOR female 1986-09-23 1.68 56
## 3 435962603 Aaron Brown CAN male 1992-05-27 1.98 79
## 4 521041435 Aaron Cook MDA male 1991-01-02 1.83 80
## 5 33922579 Aaron Gate NZL male 1990-11-26 1.81 71
## 6 173071782 Aaron Royle AUS male 1990-01-26 1.80 67
## sport gold silver bronze info
## 1 athletics 0 0 0
## 2 fencing 0 0 0
## 3 athletics 0 0 1
## 4 taekwondo 0 0 0
## 5 cycling 0 0 0
## 6 triathlon 0 0 0
head(events)
## id sport discipline name sex
## 1 701492 aquatics backstroke Women's 100m Backstroke female
## 2 305278 aquatics backstroke Women's 200m Backstroke female
## 3 708010 aquatics backstroke Men's 100m Backstroke male
## 4 729643 aquatics backstroke Men's 200m Backstroke male
## 5 567019 aquatics breaststroke Women's 100m Breaststroke female
## 6 607924 aquatics breaststroke Women's 200m Breaststroke female
## venues
## 1 Olympic Aquatics Stadium
## 2 Olympic Aquatics Stadium
## 3 Olympic Aquatics Stadium
## 4 Olympic Aquatics Stadium
## 5 Olympic Aquatics Stadium
## 6 Olympic Aquatics Stadium
Using words, describe the visualization you are going to make using which variables/characteristics in your data:
First, I am going to look at how many female and male events there where. For that, I am going to use the events data:
Which gender is most represented in the event?
fig_dat1<-eve %>%
select(Gender = sex) %>%
mutate(n = 1) %>%
group_by(Gender) %>%
summarise(total = sum(n))
fig_dat1
## # A tibble: 3 × 2
## Gender total
## <chr> <dbl>
## 1 female 136
## 2 male 161
## 3 mixed 9
Using words, describe the second visualization you are going to make using which variables/characteristics in your data:
A stacked bar of: 1. total athletes per country 2. against total medals per country.
Do countries with more athletes win more medals?
In the code chunk below, show your work filtering the data and create the subset of data you will display graphically.
## Coding and Formating of the Data
ath1 <- ath %>%
mutate(country = case_when(
nationality == 'IOA' ~ 'Independent Olympians',
nationality == 'ISV' ~ 'United States Virgin Islands',
nationality == 'LIB' ~ 'Lebanon',
nationality == 'ROT' ~ 'Refugee Olympic Team',
nationality == 'SIN' ~ 'Singapore',
nationality >0 ~ countrycode(nationality, "ioc", "country.name"))) %>%
mutate(continent = case_when(
nationality == 'IOA' ~ 'Independent',
nationality == 'ISV' ~ 'Americas',
nationality == 'LIB' ~ 'Asia',
nationality == 'ROT' ~ 'Independent',
nationality == 'SIN' ~ 'Asia',
nationality == 'KOS' ~ 'Europe',
nationality >0 ~ countrycode(nationality, "ioc", "continent"))) %>%
mutate(total_medals = gold + silver + bronze)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `country = case_when(...)`.
## Caused by warning:
## ! Some values were not matched unambiguously: IOA, LIB, ROT, SIN
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `continent = case_when(...)`.
## Caused by warning:
## ! Some values were not matched unambiguously: IOA, KOS, LIB, ROT, SIN
## Data Selection
fig_dat2<-ath1 %>%
select(country)
temp_dat2 <- ath1 %>% select(name,
country)%>%
mutate(number_of_athletes = 1) %>%
group_by(country)%>%
summarise(number_of_athletes = sum(number_of_athletes))
temp_dat1 <- ath1 %>% select(country,
total_medals) %>%
group_by(country)%>%
summarise(total_number_medals = sum(total_medals))
fig_dat2 <- fig_dat2 %>% distinct(country, .keep_all = TRUE)
fig_dat2 <- left_join(temp_dat2,fig_dat2,by="country") %>%
arrange(desc(number_of_athletes))
fig_dat2 <- left_join(temp_dat1,
fig_dat2,
by = "country")
fig_dat2
## # A tibble: 207 × 3
## country total_number_medals number_of_athletes
## <chr> <int> <dbl>
## 1 Afghanistan 0 3
## 2 Albania 0 6
## 3 Algeria 2 68
## 4 American Samoa 0 4
## 5 Andorra 0 5
## 6 Angola 0 26
## 7 Antigua & Barbuda 0 9
## 8 Argentina 22 223
## 9 Armenia 4 32
## 10 Aruba 0 7
## # ℹ 197 more rows
Using words, describe the third visualization you are going to make using which variables/characteristics in your data:
A scatter plot of weight against height, with size of observations set by number of medals won.
Do taller and heavier athletes win more medals?
fig_dat3<-ath1 %>%
select(country)
fig_dat3 <- fig_dat3 %>% distinct(country, .keep_all = TRUE)
temp_dat3 <- ath1 %>%
select(name,
country,
weight) %>%
drop_na(weight) %>%
group_by(country) %>%
summarise(Mean_weight = mean(weight))
temp_dat4 <- ath1 %>%
select(name,
country,
height) %>%
drop_na(height) %>%
group_by(country) %>%
summarise(Mean_height = mean(height))
fig_dat3 <- right_join(temp_dat1,
fig_dat3,
by="country") %>%
filter(total_number_medals > 0)
fig_dat3 <- right_join(temp_dat3,
fig_dat3,
by = "country")
fig_dat3 <- right_join(temp_dat4,
fig_dat3,
by = "country")
fig_dat3
## # A tibble: 87 × 4
## country Mean_height Mean_weight total_number_medals
## <chr> <dbl> <dbl> <int>
## 1 Algeria 1.76 70.3 2
## 2 Argentina 1.78 75.7 22
## 3 Armenia 1.72 78.7 4
## 4 Australia 1.78 73.4 82
## 5 Austria 1.77 70.5 2
## 6 Azerbaijan 1.78 75.7 18
## 7 Bahamas 1.76 68.9 6
## 8 Bahrain 1.71 60.5 2
## 9 Belarus 1.77 72.9 12
## 10 Belgium 1.78 69.8 21
## # ℹ 77 more rows