1 At glance

Refer to care point health US, body mass index (BMI) is a guideline used by health professionals and bariatric surgeons to determine the overall health of an individual in terms of weight and body fat. BMI is calculated using a ratio of your height and weight, and can be affected by how often you exercise and by the types of foods you eat. Those who have above-average BMIs that fall into the overweight and obese categories can take steps to become healthier through diet, exercise, and weight-loss surgery so they can improve their overall quality of life and live longer with fewer health complications.

knitr::include_graphics("/Users/pandjibagas/Desktop/Algoritma Course/LBB/Olympic_Athletes/120 Years Olympic History/Body Mass Index.png")

From image above, US medical standard categorize people to 5 category by result the BMI, which is underweight (BMI < 18.5), normal (<18.5 < BMI < 24.6), overweight (25 < BMI < 29.9), obese (30 < BMI < 39.9), and morbidly obese (BMI > 40). For the record, the index only apply to adult aged above 20 years.

In this publication will interprets about the effect of BMI in olympic athlete from historical data on the modern Olympic Games, from Athens 1896 to Rio 2016. Each row corresponds to an individual athlete competing in an individual event, including the athlete’s name, sex, age, height, weight, country, and medal, and the event’s name, sport, games, year, and city. The outputs will exemine the simple distribuiton from all parameters include BMI, categorize athlete by BMI chart shown above, and the correlation BMI with earning medals. Lets gets started!

2 Data Wrangling

The core activites from this step are discovering and input, structuring, cleaning, enriching, validating, and analyzing.

3 Dicovering & Input Data

The historical data olympic acquaried from mavenanalytic (https://www.mavenanalytics.io/data-playground). The next step input the data with code below,

#Load/install library package to read data
library(readxl)
library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(viridis)
## Loading required package: viridisLite
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggiraph)
library(glue)
## 
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
## 
##     collapse
#All packages data for observing the overall data and output
olympic_athlete <- read.csv("athlete_events.csv") #main data
olympic_country <- read.csv("country_definitions.csv") #supporting data
athlete_def <- read.csv("athlete_events_data_dictionary.csv") #definition each coloumn from athlete_events.csv
country_def <-  read.csv("country_definitions_data_dictionary.csv") #definition each coloumn from country_definiitons.csv

Show the data

#Our main data olympic_athlete
head(olympic_athlete,10)
#Our supporting data olympic_country
head(olympic_country,10)

4 Structuring Data

As dataframe shown from olympic_athlete & olympic_country we need to combine both of dataframe because, 1. From business question we must know the frequency each country but the variable country occur in olympic_country as region column and the detail of athlete in olympic_athlete. 2. To simplify our wrangling and analyze should be in one dataframe.

Merge the data

#Merge the data with key column, "NOC"
olympic <- merge(olympic_athlete, olympic_country, by.x = "NOC", by.y = "NOC", all.x = T)
head(olympic,10)

Check the data type every column & adjust the type.

#Check data type every column
str(olympic)
## 'data.frame':    271116 obs. of  17 variables:
##  $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ ID    : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
##  $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
##  $ Sex   : chr  "M" "M" "M" "M" ...
##  $ Age   : int  NA NA 28 NA 24 28 28 NA NA NA ...
##  $ Height: int  NA NA 163 NA NA 168 NA NA NA NA ...
##  $ Weight: num  NA NA 57 NA 74 73 NA NA 57 NA ...
##  $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
##  $ Year  : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
##  $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
##  $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
##  $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
##  $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
##  $ Medal : chr  NA NA NA NA ...
##  $ region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ notes : chr  "" "" "" "" ...

From result above there is several variables that need to change to obtain appropiate output. The variables are NOC, Sex, Height, Weight, Team, Games, Season, City, Sport, Event, Medal, Region and change become factor.

olympic[,c("Sex","Team","NOC","Games","Season","City","Sport","Medal","Event","region")] <-
  lapply(olympic[,c("Sex","Team","NOC","Games","Season","City","Sport","Medal","Event","region")], as.factor)

5 Cleaning Data

This step occur activites to check the completness of data, missing value, and validate the connection between variables.

summary(olympic)
##       NOC               ID             Name           Sex       
##  USA    : 18853   Min.   :     1   Length:271116      F: 74522  
##  FRA    : 12758   1st Qu.: 34643   Class :character   M:196594  
##  GBR    : 12256   Median : 68205   Mode  :character             
##  ITA    : 10715   Mean   : 68249                                
##  GER    :  9830   3rd Qu.:102097                                
##  CAN    :  9733   Max.   :135571                                
##  (Other):196971                                                 
##       Age            Height          Weight                 Team       
##  Min.   :10.00   Min.   :127.0   Min.   : 25.0   United States: 17847  
##  1st Qu.:21.00   1st Qu.:168.0   1st Qu.: 60.0   France       : 11988  
##  Median :24.00   Median :175.0   Median : 70.0   Great Britain: 11404  
##  Mean   :25.56   Mean   :175.3   Mean   : 70.7   Italy        : 10260  
##  3rd Qu.:28.00   3rd Qu.:183.0   3rd Qu.: 79.0   Germany      :  9326  
##  Max.   :97.00   Max.   :226.0   Max.   :214.0   Canada       :  9279  
##  NA's   :9474    NA's   :60171   NA's   :62875   (Other)      :201012  
##          Games             Year         Season                   City       
##  2000 Summer: 13821   Min.   :1896   Summer:222552   London        : 22426  
##  1996 Summer: 13780   1st Qu.:1960   Winter: 48564   Athina        : 15556  
##  2016 Summer: 13688   Median :1988                   Sydney        : 13821  
##  2008 Summer: 13602   Mean   :1978                   Atlanta       : 13780  
##  2004 Summer: 13443   3rd Qu.:2002                   Rio de Janeiro: 13688  
##  1992 Summer: 12977   Max.   :2016                   Beijing       : 13602  
##  (Other)    :189805                                  (Other)       :178243  
##         Sport                                        Event       
##  Athletics : 38624   Football Men's Football            :  5733  
##  Gymnastics: 26707   Ice Hockey Men's Ice Hockey        :  4762  
##  Swimming  : 23195   Hockey Men's Hockey                :  3958  
##  Shooting  : 11448   Water Polo Men's Water Polo        :  3358  
##  Cycling   : 10859   Basketball Men's Basketball        :  3280  
##  Fencing   : 10735   Cycling Men's Road Race, Individual:  2947  
##  (Other)   :149548   (Other)                            :247078  
##     Medal            region          notes          
##  Bronze: 13295   USA    : 18853   Length:271116     
##  Gold  : 13372   Germany: 15883   Class :character  
##  Silver: 13116   France : 12758   Mode  :character  
##  NA's  :231333   UK     : 12256                     
##                  Russia : 11692                     
##                  (Other):199304                     
##                  NA's   :   370

From summary code, it generates simple aggregation in all variable in each value depends on data type and to peek missing value in variable. Conclusion from the summary are: 1. From variable ID, amount athlete that participate olympic is 135.571 2. From variable Sex, amount Male (196.594) dominate female athlete (74.522) 3. From variable Age, avarage age athlete olympic all the time is 25.56 and have 9.479 missing values. 4. From variable Height, avarage height athlete olympic all the time is 175.3 and have 60.171 missing values. 5. From variable Weight, avarage weight athlete olympic all the time is 70.7 and have 62.875 missing values. 6. From variable Teams, the top three amount athlete by team’s name are United States, France, and Great Britain. 7. From variable Games, the most participant athlete when 2000 Summer Olympic with amount 13.821 athletes. 8. From variable Year, the first olympic held on 1896 and last on 2016. 9. From variable Season, the most season helds olympic is during summer. 10.From variable Sport, the most participate sport is athletics with amount 38.624 athletes. 11. From variable Event, the most participate event is Men’s Footbal with amount 5.733 athletes. 12. From variable Medal, the NA’s value define the athlete who not earn medal. 13. From variable region, the top three participate athlete by region are USA, Germany, and France. 14. From notes, don’t have any info.

Inspection missing value focus on Age, Height, Weight, and region.

Why it’s matter? As we know from business question, to calculate BMI we need height, weight, age, and for the output expected we need to compare BMI based on country.

#Load/install library package to filter data
library(dplyr)

#Identifying NA
colSums(is.na(olympic))
##    NOC     ID   Name    Sex    Age Height Weight   Team  Games   Year Season 
##      0      0      0      0   9474  60171  62875      0      0      0      0 
##   City  Sport  Event  Medal region  notes 
##      0      0      0 231333    370    349

Delete all row containt NA in variable Age, Height, Weight, and Region

#Identifying amount row where's NA in column Age, Height, Weight, Region

#Delete row containt NA based on Age, Height, Weight, Region
olympic <- olympic[!is.na(olympic$Age),]
olympic <- olympic[!is.na(olympic$Height),]
olympic <- olympic[!is.na(olympic$Weight),]
olympic <- olympic[!is.na(olympic$region),]

colSums(is.na(olympic))
##    NOC     ID   Name    Sex    Age Height Weight   Team  Games   Year Season 
##      0      0      0      0      0      0      0      0      0      0      0 
##   City  Sport  Event  Medal region  notes 
##      0      0      0 175723      0      0

Replace missing value on Medal variable to No Medal

#To replace with string, the data type column should be in character type
no_medal <- olympic$Medal
no_medal <- as.character(no_medal)
no_medal[is.na(no_medal)] <- "No_Medal"
no_medal <- as.factor(no_medal)
olympic$Medal <- no_medal

colSums(is.na(olympic))
##    NOC     ID   Name    Sex    Age Height Weight   Team  Games   Year Season 
##      0      0      0      0      0      0      0      0      0      0      0 
##   City  Sport  Event  Medal region  notes 
##      0      0      0      0      0      0

Remove notes column

olympic$notes <- NULL

Summary all data once again for final check

summary(olympic)
##       NOC               ID             Name           Sex       
##  USA    : 14214   Min.   :     1   Length:205895      F: 66592  
##  FRA    :  7977   1st Qu.: 35178   Class :character   M:139303  
##  CAN    :  7965   Median : 68609   Mode  :character             
##  GBR    :  7766   Mean   : 68600                                
##  ITA    :  7697   3rd Qu.:102290                                
##  JPN    :  7487   Max.   :135571                                
##  (Other):152789                                                 
##       Age            Height          Weight                 Team       
##  Min.   :11.00   Min.   :127.0   Min.   : 25.0   United States: 13714  
##  1st Qu.:21.00   1st Qu.:168.0   1st Qu.: 60.0   France       :  7807  
##  Median :24.00   Median :175.0   Median : 70.0   Canada       :  7668  
##  Mean   :25.06   Mean   :175.4   Mean   : 70.7   Great Britain:  7499  
##  3rd Qu.:28.00   3rd Qu.:183.0   3rd Qu.: 79.0   Italy        :  7433  
##  Max.   :71.00   Max.   :226.0   Max.   :214.0   Japan        :  7339  
##                                                  (Other)      :154435  
##          Games             Year         Season                   City       
##  2000 Summer: 13659   Min.   :1896   Summer:166436   London        : 13766  
##  2016 Summer: 13398   1st Qu.:1976   Winter: 39459   Sydney        : 13659  
##  2004 Summer: 13377   Median :1992                   Athina        : 13618  
##  2008 Summer: 13365   Mean   :1990                   Rio de Janeiro: 13398  
##  2012 Summer: 12486   3rd Qu.:2006                   Beijing       : 13365  
##  1996 Summer: 11816   Max.   :2016                   Atlanta       : 11816  
##  (Other)    :127794                                  (Other)       :126273  
##                   Sport                                        Event       
##  Athletics           : 32348   Ice Hockey Men's Ice Hockey        :  3825  
##  Swimming            : 18658   Football Men's Football            :  3459  
##  Gymnastics          : 18267   Hockey Men's Hockey                :  2874  
##  Rowing              :  7789   Basketball Men's Basketball        :  2461  
##  Cycling             :  7775   Water Polo Men's Water Polo        :  2231  
##  Cross Country Skiing:  7529   Cycling Men's Road Race, Individual:  2099  
##  (Other)             :113529   (Other)                            :188946  
##       Medal            region      
##  Bronze  : 10144   USA    : 14214  
##  Gold    : 10166   Germany: 13183  
##  No_Medal:175723   Russia : 10398  
##  Silver  :  9862   France :  7977  
##                    Canada :  7966  
##                    UK     :  7766  
##                    (Other):144391
nrow(olympic)
## [1] 205895

6 Enriching

As the output expected there is two columns that should be add and show, first column is Body Mass Index (BMI). The formula to calculate BMI :

knitr::include_graphics("BMI formula.png")

Add and calculate BMI

#remember height in m not cm
olympic <- olympic %>%
  mutate(BMI = Weight/(Height/100)^2) %>%
  mutate_at(vars(BMI), funs(round(., 2)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
head(olympic)

Second column is for to categorize the BMI based on BMI chart.

olympic <- olympic %>%
  mutate(BMI_category = case_when(BMI < 18.5 ~ "Underweight",
                                  BMI <= 24.9 ~ "Normal",
                                  BMI <= 29.9 ~ "Overweight",
                                  BMI <= 39.9 ~ "Obese",
                                  BMI >39.9 ~ "Morbidly Obese"))

olympic[,c("BMI","BMI_category"),]

Transform value in variable medal to column and give total of medal each athlete

library(tidyr)
olympic <- olympic %>%
  pivot_wider(
    names_from = Medal,
    values_from = Medal,
    values_fn = length,
    values_fill = 0) %>%
  mutate(Total_Medal = Bronze + Silver + Gold)

olympic$No_Medal <- NULL

olympic

7 Validating

This step prevent anomaly data before we processing, to do so we will aggregate and subset. As our business question we will focus on the BMI & medal

Check outlier

Groupby by ID because the unique keys

#olympic_check_outlier
library(ggplot2)
olympic %>%
  filter(region == "USA", Sport == "Athletics", Sex == "M") %>%
  group_by(Year, ID) %>%
  summarise(Bronze = sum(Bronze),
            Silver = sum(Silver),
            Gold = sum(Gold),
            All_Medal = sum(Total_Medal),
            BMI = mean(BMI)) %>%
  mutate(BMI_cat = case_when(BMI < 18.5 ~ "Underweight",
                                  BMI <= 24.9 ~ "Normal",
                                  BMI <= 29.9 ~ "Overweight",
                                  BMI <= 39.9 ~ "Obese",
                                  BMI >39.9 ~ "Morbidly Obese")) %>%
            
  ggplot(mapping = aes(x = BMI , y = as.factor(Year))) +
  geom_jitter(aes(col = BMI_cat)) +
  geom_boxplot(fill = "Gray", alpha = 0.5) +
  labs(title = "Olympic Athlete BMI by Year",
       subtitle = "Male Athlete",
       y = "Year", x ="BMI",
       col = "BMI Category")
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.

unique(olympic$Sport[olympic$region == "Indonesia"])
##  [1] Athletics     Shooting      Weightlifting Badminton     Sailing      
##  [6] Boxing        Table Tennis  Archery       Cycling       Fencing      
## [11] Tennis        Swimming      Rowing        Taekwondo     Diving       
## [16] Wrestling     Canoeing      Judo         
## 66 Levels: Aeronautics Alpine Skiing Alpinism Archery ... Wrestling
unique(olympic$Event[olympic$Sport == "Athletics" & olympic$region == "Indonesia"])
##  [1] Athletics Men's 4 x 100 metres Relay Athletics Men's 100 metres          
##  [3] Athletics Men's 5,000 metres         Athletics Men's 10,000 metres       
##  [5] Athletics Women's 200 metres         Athletics Women's Marathon          
##  [7] Athletics Women's 100 metres         Athletics Men's 400 metres          
##  [9] Athletics Men's 200 metres           Athletics Women's 5,000 metres      
## [11] Athletics Women's 100 metres Hurdles Athletics Women's Long Jump         
## [13] Athletics Men's 110 metres Hurdles   Athletics Women's 400 metres        
## [15] Athletics Men's Marathon            
## 765 Levels: Aeronautics Mixed Aeronautics ... Wrestling Women's Middleweight, Freestyle
olympic_boxplot <- olympic %>%
            #filter(region == input$Select_region, Sport == input$Select_Sport, Event == input$Select_Event) %>%
            group_by(Year, ID) %>%
            summarise(Bronze = sum(Bronze),
                      Silver = sum(Silver),
                      Gold = sum(Gold),
                      All_Medal = sum(Total_Medal),
                      BMI = mean(BMI)) %>%
            mutate(BMI_cat = case_when(BMI < 18.5 ~ "Underweight",
                                       BMI <= 24.9 ~ "Normal",
                                       BMI <= 29.9 ~ "Overweight",
                                       BMI <= 39.9 ~ "Obese",
                                       BMI >39.9 ~ "Morbidly Obese"))
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
        olympic_boxplot <- ggplot(olympic_boxplot, mapping = aes(y = BMI , x = Year)) +
            geom_jitter(aes(col = BMI_cat)) +
            geom_boxplot(fill = "Gray", alpha = 0.5) +
            labs(title = "Olympic Athlete BMI by Year",
                 subtitle = "Male Athlete",
                 x = "Year", y ="BMI",
                 col = "BMI Category")
        
        ggplotly(olympic_boxplot)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
scl <- 0.01
library(ggplot2) #package to create plot

#suspect BMI outlier from Sex variable in all athlete by each year
outlier_BMI_M <- olympic %>%
  filter(Sex == "M") %>%
  select(Name, Sex, Sport, Event, Year, Age,Weight, Height, BMI, BMI_category) %>%
  
  ggplot(mapping = aes(x = BMI , y = as.factor(Year))) +
  geom_jitter(aes(col = BMI_category)) +
  geom_boxplot(fill = "white") +
  labs(title = "Olympic Athlete BMI by Year",
       subtitle = "Male Athlete",
       y = "Year", x ="BMI",
       col = "BMI Category")

outlier_BMI_M

outlier_BMI_F <- olympic %>%
  filter(Sex == "F") %>%
  select(Name, Sex, Sport, Event, Year, Age,Weight, Height, BMI, BMI_category) %>%
  
  ggplot(aes(x = BMI , y = as.factor(Year))) +
  geom_jitter(aes(col = BMI_category)) +
  geom_boxplot(aes(fill = "white")) +
  labs(title = "Olympic Athlete BMI by Year",
       subtitle = "Female Athlete",
       y = "Year", x ="BMI",
       col = "BMI Category")

outlier_BMI_F

Outlier cannot be trace just from Gender variable, it’s because the amount participant athlete each sport have different height & weight requirement. Example wrestling avarage BMI higher but have athlete participant fewer than badminton, it’s because rules wrestling sport require weight over 80 kg.

So next step is inspect outlier in sport every year.

  #Example Judo Sport
  olympic %>%
  filter(Sport == "Judo") %>%
  select(Name, Sex, Sport, Event, Year, Age,Weight, Height, BMI, BMI_category) %>%
  
  ggplot(aes(x = as.factor(Year), y = BMI)) +
  geom_jitter(aes(col = BMI_category)) +
  geom_boxplot(fill = "white") +
  labs(title = "Olympic Athlete BMI by Year",
       subtitle = "Judo Sport",
       y = "Year", x ="BMI",
       col = "BMI Category")

Distribution of Data based on BMI

library(tidyverse)
library(hrbrthemes) 

BMI_dist <- olympic %>%
  filter(Sex == "M" | Sex == "F") %>%
  select(Name, Sex, Sport, Event, Year, Age,Weight, Height, BMI, BMI_category) %>%
  
  ggplot(aes(x = BMI)) +
  geom_histogram(aes(bins=10, fill="blue" , color="blue", alpha=0.5)) +
  labs(title = "Olympic Athlete BMI Distribution",
       x ="BMI") +
  scale_fill_brewer(palette="PuBu") +
  theme(legend.position = "NONE") +
  facet_wrap(~Year, scales = "free_y")
## Warning: Ignoring unknown aesthetics: bins
BMI_dist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As histogram plot shown the BMI distribution in athelete each olympic every year at range 20 -30 or normal to overweight. Overall the distribution BMI’s is normal distribution.

8 Analyze

9 Historical Olympic 1986 - 2016

olympic %>%
  group_by(Year) %>%
  summarise(Olympian =n(),
            Medal = sum(Total_Medal),
            Avg_BMI = round(mean(BMI),2)) %>%

  ggplot(aes(x= Year)) +
  geom_line(aes(y = Olympian)) + 
  geom_line(aes(y = Medal)) +
  geom_line(aes(y =Avg_BMI))

10 How many athlete in every country?

olympic_raw <- merge(olympic_athlete, olympic_country, by.x = "NOC", by.y = "NOC", all.x = T)

athlete_participant <- olympic_raw %>%
  filter(Year == 2016) %>%
  group_by(region, Sex) %>%
  summarise(ID = n_distinct(ID)) %>%
  head(30) %>%
  ggplot(aes(x = reorder(region, -ID), y = ID)) +
  geom_col(aes(fill = Sex)) +
  scale_fill_brewer(palette="PuBu")
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
ggplotly(athlete_participant)

11 Participant Athlete by Gender

athlete_gender <- olympic %>%
  filter(region == "Argentina", Year == 2016) %>%
  group_by(Sex) %>%
  summarise(ID = n_distinct(ID)) %>%
  mutate(percent = ID/sum(ID)*100) %>%
  mutate_at(vars(percent), funs(round(., 0))) %>%
  mutate(ypos = cumsum(percent)- 0.5*percent ) %>%
  
  ggplot(aes(x = "", y = percent, fill = Sex)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start= 0) +
  theme_void() +
  geom_text(aes(y = ypos, label = percent), color = "black", size=5) +
  scale_fill_brewer(palette="PuBu")

athlete_gender

# Avarage BMI category by Gender

athlete_gender <- olympic_raw %>%
  filter(Year == 2016) %>%
  group_by(Sex) %>%
  summarise(ID = n_distinct(ID)) %>%
  mutate(percent = ID/sum(ID)*100) %>%
  mutate_at(vars(percent), funs(round(., 0))) %>%
  mutate(ypos = cumsum(percent)- 0.5*percent )

head(athlete_gender,10)

12 How Country Effect the BMI

olympic %>%
  filter() %>%
  group_by(region) %>% 
  summarise(avg_age = mean(Age),
            avg_height = mean(Height),
            avg_weight = mean(Weight),
            avg_BMI = mean(BMI)) %>%
  mutate(BMI_cat_country = case_when(avg_BMI < 18.5 ~ "Underweight",
                                  avg_BMI <= 24.9 ~ "Normal",
                                  avg_BMI <= 29.9 ~ "Overweight",
                                  avg_BMI <= 39.9 ~ "Obese",
                                  avg_BMI >39.9 ~ "Morbidly Obese")) %>%
  arrange(-avg_age) %>%
  head(30) %>%
  ggplot(aes(y= reorder(region,avg_BMI), x = avg_BMI)) +
  geom_col(aes(fill = BMI_cat_country)) +
  scale_fill_brewer(palette= 14) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "white"))

As data shown two region have BMI with overweight category. So we must zoom the data to know which sport contribute high BMI

#American Samoa
olympic %>%
  filter(region == "American Samoa") %>%
  select(ID, Name, Sex, Sport, Event, Year, Age,Weight, Height, BMI, BMI_category) %>%
  ggplot(aes(x = reorder(Sport,-BMI), y = BMI)) +
  geom_col(aes(fill = BMI_category), position = "dodge") +
  scale_fill_brewer(palette="PuBu")

In region American Samoa Wrestling, Archery, and Sport are contributor high BMI with Obese category.

olympic %>%
  filter(region == "American Samoa", Sport == "Wrestling") %>%
  select(ID, Name, Sex, Sport, Event, Year, Age,Weight, Height, BMI, BMI_category) %>%
  arrange(BMI)

13 Is it more ideal BMI get more earning medal?

Check distribution

olympic %>%
  group_by(ID) %>%
  summarise(medal_id = sum(Total_Medal), avg_BMI = mean(BMI)) %>%
  mutate(BMI_cat = case_when(avg_BMI < 18.5 ~ "Underweight",
                                  avg_BMI <= 24.9 ~ "Normal",
                                  avg_BMI <= 29.9 ~ "Overweight",
                                  avg_BMI <= 39.9 ~ "Obese",
                                  avg_BMI >39.9 ~ "Morbidly Obese")) %>%
  
  ggplot(aes(avg_BMI, medal_id)) +
  geom_point(aes(col = BMI_cat))

From the scatter plot we see the ideal BMI prefer to get more earning medal. But is it true the two variable have correaltion?

Check correlation

library(ggpubr)

  olympic_corr <- olympic %>%
  group_by(ID) %>%
  summarise(medal_id = sum(Total_Medal), avg_BMI = mean(BMI)) %>%
  mutate(BMI_cat = case_when(avg_BMI < 18.5 ~ "Underweight",
                                  avg_BMI <= 24.9 ~ "Normal",
                                  avg_BMI <= 29.9 ~ "Overweight",
                                  avg_BMI <= 39.9 ~ "Obese",
                                  avg_BMI >39.9 ~ "Morbidly Obese")) %>%
  filter(BMI_cat == "Normal")

  ggplot(olympic_corr, aes(avg_BMI, medal_id)) +
  geom_point(aes(col = BMI_cat)) +
  geom_smooth(method = lm, se = TRUE)
## `geom_smooth()` using formula 'y ~ x'

  cor(olympic_corr$medal_id,olympic_corr$avg_BMI)
## [1] 0.03435621