At glance

Refer to care point health US, body mass index (BMI) is a guideline used by health professionals and bariatric surgeons to determine the overall health of an individual in terms of weight and body fat. BMI is calculated using a ratio of your height and weight, and can be affected by how often you exercise and by the types of foods you eat. Those who have above-average BMIs that fall into the overweight and obese categories can take steps to become healthier through diet, exercise, and weight-loss surgery so they can improve their overall quality of life and live longer with fewer health complications.

knitr::include_graphics("/Users/pandjibagas/Desktop/Algoritma Course/LBB/Olympic_Athletes/120 Years Olympic History/Body Mass Index.png")

From image above, US medical standard categorize people to 5 category by result the BMI, which is underweight (BMI < 18.5), normal (<18.5 < BMI < 24.6), overweight (25 < BMI < 29.9), obese (30 < BMI < 39.9), and morbidly obese (BMI > 40). For the record, the index only apply to adult aged above 20 years.

How about the olympic athletes? Is it all athletes have ideal BMI? or how BMI effect the earning medal?

In this publication will interprets about the effect of BMI in olympic athlete from historical data on the modern Olympic Games, from Athens 1896 to Rio 2016. Each row corresponds to an individual athlete competing in an individual event, including the athlete’s name, sex, age, height, weight, country, and medal, and the event’s name, sport, games, year, and city. The outputs will exemine the simple distribuiton from all parameters include BMI and categorize athlete by BMI chart shown above. Lets gets started!

Data Wrangling

The core activites from this step are discovering and input, structuring, cleaning, enriching, validating, and analyzing.

Dicovering & Input Data

The historical data olympic acquaried from mavenanalytic (https://www.mavenanalytics.io/data-playground). The next step input the data with code below,

#Load/install library package to read data
library(readxl)

#All packages data for observing the overall data and output
olympic_athlete <- read.csv("athlete_events.csv") #main data
olympic_country <- read.csv("country_definitions.csv") #supporting data
athlete_def <- read.csv("athlete_events_data_dictionary.csv") #definition each coloumn from athlete_events.csv
country_def <-  read.csv("country_definitions_data_dictionary.csv") #definition each coloumn from country_definiitons.csv

Show the data

#Our main data olympic_athlete
head(olympic_athlete,10)
##    ID                     Name Sex Age Height Weight           Team NOC
## 1   1                A Dijiang   M  24    180     80          China CHN
## 2   2                 A Lamusi   M  23    170     60          China CHN
## 3   3      Gunnar Nielsen Aaby   M  24     NA     NA        Denmark DEN
## 4   4     Edgar Lindenau Aabye   M  34     NA     NA Denmark/Sweden DEN
## 5   5 Christine Jacoba Aaftink   F  21    185     82    Netherlands NED
## 6   5 Christine Jacoba Aaftink   F  21    185     82    Netherlands NED
## 7   5 Christine Jacoba Aaftink   F  25    185     82    Netherlands NED
## 8   5 Christine Jacoba Aaftink   F  25    185     82    Netherlands NED
## 9   5 Christine Jacoba Aaftink   F  27    185     82    Netherlands NED
## 10  5 Christine Jacoba Aaftink   F  27    185     82    Netherlands NED
##          Games Year Season        City         Sport
## 1  1992 Summer 1992 Summer   Barcelona    Basketball
## 2  2012 Summer 2012 Summer      London          Judo
## 3  1920 Summer 1920 Summer   Antwerpen      Football
## 4  1900 Summer 1900 Summer       Paris    Tug-Of-War
## 5  1988 Winter 1988 Winter     Calgary Speed Skating
## 6  1988 Winter 1988 Winter     Calgary Speed Skating
## 7  1992 Winter 1992 Winter Albertville Speed Skating
## 8  1992 Winter 1992 Winter Albertville Speed Skating
## 9  1994 Winter 1994 Winter Lillehammer Speed Skating
## 10 1994 Winter 1994 Winter Lillehammer Speed Skating
##                                 Event Medal
## 1         Basketball Men's Basketball  <NA>
## 2        Judo Men's Extra-Lightweight  <NA>
## 3             Football Men's Football  <NA>
## 4         Tug-Of-War Men's Tug-Of-War  Gold
## 5    Speed Skating Women's 500 metres  <NA>
## 6  Speed Skating Women's 1,000 metres  <NA>
## 7    Speed Skating Women's 500 metres  <NA>
## 8  Speed Skating Women's 1,000 metres  <NA>
## 9    Speed Skating Women's 500 metres  <NA>
## 10 Speed Skating Women's 1,000 metres  <NA>
#Our supporting data olympic_country
head(olympic_country,10)
##    NOC      region                notes
## 1  AFG Afghanistan                     
## 2  AHO     Curacao Netherlands Antilles
## 3  ALB     Albania                     
## 4  ALG     Algeria                     
## 5  AND     Andorra                     
## 6  ANG      Angola                     
## 7  ANT     Antigua  Antigua and Barbuda
## 8  ANZ   Australia          Australasia
## 9  ARG   Argentina                     
## 10 ARM     Armenia

Structuring Data

As dataframe shown from olympic_athlete & olympic_country we need to combine both of dataframe because, 1. From business question we must know the frequency each country but the variable country occur in olympic_country as region column and the detail of athlete in olympic_athlete. 2. To simplify our wrangling and analyze should be in one dataframe.

Merge the data

#Merge the data with key column, "NOC"
olympic <- merge(olympic_athlete, olympic_country, by.x = "NOC", by.y = "NOC", all.x = T)
head(olympic,10)
##    NOC     ID                     Name Sex Age Height Weight        Team
## 1  AFG 132181              Najam Yahya   M  NA     NA     NA Afghanistan
## 2  AFG  87371    Ahmad Jahan Nuristani   M  NA     NA     NA Afghanistan
## 3  AFG  44977        Mohammad Halilula   M  28    163     57 Afghanistan
## 4  AFG    502        Ahmad Shah Abouwi   M  NA     NA     NA Afghanistan
## 5  AFG 109153       Shakar Khan Shakar   M  24     NA     74 Afghanistan
## 6  AFG  29626     Sultan Mohammad Dost   M  28    168     73 Afghanistan
## 7  AFG   1076    Jammal-ud-Din Affendi   M  28     NA     NA Afghanistan
## 8  AFG 121376 Khan Nasrullah Totakhail   M  NA     NA     NA Afghanistan
## 9  AFG  80210                 Alam Mir   M  NA     NA     57 Afghanistan
## 10 AFG  87374  Mohammad Amin Nuristani   M  NA     NA     NA Afghanistan
##          Games Year Season      City     Sport
## 1  1956 Summer 1956 Summer Melbourne    Hockey
## 2  1948 Summer 1948 Summer    London    Hockey
## 3  1980 Summer 1980 Summer    Moskva Wrestling
## 4  1956 Summer 1956 Summer Melbourne    Hockey
## 5  1964 Summer 1964 Summer     Tokyo Wrestling
## 6  1960 Summer 1960 Summer      Roma Wrestling
## 7  1936 Summer 1936 Summer    Berlin    Hockey
## 8  1956 Summer 1956 Summer Melbourne    Hockey
## 9  1972 Summer 1972 Summer    Munich Wrestling
## 10 1956 Summer 1956 Summer Melbourne    Hockey
##                                        Event Medal      region notes
## 1                        Hockey Men's Hockey  <NA> Afghanistan      
## 2                        Hockey Men's Hockey  <NA> Afghanistan      
## 3    Wrestling Men's Bantamweight, Freestyle  <NA> Afghanistan      
## 4                        Hockey Men's Hockey  <NA> Afghanistan      
## 5    Wrestling Men's Welterweight, Freestyle  <NA> Afghanistan      
## 6    Wrestling Men's Welterweight, Freestyle  <NA> Afghanistan      
## 7                        Hockey Men's Hockey  <NA> Afghanistan      
## 8                        Hockey Men's Hockey  <NA> Afghanistan      
## 9  Wrestling Men's Bantamweight, Greco-Roman  <NA> Afghanistan      
## 10                       Hockey Men's Hockey  <NA> Afghanistan

Check the data type every column & adjust the type.

#Check data type every column
str(olympic)
## 'data.frame':    271116 obs. of  17 variables:
##  $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ ID    : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
##  $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
##  $ Sex   : chr  "M" "M" "M" "M" ...
##  $ Age   : int  NA NA 28 NA 24 28 28 NA NA NA ...
##  $ Height: int  NA NA 163 NA NA 168 NA NA NA NA ...
##  $ Weight: num  NA NA 57 NA 74 73 NA NA 57 NA ...
##  $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
##  $ Year  : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
##  $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
##  $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
##  $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
##  $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
##  $ Medal : chr  NA NA NA NA ...
##  $ region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ notes : chr  "" "" "" "" ...

From result above there is several variables that need to change to obtain appropiate output. The variables are NOC, Sex, Height, Weight, Team, Games, Season, City, Sport, Event, Medal, Region and change become factor.

olympic[,c("Sex","Team","NOC","Games","Season","City","Sport","Medal","Event","region")] <-
  lapply(olympic[,c("Sex","Team","NOC","Games","Season","City","Sport","Medal","Event","region")], as.factor)

Cleaning Data

This step occur activites to check the completness of data, missing value, and validate the connection between variables.

summary(olympic)
##       NOC               ID             Name           Sex       
##  USA    : 18853   Min.   :     1   Length:271116      F: 74522  
##  FRA    : 12758   1st Qu.: 34643   Class :character   M:196594  
##  GBR    : 12256   Median : 68205   Mode  :character             
##  ITA    : 10715   Mean   : 68249                                
##  GER    :  9830   3rd Qu.:102097                                
##  CAN    :  9733   Max.   :135571                                
##  (Other):196971                                                 
##       Age            Height          Weight                 Team       
##  Min.   :10.00   Min.   :127.0   Min.   : 25.0   United States: 17847  
##  1st Qu.:21.00   1st Qu.:168.0   1st Qu.: 60.0   France       : 11988  
##  Median :24.00   Median :175.0   Median : 70.0   Great Britain: 11404  
##  Mean   :25.56   Mean   :175.3   Mean   : 70.7   Italy        : 10260  
##  3rd Qu.:28.00   3rd Qu.:183.0   3rd Qu.: 79.0   Germany      :  9326  
##  Max.   :97.00   Max.   :226.0   Max.   :214.0   Canada       :  9279  
##  NA's   :9474    NA's   :60171   NA's   :62875   (Other)      :201012  
##          Games             Year         Season                   City       
##  2000 Summer: 13821   Min.   :1896   Summer:222552   London        : 22426  
##  1996 Summer: 13780   1st Qu.:1960   Winter: 48564   Athina        : 15556  
##  2016 Summer: 13688   Median :1988                   Sydney        : 13821  
##  2008 Summer: 13602   Mean   :1978                   Atlanta       : 13780  
##  2004 Summer: 13443   3rd Qu.:2002                   Rio de Janeiro: 13688  
##  1992 Summer: 12977   Max.   :2016                   Beijing       : 13602  
##  (Other)    :189805                                  (Other)       :178243  
##         Sport                                        Event       
##  Athletics : 38624   Football Men's Football            :  5733  
##  Gymnastics: 26707   Ice Hockey Men's Ice Hockey        :  4762  
##  Swimming  : 23195   Hockey Men's Hockey                :  3958  
##  Shooting  : 11448   Water Polo Men's Water Polo        :  3358  
##  Cycling   : 10859   Basketball Men's Basketball        :  3280  
##  Fencing   : 10735   Cycling Men's Road Race, Individual:  2947  
##  (Other)   :149548   (Other)                            :247078  
##     Medal            region          notes          
##  Bronze: 13295   USA    : 18853   Length:271116     
##  Gold  : 13372   Germany: 15883   Class :character  
##  Silver: 13116   France : 12758   Mode  :character  
##  NA's  :231333   UK     : 12256                     
##                  Russia : 11692                     
##                  (Other):199304                     
##                  NA's   :   370

From summary code, it generates simple aggregation in all variable in each value depends on data type and to peek missing value in variable. Conclusion from the summary are:
1. From variable ID, amount athlete that participate olympic is 135.571
2. From variable Sex, amount Male (196.594) dominate female athlete (74.522)
3. From variable Age, avarage age athlete olympic all the time is 25.56 and have 9.479 missing values.
4. From variable Height, avarage height athlete olympic all the time is 175.3 and have 60.171 missing values.
5. From variable Weight, avarage weight athlete olympic all the time is 70.7 and have 62.875 missing values.
6. From variable Teams, the top three amount athlete by team’s name are United States, France, and Great Britain.
7. From variable Games, the most participant athlete when 2000 Summer Olympic with amount 13.821 athletes.
8. From variable Year, the first olympic held on 1896 and last on 2016.
9. From variable Season, the most season helds olympic is during summer.
10.From variable Sport, the most participate sport is athletics with amount 38.624 athletes.
11. From variable Event, the most participate event is Men’s Footbal with amount 5.733 athletes.
12. From variable Medal, the NA’s value define the athlete who not earn medal.
13. From variable region, the top three participate athlete by region are USA, Germany, and France.
14. From notes, don’t have any info.

Inspection missing value focus on Age, Height, Weight, and region.

Why it’s matter? As we know from business question, to calculate BMI we need height, weight, age, and for the output expected we need to compare BMI based on country.

#Load/install library package to filter data
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Identifying NA
colSums(is.na(olympic))
##    NOC     ID   Name    Sex    Age Height Weight   Team  Games   Year Season 
##      0      0      0      0   9474  60171  62875      0      0      0      0 
##   City  Sport  Event  Medal region  notes 
##      0      0      0 231333    370    349

Delete all row containt NA in variable Age, Height, Weight, and Region

#Identifying amount row where's NA in column Age, Height, Weight, Region

#Delete row containt NA based on Age, Height, Weight, Region
olympic <- olympic[!is.na(olympic$Age),]
olympic <- olympic[!is.na(olympic$Height),]
olympic <- olympic[!is.na(olympic$Weight),]
olympic <- olympic[!is.na(olympic$region),]

colSums(is.na(olympic))
##    NOC     ID   Name    Sex    Age Height Weight   Team  Games   Year Season 
##      0      0      0      0      0      0      0      0      0      0      0 
##   City  Sport  Event  Medal region  notes 
##      0      0      0 175723      0      0

Replace missing value on Medal variable to No Medal

#To replace with string, the data type column should be in character type
no_medal <- olympic$Medal
no_medal <- as.character(no_medal)
no_medal[is.na(no_medal)] <- "No_Medal"
no_medal <- as.factor(no_medal)
olympic$Medal <- no_medal

colSums(is.na(olympic))
##    NOC     ID   Name    Sex    Age Height Weight   Team  Games   Year Season 
##      0      0      0      0      0      0      0      0      0      0      0 
##   City  Sport  Event  Medal region  notes 
##      0      0      0      0      0      0

Remove notes column

olympic$notes <- NULL

Summary all data once again for final check

summary(olympic)
##       NOC               ID             Name           Sex       
##  USA    : 14214   Min.   :     1   Length:205895      F: 66592  
##  FRA    :  7977   1st Qu.: 35178   Class :character   M:139303  
##  CAN    :  7965   Median : 68609   Mode  :character             
##  GBR    :  7766   Mean   : 68600                                
##  ITA    :  7697   3rd Qu.:102290                                
##  JPN    :  7487   Max.   :135571                                
##  (Other):152789                                                 
##       Age            Height          Weight                 Team       
##  Min.   :11.00   Min.   :127.0   Min.   : 25.0   United States: 13714  
##  1st Qu.:21.00   1st Qu.:168.0   1st Qu.: 60.0   France       :  7807  
##  Median :24.00   Median :175.0   Median : 70.0   Canada       :  7668  
##  Mean   :25.06   Mean   :175.4   Mean   : 70.7   Great Britain:  7499  
##  3rd Qu.:28.00   3rd Qu.:183.0   3rd Qu.: 79.0   Italy        :  7433  
##  Max.   :71.00   Max.   :226.0   Max.   :214.0   Japan        :  7339  
##                                                  (Other)      :154435  
##          Games             Year         Season                   City       
##  2000 Summer: 13659   Min.   :1896   Summer:166436   London        : 13766  
##  2016 Summer: 13398   1st Qu.:1976   Winter: 39459   Sydney        : 13659  
##  2004 Summer: 13377   Median :1992                   Athina        : 13618  
##  2008 Summer: 13365   Mean   :1990                   Rio de Janeiro: 13398  
##  2012 Summer: 12486   3rd Qu.:2006                   Beijing       : 13365  
##  1996 Summer: 11816   Max.   :2016                   Atlanta       : 11816  
##  (Other)    :127794                                  (Other)       :126273  
##                   Sport                                        Event       
##  Athletics           : 32348   Ice Hockey Men's Ice Hockey        :  3825  
##  Swimming            : 18658   Football Men's Football            :  3459  
##  Gymnastics          : 18267   Hockey Men's Hockey                :  2874  
##  Rowing              :  7789   Basketball Men's Basketball        :  2461  
##  Cycling             :  7775   Water Polo Men's Water Polo        :  2231  
##  Cross Country Skiing:  7529   Cycling Men's Road Race, Individual:  2099  
##  (Other)             :113529   (Other)                            :188946  
##       Medal            region      
##  Bronze  : 10144   USA    : 14214  
##  Gold    : 10166   Germany: 13183  
##  No_Medal:175723   Russia : 10398  
##  Silver  :  9862   France :  7977  
##                    Canada :  7966  
##                    UK     :  7766  
##                    (Other):144391
nrow(olympic)
## [1] 205895

Enriching

As the output expected there is two columns that should be add and show, first column is Body Mass Index (BMI). The formula to calculate BMI :

knitr::include_graphics("BMI formula.png")

Add and calculate BMI

#remember height in m not cm
olympic <- olympic %>%
  mutate(BMI = Weight/(Height/100)^2) %>%
  mutate_at(vars(BMI), funs(round(., 2)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
head(olympic)
##    NOC     ID                 Name Sex Age Height Weight        Team
## 3  AFG  44977    Mohammad Halilula   M  28    163     57 Afghanistan
## 6  AFG  29626 Sultan Mohammad Dost   M  28    168     73 Afghanistan
## 11 AFG 116125 Nizam-ud-din Subhani   M  34    168    111 Afghanistan
## 13 AFG 133692    Khojawahid Zahedi   M  20    178     74 Afghanistan
## 15 AFG 109486   Abdul Hadi Shekaib   M  20    178     68 Afghanistan
## 17 AFG 106372   Habib Zareef Sayed   M  23    170     58 Afghanistan
##          Games Year Season   City     Sport
## 3  1980 Summer 1980 Summer Moskva Wrestling
## 6  1960 Summer 1960 Summer   Roma Wrestling
## 11 1960 Summer 1960 Summer   Roma Wrestling
## 13 1980 Summer 1980 Summer Moskva Wrestling
## 15 1960 Summer 1960 Summer   Roma Athletics
## 17 1960 Summer 1960 Summer   Roma Athletics
##                                      Event    Medal      region   BMI
## 3  Wrestling Men's Bantamweight, Freestyle No_Medal Afghanistan 21.45
## 6  Wrestling Men's Welterweight, Freestyle No_Medal Afghanistan 25.86
## 11  Wrestling Men's Heavyweight, Freestyle No_Medal Afghanistan 39.33
## 13 Wrestling Men's Welterweight, Freestyle No_Medal Afghanistan 23.36
## 15              Athletics Men's 100 metres No_Medal Afghanistan 21.46
## 17              Athletics Men's 400 metres No_Medal Afghanistan 20.07

Second column is for to categorize the BMI based on BMI chart.

olympic <- olympic %>%
  mutate(BMI_category = case_when(BMI < 18.5 ~ "Underweight",
                                  BMI <= 24.9 ~ "Normal",
                                  BMI <= 29.9 ~ "Overweight",
                                  BMI <= 39.9 ~ "Obese",
                                  BMI >39.9 ~ "Morbidly Obese"))

head(olympic[,c("BMI","BMI_category"),])
##      BMI BMI_category
## 3  21.45       Normal
## 6  25.86   Overweight
## 11 39.33        Obese
## 13 23.36       Normal
## 15 21.46       Normal
## 17 20.07       Normal

Transform value in variable medal to column and give total of medal each athlete

library(tidyr)
olympic <- olympic %>%
  pivot_wider(
    names_from = Medal,
    values_from = Medal,
    values_fn = length,
    values_fill = 0) %>%
  mutate(Total_Medal = Bronze + Silver + Gold)

olympic$No_Medal <- NULL

head(olympic)
## # A tibble: 6 x 21
##   NOC       ID Name    Sex     Age Height Weight Team   Games  Year Season City 
##   <fct>  <int> <chr>   <fct> <int>  <int>  <dbl> <fct>  <fct> <int> <fct>  <fct>
## 1 AFG    44977 Mohamm… M        28    163     57 Afgha… 1980…  1980 Summer Mosk…
## 2 AFG    29626 Sultan… M        28    168     73 Afgha… 1960…  1960 Summer Roma 
## 3 AFG   116125 Nizam-… M        34    168    111 Afgha… 1960…  1960 Summer Roma 
## 4 AFG   133692 Khojaw… M        20    178     74 Afgha… 1980…  1980 Summer Mosk…
## 5 AFG   109486 Abdul … M        20    178     68 Afgha… 1960…  1960 Summer Roma 
## 6 AFG   106372 Habib … M        23    170     58 Afgha… 1960…  1960 Summer Roma 
## # … with 9 more variables: Sport <fct>, Event <fct>, region <fct>, BMI <dbl>,
## #   BMI_category <chr>, Bronze <int>, Silver <int>, Gold <int>,
## #   Total_Medal <int>