Refer to care point health US, body mass index (BMI) is a guideline used by health professionals and bariatric surgeons to determine the overall health of an individual in terms of weight and body fat. BMI is calculated using a ratio of your height and weight, and can be affected by how often you exercise and by the types of foods you eat. Those who have above-average BMIs that fall into the overweight and obese categories can take steps to become healthier through diet, exercise, and weight-loss surgery so they can improve their overall quality of life and live longer with fewer health complications.
knitr::include_graphics("/Users/pandjibagas/Desktop/Algoritma Course/LBB/Olympic_Athletes/120 Years Olympic History/Body Mass Index.png")
From image above, US medical standard categorize people to 5 category by result the BMI, which is underweight (BMI < 18.5), normal (<18.5 < BMI < 24.6), overweight (25 < BMI < 29.9), obese (30 < BMI < 39.9), and morbidly obese (BMI > 40). For the record, the index only apply to adult aged above 20 years.
How about the olympic athletes? Is it all athletes have ideal BMI? or how BMI effect the earning medal?
In this publication will interprets about the effect of BMI in olympic athlete from historical data on the modern Olympic Games, from Athens 1896 to Rio 2016. Each row corresponds to an individual athlete competing in an individual event, including the athlete’s name, sex, age, height, weight, country, and medal, and the event’s name, sport, games, year, and city. The outputs will exemine the simple distribuiton from all parameters include BMI and categorize athlete by BMI chart shown above. Lets gets started!
The core activites from this step are discovering and input, structuring, cleaning, enriching, validating, and analyzing.
The historical data olympic acquaried from mavenanalytic (https://www.mavenanalytics.io/data-playground). The next step input the data with code below,
#Load/install library package to read data
library(readxl)
#All packages data for observing the overall data and output
olympic_athlete <- read.csv("athlete_events.csv") #main data
olympic_country <- read.csv("country_definitions.csv") #supporting data
athlete_def <- read.csv("athlete_events_data_dictionary.csv") #definition each coloumn from athlete_events.csv
country_def <- read.csv("country_definitions_data_dictionary.csv") #definition each coloumn from country_definiitons.csv
Show the data
#Our main data olympic_athlete
head(olympic_athlete,10)
## ID Name Sex Age Height Weight Team NOC
## 1 1 A Dijiang M 24 180 80 China CHN
## 2 2 A Lamusi M 23 170 60 China CHN
## 3 3 Gunnar Nielsen Aaby M 24 NA NA Denmark DEN
## 4 4 Edgar Lindenau Aabye M 34 NA NA Denmark/Sweden DEN
## 5 5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED
## 6 5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED
## 7 5 Christine Jacoba Aaftink F 25 185 82 Netherlands NED
## 8 5 Christine Jacoba Aaftink F 25 185 82 Netherlands NED
## 9 5 Christine Jacoba Aaftink F 27 185 82 Netherlands NED
## 10 5 Christine Jacoba Aaftink F 27 185 82 Netherlands NED
## Games Year Season City Sport
## 1 1992 Summer 1992 Summer Barcelona Basketball
## 2 2012 Summer 2012 Summer London Judo
## 3 1920 Summer 1920 Summer Antwerpen Football
## 4 1900 Summer 1900 Summer Paris Tug-Of-War
## 5 1988 Winter 1988 Winter Calgary Speed Skating
## 6 1988 Winter 1988 Winter Calgary Speed Skating
## 7 1992 Winter 1992 Winter Albertville Speed Skating
## 8 1992 Winter 1992 Winter Albertville Speed Skating
## 9 1994 Winter 1994 Winter Lillehammer Speed Skating
## 10 1994 Winter 1994 Winter Lillehammer Speed Skating
## Event Medal
## 1 Basketball Men's Basketball <NA>
## 2 Judo Men's Extra-Lightweight <NA>
## 3 Football Men's Football <NA>
## 4 Tug-Of-War Men's Tug-Of-War Gold
## 5 Speed Skating Women's 500 metres <NA>
## 6 Speed Skating Women's 1,000 metres <NA>
## 7 Speed Skating Women's 500 metres <NA>
## 8 Speed Skating Women's 1,000 metres <NA>
## 9 Speed Skating Women's 500 metres <NA>
## 10 Speed Skating Women's 1,000 metres <NA>
#Our supporting data olympic_country
head(olympic_country,10)
## NOC region notes
## 1 AFG Afghanistan
## 2 AHO Curacao Netherlands Antilles
## 3 ALB Albania
## 4 ALG Algeria
## 5 AND Andorra
## 6 ANG Angola
## 7 ANT Antigua Antigua and Barbuda
## 8 ANZ Australia Australasia
## 9 ARG Argentina
## 10 ARM Armenia
As dataframe shown from olympic_athlete & olympic_country we need to combine both of dataframe because, 1. From business question we must know the frequency each country but the variable country occur in olympic_country as region column and the detail of athlete in olympic_athlete. 2. To simplify our wrangling and analyze should be in one dataframe.
Merge the data
#Merge the data with key column, "NOC"
olympic <- merge(olympic_athlete, olympic_country, by.x = "NOC", by.y = "NOC", all.x = T)
head(olympic,10)
## NOC ID Name Sex Age Height Weight Team
## 1 AFG 132181 Najam Yahya M NA NA NA Afghanistan
## 2 AFG 87371 Ahmad Jahan Nuristani M NA NA NA Afghanistan
## 3 AFG 44977 Mohammad Halilula M 28 163 57 Afghanistan
## 4 AFG 502 Ahmad Shah Abouwi M NA NA NA Afghanistan
## 5 AFG 109153 Shakar Khan Shakar M 24 NA 74 Afghanistan
## 6 AFG 29626 Sultan Mohammad Dost M 28 168 73 Afghanistan
## 7 AFG 1076 Jammal-ud-Din Affendi M 28 NA NA Afghanistan
## 8 AFG 121376 Khan Nasrullah Totakhail M NA NA NA Afghanistan
## 9 AFG 80210 Alam Mir M NA NA 57 Afghanistan
## 10 AFG 87374 Mohammad Amin Nuristani M NA NA NA Afghanistan
## Games Year Season City Sport
## 1 1956 Summer 1956 Summer Melbourne Hockey
## 2 1948 Summer 1948 Summer London Hockey
## 3 1980 Summer 1980 Summer Moskva Wrestling
## 4 1956 Summer 1956 Summer Melbourne Hockey
## 5 1964 Summer 1964 Summer Tokyo Wrestling
## 6 1960 Summer 1960 Summer Roma Wrestling
## 7 1936 Summer 1936 Summer Berlin Hockey
## 8 1956 Summer 1956 Summer Melbourne Hockey
## 9 1972 Summer 1972 Summer Munich Wrestling
## 10 1956 Summer 1956 Summer Melbourne Hockey
## Event Medal region notes
## 1 Hockey Men's Hockey <NA> Afghanistan
## 2 Hockey Men's Hockey <NA> Afghanistan
## 3 Wrestling Men's Bantamweight, Freestyle <NA> Afghanistan
## 4 Hockey Men's Hockey <NA> Afghanistan
## 5 Wrestling Men's Welterweight, Freestyle <NA> Afghanistan
## 6 Wrestling Men's Welterweight, Freestyle <NA> Afghanistan
## 7 Hockey Men's Hockey <NA> Afghanistan
## 8 Hockey Men's Hockey <NA> Afghanistan
## 9 Wrestling Men's Bantamweight, Greco-Roman <NA> Afghanistan
## 10 Hockey Men's Hockey <NA> Afghanistan
Check the data type every column & adjust the type.
#Check data type every column
str(olympic)
## 'data.frame': 271116 obs. of 17 variables:
## $ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
## $ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
## $ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
## $ Sex : chr "M" "M" "M" "M" ...
## $ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
## $ Height: int NA NA 163 NA NA 168 NA NA NA NA ...
## $ Weight: num NA NA 57 NA 74 73 NA NA 57 NA ...
## $ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
## $ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
## $ Season: chr "Summer" "Summer" "Summer" "Summer" ...
## $ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
## $ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
## $ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
## $ Medal : chr NA NA NA NA ...
## $ region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ notes : chr "" "" "" "" ...
From result above there is several variables that need to change to obtain appropiate output. The variables are NOC, Sex, Height, Weight, Team, Games, Season, City, Sport, Event, Medal, Region and change become factor.
olympic[,c("Sex","Team","NOC","Games","Season","City","Sport","Medal","Event","region")] <-
lapply(olympic[,c("Sex","Team","NOC","Games","Season","City","Sport","Medal","Event","region")], as.factor)
This step occur activites to check the completness of data, missing value, and validate the connection between variables.
summary(olympic)
## NOC ID Name Sex
## USA : 18853 Min. : 1 Length:271116 F: 74522
## FRA : 12758 1st Qu.: 34643 Class :character M:196594
## GBR : 12256 Median : 68205 Mode :character
## ITA : 10715 Mean : 68249
## GER : 9830 3rd Qu.:102097
## CAN : 9733 Max. :135571
## (Other):196971
## Age Height Weight Team
## Min. :10.00 Min. :127.0 Min. : 25.0 United States: 17847
## 1st Qu.:21.00 1st Qu.:168.0 1st Qu.: 60.0 France : 11988
## Median :24.00 Median :175.0 Median : 70.0 Great Britain: 11404
## Mean :25.56 Mean :175.3 Mean : 70.7 Italy : 10260
## 3rd Qu.:28.00 3rd Qu.:183.0 3rd Qu.: 79.0 Germany : 9326
## Max. :97.00 Max. :226.0 Max. :214.0 Canada : 9279
## NA's :9474 NA's :60171 NA's :62875 (Other) :201012
## Games Year Season City
## 2000 Summer: 13821 Min. :1896 Summer:222552 London : 22426
## 1996 Summer: 13780 1st Qu.:1960 Winter: 48564 Athina : 15556
## 2016 Summer: 13688 Median :1988 Sydney : 13821
## 2008 Summer: 13602 Mean :1978 Atlanta : 13780
## 2004 Summer: 13443 3rd Qu.:2002 Rio de Janeiro: 13688
## 1992 Summer: 12977 Max. :2016 Beijing : 13602
## (Other) :189805 (Other) :178243
## Sport Event
## Athletics : 38624 Football Men's Football : 5733
## Gymnastics: 26707 Ice Hockey Men's Ice Hockey : 4762
## Swimming : 23195 Hockey Men's Hockey : 3958
## Shooting : 11448 Water Polo Men's Water Polo : 3358
## Cycling : 10859 Basketball Men's Basketball : 3280
## Fencing : 10735 Cycling Men's Road Race, Individual: 2947
## (Other) :149548 (Other) :247078
## Medal region notes
## Bronze: 13295 USA : 18853 Length:271116
## Gold : 13372 Germany: 15883 Class :character
## Silver: 13116 France : 12758 Mode :character
## NA's :231333 UK : 12256
## Russia : 11692
## (Other):199304
## NA's : 370
From summary code, it generates simple aggregation in all variable in each value depends on data type and to peek missing value in variable. Conclusion from the summary are:
1. From variable ID, amount athlete that participate olympic is 135.571
2. From variable Sex, amount Male (196.594) dominate female athlete (74.522)
3. From variable Age, avarage age athlete olympic all the time is 25.56 and have 9.479 missing values.
4. From variable Height, avarage height athlete olympic all the time is 175.3 and have 60.171 missing values.
5. From variable Weight, avarage weight athlete olympic all the time is 70.7 and have 62.875 missing values.
6. From variable Teams, the top three amount athlete by team’s name are United States, France, and Great Britain.
7. From variable Games, the most participant athlete when 2000 Summer Olympic with amount 13.821 athletes.
8. From variable Year, the first olympic held on 1896 and last on 2016.
9. From variable Season, the most season helds olympic is during summer.
10.From variable Sport, the most participate sport is athletics with amount 38.624 athletes.
11. From variable Event, the most participate event is Men’s Footbal with amount 5.733 athletes.
12. From variable Medal, the NA’s value define the athlete who not earn medal.
13. From variable region, the top three participate athlete by region are USA, Germany, and France.
14. From notes, don’t have any info.
Inspection missing value focus on Age, Height, Weight, and region.
Why it’s matter? As we know from business question, to calculate BMI we need height, weight, age, and for the output expected we need to compare BMI based on country.
#Load/install library package to filter data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Identifying NA
colSums(is.na(olympic))
## NOC ID Name Sex Age Height Weight Team Games Year Season
## 0 0 0 0 9474 60171 62875 0 0 0 0
## City Sport Event Medal region notes
## 0 0 0 231333 370 349
Delete all row containt NA in variable Age, Height, Weight, and Region
#Identifying amount row where's NA in column Age, Height, Weight, Region
#Delete row containt NA based on Age, Height, Weight, Region
olympic <- olympic[!is.na(olympic$Age),]
olympic <- olympic[!is.na(olympic$Height),]
olympic <- olympic[!is.na(olympic$Weight),]
olympic <- olympic[!is.na(olympic$region),]
colSums(is.na(olympic))
## NOC ID Name Sex Age Height Weight Team Games Year Season
## 0 0 0 0 0 0 0 0 0 0 0
## City Sport Event Medal region notes
## 0 0 0 175723 0 0
Replace missing value on Medal variable to No Medal
#To replace with string, the data type column should be in character type
no_medal <- olympic$Medal
no_medal <- as.character(no_medal)
no_medal[is.na(no_medal)] <- "No_Medal"
no_medal <- as.factor(no_medal)
olympic$Medal <- no_medal
colSums(is.na(olympic))
## NOC ID Name Sex Age Height Weight Team Games Year Season
## 0 0 0 0 0 0 0 0 0 0 0
## City Sport Event Medal region notes
## 0 0 0 0 0 0
Remove notes column
olympic$notes <- NULL
Summary all data once again for final check
summary(olympic)
## NOC ID Name Sex
## USA : 14214 Min. : 1 Length:205895 F: 66592
## FRA : 7977 1st Qu.: 35178 Class :character M:139303
## CAN : 7965 Median : 68609 Mode :character
## GBR : 7766 Mean : 68600
## ITA : 7697 3rd Qu.:102290
## JPN : 7487 Max. :135571
## (Other):152789
## Age Height Weight Team
## Min. :11.00 Min. :127.0 Min. : 25.0 United States: 13714
## 1st Qu.:21.00 1st Qu.:168.0 1st Qu.: 60.0 France : 7807
## Median :24.00 Median :175.0 Median : 70.0 Canada : 7668
## Mean :25.06 Mean :175.4 Mean : 70.7 Great Britain: 7499
## 3rd Qu.:28.00 3rd Qu.:183.0 3rd Qu.: 79.0 Italy : 7433
## Max. :71.00 Max. :226.0 Max. :214.0 Japan : 7339
## (Other) :154435
## Games Year Season City
## 2000 Summer: 13659 Min. :1896 Summer:166436 London : 13766
## 2016 Summer: 13398 1st Qu.:1976 Winter: 39459 Sydney : 13659
## 2004 Summer: 13377 Median :1992 Athina : 13618
## 2008 Summer: 13365 Mean :1990 Rio de Janeiro: 13398
## 2012 Summer: 12486 3rd Qu.:2006 Beijing : 13365
## 1996 Summer: 11816 Max. :2016 Atlanta : 11816
## (Other) :127794 (Other) :126273
## Sport Event
## Athletics : 32348 Ice Hockey Men's Ice Hockey : 3825
## Swimming : 18658 Football Men's Football : 3459
## Gymnastics : 18267 Hockey Men's Hockey : 2874
## Rowing : 7789 Basketball Men's Basketball : 2461
## Cycling : 7775 Water Polo Men's Water Polo : 2231
## Cross Country Skiing: 7529 Cycling Men's Road Race, Individual: 2099
## (Other) :113529 (Other) :188946
## Medal region
## Bronze : 10144 USA : 14214
## Gold : 10166 Germany: 13183
## No_Medal:175723 Russia : 10398
## Silver : 9862 France : 7977
## Canada : 7966
## UK : 7766
## (Other):144391
nrow(olympic)
## [1] 205895
As the output expected there is two columns that should be add and show, first column is Body Mass Index (BMI). The formula to calculate BMI :
knitr::include_graphics("BMI formula.png")
Add and calculate BMI
#remember height in m not cm
olympic <- olympic %>%
mutate(BMI = Weight/(Height/100)^2) %>%
mutate_at(vars(BMI), funs(round(., 2)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
head(olympic)
## NOC ID Name Sex Age Height Weight Team
## 3 AFG 44977 Mohammad Halilula M 28 163 57 Afghanistan
## 6 AFG 29626 Sultan Mohammad Dost M 28 168 73 Afghanistan
## 11 AFG 116125 Nizam-ud-din Subhani M 34 168 111 Afghanistan
## 13 AFG 133692 Khojawahid Zahedi M 20 178 74 Afghanistan
## 15 AFG 109486 Abdul Hadi Shekaib M 20 178 68 Afghanistan
## 17 AFG 106372 Habib Zareef Sayed M 23 170 58 Afghanistan
## Games Year Season City Sport
## 3 1980 Summer 1980 Summer Moskva Wrestling
## 6 1960 Summer 1960 Summer Roma Wrestling
## 11 1960 Summer 1960 Summer Roma Wrestling
## 13 1980 Summer 1980 Summer Moskva Wrestling
## 15 1960 Summer 1960 Summer Roma Athletics
## 17 1960 Summer 1960 Summer Roma Athletics
## Event Medal region BMI
## 3 Wrestling Men's Bantamweight, Freestyle No_Medal Afghanistan 21.45
## 6 Wrestling Men's Welterweight, Freestyle No_Medal Afghanistan 25.86
## 11 Wrestling Men's Heavyweight, Freestyle No_Medal Afghanistan 39.33
## 13 Wrestling Men's Welterweight, Freestyle No_Medal Afghanistan 23.36
## 15 Athletics Men's 100 metres No_Medal Afghanistan 21.46
## 17 Athletics Men's 400 metres No_Medal Afghanistan 20.07
Second column is for to categorize the BMI based on BMI chart.
olympic <- olympic %>%
mutate(BMI_category = case_when(BMI < 18.5 ~ "Underweight",
BMI <= 24.9 ~ "Normal",
BMI <= 29.9 ~ "Overweight",
BMI <= 39.9 ~ "Obese",
BMI >39.9 ~ "Morbidly Obese"))
head(olympic[,c("BMI","BMI_category"),])
## BMI BMI_category
## 3 21.45 Normal
## 6 25.86 Overweight
## 11 39.33 Obese
## 13 23.36 Normal
## 15 21.46 Normal
## 17 20.07 Normal
Transform value in variable medal to column and give total of medal each athlete
library(tidyr)
olympic <- olympic %>%
pivot_wider(
names_from = Medal,
values_from = Medal,
values_fn = length,
values_fill = 0) %>%
mutate(Total_Medal = Bronze + Silver + Gold)
olympic$No_Medal <- NULL
head(olympic)
## # A tibble: 6 x 21
## NOC ID Name Sex Age Height Weight Team Games Year Season City
## <fct> <int> <chr> <fct> <int> <int> <dbl> <fct> <fct> <int> <fct> <fct>
## 1 AFG 44977 Mohamm… M 28 163 57 Afgha… 1980… 1980 Summer Mosk…
## 2 AFG 29626 Sultan… M 28 168 73 Afgha… 1960… 1960 Summer Roma
## 3 AFG 116125 Nizam-… M 34 168 111 Afgha… 1960… 1960 Summer Roma
## 4 AFG 133692 Khojaw… M 20 178 74 Afgha… 1980… 1980 Summer Mosk…
## 5 AFG 109486 Abdul … M 20 178 68 Afgha… 1960… 1960 Summer Roma
## 6 AFG 106372 Habib … M 23 170 58 Afgha… 1960… 1960 Summer Roma
## # … with 9 more variables: Sport <fct>, Event <fct>, region <fct>, BMI <dbl>,
## # BMI_category <chr>, Bronze <int>, Silver <int>, Gold <int>,
## # Total_Medal <int>