Refer to care point health US, body mass index (BMI) is a guideline used by health professionals and bariatric surgeons to determine the overall health of an individual in terms of weight and body fat. BMI is calculated using a ratio of your height and weight, and can be affected by how often you exercise and by the types of foods you eat. Those who have above-average BMIs that fall into the overweight and obese categories can take steps to become healthier through diet, exercise, and weight-loss surgery so they can improve their overall quality of life and live longer with fewer health complications.
knitr::include_graphics("/Users/pandjibagas/Desktop/Algoritma Course/LBB/Olympic_Athletes/120 Years Olympic History/Body Mass Index.png")
From image above, US medical standard categorize people to 5 category by result the BMI, which is underweight (BMI < 18.5), normal (<18.5 < BMI < 24.6), overweight (25 < BMI < 29.9), obese (30 < BMI < 39.9), and morbidly obese (BMI > 40). For the record, the index only apply to adult aged above 20 years.
In this publication will interprets about the effect of BMI in olympic athlete from historical data on the modern Olympic Games, from Athens 1896 to Rio 2016. Each row corresponds to an individual athlete competing in an individual event, including the athlete’s name, sex, age, height, weight, country, and medal, and the event’s name, sport, games, year, and city. The outputs will exemine the simple distribuiton from all parameters include BMI, categorize athlete by BMI chart shown above, and the correlation BMI with earning medals. Lets gets started!
The core activites from this step are discovering and input, structuring, cleaning, enriching, validating, and analyzing.
The historical data olympic acquaried from mavenanalytic (https://www.mavenanalytics.io/data-playground). The next step input the data with code below,
#Load/install library package to read data
library(readxl)
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.0 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(viridis)
## Loading required package: viridisLite
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggiraph)
library(glue)
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
#All packages data for observing the overall data and output
olympic_athlete <- read.csv("athlete_events.csv") #main data
olympic_country <- read.csv("country_definitions.csv") #supporting data
athlete_def <- read.csv("athlete_events_data_dictionary.csv") #definition each coloumn from athlete_events.csv
country_def <- read.csv("country_definitions_data_dictionary.csv") #definition each coloumn from country_definiitons.csv
Show the data
#Our main data olympic_athlete
head(olympic_athlete,10)
#Our supporting data olympic_country
head(olympic_country,10)
As dataframe shown from olympic_athlete & olympic_country we need to combine both of dataframe because, 1. From business question we must know the frequency each country but the variable country occur in olympic_country as region column and the detail of athlete in olympic_athlete. 2. To simplify our wrangling and analyze should be in one dataframe.
Merge the data
#Merge the data with key column, "NOC"
olympic <- merge(olympic_athlete, olympic_country, by.x = "NOC", by.y = "NOC", all.x = T)
head(olympic,10)
Check the data type every column & adjust the type.
#Check data type every column
str(olympic)
## 'data.frame': 271116 obs. of 17 variables:
## $ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
## $ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
## $ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
## $ Sex : chr "M" "M" "M" "M" ...
## $ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
## $ Height: int NA NA 163 NA NA 168 NA NA NA NA ...
## $ Weight: num NA NA 57 NA 74 73 NA NA 57 NA ...
## $ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
## $ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
## $ Season: chr "Summer" "Summer" "Summer" "Summer" ...
## $ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
## $ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
## $ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
## $ Medal : chr NA NA NA NA ...
## $ region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ notes : chr "" "" "" "" ...
From result above there is several variables that need to change to obtain appropiate output. The variables are NOC, Sex, Height, Weight, Team, Games, Season, City, Sport, Event, Medal, Region and change become factor.
olympic[,c("Sex","Team","NOC","Games","Season","City","Sport","Medal","Event","region")] <-
lapply(olympic[,c("Sex","Team","NOC","Games","Season","City","Sport","Medal","Event","region")], as.factor)
This step occur activites to check the completness of data, missing value, and validate the connection between variables.
summary(olympic)
## NOC ID Name Sex
## USA : 18853 Min. : 1 Length:271116 F: 74522
## FRA : 12758 1st Qu.: 34643 Class :character M:196594
## GBR : 12256 Median : 68205 Mode :character
## ITA : 10715 Mean : 68249
## GER : 9830 3rd Qu.:102097
## CAN : 9733 Max. :135571
## (Other):196971
## Age Height Weight Team
## Min. :10.00 Min. :127.0 Min. : 25.0 United States: 17847
## 1st Qu.:21.00 1st Qu.:168.0 1st Qu.: 60.0 France : 11988
## Median :24.00 Median :175.0 Median : 70.0 Great Britain: 11404
## Mean :25.56 Mean :175.3 Mean : 70.7 Italy : 10260
## 3rd Qu.:28.00 3rd Qu.:183.0 3rd Qu.: 79.0 Germany : 9326
## Max. :97.00 Max. :226.0 Max. :214.0 Canada : 9279
## NA's :9474 NA's :60171 NA's :62875 (Other) :201012
## Games Year Season City
## 2000 Summer: 13821 Min. :1896 Summer:222552 London : 22426
## 1996 Summer: 13780 1st Qu.:1960 Winter: 48564 Athina : 15556
## 2016 Summer: 13688 Median :1988 Sydney : 13821
## 2008 Summer: 13602 Mean :1978 Atlanta : 13780
## 2004 Summer: 13443 3rd Qu.:2002 Rio de Janeiro: 13688
## 1992 Summer: 12977 Max. :2016 Beijing : 13602
## (Other) :189805 (Other) :178243
## Sport Event
## Athletics : 38624 Football Men's Football : 5733
## Gymnastics: 26707 Ice Hockey Men's Ice Hockey : 4762
## Swimming : 23195 Hockey Men's Hockey : 3958
## Shooting : 11448 Water Polo Men's Water Polo : 3358
## Cycling : 10859 Basketball Men's Basketball : 3280
## Fencing : 10735 Cycling Men's Road Race, Individual: 2947
## (Other) :149548 (Other) :247078
## Medal region notes
## Bronze: 13295 USA : 18853 Length:271116
## Gold : 13372 Germany: 15883 Class :character
## Silver: 13116 France : 12758 Mode :character
## NA's :231333 UK : 12256
## Russia : 11692
## (Other):199304
## NA's : 370
From summary code, it generates simple aggregation in all variable in each value depends on data type and to peek missing value in variable. Conclusion from the summary are: 1. From variable ID, amount athlete that participate olympic is 135.571 2. From variable Sex, amount Male (196.594) dominate female athlete (74.522) 3. From variable Age, avarage age athlete olympic all the time is 25.56 and have 9.479 missing values. 4. From variable Height, avarage height athlete olympic all the time is 175.3 and have 60.171 missing values. 5. From variable Weight, avarage weight athlete olympic all the time is 70.7 and have 62.875 missing values. 6. From variable Teams, the top three amount athlete by team’s name are United States, France, and Great Britain. 7. From variable Games, the most participant athlete when 2000 Summer Olympic with amount 13.821 athletes. 8. From variable Year, the first olympic held on 1896 and last on 2016. 9. From variable Season, the most season helds olympic is during summer. 10.From variable Sport, the most participate sport is athletics with amount 38.624 athletes. 11. From variable Event, the most participate event is Men’s Footbal with amount 5.733 athletes. 12. From variable Medal, the NA’s value define the athlete who not earn medal. 13. From variable region, the top three participate athlete by region are USA, Germany, and France. 14. From notes, don’t have any info.
Inspection missing value focus on Age, Height, Weight, and region.
Why it’s matter? As we know from business question, to calculate BMI we need height, weight, age, and for the output expected we need to compare BMI based on country.
#Load/install library package to filter data
library(dplyr)
#Identifying NA
colSums(is.na(olympic))
## NOC ID Name Sex Age Height Weight Team Games Year Season
## 0 0 0 0 9474 60171 62875 0 0 0 0
## City Sport Event Medal region notes
## 0 0 0 231333 370 349
Delete all row containt NA in variable Age, Height, Weight, and Region
#Identifying amount row where's NA in column Age, Height, Weight, Region
#Delete row containt NA based on Age, Height, Weight, Region
olympic <- olympic[!is.na(olympic$Age),]
olympic <- olympic[!is.na(olympic$Height),]
olympic <- olympic[!is.na(olympic$Weight),]
olympic <- olympic[!is.na(olympic$region),]
colSums(is.na(olympic))
## NOC ID Name Sex Age Height Weight Team Games Year Season
## 0 0 0 0 0 0 0 0 0 0 0
## City Sport Event Medal region notes
## 0 0 0 175723 0 0
Replace missing value on Medal variable to No Medal
#To replace with string, the data type column should be in character type
no_medal <- olympic$Medal
no_medal <- as.character(no_medal)
no_medal[is.na(no_medal)] <- "No_Medal"
no_medal <- as.factor(no_medal)
olympic$Medal <- no_medal
colSums(is.na(olympic))
## NOC ID Name Sex Age Height Weight Team Games Year Season
## 0 0 0 0 0 0 0 0 0 0 0
## City Sport Event Medal region notes
## 0 0 0 0 0 0
Remove notes column
olympic$notes <- NULL
Summary all data once again for final check
summary(olympic)
## NOC ID Name Sex
## USA : 14214 Min. : 1 Length:205895 F: 66592
## FRA : 7977 1st Qu.: 35178 Class :character M:139303
## CAN : 7965 Median : 68609 Mode :character
## GBR : 7766 Mean : 68600
## ITA : 7697 3rd Qu.:102290
## JPN : 7487 Max. :135571
## (Other):152789
## Age Height Weight Team
## Min. :11.00 Min. :127.0 Min. : 25.0 United States: 13714
## 1st Qu.:21.00 1st Qu.:168.0 1st Qu.: 60.0 France : 7807
## Median :24.00 Median :175.0 Median : 70.0 Canada : 7668
## Mean :25.06 Mean :175.4 Mean : 70.7 Great Britain: 7499
## 3rd Qu.:28.00 3rd Qu.:183.0 3rd Qu.: 79.0 Italy : 7433
## Max. :71.00 Max. :226.0 Max. :214.0 Japan : 7339
## (Other) :154435
## Games Year Season City
## 2000 Summer: 13659 Min. :1896 Summer:166436 London : 13766
## 2016 Summer: 13398 1st Qu.:1976 Winter: 39459 Sydney : 13659
## 2004 Summer: 13377 Median :1992 Athina : 13618
## 2008 Summer: 13365 Mean :1990 Rio de Janeiro: 13398
## 2012 Summer: 12486 3rd Qu.:2006 Beijing : 13365
## 1996 Summer: 11816 Max. :2016 Atlanta : 11816
## (Other) :127794 (Other) :126273
## Sport Event
## Athletics : 32348 Ice Hockey Men's Ice Hockey : 3825
## Swimming : 18658 Football Men's Football : 3459
## Gymnastics : 18267 Hockey Men's Hockey : 2874
## Rowing : 7789 Basketball Men's Basketball : 2461
## Cycling : 7775 Water Polo Men's Water Polo : 2231
## Cross Country Skiing: 7529 Cycling Men's Road Race, Individual: 2099
## (Other) :113529 (Other) :188946
## Medal region
## Bronze : 10144 USA : 14214
## Gold : 10166 Germany: 13183
## No_Medal:175723 Russia : 10398
## Silver : 9862 France : 7977
## Canada : 7966
## UK : 7766
## (Other):144391
nrow(olympic)
## [1] 205895
As the output expected there is two columns that should be add and show, first column is Body Mass Index (BMI). The formula to calculate BMI :
knitr::include_graphics("BMI formula.png")
Add and calculate BMI
#remember height in m not cm
olympic <- olympic %>%
mutate(BMI = Weight/(Height/100)^2) %>%
mutate_at(vars(BMI), funs(round(., 2)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
head(olympic)
Second column is for to categorize the BMI based on BMI chart.
olympic <- olympic %>%
mutate(BMI_category = case_when(BMI < 18.5 ~ "Underweight",
BMI <= 24.9 ~ "Normal",
BMI <= 29.9 ~ "Overweight",
BMI <= 39.9 ~ "Obese",
BMI >39.9 ~ "Morbidly Obese"))
olympic[,c("BMI","BMI_category"),]
Transform value in variable medal to column and give total of medal each athlete
library(tidyr)
olympic <- olympic %>%
pivot_wider(
names_from = Medal,
values_from = Medal,
values_fn = length,
values_fill = 0) %>%
mutate(Total_Medal = Bronze + Silver + Gold)
olympic$No_Medal <- NULL
olympic
This step prevent anomaly data before we processing, to do so we will aggregate and subset. As our business question we will focus on the BMI & medal
Check outlier
Groupby by ID because the unique keys
#olympic_check_outlier
library(ggplot2)
olympic %>%
filter(region == "USA", Sport == "Athletics", Sex == "M") %>%
group_by(Year, ID) %>%
summarise(Bronze = sum(Bronze),
Silver = sum(Silver),
Gold = sum(Gold),
All_Medal = sum(Total_Medal),
BMI = mean(BMI)) %>%
mutate(BMI_cat = case_when(BMI < 18.5 ~ "Underweight",
BMI <= 24.9 ~ "Normal",
BMI <= 29.9 ~ "Overweight",
BMI <= 39.9 ~ "Obese",
BMI >39.9 ~ "Morbidly Obese")) %>%
ggplot(mapping = aes(x = BMI , y = as.factor(Year))) +
geom_jitter(aes(col = BMI_cat)) +
geom_boxplot(fill = "Gray", alpha = 0.5) +
labs(title = "Olympic Athlete BMI by Year",
subtitle = "Male Athlete",
y = "Year", x ="BMI",
col = "BMI Category")
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
unique(olympic$Sport[olympic$region == "Indonesia"])
## [1] Athletics Shooting Weightlifting Badminton Sailing
## [6] Boxing Table Tennis Archery Cycling Fencing
## [11] Tennis Swimming Rowing Taekwondo Diving
## [16] Wrestling Canoeing Judo
## 66 Levels: Aeronautics Alpine Skiing Alpinism Archery ... Wrestling
unique(olympic$Event[olympic$Sport == "Athletics" & olympic$region == "Indonesia"])
## [1] Athletics Men's 4 x 100 metres Relay Athletics Men's 100 metres
## [3] Athletics Men's 5,000 metres Athletics Men's 10,000 metres
## [5] Athletics Women's 200 metres Athletics Women's Marathon
## [7] Athletics Women's 100 metres Athletics Men's 400 metres
## [9] Athletics Men's 200 metres Athletics Women's 5,000 metres
## [11] Athletics Women's 100 metres Hurdles Athletics Women's Long Jump
## [13] Athletics Men's 110 metres Hurdles Athletics Women's 400 metres
## [15] Athletics Men's Marathon
## 765 Levels: Aeronautics Mixed Aeronautics ... Wrestling Women's Middleweight, Freestyle
olympic_boxplot <- olympic %>%
#filter(region == input$Select_region, Sport == input$Select_Sport, Event == input$Select_Event) %>%
group_by(Year, ID) %>%
summarise(Bronze = sum(Bronze),
Silver = sum(Silver),
Gold = sum(Gold),
All_Medal = sum(Total_Medal),
BMI = mean(BMI)) %>%
mutate(BMI_cat = case_when(BMI < 18.5 ~ "Underweight",
BMI <= 24.9 ~ "Normal",
BMI <= 29.9 ~ "Overweight",
BMI <= 39.9 ~ "Obese",
BMI >39.9 ~ "Morbidly Obese"))
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
olympic_boxplot <- ggplot(olympic_boxplot, mapping = aes(y = BMI , x = Year)) +
geom_jitter(aes(col = BMI_cat)) +
geom_boxplot(fill = "Gray", alpha = 0.5) +
labs(title = "Olympic Athlete BMI by Year",
subtitle = "Male Athlete",
x = "Year", y ="BMI",
col = "BMI Category")
ggplotly(olympic_boxplot)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?