Data Wrangling (Data Preprocessing)

Setup

library(editrules)
library(kableExtra)
library(magrittr)
library(readr)
library(tidyverse)
library(forecast)
library(tidytext)
library(moments)
library(ggplot2)
library(outliers)

Student names, numbers and percentage of contributions

Group information
Student name	Student number	Percentage of contribution
Lai Teng Wong	S3714421	100%

Executive Summary

The report will be looking at the Prevalence of Obesity (%) and Prevalence of Underweight (%) of adults aged 18 and above in 191 countries.

Following the tidy data principles, both data sets loaded into R were in untidy formats, the first column of each data set were renamed to Country and the data sets and were reshaped from a wide format to a long format prior to joining them into a single data set using a full join by the common variable Country.

Data pre-processing steps were carried out, such as checking the number of rows and columns prior and after joining the data sets to ensure that none of the rows and columns were missed out, separating values in the prevalence columns by “space”, selecting the variables to be focused on, checking the structure of the data set and applying appropriate data type conversions to convert character variables to factors and numerics.

The data set was scanned for missing values, special values and obvious inconsistencies prior to mutating two global rank variables for obesity prevalence and underweight prevalence respectively. Missing values were omitted from the data set. Both numeric prevalence of obesity and underweight variables were checked for normality using summary statistics and histograms prior to scanning for outliers with Tukey’s outliers detection method using boxplots as the distributions were found to be skewed. The outliers were not removed because they convey meaningful and important insights which will be discussed in this report.

Lastly, data transformation techniques such as logarithm, BoxCox and square root transformations were performed on both obesity and underweight prevalence distributions to reduce the skewness and to bring both of the distributions closer to a normal distribution. After having approximately normal distributions, z-scores method were used to detect outliers. There were no outliers detected after performing logarithm/square root transformations on underweight prevalence and BoxCox/square root transformations on obesity prevalence.

Load Data, Join Data & Tidy & Manipulate Data I (Reshaping Data)

Two sets of data: underweight.csv and obesity.csv were downloaded from the World Health Organization Data Repository. Both sets of data were filtered down to year: 2016, 2015 and 2014 and sex: both sexes prior to downloading as CSV files. Both data sets have 4 columns and 195 observations each.

In tidy data:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Both of the data sets are in untidy format. The first column for both data sets have been renamed to Country. The Country variable has no issues, as all the country names are listed within in a single column Country.

From columns 2 to 4, the column headers are values: 2016, 2015 and 2014, instead of variable names - the values are spreaded across 3 column headers instead of having its own column under a variable named Year.

Similarly, the values for obesity and underweight prevalence are spreaded across all the rows in columns 2 to 4 instead of having their own respective columns under two variables named Prevalence_of_Underweight and Prevalence_of_Obesity. Also, there are two sets of values within one cell for all the observations in Prevalence_of_Obesity and Prevalence_of_Underweight - Two variables Prevalence (%) and Range of Estimates were combined together by a “space”, for example: “16.4 [10.0-23.7]” and stored within the same cell.

After converting both of the data sets from wide format to long format and separating the Prevalence (%) values and Range of Estimates into two columns by “space”:

The data set underweight.csv consists of 4 variables: * Country: Name of the country * Year: 2014, 2015, 2016 * Prevalence_of_Underweight: Prevalence of underweight among adults, defined as the percentage of adults (Age 18 and above) with a body mass index (BMI) less than 18.5 kg/m2, based on an age-standardized estimate. * Range_1: Age-standardized estimate range for underweight prevalence

The data set obesity.csv consists of 4 variables: * Country: Name of the country * Year: 2014, 2015, 2016 * Prevalence_of_Obesity: Prevalence of obesity among adults, defined as the percentage of adults (Age 18 and above) with a body mass index (BMI) of 30 kg/m2 or higher, based on an age-standardized estimate. * Range_2: Age-standardized estimate range for obesity prevalence

We will only be looking at the first three variables in both data sets: Country, Year and Prevalence of Obesity / Prevalence of Underweight, so Range_1 and Range_2 have been removed from both data sets using the select function. The two data sets were joined together by a full_join to retain all the observations in both data sets.

After the join, the combined data set consists of Country, Year, Prevalence_of_Obesity and Prevalence_of_Underweight. The data set is now in a tidy format, as each country has its own row with underweight and obesity prevalence for each year.

setwd("C:/Users/laite/Desktop/Data Wrangling/Practical Assessment 2")
#retain existing column names, load 1st data set, rename 1st column
col_names <- suppressMessages(names(read_csv("underweight.csv", n_max = 1, show_col_types = FALSE)))
underweight <- read_csv("underweight.csv", col_names = col_names, skip = 4, show_col_types = FALSE) %>% rename(Country = `...1`) 
dim(underweight)

## [1] 195   4

underweight %>% head(5) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	2016	2015	2014
Afghanistan	16.4 [10.0-23.7]	16.8 [10.3-24.0]	17.1 [10.7-24.3]
Albania	1.4 [0.6-2.6]	1.4 [0.6-2.6]	1.4 [0.6-2.7]
Algeria	3.6 [1.9-5.9]	3.7 [2.0-6.0]	3.8 [2.1-6.1]
Andorra	1.0 [0.4-2.2]	1.1 [0.4-2.2]	1.1 [0.4-2.2]
Angola	13.5 [7.2-21.1]	13.7 [7.4-21.1]	13.9 [7.7-21.4]

underweight %<>% pivot_longer(names_to = "Year", values_to = "Prevalence_of_Underweight", cols = 2:4) %>%
  separate(`Prevalence_of_Underweight`, into = c("Prevalence_of_Underweight", "Range_1"), sep = " ") 
underweight %>% head(5) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Prevalence_of_Underweight	Range_1
Afghanistan	2016	16.4	[10.0-23.7]
Afghanistan	2015	16.8	[10.3-24.0]
Afghanistan	2014	17.1	[10.7-24.3]
Albania	2016	1.4	[0.6-2.6]
Albania	2015	1.4	[0.6-2.6]

underweight %<>% select(Country, Year, `Prevalence_of_Underweight`)
underweight %>% head(5) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Prevalence_of_Underweight
Afghanistan	2016	16.4
Afghanistan	2015	16.8
Afghanistan	2014	17.1
Albania	2016	1.4
Albania	2015	1.4

#retain existing column names, load 2nd data set, rename 1st column
col_names1 <- suppressMessages(names(read_csv("obesity.csv", n_max = 1, show_col_types = FALSE)))
obesity <- read_csv("obesity.csv", col_names = col_names1, skip = 4, show_col_types = FALSE) %>% rename(Country = `...1`)
dim(obesity)

## [1] 195   4

obesity %>% head(5) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	2016	2015	2014
Afghanistan	5.5 [3.4-8.1]	5.2 [3.3-7.7]	4.9 [3.1-7.3]
Albania	21.7 [17.0-26.7]	21.1 [16.6-26.0]	20.5 [16.2-25.1]
Algeria	27.4 [22.5-32.7]	26.7 [21.9-31.8]	26.0 [21.4-30.9]
Andorra	25.6 [20.1-31.3]	25.4 [20.1-31.0]	25.2 [20.0-30.7]
Angola	8.2 [5.1-12.2]	7.9 [4.9-11.7]	7.5 [4.7-11.2]

obesity %<>% pivot_longer(names_to = "Year", values_to = "Prevalence_of_Obesity", cols = 2:4) %>%
  separate(`Prevalence_of_Obesity`, into = c("Prevalence_of_Obesity", "Range_2"), sep = " ") 
obesity %>% head(5) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Prevalence_of_Obesity	Range_2
Afghanistan	2016	5.5	[3.4-8.1]
Afghanistan	2015	5.2	[3.3-7.7]
Afghanistan	2014	4.9	[3.1-7.3]
Albania	2016	21.7	[17.0-26.7]
Albania	2015	21.1	[16.6-26.0]

obesity %<>% select(Country, Year, `Prevalence_of_Obesity`)
obesity %>% head(5) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Prevalence_of_Obesity
Afghanistan	2016	5.5
Afghanistan	2015	5.2
Afghanistan	2014	4.9
Albania	2016	21.7
Albania	2015	21.1

#combine both data sets  
combined <- underweight %>% full_join(obesity, by = c("Country", "Year")) 
combined %>% head(6) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Prevalence_of_Underweight	Prevalence_of_Obesity
Afghanistan	2016	16.4	5.5
Afghanistan	2015	16.8	5.2
Afghanistan	2014	17.1	4.9
Albania	2016	1.4	21.7
Albania	2015	1.4	21.1
Albania	2014	1.4	20.5

Understand the Data (Data Type Conversions)

The data set now consists of 4 variables (Country, Year, Prevalence_of_Obesity and Prevalence_of_Underweight) and 585 observations.

str(combined) #check the data types of the variables in the data set

## tibble [585 x 4] (S3: tbl_df/tbl/data.frame)
##  $ Country                  : chr [1:585] "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
##  $ Year                     : chr [1:585] "2016" "2015" "2014" "2016" ...
##  $ Prevalence_of_Underweight: chr [1:585] "16.4" "16.8" "17.1" "1.4" ...
##  $ Prevalence_of_Obesity    : chr [1:585] "5.5" "5.2" "4.9" "21.7" ...

combined$Country %<>% as.factor()
combined$Year <- factor(combined$Year, ordered=TRUE) #factorise and order the Year variable 
combined$`Prevalence_of_Obesity` %<>% as.numeric() #convert Prevalence to numeric

## Warning in combined$Prevalence_of_Obesity %<>% as.numeric(): NAs introduced by
## coercion

combined$`Prevalence_of_Underweight` %<>% as.numeric()

## Warning in combined$Prevalence_of_Underweight %<>% as.numeric(): NAs introduced
## by coercion

str(combined) #check the data types of the variables after data type conversions

## tibble [585 x 4] (S3: tbl_df/tbl/data.frame)
##  $ Country                  : Factor w/ 195 levels "Afghanistan",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ Year                     : Ord.factor w/ 3 levels "2014"<"2015"<..: 3 2 1 3 2 1 3 2 1 3 ...
##  $ Prevalence_of_Underweight: num [1:585] 16.4 16.8 17.1 1.4 1.4 1.4 3.6 3.7 3.8 1 ...
##  $ Prevalence_of_Obesity    : num [1:585] 5.5 5.2 4.9 21.7 21.1 20.5 27.4 26.7 26 25.6 ...

The Country variable was read in as character, it consists of the name of the countries. Each country has a record of obesity and underweight prevalence for each year and has been factorized. Year should be considered as a categorical variable as it represents data collected for underweight and obesity prevalence in 2014, 2015 and 2016, and is definitely not numeric as we cannot perform any statistical analysis on the variable, so it has been factorised and ordered. Prevalence_of_Obesity and Prevalence_of_Underweight have been converted from character to numeric because it accounts for the percentage of adults aged 18 and above who are underweight or obese. After the data type conversions, we can see that there are 195 levels for Country as there are 195 countries in the data set, 3 levels for Year and 585 numeric observations for Prevalence of Underweight and Obesity respectively.

Scan I (Missing Values, Special Values, Inconsistencies) & Tidy & Manipulate Data II (Mutate)

any(is.na(combined)) #Identify are there any missing values in the data set

## [1] TRUE

which(is.na(combined)) #Identify location of missing values

##  [1] 1507 1508 1509 1618 1619 1620 1657 1658 1659 1666 1667 1668 2092 2093 2094
## [16] 2203 2204 2205 2242 2243 2244 2251 2252 2253

sum(is.na(combined)) #Total missing values

## [1] 24

combined %>% arrange(`Prevalence_of_Obesity`) %>% tail(12) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #Missing data for 4 countries for every year

Country	Year	Prevalence_of_Underweight	Prevalence_of_Obesity
Monaco	2016	NA	NA
Monaco	2015	NA	NA
Monaco	2014	NA	NA
San Marino	2016	NA	NA
San Marino	2015	NA	NA
San Marino	2014	NA	NA
South Sudan	2016	NA	NA
South Sudan	2015	NA	NA
South Sudan	2014	NA	NA
Sudan	2016	NA	NA
Sudan	2015	NA	NA
Sudan	2014	NA	NA

combined %<>% na.omit() #Remove rows with missing values 
any(is.na(combined)) #Missing values have been omitted

## [1] FALSE

nrow(combined) #Check number of rows remaining

## [1] 573

#define function to check for special and NAN values 
is.specialorNAN <- function(x){
if (is.numeric(x)) any((is.infinite(x) | is.nan(x)))
}
sapply(combined, is.specialorNAN)

## $Country
## NULL
## 
## $Year
## NULL
## 
## $Prevalence_of_Underweight
## [1] FALSE
## 
## $Prevalence_of_Obesity
## [1] FALSE

#check for inconsistencies
rule <- editfile("rules.txt", type = "all")
rule

## 
## Data model:
## dat1 : Year %in% c('2014', '2015', '2016') 
## 
## Edit set:
## num1 : 0 <= Prevalence_of_Underweight
## num2 : Prevalence_of_Underweight <= 100
## num3 : 0 <= Prevalence_of_Obesity
## num4 : Prevalence_of_Obesity <= 100

summary(violatedEdits(rule, combined))

## No violations detected, 0 checks evaluated to NA

## NULL

#Mutate Global Rank columns to rank prevalence of Obesity & Prevalence of Underweight from highest to lowest
combined %<>% group_by(`Year`) %>%
  mutate(`Global_Rank_Obesity_Prevalence` = rank(-`Prevalence_of_Obesity`, ties.method = "min")) %>% ungroup()
combined %<>% group_by(`Year`) %>%
  mutate(`Global_Rank_Underweight_Prevalence` = rank(-`Prevalence_of_Underweight`, ties.method = "min")) %>% ungroup()
combined %>% arrange(`Global_Rank_Obesity_Prevalence`) %>% head() %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Prevalence_of_Underweight	Prevalence_of_Obesity	Global_Rank_Obesity_Prevalence	Global_Rank_Underweight_Prevalence
Nauru	2016	0.2	61.0	1	190
Nauru	2015	0.2	60.7	1	191
Nauru	2014	0.2	60.3	1	191
Cook Islands	2016	0.2	55.9	2	190
Cook Islands	2015	0.3	55.4	2	188
Cook Islands	2014	0.3	54.9	2	188

combined %>% arrange(`Global_Rank_Underweight_Prevalence`) %>% head() %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Prevalence_of_Underweight	Prevalence_of_Obesity	Global_Rank_Obesity_Prevalence	Global_Rank_Underweight_Prevalence
India	2016	23.6	3.9	187	1
India	2015	24.0	3.7	187	1
India	2014	24.3	3.5	187	1
Bangladesh	2016	21.5	3.6	190	2
Bangladesh	2015	21.9	3.4	190	2
Bangladesh	2014	22.3	3.2	190	2

#Check for inconsistencies
(rule1 <- editset(c("Global_Rank_Obesity_Prevalence >= 1", "Global_Rank_Obesity_Prevalence <= 191", "Global_Rank_Underweight_Prevalence >= 1", "Prevalence_of_Underweight <= 191")))

## 
## Edit set:
## num1 : 1 <= Global_Rank_Obesity_Prevalence
## num2 : Global_Rank_Obesity_Prevalence <= 191
## num3 : 1 <= Global_Rank_Underweight_Prevalence
## num4 : Prevalence_of_Underweight <= 191

summary(violatedEdits(rule1, combined))

## No violations detected, 0 checks evaluated to NA

## NULL

Before mutating a variable from an existing variable in the data set, it is important check if there are any missing values, special values and obvious inconsistencies. When Prevalence_of_Obesity and Prevalence_of_Underweight were converted to numeric, the missing values were coerced into “NA” in the data set. any(is.na(combined)) returned TRUE, which means that there are missing values in the data set. which(is.na(combined)) returned the indices of the rows with missing values and from sum(is.na(combined)), we know that there are 24 missing values in the data set. As the missing values always appear at the end of the data set, I have arranged the data set by prevalence, and the tail(12) shows the last 12 rows of missing data. Since there are no data available for Monaco, San Marino, South Sudan and Sudan for each year, I have used na.omit() to remove these 12 rows from the data set. nrow(combined) returned 573 rows, after 12 rows have been removed from the initial combined data set which had 585 rows.

There are no special values such as -Inf/+Inf and NAN values in the data set, as the defined function is.specialorNAN returned FALSE upon checking for both Prevalence_of_Obesity and Prevalence_of_Underweight columns which are numeric.

There are no obvious inconsistencies or errors in the data set after passing the data set to the violatedEdits function with defined rules using editset. There are no violations detected because Obesity and Underweight Prevalence values fall within 0% and 100%, and there are no values other than 2014, 2015 and 2016 in the Year variable.

After checking for missing values, special values and inconsistencies, I have mutated 2 columns: Global_Rank_Obesity_Prevalence and Global_Rank_Underweight_Prevalence. These two columns show the global ranks of countries in terms of their obesity and underweight prevalence respectively. Countries with highest prevalence ranks first, and countries with lowest prevalence ranks last. It provides the audience with useful information to look up for the global rank of countries they are interested in or if they wanted to find out which country has the highest and lowest obesity and/or underweight prevalence in 2014, 2015 and 2016.

To ensure that there are only 191 ranks for 191 countries, inconsistencies check using ViolatedEdits has been carried out on Global_Rank_Obesity_Prevalence and Global_Rank_Underweight_Prevalence variables.

Scan II (Outliers)

#reshape data set 
combined_pivot <- combined %>% pivot_longer(names_to = "Prevalence Type", values_to = "Prevalence (%)", cols = 3:4)

#get summary statistics grouped by prevalence type and year
options(dplyr.summarise.inform = FALSE)
combined_pivot %>% group_by(`Prevalence Type`, `Year`) %>% summarise(Mean=mean(`Prevalence (%)`), 
                                                                     `Standard Deviation` = sd(`Prevalence (%)`), 
                                                                     Median = median(`Prevalence (%)`), 
                                                                     `First Quartile` = quantile(`Prevalence (%)`,probs=.25), 
                                                                     `Third Quartile` = quantile(`Prevalence (%)`,probs=.75), 
                                                                     `Interquartile Range` = IQR(`Prevalence (%)`), 
                                                                     Min = min(`Prevalence (%)`), 
                                                                     Max = max(`Prevalence (%)`), 
                                                                     Skewness = skewness(`Prevalence (%)`), 
                                                                     Kurtosis = kurtosis(`Prevalence (%)`)) %>% 
  ungroup() %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Prevalence Type	Year	Mean	Standard Deviation	Median	First Quartile	Third Quartile	Interquartile Range	Min	Max	Skewness	Kurtosis
Prevalence_of_Obesity	2014	19.043979	11.225111	19.8	8.80	24.65	15.85	1.8	60.3	0.9047697	4.371811
Prevalence_of_Obesity	2015	19.508377	11.313017	20.2	9.15	25.20	16.05	2.0	60.7	0.8909103	4.325069
Prevalence_of_Obesity	2016	19.960733	11.409985	20.6	9.55	25.65	16.10	2.1	61.0	0.8684557	4.253928
Prevalence_of_Underweight	2014	5.529319	5.192310	3.1	1.50	9.30	7.80	0.2	24.3	1.1160693	3.418997
Prevalence_of_Underweight	2015	5.419372	5.100474	3.0	1.50	9.15	7.65	0.2	24.0	1.1257166	3.455797
Prevalence_of_Underweight	2016	5.314136	5.010133	3.0	1.40	9.00	7.60	0.2	23.6	1.1270406	3.456184

#filter data set into each year and store into 3 separate dataframes for plotting histograms
comb_2016 <- combined_pivot %>% filter(Year == "2016")
comb_2015 <- combined_pivot %>% filter(Year == "2015")
comb_2014 <- combined_pivot %>% filter(Year == "2014")

#plot histogram 
ggplot(comb_2014, aes(x = `Prevalence (%)`)) + geom_histogram(fill = "skyblue", colour = "black", binwidth = 2) + facet_wrap(`Prevalence Type` ~., scales = "free") + theme_bw() + labs(title = "Histogram of Prevalence of Obesity & Underweight in 2014", ylab = "Frequency", xlab = "Prevalence (%)")

ggplot(comb_2015, aes(x = `Prevalence (%)`)) + geom_histogram(fill = "pink", colour = "black", binwidth = 2) +
facet_wrap(`Prevalence Type` ~., scales = "free") + theme_bw() + labs(title = "Histogram of Prevalence of Obesity & Underweight in 2015", ylab = "Frequency", xlab = "Prevalence (%)")

ggplot(comb_2016, aes(x = `Prevalence (%)`)) + geom_histogram(fill = "orange", colour = "black", binwidth = 2) + facet_wrap(`Prevalence Type` ~., scales = "free") + theme_bw() + labs(title = "Histogram of Prevalence of Obesity & Underweight in 2016", ylab = "Frequency", xlab = "Prevalence (%)")

Before deciding on which outlier detection method to be used, we need to check whether the data is normally distributed. As the distribution for each year varies slightly, I have plotted the obesity and underweight prevalence for each year on 3 separate faceted histograms. Before that, to facilitate the plotting, Prevalence_of_Obesity and Prevalence_of_Underweight columns have been reshaped from a wide format to long format using pivot_longer to Prevalence Type and Prevalence (%).

As we can see, the distributions for 2014, 2015 and 2016 are very similar. The distributions of Prevalence_of_Underweight are skewed to the right, and this is very much in line with the summary statistics that the mean value is higher than the median across three years. A normal distribution has a skewness of 0 and a kurtosis of 3. Prevalence_of_Underweight has skewness which is greater than 1 across the years, indicating a high degree of skewness.

Prevalence_of_Obesity on the other hand appears to have a long right tail at the first glance, but visually it is hard to tell, because the distributions look bimodal, and the second peak is at around 22%. The distributions do indeed have more values concentrated on the left side of the distribution instead of the right, thus the reason why it has a high kurtosis ranging from 4.25 to 4.37. Based on the actual mean and median values from the summary statistics, the obesity prevalence distributions are left-skewed as the mean values are less than the median values across the years, but the degree of skewness is less than Prevalence_of_Underweight.

The distributions above are not normal or approximately normal, therefore Tukey’s method of outlier detection which is mainly used to test outliers in non-symmetric/non-normal data distributions will be used to detect outliers in the data by plotting 3 boxplots for obesity and underweight prevalence for 2014, 2015 and 2016.

par(mfrow = c(1,3))
box_plot1 <- boxplot(comb_2014$`Prevalence (%)`~comb_2014$`Prevalence Type`, main = "Boxplot of Prevalence (%) by Prevalence Type in 2014", ylab = "Prevalence (%)", xlab = "Prevalence Type", col = "skyblue")
box_plot1$out #2014 outliers

## [1] 54.9 51.8 60.3 54.2 50.1 22.3 24.3

box_plot2 <- boxplot(comb_2015$`Prevalence (%)`~comb_2015$`Prevalence Type`, main = "Boxplot of Prevalence (%) by Prevalence Type in 2015", ylab = "Prevalence (%)", xlab = "Prevalence Type", col = "pink")
box_plot2$out #2015 outliers

## [1] 55.4 52.4 60.7 54.8 50.9 21.9 24.0

box_plot3 <- boxplot(comb_2016$`Prevalence (%)`~comb_2016$`Prevalence Type`, main = "Boxplot of Prevalence (%) by Prevalence Type in 2016", ylab = "Prevalence (%)", xlab = "Prevalence Type", col = "orange")

box_plot3$out #2016 outliers

## [1] 55.9 52.9 61.0 50.0 55.3 51.6 21.5 23.6

upper_fence <- function(Q3, IQR){
  Q3 + (1.5*IQR)
}
upper_fence(24.65, 15.85) #2014 obesity

## [1] 48.425

comb_2014 %>% filter(`Prevalence Type` == "Prevalence_of_Obesity" & `Prevalence (%)` > 48.425) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Global_Rank_Obesity_Prevalence	Global_Rank_Underweight_Prevalence	Prevalence Type	Prevalence (%)
Cook Islands	2014	2	188	Prevalence_of_Obesity	54.9
Marshall Islands	2014	4	184	Prevalence_of_Obesity	51.8
Nauru	2014	1	191	Prevalence_of_Obesity	60.3
Palau	2014	3	186	Prevalence_of_Obesity	54.2
Tuvalu	2014	5	186	Prevalence_of_Obesity	50.1

upper_fence(25.20, 16.05) #2015 obesity

## [1] 49.275

comb_2015 %>% filter(`Prevalence Type` == "Prevalence_of_Obesity" & `Prevalence (%)` > 49.275) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Global_Rank_Obesity_Prevalence	Global_Rank_Underweight_Prevalence	Prevalence Type	Prevalence (%)
Cook Islands	2015	2	188	Prevalence_of_Obesity	55.4
Marshall Islands	2015	4	184	Prevalence_of_Obesity	52.4
Nauru	2015	1	191	Prevalence_of_Obesity	60.7
Palau	2015	3	186	Prevalence_of_Obesity	54.8
Tuvalu	2015	5	186	Prevalence_of_Obesity	50.9

upper_fence(25.65, 16.10) #2016 obesity

## [1] 49.8

comb_2016 %>% filter(`Prevalence Type` == "Prevalence_of_Obesity" & `Prevalence (%)` > 49.8) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Global_Rank_Obesity_Prevalence	Global_Rank_Underweight_Prevalence	Prevalence Type	Prevalence (%)
Cook Islands	2016	2	190	Prevalence_of_Obesity	55.9
Marshall Islands	2016	4	184	Prevalence_of_Obesity	52.9
Nauru	2016	1	190	Prevalence_of_Obesity	61.0
Niue	2016	6	184	Prevalence_of_Obesity	50.0
Palau	2016	3	184	Prevalence_of_Obesity	55.3
Tuvalu	2016	5	184	Prevalence_of_Obesity	51.6

upper_fence(9.30, 7.80) #2014 underweight

## [1] 21

comb_2014 %>% filter(`Prevalence Type` == "Prevalence_of_Underweight" & `Prevalence (%)` > 21) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Global_Rank_Obesity_Prevalence	Global_Rank_Underweight_Prevalence	Prevalence Type	Prevalence (%)
Bangladesh	2014	190	2	Prevalence_of_Underweight	22.3
India	2014	187	1	Prevalence_of_Underweight	24.3

upper_fence(9.15, 7.65) #2015 underweight

## [1] 20.625

comb_2015 %>% filter(`Prevalence Type` == "Prevalence_of_Underweight" & `Prevalence (%)` > 20.625) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Global_Rank_Obesity_Prevalence	Global_Rank_Underweight_Prevalence	Prevalence Type	Prevalence (%)
Bangladesh	2015	190	2	Prevalence_of_Underweight	21.9
India	2015	187	1	Prevalence_of_Underweight	24.0

upper_fence(9.00,   7.60) #2016 underweight

## [1] 20.4

comb_2016 %>% filter(`Prevalence Type` == "Prevalence_of_Underweight" & `Prevalence (%)` > 20.4) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Country	Year	Global_Rank_Obesity_Prevalence	Global_Rank_Underweight_Prevalence	Prevalence Type	Prevalence (%)
Bangladesh	2016	190	2	Prevalence_of_Underweight	21.5
India	2016	187	1	Prevalence_of_Underweight	23.6

Looking at the boxplots, we can further confirm that the distributions are skewed and not normal because the median line is cutting the boxplots into two unequal pieces. For Prevalence_of_Obesity, the longer part of the box is below the median, indicating that the data is skewed left; for Prevalence_of_Underweight, the longer part of the box is above the median, indicating that the data is skewed right.

According to Tukey’s method of outlier detection, outliers are the values which fall beyond the lower fences (Q1 – 1.5 * IQR) and upper fences (Q1 – 1.5 * IQR). From the boxplots for the numeric data: Prevalence_of_Obesity and Prevalence_of_Underweight, there are no outliers below the lower fences, but there are outliers above the upper fences, depicted by small circles above the upper fences. In 2014 and 2015, there are 5 outliers for obesity prevalence and 2 outliers for underweight prevalence, whereas in 2016, there are 6 outliers for obesity prevalence and 2 outliers for underweight prevalence. The upper fences for Prevalence_of_Obesity are 48.425, 49.275 and 49.8 for 2014, 2015 and 2016 respectively; and the upper fences for Prevalence_of_Underweight are 21, 20.625, 20.4 for 2014, 2015 and 2016 respectively. The value of the outliers were extracted using boxplot$out. The outliers are 54.9, 51.8, 60.3, 54.2, 50.1, 22.3, 24.3 for 2014, 55.4, 52.4, 60.7, 54.8, 50.9, 21.9, 24.0 for 2015 and 55.9, 52.9, 61.0, 50.0, 55.3, 51.6, 21.5, 23.6 for 2016. The names of the countries to have the stated “outliers” were identified using the filter function.

As we are looking at prevalence (percentages), none of the outliers are impossible values or errors as they fall within 0% to 100%. These “anomalies” give us important information about the data and therefore should not be removed. According to Lancet (2016) and Lancet (2017), countries in the Pacific Islands such as Nauru, Cook Islands, Marshall Islands, Palau, Tuvalu and Niue are indeed the most obese countries in the world and countries in the South Asia such as India and Bangladesh do indeed have the highest underweight prevalence in the world. These “anomalies” should not be removed because they convey the most important information in the entire data set. They serve the purpose of raising global awareness that these countries with high obesity and underweight prevalence are experiencing rising global health crisis which can lead to highly elevated risks of adverse health outcomes, and this piece of information calls for proper intervention and prevention.

Data Transformation

comb2016_uw <- comb_2016 %>% filter(`Prevalence Type` == "Prevalence_of_Underweight")
comb2016_ob <- comb_2016 %>% filter(`Prevalence Type` == "Prevalence_of_Obesity")
par(mfrow = c(2,3))
hist(comb2016_uw$`Prevalence (%)`, main = "Histogram of Underweight Prevalence", xlab = "Prevalence of Underweight (%)")
ln_uw <- hist(log(comb2016_uw$`Prevalence (%)`), main = "Histogram of log(Underweight Prevalence)", xlab = "log(Prevalence_of_Underweight)")
log_uw <- hist(log10(comb2016_uw$`Prevalence (%)`), main = "Histogram of ln(Underweight Prevalence)", xlab = "ln(Prevalence_of_Underweight)")
sqrt_uw <- hist(sqrt(comb2016_uw$`Prevalence (%)`), main = "Histogram of sqrt(Underweight Prevalence)", xlab = "sqrt(Prevalence_of_Underweight)")
rec_uw <- hist(1/(comb2016_uw$`Prevalence (%)`), main = "Histogram of reciprocal(Underweight Prevalence)", xlab = "Reciprocal(Prevalence_of_Underweight)")
boxcox_uw <- hist(BoxCox(comb2016_uw$`Prevalence (%)`, lambda = "auto"), main = "Histogram of BoxCox(Underweight Prevalence)", xlab = "BoxCox(Prevalence_of_Underweight)")

log10(comb2016_uw$`Prevalence (%)`) %>% summary() #log base 10

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6990  0.1461  0.4771  0.5117  0.9542  1.3729

log(comb2016_uw$`Prevalence (%)`) %>% summary() #natural log

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6094  0.3365  1.0986  1.1783  2.1972  3.1612

sqrt(comb2016_uw$`Prevalence (%)`) %>% summary() #square root

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4472  1.1832  1.7321  2.0580  3.0000  4.8580

BoxCox(comb2016_uw$`Prevalence (%)`, lambda = "auto") %>% summary() #Boxcox

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8494  0.3271  1.0027  0.9908  1.8356  2.4496

z.scores_log10 <- log10(comb2016_uw$`Prevalence (%)`) %>% scores(type ="z") #log base 10 z.scores
summary(z.scores_log10)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.64369 -0.79832 -0.07555  0.00000  0.96624  1.88051

which(abs(z.scores_log10) >3)

## integer(0)

z.scores_ln <- log(comb2016_uw$`Prevalence (%)`) %>% scores(type ="z") #natural log z.scores
summary(z.scores_ln)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.64369 -0.79832 -0.07555  0.00000  0.96624  1.88051

which(abs(z.scores_ln) >3)

## integer(0)

z.scores_sqrt <- sqrt(comb2016_uw$`Prevalence (%)`) %>% scores(type ="z") #square root z.scores
summary(z.scores_sqrt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5467 -0.8400 -0.3130  0.0000  0.9045  2.6887

which(abs(z.scores_sqrt) >3)

## integer(0)

z.scores_boxcox <- BoxCox(comb2016_uw$`Prevalence (%)`, lambda = "auto") %>% scores(type ="z") #boxcox z.scores
summary(z.scores_boxcox)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.19064 -0.74565  0.01336  0.00000  0.94902  1.63870

which(abs(z.scores_boxcox) >3)

## [1]  40 119

Previously, we found that the distribution of Prevalence_of_Obesity and Prevalence_of_Underweight are skewed, and there were outliers detected on the distributions which were not meant to be removed/imputated/capped. In order to eliminate the existence of these outliers, data transformation is needed to decrease the skewness of the distributions and convert the distributions to normal distributions. Since the distributions across 2014, 2015 and 2016 are very similar, we will only focus on the prevalence distributions for 2016.

The first histogram on both plots is the original histogram of obesity/underweight prevalence. Since the distribution for Prevalence_of_Underweight is highly skewed to the right, I applied base 10 logarithm and the natural logarithm, both of the logarithmic distributions look quite similar and have appeared to be much more symmetrical compared to the initial distribution. The square root transformation has reduced the skewness but the distribution is less symmetrical compared to the logarithm transformations. Reciprocal transformation which usually has a drastic effect on the distribution shape had an adverse effect on the distribution as it did not improve the skewness of the original distribution. Visually, it appears as if the BoxCox transformation which is a type of power transformation to transform non-normal data into a normal distribution has transformed the original distribution to a left-skewed distribution, but based on the summary statistics, the transformed data is approximately normal, as the mean and the median values are very close to each other.

The median and the mean values for the log transformation distributions are also very close to each other, so it is safe to assume approximate normality and use the z-scores approach for outliers detection. There were no outliers detected after applying logarithmic and square root transformations. However, comparing the square root transformation to the logarithm transformations, logarithm transformations have transformed the Prevalence_of_Underweight distribution to a distribution closer to normality, so the logarithm transformations are highly favored to be selected. The BoxCox transformation on the other end resulted in two outliers detected using the z-scores method.

par(mfrow = c(2,3))
hist(comb2016_ob$`Prevalence (%)`, main = "Histogram of Obesity Prevalence", xlab = "Prevalence of Obesity (%)")
ln_ob <- hist(log(comb2016_ob$`Prevalence (%)`), main = "Histogram of log(Obesity Prevalence)", xlab = "log(Prevalence_of_Obesity)", breaks = 15)
log_ob <- hist(log10(comb2016_ob$`Prevalence (%)`), main = "Histogram of ln(Obesity Prevalence)", xlab = "ln(Prevalence_of_Obesity)", breaks = 15)
sqrt_ob <- hist(sqrt(comb2016_ob$`Prevalence (%)`), main = "Histogram of sqrt(Obesity Prevalence)", xlab = "sqrt(Prevalence_of_Obesity)", breaks = 15)
square_ob <- hist((comb2016_ob$`Prevalence (%)`^2), main = "Histogram of (Obesity Prevalence)^2", xlab = "(Prevalence_of_Obesity)^2")
boxcox_ob <- hist(BoxCox(comb2016_ob$`Prevalence (%)`, lambda = "auto"), main = "Histogram of BoxCox(Obesity Prevalence)", xlab = "BoxCox(Prevalence_of_Obesity)")

log10(comb2016_ob$`Prevalence (%)`) %>% summary() #log base 10

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3222  0.9800  1.3139  1.2168  1.4091  1.7853

log(comb2016_ob$`Prevalence (%)`) %>% summary() #natural log

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7419  2.2565  3.0253  2.8019  3.2445  4.1109

sqrt(comb2016_ob$`Prevalence (%)`) %>% summary() #square root

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.449   3.090   4.539   4.275   5.065   7.810

BoxCox(comb2016_ob$`Prevalence (%)`, lambda = "auto") %>% summary() #boxcox

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7668  2.4972  3.4682  3.2042  3.7575  4.9568

z.scores_log <- log10(comb2016_ob$`Prevalence (%)`) %>% scores(type ="z")  #log base10 z.scores
summary(z.scores_log)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.0721 -0.8133  0.3332  0.0000  0.6602  1.9523

which(abs(z.scores_log) >3)

## [1] 188

z.scores_ln1 <- log(comb2016_ob$`Prevalence (%)`) %>% scores(type ="z") #natural log z.scores
summary(z.scores_ln1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.0721 -0.8133  0.3332  0.0000  0.6602  1.9523

which(abs(z.scores_ln1) >3)

## [1] 188

z.scores_srt <- sqrt(comb2016_ob$`Prevalence (%)`) %>% scores(type ="z") #square root z.scores
summary(z.scores_srt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.1715 -0.9104  0.2026  0.0000  0.6067  2.7165

which(abs(z.scores_srt) >3)

## integer(0)

z.scores_boxcox1 <- BoxCox(comb2016_ob$`Prevalence (%)`, lambda = "auto") %>% scores(type ="z") #boxcox z.scores
summary(z.scores_boxcox1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.8860 -0.8371  0.3126  0.0000  0.6551  2.0751

which(abs(z.scores_boxcox1) >3)

## integer(0)

As for the Prevalence_of_Obesity distribution which was initially slightly left skewed, applying a square transformation did not improve the skewness. Visually, the logarithm, square root and BoxCox transformations have improved the skewness of the original distribution and the square root and BoxCox transformations have made the distribution more symmetrical and approximately normal compared to the other transformations. Based on the summary statistics, although the base 10 logarithm transformation has transformed the original skewed distribution closest to an approximately normal distribution as the difference between the median and the mean is only 0.0971, an outlier was detected using the z-scores outlier detection method, similarly with the natural logarithm transformation. The square root and BoxCox transformation methods are favored to be selected, as no outliers were detected after the transformations.

Reference List

Lancet. (2017). Worldwide trends in body-mass index, underweight, overweight, and obesity from 1975 to 2016: a pooled analysis of 2416 population-based measurement studies in 128·9 million children, adolescents, and adults. Lancet, 390(10113), 2627–2642. https://doi.org/10.1016/S0140-6736(17)32129-3

Lancet. (2016). Trends in adult body-mass index in 200 countries from 1975 to 2014: a pooled analysis of 1698 population-based measurement studies with 19·2 million participants. Lancet, 387(10026), 1377–1396. https://doi.org/10.1016/S0140-6736(16)30054-X

Taheri, S. (2021). Module 4 Tidy and Manipulate: Tidy Data Principles and Manipulating Data [Module Webpage]. Canvas @ RMIT University, http://rare-phoenix-161610.appspot.com/secured/Module_04.html

Taheri, S. (2021). Module 5 Scan: Missing Values [Module Webpage]. Canvas @ RMIT University. http://rare-phoenix-161610.appspot.com/secured/Module_05.html

Taheri, S. (2021). Module 4 Tidy and Manipulate: Tidy Data Principles and Manipulating Data [Module Webpage]. Canvas @ RMIT University, http://rare-phoenix-161610.appspot.com/secured/Module_04.html

Taheri, S. (2021). Module 7 Transform: Data Transformation, Standardisation, and Reduction [Module Webpage]. Canvas @ RMIT University. http://rare-phoenix-161610.appspot.com/secured/Module_07.html

World Health Organization. (2017). Prevalence of obesity among adults, BMI ≥ 30, age-standardized estimates by country. https://apps.who.int/gho/data/view.main.CTRY2450A?lang=en

World Health Organization. (2017). Prevalence of underweight among adults, BMI < 18.5, age-standardized estimates by country. https://apps.who.int/gho/data/node.main.NCDBMILT18A?lang=en