Introduction
Nowadays, Obesity is one of the most prominent health-related issues faced by the people across globe. Due to this very reason, it is very crucial to analyze the issue deeply. This dataset include data for the estimation of obesity levels based on their eating habits and physical condition in individuals from the countries of Mexico, Peru and Colombia. Using the “ggplot” library primarily, we try to visualize and get some insights of the under-lying patterns for the people of these countries.
About the Dataset
The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.
The dataset can be found here : https://archive-beta.ics.uci.edu/ml/datasets/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition
- Columns Description :
- Gender : Female/Male
- Age : Numeric value
- Height : Numeric value in meters
- Weight : Numeric value in kilograms
- Family History With Over Weight
- FAVC : Frequent consumption of high caloric food,
- FCVC : Frequency of consumption of vegetables
- NCP : Number of main meals
- CAEC : Consumption of food between meals
- CH20 : Consumption of water daily
- CALC : Consumption of alcohol
- SCC : Calories consumption monitoring
- FAF : Physical activity frequency
- TUE : Time using technology devices
- MTRANS : Transportation used
Data Input and Inspection
Importing the required packages and libraries
#Library List
library(tidyverse)
library(ggpubr)
library(scales)
library(glue)
library(plotly)
library(ggplot2)
library(stringr)
library(GGally)
library(dplyr)
library(viridis)
library(rmdformats)
Data Input:
Read “obesity.csv” as obesity and rename the columns and round the number(value) into integer
obesity <- read.csv("obesity.csv", stringsAsFactors = T)
names(obesity) <- c("Gender", "Age", "Height", "Weight", "Family_History_with_Overweight",
"Frequent_consumption_of_high_caloric_food", "Frequency_of_consumption_of_vegetables", "Number_of_main_meals", "Consumption_of_food_between_meals", "Smoke", "Consumption_of_water_daily", "Calories_consumption_monitoring", "Physical_activity_frequency", "Time_using_technology_devices",
"Consumption_of_alcohol", "Transportation_used", "Obesity")
obesity <- obesity %>% mutate_at(vars(Frequency_of_consumption_of_vegetables, Number_of_main_meals,Consumption_of_water_daily, Physical_activity_frequency, Time_using_technology_devices), funs(round(.,0)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
str(obesity)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ Family_History_with_Overweight : Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
## $ Frequent_consumption_of_high_caloric_food: Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
## $ Frequency_of_consumption_of_vegetables : num 2 3 2 3 2 2 3 2 3 2 ...
## $ Number_of_main_meals : num 3 3 3 3 1 3 3 3 3 3 ...
## $ Consumption_of_food_between_meals : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Smoke : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ Consumption_of_water_daily : num 2 3 2 2 2 2 2 2 2 2 ...
## $ Calories_consumption_monitoring : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ Physical_activity_frequency : num 0 3 2 2 0 0 1 3 1 1 ...
## $ Time_using_technology_devices : num 1 0 1 0 0 0 0 0 1 1 ...
## $ Consumption_of_alcohol : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
## $ Transportation_used : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
## $ Obesity : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
Save edited data to .csv format and load it as “obesity_new”
write_csv(obesity,"obesity_new.csv" )
obesity_new <- read.csv("obesity_new.csv", stringsAsFactors = T)
str(obesity_new)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ Family_History_with_Overweight : Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
## $ Frequent_consumption_of_high_caloric_food: Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
## $ Frequency_of_consumption_of_vegetables : int 2 3 2 3 2 2 3 2 3 2 ...
## $ Number_of_main_meals : int 3 3 3 3 1 3 3 3 3 3 ...
## $ Consumption_of_food_between_meals : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Smoke : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ Consumption_of_water_daily : int 2 3 2 2 2 2 2 2 2 2 ...
## $ Calories_consumption_monitoring : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ Physical_activity_frequency : int 0 3 2 2 0 0 1 3 1 1 ...
## $ Time_using_technology_devices : int 1 0 1 0 0 0 0 0 1 1 ...
## $ Consumption_of_alcohol : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
## $ Transportation_used : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
## $ Obesity : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
Data Inspection :
Change selected columuns to factor format
obesity_new <- obesity_new %>% mutate(Frequency_of_consumption_of_vegetables = as.factor(Frequency_of_consumption_of_vegetables),
Number_of_main_meals = as.factor(Number_of_main_meals),
Consumption_of_water_daily = as.factor(Consumption_of_water_daily),
Physical_activity_frequency = as.factor(Physical_activity_frequency),
Time_using_technology_devices = as.factor(Time_using_technology_devices))
List old Factor Name and Create the new Factor Name
old_1 <- c("1", "2", "3")
old_2 <- c("1", "2", "3", "4")
old_3 <- c("1", "2", "3")
old_4 <- c("0", "1", "2", "3")
old_5 <- c("0", "1", "2")
new_1 <- c("Never", "Sometimes", "Always")
new_2 <- c("1", "2", "3", "3+")
new_3 <- c("Less than a liter", "Between 1 and 2 L", "More than 2 L")
new_4 <- c("I do not have", "1 - 2 times", "3 - 4 times", "More than 4 times")
new_5 <- c("0–2 hours", "3–5 hours", "More than 5 hours")
Assign the New Factor Name into dataframe
obesity_new$Frequency_of_consumption_of_vegetables <- do.call(
fct_recode,
c(list(obesity_new$Frequency_of_consumption_of_vegetables), setNames(old_1, new_1)))
obesity_new$Number_of_main_meals <- do.call(
fct_recode,
c(list(obesity_new$Number_of_main_meals), setNames(old_2, new_2)))
obesity_new$Consumption_of_water_daily <- do.call(
fct_recode,
c(list(obesity_new$Consumption_of_water_daily), setNames(old_3, new_3)))
obesity_new$Physical_activity_frequency <- do.call(
fct_recode,
c(list(obesity_new$Physical_activity_frequency), setNames(old_4, new_4)))
obesity_new$Time_using_technology_devices <- do.call(
fct_recode,
c(list(obesity_new$Time_using_technology_devices), setNames(old_5, new_5)))
Check the Missing Value of the data
colSums(is.na(obesity_new))
## Gender
## 0
## Age
## 0
## Height
## 0
## Weight
## 0
## Family_History_with_Overweight
## 0
## Frequent_consumption_of_high_caloric_food
## 0
## Frequency_of_consumption_of_vegetables
## 0
## Number_of_main_meals
## 0
## Consumption_of_food_between_meals
## 0
## Smoke
## 0
## Consumption_of_water_daily
## 0
## Calories_consumption_monitoring
## 0
## Physical_activity_frequency
## 0
## Time_using_technology_devices
## 0
## Consumption_of_alcohol
## 0
## Transportation_used
## 0
## Obesity
## 0
Change Height matric from Meter to Centimeter by multiply the value with 100
obesity_new <- obesity_new %>% mutate_at(vars(Age, Weight), funs(round(.,0)))
obesity_new$Height <- obesity_new$Height *100
Data Visualization :
Correlation Between Height and Weight in Type Of Obesity
obesity_cor <- obesity_new %>%
select(c(Obesity, Height, Weight))
plotob_cor <- ggplot(data = obesity_cor, mapping = aes(x = Height, y = Weight, col = Obesity))+
geom_point(aes(col = Obesity))+
geom_smooth(method=lm , color="black", se=FALSE, formula = y~x) +
scale_fill_viridis(discrete = T, option = "C") +
labs(title = list(text = paste0('Correlation of Height and Weight')),
x = "Height (cm)",
y = "Weight (Kg)"
) +
theme(legend.title = element_blank(),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line())
ggplotly(plotob_cor, tooltip = "text")
ob_corr <- cor(obesity_cor$Height, obesity_cor$Weight)
According to the data, the correlation between height and weight is weakly positive (0.46)
Height and Weight Distribution based on Gender
height_weight <- obesity_new %>%
select(c(Gender, Height, Weight))
height_weight <- pivot_longer(data = height_weight,
cols = c("Height","Weight"),
names_to = "variabel")
plothw <- ggplot(data = height_weight, mapping = aes(x = Gender, y = value))+
geom_boxplot(aes(fill=Gender), position = "dodge")+
facet_wrap(vars(variabel)) + #memisahkan plot berdasarkan variable parameter
labs(title = list(text = paste0('Height and Weight Distribution Based on Gender')),
x = "Gender",
y = "Height (cm) / Weight (Kg)"
) +
theme(legend.title = element_blank(),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line())
ggplotly(plothw, tooltip = "text")
The box plots show the distribution of Height and Weight based on Gender wise. The plots highlight that the median height of females in the sample is significantly lower than that of males, with a few of males surpassing 1.98 meters (outliers). In terms of their weights, though, the difference is not as significant. While, one individual with a weight of more than 165 kg is considered an outlier.
Obesity Type Distribution on Gender
obs_gender <- obesity_new %>%
select(c(Gender, Obesity)) %>%
group_by(Gender, Obesity) %>%
summarise(total = n()) %>%
mutate(label = glue("Total : {total}"))
plotog <- ggplot(data = obs_gender, aes(x = Obesity, y = total, fill = Gender, text = label))+
geom_col(position = "dodge")+
facet_wrap(vars(Gender)) + #memisahkan plot berdasarkan variable parameter
scale_fill_viridis(discrete = T, option = "C") +
labs(title = list(text = paste0('Obesity Type based on Gender')),
x = "Gender",
y = "Total"
) +
theme(legend.title = element_blank(),
axis.text.x = element_text(hjust = 1, angle = 20),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line()) +
coord_flip()
ggplotly(plotog, tooltip = "text")
BMI Distribution on Age by Gender:
obesity_new$bmi <- obesity_new$Weight/(obesity_new$Height/100)**2
obesity_new$bmi <- round(obesity_new$bmi, 1)
obesity_bmi <- obesity_new %>%
select(c(bmi, Age, Gender)) %>% mutate(label = glue("BMI : {bmi}
Age : {Age}
Gender : {Gender}"))
plotob_bmi <- ggplot(data = obesity_bmi, aes(x = Age, y = bmi, fill = Gender, text = label))+
geom_point(aes(col = Gender), alpha = 0.5, col ="black") +
labs(title = list(text = paste0('BMI on Age Distribution')),
x = "Age",
y = "BMI"
) +
theme(legend.title = element_blank(),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line())
ggplotly(plotob_bmi, tooltip = "text")
Normal BMI on Adults (>= 18 y.o)
obesity_bmi_a <- obesity_bmi[obesity_bmi$Age >18 & obesity_bmi$bmi > 18.5 & obesity_bmi$bmi < 24.9,]
#obesity_bmi_a %>% group_by(Gender) %>% summarise(count = n())
plotob_bmi_a <- ggplot(data = obesity_bmi[obesity_bmi$Age >18 & obesity_bmi$bmi > 18.5 & obesity_bmi$bmi < 24.9,], aes(x = bmi, fill = Gender))+
geom_histogram(position = "dodge", bins = 30) +
scale_y_continuous(breaks = seq(5,15,5), limits=c(0,15)) +
labs(title = "Normal BMI on Adults",
x = "BMI",
y = NULL,
fill = "Gender") +
theme(legend.title = element_blank(),
axis.text.x = element_text(hjust = 1),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_blank(),
panel.grid = element_line(colour = "black"))
ggplotly(plotob_bmi_a, tooltip = "text")
According to the graph, there are more adult females (56) than adult males (45) in the Normal BMI range who are above the age of 18. Furthermore, while this figure represents the Normal BMI range of 18.5 to 24.9, the majority of the sample is in the second half of the range, implying slightly heavier weight than the range’s median.
Conclusion :
Moreover half of the population does not meet the Normal BMI Standard, which is limited to 101 people. Both males and females have dominated the Obesity type on the Gender plot, showing that the distribution is nearly equal. Individuals who eat more than three times per day, consume high-calorie foods and do not track their food’s calories are at risk of becoming obese, according to their eating habits. Individuals who do not smoke and do not/rarely consume alcohol may be classified as obese in the alcohol and smoking variable.