Nowadays, Obesity is one of the most prominent health-related issues faced by the people across globe. Due to this very reason, it is very crucial to analyze the issue deeply. The dataset used is about the obesity levels of people from 3 specific countries, namely, Mexico, Pery and Colombia. Using the “ggplot” library primarily, we try to visualize and get some insights of the under-lying patterns for the people of these countries.
library(ggplot2)
library(ggmosaic)
library(gridExtra)
library(pillar)
library(DT)
library(plotly)
library(dplyr)
data <- read.csv("ObesityDataSet.csv")
glimpse(data)
## Rows: 2,111
## Columns: 17
## $ Gender <chr> "Female", "Female", "Male", "Male", "Ma~
## $ Age <dbl> 21, 21, 23, 27, 22, 29, 23, 22, 24, 22,~
## $ Height <dbl> 1.62, 1.52, 1.80, 1.80, 1.78, 1.62, 1.5~
## $ Weight <dbl> 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55.~
## $ family_history_with_overweight <chr> "yes", "yes", "yes", "no", "no", "no", ~
## $ FAVC <chr> "no", "no", "no", "no", "no", "yes", "y~
## $ FCVC <dbl> 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, ~
## $ NCP <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, ~
## $ CAEC <chr> "Sometimes", "Sometimes", "Sometimes", ~
## $ SMOKE <chr> "no", "yes", "no", "no", "no", "no", "n~
## $ CH2O <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, ~
## $ SCC <chr> "no", "yes", "no", "no", "no", "no", "n~
## $ FAF <dbl> 0, 3, 2, 2, 0, 0, 1, 3, 1, 1, 2, 2, 2, ~
## $ TUE <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, ~
## $ CALC <chr> "no", "Sometimes", "Frequently", "Frequ~
## $ MTRANS <chr> "Public_Transportation", "Public_Transp~
## $ NObeyesdad <chr> "Normal_Weight", "Normal_Weight", "Norm~
The dataset is open-source and available on the UCI repository and is about the obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records.
The dataset can be found here - https://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition+#
NObeyesdad (Obesity Level)- that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III.
FAVC - Frequent consumption of high caloric food
FCVC- Frequency of consumption of vegetables
NCP - Number of main meals
CAEC- Consumption of food between meals
CH20 - Consumption of water daily
CALC - Consumption of alcohol
data %>%
datatable(class="cell-border" ,
caption='Interactive table view of the dataset',
rowname = F,
filter = 'top',
options = list(pageLength = 20, autoWidth = TRUE))
print(paste0("Total missing values in the dataset : ", sum(is.na(data))))
## [1] "Total missing values in the dataset : 0"
As it can be seen, the dataset has no missing values which means it is good to be analyzed further.
fig1 <- ggplot(data, aes(x = NObeyesdad, fill = NObeyesdad)) + geom_bar(stat = "count") + ggtitle("Distribution across different obesity levels") + xlab("Obesity level") + ylab("Number of records") + theme_bw() + geom_text(aes(label = ..count..), stat = 'count', vjust = -0.4) + labs(fill = "Obesity level") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
fig1
Interestingly, the dataset contains information about all the obesity levels quite uniformly. Here the most striking thing is that the people belonging to the Normal Weight category only constitutes about 13% of the whole sample. This clearly states that there is an urgent need for us the humans in finding some solutions to stay healthy.
fig2 <- ggplot(data, aes(x = NObeyesdad, fill = Gender)) + geom_bar(position = position_dodge(), stat = "count") + theme_bw() + geom_text(aes(label = ..count..), stat = 'count', vjust = -0.4, position = position_dodge(0.9)) + xlab("Obesity level") + ggtitle("Gender wise distribution") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
fig2
The above plot clearly suggests that most number of males belong to the “Obesity type 2” category, whereas more females belong to the maximum level of Obesity. Interestingly, the difference in numbers is not considerable in the “Normal Weight” category.
fig3 <- ggplot(data, aes(y = Height, color = Gender)) + geom_boxplot() +
ggtitle("Gender wise distribution of Height") + theme_bw()
fig4 <- ggplot(data, aes(y = Weight, color = Gender)) + geom_boxplot() +
ggtitle("Gender wise distribution of Weight") + theme_bw()
grid.arrange(fig3, fig4, ncol = 2)
The box plots show the distribution of Weight and Height features gender wise. Looking at the plots, it becomes clear that the median height of females in the sample is considerably lower than that of males, with a couple of males as tall as 1.98 meters(outliers). However, the difference is not that huge in case of their weights. Here again, one male is considered as outlier who has a weight of more than 165 Kilograms.
fig5 <- ggplot(data, aes(x=Age)) +
geom_histogram(aes(y=..density..), colour="black", fill="white", bins = 35)+
geom_density(alpha=0.5, fill="blue") + theme_classic() + ylab("Density") +
ggtitle("Age distribution")
fig5
To show the distribution of the Age of the participants of the dataset, a density plot has been used. As it can be seen, major chunk of the sample belongs to the very youth category as a huge proportion lies between the age of 18 and 27 years.
fig6 <- ggplot(data) +
geom_mosaic(aes(x = product(family_history_with_overweight, NObeyesdad),
fill = family_history_with_overweight)) +
xlab("Obesity level") + ylab("Family history with overweight") +
labs(fill = "Family history with overweight?") +
ggtitle("Does family history with overweight leads to obesity level?") + theme(axis.text.x = element_text(angle = 25, vjust = 1, hjust=1))
fig6
Most of the people who were either Overweight or Obese had family history with Overweight, as suggested by the Mosaic plot. Now, this is a very crucial information which suggests that there is some relation between the family history and the current physical condition of a person.
fig7 <- ggplot(data) +
geom_bar(aes(y = MTRANS,
fill = Gender),
position = position_dodge()) + ylab("Mode of Transport") +
scale_color_manual(values = c("blue", "red"), aesthetics = "fill") +
ggtitle("Transport preferences by Gender") + theme_mosaic()
fig7
As seen previously, many people suffer from Overweight is an issue. Now, the above chart might provide an insight on what could be the reason behind the ever-increasing issue. As it can be seen, very few people prefer “Walking” or riding a bike for travelling. Most of the people prefer Public transportation or Automobile modes of transportation.
data$bmi <- data$Weight/(data$Height)**2
data$tooltip <- paste(
"Age: ", data$Age,"<br>",
"BMI: ", data$bmi, "<br>",
"Gender: ", data$Gender,
sep="") %>%
sapply(htmltools::HTML)
fig8 <- plot_ly(data,x = ~Age, y = ~bmi, color = ~Gender,
colors = c("red", "green"), text = ~tooltip, hoverinfo = "text",
type = 'scatter', mode = 'markers')%>%
layout(title = 'Age vs BMI', xaxis = list(title = 'Age'),
yaxis = list(title = 'Body mass index(BMI)'),
legend = list(title=list(text='<b> Gender </b>')))
fig8
To analyze the Age vs BMI relation, an Interactive scatter plot is used. Moreover, the scatter plot is categorized Gender wise to see the trend among male and female participants. The plot clearly suggests that many young females have a Body Mass Index of more than 40. This means it is an issue affecting young females a lot.
data$TUE <- round(data$TUE, digits = 0)
data$FAF <- round(data$FAF, digits = 0)
data$CH2O <- round(data$CH2O, digits = 0)
data$FCVC <- round(data$FCVC, digits = 0)
data$NCP <- round(data$NCP, digits = 0)
data$techuse <- ifelse(data$TUE == 0, "0-2 hours",
ifelse(data$TUE == 1, "3-5 hours",
"More than 5 hours"))
data$water <- ifelse(data$CH2O == 1, "Less than 1 litre",
ifelse(data$CH2O == 2, "Between 1 and 2 litres",
"More than 2 litres"))
data$physical <- ifelse(data$FAF== 0,'No',
ifelse(data$FAF == 1,'1 or 2 days',
ifelse(data$FAF==2,'2 to 4 days',
'4 or 5 days')))
data$freq_veg <- ifelse(data$FCVC==1,'Never',
ifelse(data$FCVC==2,'Sometimes',
'Always'))
data$nmeals <- ifelse(data$NCP==1,'One or two',
ifelse(data$NCP==2,'Three or four',
ifelse(data$NCP==3,'Four or five','More than five')))
fig9 <- ggplot(data, aes(x = NObeyesdad, fill = factor(techuse))) + geom_bar() +
facet_wrap(~Gender) + theme(axis.text.x = element_text(angle = 45, vjust = 1,
hjust=1)) +
xlab("Obesity level") +
ggtitle("Genderwise effects of using technological devices on the obesity levels") +
labs(fill = "Time spent")
fig9
Looking at the plot, one cannot find a straight away linkage between how does the time using such gadgets affect the physical being of a person. However, there are many other useful insights that one can gather. For instance, females with Obesity level 3 tend to spend more time on using gadgets than their counterparts in any level of the physical condition. It might be because of the fact that traditionally, women spend more time at home than men do.
fig10 <- ggplot(data, aes(x = NObeyesdad, fill = factor(physical))) + geom_bar() +
xlab("Obesity level") + ggtitle("Physical activity frequency vs Obesity level")+
labs(fill = "Frequency of physical activity") +
theme(axis.text.x = element_text(angle = 15, vjust = 1, hjust=1))
fig11 <- ggplot(data, aes(x = NObeyesdad, fill = factor(MTRANS))) + geom_bar() +
xlab("Obesity level") + ggtitle("Mode of Transportation vs Obesity level")+
labs(fill = "Mode of Transport") +
theme(axis.text.x = element_text(angle = 15, vjust = 1, hjust=1))
grid.arrange(fig10, fig11, nrow=2)
The first chart clearly suggests that the people who were not involved in any kind of physical activity suffered more in terms of bad physical condition. Whereas, the situation gets better with the increase in physical activity as the people who did some kind of physical activity 4-5 days a week suffered the least.
The bar chart depicting the relation between the mode of transport and the obesity level suggests that very few people preferred the “Walking” or “Bike” as the mode of travelling. But the positive out of this is that, almost all the people who used Bike or Walked to move around were all in the Normal Weight category.
data$tooltip <- paste(
"CALC: ", data$CALC,"<br>",
"Gender: ", data$Gender,
sep="") %>%
sapply(htmltools::HTML)
fig12 <- plot_ly(data, x = ~CALC, color= ~Gender,
colors= c('red', 'green'), text = ~tooltip, hoverinfo = "text")%>%
layout(title = 'Genderwise frequency of alcohol consumption',
xaxis = list(title = 'Frequency of Alcohol consumption'),
yaxis = list(title = 'Count'),
legend = list(title=list(text='<b> Gender </b>')))
fig12
The interactive bar plot clearly shows that most of the population irrespective of the gender consume Alcohol sometimes. While there was a very small proportion of the sample who always consumed Alcohol, there was a significant number of people who never drinks Alcohol.
data2 <-data[data$Age >20 & data$bmi > 18.5 & data$bmi < 24.9,]
fig13 <- ggplot(data2, aes(x = bmi, fill = Gender))+geom_histogram(color = 'black')+
scale_color_manual(values = c('red', 'blue')) +
ggtitle("Normal BMI distribution for adults") + theme_bw()
fig13
As per the graph, it can be observed that there are more adult females than males of age more than 20 years that belong to the Normal BMI range. Furthermore, even though this chart represents the Normal BMI range from 18.5 to 24.9, major proportion of the sample lies in the second half of the range suggesting slightly higher weight than the median of the range.