library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Load the Obesity Dataset
df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)
In this dataset, 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.
colnames(df)
## [1] "Gender" "Age"
## [3] "Height" "Weight"
## [5] "family_history_with_overweight" "FAVC"
## [7] "FCVC" "NCP"
## [9] "CAEC" "SMOKE"
## [11] "CH2O" "SCC"
## [13] "FAF" "TUE"
## [15] "CALC" "MTRANS"
## [17] "NObeyesdad"
| Obesity Level | Body Mass Index |
|---------------|-----------------|
| Underweight | Less than 18.5 |
| Normal | 18.5 to 24.9 |
| Overweight | 25.0 to 29.9 |
| Obesity I | 30.0 to 34.9 |
| Obesity II | 35.0 to 39.9 |
| Obesity III | Higher than 40 |
Example, FAF - Fitness Activity Frequency - is measured across a week and how many days do a person engage in Physical Activity, instead of the number of hours they engage in.
Other example is TUE - Time using Electronics - This is measured in hours per day
Regarding column names:
I initially misinterpreted the classification of obesity levels as being linear (w/h); however, upon reviewing the documentation, I learned that it is based on a quadratic equation (w/(h²)).
This misunderstanding could have impacted my future assignments and resulted in incomplete analyses.
ggplot(data = df[, c('CAEC', 'NObeyesdad')], aes(x = NObeyesdad, fill = CAEC)) + geom_bar(position = 'dodge', color = 'black') +
scale_fill_manual(values = c("white", "grey", "black", "red")) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Notice the ‘red’ bars:
Most of the data entries are “sometimes”, this reflects the highly subjective nature of high-caloric meal frequency. For instance, what a person of normal weight may classify as “frequently” consuming high-caloric meals over 3-4 days could be perceived differently by an individual classified as Obesity Type III, who might categorize 5 days as “frequently” and 3-4 days as “sometimes.”
Therefore, a thorough analysis of this variable does not provide a complete understanding due to inherent measurement bias. While there are no significant risks or consequences associated with this bias, it is crucial to acknowledge the subjectivity involved in measuring the CAEC variable.