GANESH PATHAN, STUDENT ID : S3847177
Last updated: 16 October, 2022
Obesity is a complex disease involving an excessive amount of body fat. Obesity isn’t just a cosmetic concern.
It is a medical problem that increases the risk of other diseases and health problems
Obesity is generally caused by eating too much and moving too little.
If you consume high amounts of fat and sugars, but do not burn off the energy through exercise and physical activity, much of the surplus energy will be stored by the body as fat.
This report aims at analyzing the survey details taken from more than 2000+ people aged between 14 to 61 on their lifestyle, food habits, their physical activities etc.
This also aims at arriving at various inferences of how BMI (Body Mass Index) helps determine the obesity type
Create an awareness in people about this deadly disease and help track their BMI.
Survey is done in 3 South American countries - Mexico, Peru and Colombia.
For adults, WHO defines overweight and obesity as follows:
For Children aged between 5-19 years,
Height, Weight and BMI and Obesity Type are few important variables are considered from this data for statistical analysis.
The data set used in this project assignment is an open-source data collected from science Direct under a Creative Commons License.
The dataset is sourced from https://www.kaggle.com/datasets/mandysia/obesity-dataset-cleaned-and-data-sinthetic
Sampling Method Used:
Data Characteristics:
# Read the Data file from the working directory:
ObesityData <- read_csv('ObesityData.csv')
dim(ObesityData)## [1] 2111 18
## [1] "character"
# Convert the variable names - Obesity_Type from character to Factor and check the class of Obesity_Type again
ObesityData <- mutate_at(ObesityData, vars(Obesity_Type), as.factor)
class(ObesityData$Obesity_Type)## [1] "factor"
## [1] "numeric"
## [1] "numeric"
## [1] "numeric"
# Order the values of Obesity_Type
ObesityData$Obesity_Type <- factor(ObesityData$Obesity_Type,levels=(unique(ObesityData$Obesity_Type)),ordered=TRUE)
class(ObesityData$Obesity_Type)## [1] "ordered" "factor"
## [1] "NormalWeight" "OBESE"
Prior to proceeding with descriptive statistics and visualisation, the data is scanned for any Null values and Outliers.
Box plot is used to scan for any outliers.
## Identify the total number of LOCATIONS in the dataframe that has NULL VALUES.
sum(is.na(ObesityData))## [1] 0
boxplot(ObesityData$BMI ~ ObesityData$Obesity_Type, main="Box-Plot of BMI And Obesity_Type", ylab = "BMI", xlab = "Obesity_Type")ObesityData %>% group_by(`Obesity_Type`) %>% summarise(Min = min(Weight,na.rm = TRUE),
Q1 = quantile(Weight,probs = .25,na.rm = TRUE),
Median = median(Weight, na.rm = TRUE),
Q3 = quantile(Weight,probs = .75,na.rm = TRUE),
Max = max(Weight,na.rm = TRUE),
Mean = mean(Weight, na.rm = TRUE),
SD = sd(Weight, na.rm = TRUE),
n = n(),
missing = sum(is.na(ObesityData$'Weight'))) -> table1
knitr::kable(table1)| Obesity_Type | Min | Q1 | Median | Q3 | Max | Mean | SD | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| NormalWeight | 39 | 50 | 55 | 62 | 87 | 56.20930 | 9.968814 | 559 | 0 |
| OBESE | 53 | 80 | 96 | 112 | 173 | 97.53093 | 21.091164 | 1552 | 0 |
ObesityData %>% group_by(`Obesity_Type`) %>% summarise(Min = min(BMI,na.rm = TRUE),
Q1 = quantile(BMI,probs = .25,na.rm = TRUE),
Median = median(BMI, na.rm = TRUE),
Q3 = quantile(BMI,probs = .75,na.rm = TRUE),
Max = max(BMI,na.rm = TRUE),
Mean = mean(BMI, na.rm = TRUE),
SD = sd(BMI, na.rm = TRUE),
n = n(),
missing = sum(is.na(ObesityData$'BMI'))) -> table1
knitr::kable(table1)| Obesity_Type | Min | Q1 | Median | Q3 | Max | Mean | SD | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| NormalWeight | 12.99868 | 17.57812 | 18.73049 | 22.22222 | 24.91349 | 19.77105 | 2.712523 | 559 | 0 |
| OBESE | 22.82674 | 27.81278 | 32.33759 | 37.80136 | 50.81175 | 33.27643 | 6.027950 | 1552 | 0 |
The two sample t-test states that, H0 : μ1 = μ2 HA : μ1 ≠ μ2 Here, μ1 and μ2 are the mean of BMI of people of Normal weight and OBESE people respectively. H0 is the null hypothesis and And HA is the alternative hypothesis.
BMI_NormalWeight <- ObesityData %>% filter(ObesityData$Obesity_Type == "NormalWeight")
BMI_NormalWeight$BMI %>% qqPlot(dist="norm")## [1] 442 186
BMI_OBESE <- ObesityData %>% filter(ObesityData$Obesity_Type == "OBESE")
BMI_OBESE$BMI %>% qqPlot(dist="norm")## [1] 1256 1340
Homogeneity of variance or the assumption of equal variance, is tested using the Levene’s test.
Levene’s test states that,
H0:(σ1)^2 = (σ2)^2
HA:(σ1)^2 = (σ2)^2
| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 1 | 335.99 | 0 |
| 2109 | NA | NA |
##
## Two Sample t-test
##
## data: BMI by Obesity_Type
## t = -51.134, df = 2109, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group NormalWeight and group OBESE is not equal to 0
## 95 percent confidence interval:
## -14.02334 -12.98742
## sample estimates:
## mean in group NormalWeight mean in group OBESE
## 19.77105 33.27643
## [1] -1.961089
From the descriptive statistics,