Introduction

Obesity is a complex disease involving an excessive amount of body fat. Obesity isn’t just a cosmetic concern.
It is a medical problem that increases the risk of other diseases and health problems
Obesity is generally caused by eating too much and moving too little.
If you consume high amounts of fat and sugars, but do not burn off the energy through exercise and physical activity, much of the surplus energy will be stored by the body as fat.
This report aims at analyzing the survey details taken from more than 2000+ people aged between 14 to 61 on their lifestyle, food habits, their physical activities etc.
This also aims at arriving at various inferences of how BMI (Body Mass Index) helps determine the obesity type
Create an awareness in people about this deadly disease and help track their BMI.

Introduction Continued

Survey is done in 3 South American countries - Mexico, Peru and Colombia.
For adults, WHO defines overweight and obesity as follows:
- Overweight is a BMI greater than or equal to 25
- Obesity is a BMI greater than or equal to 30.
For Children aged between 5-19 years,
- overweight is BMI-for-age greater than 1 standard deviation above the WHO Growth Reference median;
- obesity is greater than 2 standard deviations above the WHO Growth Reference median
Height, Weight and BMI and Obesity Type are few important variables are considered from this data for statistical analysis.

Problem Statement

The objective of this report is to analyze the survey data from more than 2000+ people aged between 14 and 61
Analyze the Height, Weight and BMI and determine how this data is used to classify Obese people.
Help people Track the BMI and create awareness to prevent Obesity
Create awareness among people who are in risk of obesity and bring in lifestyle changes in people who are already affected by obesity.
Find patterns and trends using multiple data variables (Height,Weight and BMI) given in the data sample by producing graphs and tables.
Verify the assumptions that the height, weight and BMI of an individual directly relates to the person being obese or not.

Data

The data set used in this project assignment is an open-source data collected from science Direct under a Creative Commons License.

The dataset is sourced from https://www.kaggle.com/datasets/mandysia/obesity-dataset-cleaned-and-data-sinthetic

Sampling Method Used:

The data is a result of an Online survey collected through a web platform from the people of South American Countries like Mexico, Peru and Colombia.
Data collected from people with ages between 14 and 61 and diverse eating habits and physical conditions.

Data Continued :

Data Characteristics:

The data has responses from 2111 people with 2111 observations and 18 attributes or variables.
The data variables are of different types which includes Character,Numeric (both real and double) and Factors.
Gender is a factor variable with values Male and Female .
Few Important variables considered for this analysis are Height, Weight and BMI, Obesity_Type.
Obesity_Type is converted into a factor variable and Height, Weight and BMI are numeric variables.
Obesity_Type has 3 values namely NormalWeight, OverWeight and Obese.
Height has values between 1.45 and 1.98 metres.
Weight has values between 39 to 173 Kilograms.
BMI has data with values between 12.99868489 and 50.81175281.
The data will be pre-processed using R functions to prepare the same for statistical analysis.

Data Pre-processing :

Read the dataset (ObesityData.csv ) and check for class of the important variables.

# Read the Data file from the working directory:

ObesityData <- read_csv('ObesityData.csv')
dim(ObesityData)

## [1] 2111   18

# Check class of the Variable Obesity_Type

class(ObesityData$Obesity_Type)

## [1] "character"

# Convert the variable names - Obesity_Type from character to Factor and check the class of Obesity_Type again

ObesityData <- mutate_at(ObesityData, vars(Obesity_Type), as.factor)
class(ObesityData$Obesity_Type)

## [1] "factor"

#Check the calss of variable Height

class(ObesityData$Height)

## [1] "numeric"

Data Pre-processing Continued :

# Check the class of Variable Weight

class(ObesityData$Weight)

## [1] "numeric"

# Check the class of Variable BMI

class(ObesityData$BMI)

## [1] "numeric"

# Order the values of Obesity_Type

ObesityData$Obesity_Type <- factor(ObesityData$Obesity_Type,levels=(unique(ObesityData$Obesity_Type)),ordered=TRUE)
class(ObesityData$Obesity_Type)

## [1] "ordered" "factor"

# Check the distinct values of the variable Obesity_Type

levels(ObesityData$Obesity_Type)

## [1] "NormalWeight" "OBESE"

Descriptive Statistics and Visualisation :

Prior to proceeding with descriptive statistics and visualisation, the data is scanned for any Null values and Outliers.
Box plot is used to scan for any outliers.

## Identify the total number of LOCATIONS in the dataframe that has NULL VALUES.

sum(is.na(ObesityData))

## [1] 0

# The result shows that there are no Null values in the whole dataset.

Descriptive Statistics and Visualisation - Continued :

Box Plot of BMI and Obesity Data :

boxplot(ObesityData$BMI ~ ObesityData$Obesity_Type, main="Box-Plot of BMI And Obesity_Type", ylab = "BMI", xlab = "Obesity_Type")

## It can be seen from the boxlplot there are no major visual outliers in the data for BMI values of Different Obesity_Types. So we can safely proceed with the data. No cleaning of data is required.

Descriptive Statistics and Visualisation - Continued :

Statistics Summary for Weight of People and Obesity_Type :

ObesityData %>% group_by(`Obesity_Type`) %>% summarise(Min = min(Weight,na.rm = TRUE),
                                         Q1 = quantile(Weight,probs = .25,na.rm = TRUE),
                                         Median = median(Weight, na.rm = TRUE),
                                         Q3 = quantile(Weight,probs = .75,na.rm = TRUE),
                                         Max = max(Weight,na.rm = TRUE),
                                         Mean = mean(Weight, na.rm = TRUE),
                                         SD = sd(Weight, na.rm = TRUE),
                                         n = n(),
                                         missing = sum(is.na(ObesityData$'Weight'))) -> table1
knitr::kable(table1)

Obesity_Type	Min	Q1	Median	Q3	Max	Mean	SD	n	missing
NormalWeight	39	50	55	62	87	56.20930	9.968814	559	0
OBESE	53	80	96	112	173	97.53093	21.091164	1552	0

Descriptive Statistics and Visualisation - Continued :

Statistics Summary for BMI of People and Obesity_Type:

ObesityData %>% group_by(`Obesity_Type`) %>% summarise(Min = min(BMI,na.rm = TRUE),
                                         Q1 = quantile(BMI,probs = .25,na.rm = TRUE),
                                         Median = median(BMI, na.rm = TRUE),
                                         Q3 = quantile(BMI,probs = .75,na.rm = TRUE),
                                         Max = max(BMI,na.rm = TRUE),
                                         Mean = mean(BMI, na.rm = TRUE),
                                         SD = sd(BMI, na.rm = TRUE),
                                         n = n(),
                                         missing = sum(is.na(ObesityData$'BMI'))) -> table1
knitr::kable(table1)

Obesity_Type	Min	Q1	Median	Q3	Max	Mean	SD	n	missing
NormalWeight	12.99868	17.57812	18.73049	22.22222	24.91349	19.77105	2.712523	559	0
OBESE	22.82674	27.81278	32.33759	37.80136	50.81175	33.27643	6.027950	1552	0

Hypothesis Testing :

A two sample t-test is used. the two sample t-test is used to compare the difference between the two population means.
In this case, the two different population groups are NormalWeight and Obese
The test assumes that these 2 population groups are independent of each other, the data for both population have equal variance and for small samples, the data of both populations are normally distributed.
Our assumption of Normality is tested using this hypothesis.

The two sample t-test states that, H0 : μ1 = μ2 HA : μ1 ≠ μ2 Here, μ1 and μ2 are the mean of BMI of people of Normal weight and OBESE people respectively. H0 is the null hypothesis and And HA is the alternative hypothesis.

Hypothesis Testing - COntinued - QQ Plot (BMI of Normal Weight People) :

Testing the assumption of Normality :

BMI_NormalWeight <- ObesityData %>% filter(ObesityData$Obesity_Type == "NormalWeight")

BMI_NormalWeight$BMI %>% qqPlot(dist="norm")

## [1] 442 186

## From the Graph , the data points are not close to the diagonal line and hence the BMI of normal weight people is not normally distributed.

Hypothesis Testing - COntinued - QQ Plot (BMI of OBESE People) :

BMI_OBESE <- ObesityData %>% filter(ObesityData$Obesity_Type == "OBESE")

BMI_OBESE$BMI %>% qqPlot(dist="norm")

## [1] 1256 1340

## From the Graph , the data points are not close to the diagonal line and hence the BMI of OBESE people is not normally distributed.

Hypothesis Testing - Continued - Homogeneity of Variance:

Homogeneity of variance or the assumption of equal variance, is tested using the Levene’s test.
Levene’s test states that,
```
  H0:(σ1)^2 = (σ2)^2
  HA:(σ1)^2 = (σ2)^2
```
- Here (σ1)^2 and (σ2)^2 are the population variance of groups 1 and 2 respectively.
- Using leveneTest() function compare the variances of BMI of Normalweight and OBESE peopl, we get

knitr::kable(round(leveneTest(ObesityData$'BMI' ~ ObesityData$Obesity_Type, data = ObesityData),3))

	Df	F value	Pr(>F)
group	1	335.99	0
	2109	NA	NA

The p-value from this test of equal variance of BMI’s of Normalweight and Obese people is 0 .
Here p < 0.05 and hence we reject the Null Hypothesis H0.
So this report will assume that the variances are not equal.

Hypothesis Testing - Continued :

t.test(`BMI` ~ Obesity_Type,
data = ObesityData,
var.equal = TRUE,
alternative = "two.sided"
)

## 
##  Two Sample t-test
## 
## data:  BMI by Obesity_Type
## t = -51.134, df = 2109, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group NormalWeight and group OBESE is not equal to 0
## 95 percent confidence interval:
##  -14.02334 -12.98742
## sample estimates:
## mean in group NormalWeight        mean in group OBESE 
##                   19.77105                   33.27643

qt(p = .025, df = 559 + 1552 -2)

## [1] -1.961089

Discussion:

Since the test statistic t from the two-sample t-test is, t = -51.134 which is more extreme than -1.961089, we reject the null hypothesis H0.
Also as seen earlier, according to the two sample t-test, the two-tailed p value was reported as 0 .
SO, since p = 0 < α=0.05, we reject the null hypothesis H0.
There is a statistically significant difference between the means of both the groups (NormalWeight and OBESE)
The mean difference between the both the groups is (33.27643 - 19.77105) = 13.50538)

From the descriptive statistics,

The mean of BMI of Normal weight people is 19.77105 for a population of 559 and the mean of Obese people is 33.27643 for a population of 1552.
This shows that number of obese people in the selected population group of 2111 is nearly 2.8 time more than the Normalweight people irrespective of age, gender , height and weight.
This definitely shows an alarming trend of the Obesity disease in the selected population of 2111 people in the three South American Countries namely Mexico, Peru and Colombia which suggests that the people need to look for effective and immediate life style changes.

MATH-1324-Applied data project - Assignment 2

Introduction to Statistics Assignment 2

Rpubs Link Information :

Introduction

Introduction Continued

Problem Statement

Data

Data Continued :

Data Pre-processing :

Data Pre-processing Continued :

Descriptive Statistics and Visualisation :

Descriptive Statistics and Visualisation - Continued :

Descriptive Statistics and Visualisation - Continued :

Descriptive Statistics and Visualisation - Continued :

Hypothesis Testing :

Hypothesis Testing - COntinued - QQ Plot (BMI of Normal Weight People) :

Hypothesis Testing - COntinued - QQ Plot (BMI of OBESE People) :

Hypothesis Testing - Continued - Homogeneity of Variance:

Hypothesis Testing - Continued :

Discussion:

References :