1 Executive Summary

The aim of this report is to utilise the researched dataset and explore two prompts; Does physical activity frequency directly impact the obesity level of young to middle-aged people and How does water intake affect obesity? The first question compares the obesity levels of people ages 20-40 and their physical activity to better understand the trend between the two. Once the senior population has been taken out of the data set, an expected trend arises where the increase in physical activity corresponds to a lower weight class. The second question explores the connection between water intake and weight class, and according to research, it is expected that water consumption should help with weight loss. However, the results do not reflect this, hence the report also aims to understand the reasoning behind this by analysing the source and variables used to identify potential problems.

2 Full Report

2.1 Initial Data Analysis (IDA)

The data is from kaggle.com, a website providing free open-source datasets for various categories. The dataset chosen for this study reflects the lifestyle of people from Columbia, Peru and Mexico, categorised into different weight classes. 23% of the data was collected directly using a web survey, and the other 77% was simulated using the Weka tool and the SMOTE filter. The process of generating synthetic data includes an initial analysis and ensuring that the spread of the Weight class is balanced. This increases the validity of the data generated by minimising any error or bias against one class. However, the dataset uses the BMI( Body Mass Index) to assign the Weight class, and it is widely recognised as misleading or not a completely accurate measure of body fat content since it does not account for muscle mass (Nordqvist, 2022). This can definitely affect the results we are unsure if the BMI has correctly identified an instance of obesity or if it has made an error and assigned incorrect data to the category. Potential issues aside, the dataset can be used to identify potential problems and areas of concern regarding the health and well-being of the public by providing a general idea of common issues that affect weight. Additionally, the survey data can be used in a variety of ways to simulate more data values as a cost and time-efficient alternative to regular data collection.

Variables in dataset; Gender, Age, Height, family_history_with_overweight( family member with obesity), FAVC (consumption of high caloric food), FCVC (consumption of vegetables), NCP (Number of main meals), CAEC (Consumption of food between meals), SMOKE, CH2O (Consumption of water daily), SCC (Calories consumption monitoring), FAF (Weekly physical activity frequency), TUE (Technology usage), CALC (Consumption of alcohol), MTRANS (Transportation), NObeyesdad (obesity condition; Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III)

For this report, four variables have been chosen; Age, FAF, NObeyesdad and CH2O. Some have been modified to better suit the graphical summaries.

The Nobeysdad variable has been converted to numerical format ONLY for question 1 as follows; 1:Insufficient Weight, 2: Normal Weight, 3:OverWeight Level I, 4:OverWeight Level II, 5:Obeysity Type I, 6: Obesity Type II, 7:Obeysity Type III

CH2O ranges from 1~3 as follows; 1: Less than a litre, 2: Between 1 and 2 L, 3: More than 2 L. All floating point values generated by the computer were rounded and converted to integer format.

## read in data
RawData = read.csv("~/Desktop/ObesityDataSet_raw_and_data_sinthetic.csv", header = T)
## show classification of variables
str(RawData)
## 'data.frame':    2111 obs. of  17 variables:
##  $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr  "no" "no" "no" "no" ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "yes" "no" "no" ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr  "no" "yes" "no" "no" ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...


2.2 Research Question 1: Does physical activity frequency directly impact obesity level?

## write code here
data <- read.csv("~/Desktop/ObesityDataSet_raw_and_data_sinthetic.csv")

#Age and physical activity frequency
alter_age = data$Age
alter_phys = data$FAF   #Also for second
alter_model_AgePhy = lm(alter_phys~alter_age)
plot(alter_age, alter_phys, xlab="age", ylab="physical activity")
abline(alter_model_AgePhy, col="red")

cor(alter_age, alter_phys)
## [1] -0.1449383
#convert obesity state to number

#and then use data with only young people ( 20 <= x < 40 )

alt_data <- read.csv("~/Desktop/ObesityDataSet_copy.csv")

young_data = alt_data[alt_data$Age>=20 & alt_data$Age<40, ]

alter_young_phys = young_data$FAF
alter_young_obesity = young_data$NObeyesdad
alter_young_model_PhysOb = lm(alter_young_obesity~alter_young_phys)
plot(alter_young_phys, alter_young_obesity, xlab="Young people physical activity frequency", ylab="young people obesity")
abline(alter_young_model_PhysOb, col="red")

cor(alter_young_phys,alter_young_obesity)
## [1] -0.1916566

The researches state that increase in frequency of physical activity helps people stay at a healthy weight or loss weight, keeping away from obesity (Physical Activity, 2016) (Chin et al., 2016). Dataset with 2111 people’s information was used to explore this connection.

The process of defining people whose weight are most directly affected by frequency of physical activity is essential; meaning older people who are more likely to be affected by hormonal changes should be considered as confounding variables to increase later analysis’ accuracy (Obesity - Symptoms and Causes, 2021). Thus age and FAF (physical activity frequency) variables were first used to clarify if clear trend of decreasing FAF exist and is related to age. The data is represented in a form of scatter plot with linear regression line which assist the reader to identify relationship between two variables.

The resultant plot with regression line shows people in age group of above 40 tends to have physical activity less than other age groups, while the negative correlation coefficient (-0.1449383) shows physical activity frequency actually decrease as people gets older. Since most data were within the range of age group 20 to 40, impact of discarding data with age group of above 40 is negligible.

With this analysis result, another linear regression model of relationship between those young people (20~40)‘s physical activity frequency and obesity condition was made. However, the original dataset stored the obesity condition variable (NObeyesdad) as string format, such as ’Normal Weight’ and ‘Obesity_Type_II’. As string format cannot be used for scatter plot, the variables were converted to numerical value. For example, ‘Insufficient_Weight’ was considered as 1, ‘Normal_Weight’ was considered as 2, ‘Overweight_Level_I’ was considered as 3, and so on. Although this process allows to use the previously unusable variable, the results still remain as discrete variable since value like 1.5 or 2.3 cannot exist. This impacts the value of correlation coefficient of the final model.

In the final scatter plot, the regression line tends to be progressing downward with negative correlation coefficient (-0.1916566), clearly showing that young people suffers more severe obesity as their frequency of physical activity decreases.


2.3 Research Question 2: How does water intake affect obesity?

## write code here

# library
library("ggplot2")

q2Data = read.csv("~/Desktop/Data2.csv", header = T)

#assigning variable names
WeightClass = q2Data$WeightClass
WaterConsumption = q2Data$WaterCons
Percentage = q2Data$Freq

#create data frame
data <- data.frame(WeightClass,WaterConsumption,Percentage)

#Create bar chart
ggplot(data, aes(fill=WaterConsumption, y=Percentage, x=WeightClass)) + 
 geom_bar(stat = "identity",
           position = "fill")

There is some debate surrounding the claim that increased water consumption can directly help maintain a healthy weight and research papers indicate an emerging trend between increased water consumption and weight loss. Thus the second question uses the variables, Water consumption (CH20) and Weight class (NObeyesdad) to determine if there is a trend between an individual’s daily water consumption and their weight class. The data is represented in the form of a stacked bar chart that helps visualise the relative percentages of water consumption levels between different weight classes.

The minimum recommended intake of water is 2 litres per day, which falls into category 2. This graph gives an unexpected trend where people who are overweight tend to drink more than 2 litres of water per day. There seems to be a general trend where the weight class increases as the water consumption increases, with the exception of obese class II which shows a sudden dip.

This may be due to the aforementioned problem of using BMI to assign weight classes as it does not account for muscle mass. Muscle is denser than fat, so some of those considered overweight class 1 and above could be there due to high muscle mass gained via exercise. This is one possible explanation for the data as heavy and rigorous exercise used to gain muscles requires a high water intake to compensate for sweat and dehydration (Exercise - the Low-down on Hydration - Better Health Channel, n.d.). Despite the data showing such a trend, it is understood that regular water intake at recommended levels can help reduce weight. It was found that overweight/obese middle-aged and older adults lost 44% more weight when the subjects were regularly drinking 500ml of water per meal and more during the day (Dennis et al., 2010). Therefore, due to the discussed drawbacks in the data used, we cannot properly conclude that water intake directly contributes to obesity.


3 References

  1. Nordqvist, C. (2022, January 20). Why BMI is inaccurate and misleading. Retrieved September 22, 2022, from https://www.medicalnewstoday.com/articles/265215

  2. Exercise - the low-down on hydration - Better Health Channel. (n.d.). Retrieved September 22, 2022, from https://www.betterhealth.vic.gov.au/health/healthyliving/Exercise-the-low-down-on-water-and-drinks

  3. Dennis, E. A., Dengo, A. L., Comber, D. L., Flack, K. D., Savla, J., Davy, K. P., & Davy, B. M. (2010, February). Water Consumption Increases Weight Loss During a Hypocaloric Diet Intervention in Middle-aged and Older Adults. Obesity, 18(2), 300–307. https://doi.org/10.1038/oby.2009.235

  4. Physical Activity. (2016, April 12). Obesity Prevention Source. Retrieved September 22, 2022, from https://www.hsph.harvard.edu/obesity-prevention-source/obesity-causes/physical-activity-and-obesity/

  5. Chin, S.-H., Kahathuduwa, C. N., & Binks, M. (2016, October 14). Physical activity and obesity: what we know and what we need to know. Wiley Online Library. Retrieved September 22, 2022, from https://onlinelibrary.wiley.com/doi/full/10.1111/obr.12460

  6. Obesity - Symptoms and causes. (2021, September 2). Mayo Clinic. Retrieved September 22, 2022, from https://www.mayoclinic.org/diseases-conditions/obesity/symptoms-causes/syc-20375742