Statistics for Data Science - Final Project Write Up

Introduction

My final project is focused on determining which lifestyle factors, including things like frequency of exercise and counting calories, and genetic factors, like a family history of obesity, are good predictors for whether or not someone will be overweight or obese. This project will explore various variables and their relationship to obesity level using a dataset from UC Irvine Machine Learning Repository, Estimation of Obesity Levels Based on Eating Habits and Physical Condition. This dataset, while technically a machine learning dataset designed with predicting obesity level based on eating habits and physical condition, has 23% of it’s data collected from actual people (although the other 77% is generated) in Mexico, Peru, and Colombia. There are 16 features and 2,111 instances with no missing values.

The main goal with this project is to determine how these individual lifestyle habits and genetic backgrounds influence obesity level, and if any prescriptive advice can be given.

First - load the dataset & necessary libraries

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

obesity <- read.csv(file.choose())

The Genetic Component

One of the most interesting things I wanted to explore with this dataset, before I started exploring the influence that lifestyle factors had on obesity, was whether or not the genetic component matters. If there is a shown relationship between a family history of obesity/being overweight and an individual being overweight/obese, then preventative advice could be given to a patient, even if they themselves are currently at a normal weight (based off BMI).

First, I visualized the split of who has a family history of being overweight/obese and who doesn’t, separated by obesity level.

library(ggplot2)

palette1 <- c("deepskyblue2", "deepskyblue4")

ggplot(obesity, aes(x = NObeyesdad, fill = family_history_with_overweight)) +
  geom_bar() +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(title = 'Family History of Obesity by Obesity Level', x = 'Obesity Level', y = 'Number of Individuals') +
  scale_fill_manual(values = palette1)

From this visualization, it appears that there are visibly fewer individuals who are of insufficient or normal weight with overweight or obese family members. Contrasting this, individuals in obesity type I, II, and III almost entirely have family members who are overweight or obese. This visualization suggests that having a family member who was or is overweight or obese increases your chance of being overweight or obese.

To confirm this, it is important to run a statistical test, so the next step would be to run a hypothesis test. Because both of these variables are categorical (family_history_with_overweight being ‘yes’ or ‘no’ and NObeyesdad being one of the 7 obesity levels), I will be running a Chi Squared Test for Independence. I will be using an alpha value of 0.05, or am willing to accept a 5% chance of rejecting my null. I selected this value because it is standard, and because this data is primarily generated data, I want to give it a bit of wiggle room. Additionally, there is not a lot at stake here - if the null is rejected when it really should not be, the worst that happens is someone is given guidance to monitor their weight.

The null hypothesis is that: There is no relationship between having a family history of being overweight/obese and an individual’s susceptibility to being overweight/obese

#Create a contingency table with my two variables 
test1 <- table(obesity$family_history_with_overweight, obesity$NObeyesdad)

#Run the chi-squared test
chisquare1 <-chisq.test(test1)

print(chisquare1)

## 
##  Pearson's Chi-squared test
## 
## data:  test1
## X-squared = 621.98, df = 6, p-value < 2.2e-16

The results of this Chi-Squared test suggest that the null hypothesis should be rejected, and there is a significant relationship between family history of overweight/obesity and current obesity level. The p-value is much smaller than 0.05 at 2.2e-16.

What these steps might be, exactly, is the next part of my analysis: what lifestyle factors contribute to obesity?

Lifestyle Factor 1: Calorie Counting

One of the variables included in my dataset is a categorical variable, SCC, which encodes whether or not an individual monitors the calories they consume in a day. Modern science and research on obesity and calories tells us that at the core of weight loss is calories in versus calories out, or burning more calories than you consume. Research also tells us that the most efficient way to make sure this is actually helping is to track calories.

First, I visualized the division of who does or does not monitor their calories by their obesity level.

palette2 = c("darkslategray3", "darkslategrey")

ggplot(obesity, aes(x = NObeyesdad, fill = SCC)) +
  geom_bar() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = 'Calorie Monitoring by Obesity Level', x = 'Number of Individuals', y = 'Obesity Level') +
  scale_fill_manual(values = palette2)

Based on this visualization, it seems that hardly anyone at any obesity level is monitoring their calories, with obesity type I, II, and III having almost no calorie monitors, and the other sections having an insignificant amount.

To confirm that this is the case, I will run another Chi-Squared test, because both of these variables are categorical again. I am using the same alpha, for the same reasons as in my last hypothesis test.

The null hypothesis is: There is no relationship between monitoring calories and an individual’s obesity level.

#Create a contingency table with my two variables
test2 <- table(obesity$NObeyesdad, obesity$SCC)

#Run the chi-squared test
chisquare2 <- chisq.test(test2)

print(chisquare2)

## 
##  Pearson's Chi-squared test
## 
## data:  test2
## X-squared = 123.02, df = 6, p-value < 2.2e-16

While this p-value suggests that the null hypothesis should be rejected and there is, in fact, a significant relationship between monitoring calories and obesity level, I would not fully come to that conclusion. The chi-square test, and any hypothesis test, does not account for the direction of the relationship between the variables (that is, whether counting calories decreases or increases obesity). My best assumption here would be that, because there are less people that count calories in the higher obesity levels, that not counting calories leads to obesity.

Lifestyle Factor 2: Physical Activity

The second lifestyle factor I wanted to explore was frequency of physical activity. While calorie counting is touted as the best way to burn more calories than you consume, increasing physical activity is another way, although it’s less recommended by professionals in the field.

Note: This variable is encoded in the following way:

0 = No physical activity per week
1 = 1 or 2 days of physical activity per week
2 = 2 or 4 days of physical activity per week
3 = 4 or 5 days of physical activity per week

I started by visualizing the average frequency of physical activity (per week) by obesity level in a box plot.

ggplot(obesity, aes(x = NObeyesdad, y = FAF)) +
  geom_boxplot() +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(title = 'Frequency of Physical Activity versus Obesity Level', x = 'Obesity Level', y = 'Frequency of Physical Activity (per week)')

Based on this visualization, it does not appear that there is a significant difference between obesity level and frequency of physical activity per week. Most obesity levels, excluding insufficient weight and obesity type III, exercise one or two times a week on average. Based on this, I would venture to say that there is not a significant relationship between the frequency of physical activity and obesity level.

To confirm this, I will run another hypothesis test. This time, it will be an ANOVA, as I will be using the mean frequency of physical activity for each obesity level.

The null hypothesis is: There is no relationship between the frequency of physical activity and an individual’s level of obesity.

anova <- aov(FAF ~ NObeyesdad, data = obesity)

summary(anova)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## NObeyesdad     6   72.5  12.084   17.48 <2e-16 ***
## Residuals   2104 1454.1   0.691                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Despite the fact that every group has approximately the same mean value for frequency of physical activity, there is a statistically significant relationship between the frequency of physical activity and obesity level, meaning we reject the null hypothesis. I think I personally was not expecting a statistical significance because of the similarity between means, but neglected to consider that the statistical significance just means the relationship is not random.

Lifestyle Factor 3: Alcohol Consumption

The third and final lifestyle factor I wanted to explore in my analysis was alcohol consumption and how it related to obesity level. As limiting calories is the most commonly prescribed weight loss method, many professionals recommend limiting alcohol consumption, as it’s easy to overindulge in calories when drinking.

As with the other two lifestyle factors, I will start by visualizing this comparison.

palette4 = c("Always" = "darkseagreen1", "Sometimes" = "darkseagreen4", "Frequently" = "darkolivegreen", "no" = "darkslategrey")

ggplot(obesity, aes(x = NObeyesdad, fill = CALC)) +
  geom_bar(position = "stack") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Alcohol Consumption by Obesity Level", x = "Obesity Level", y = "Frequency of Alcohol Consumption") +
  scale_fill_manual(values = palette4)

Based on this visualization, it appears that almost everyone in every obesity level drinks alcohol “sometimes”, which I would assume is true for most of the population of each country included in this dataset, as well as many countries not included. However, it appears that insufficient weight and normal weight categories, as well as the lower overweight categories, have a higher number of people who do not drink alcohol, so stopping alcohol consumption might be a weight management tactic.

To confirm if there is any relationship, I will run a Chi-Squared test.

The null hypothesis is: There is no relationship between frequency of alcohol consumption and an individual’s obesity level.

#Create a contingency table
test3 <- table(obesity$NObeyesdad, obesity$CALC)

#Run the chi-squared test
chisquare3 <-chisq.test(test3)

## Warning in chisq.test(test3): Chi-squared approximation may be incorrect

print(chisquare3)

## 
##  Pearson's Chi-squared test
## 
## data:  test3
## X-squared = 338.58, df = 18, p-value < 2.2e-16

To first address the warning from the chi-squared test: There is a very low number in the “Always” category (less than 1 for each obesity level). However, despite that, the p-value is still much smaller than the alpha value of 0.05, which means that we should reject the null hypothesis and conclude that there is a relationship between the frequency of alcohol consumption and obesity level.

Regression Model

To fully explore the relationship and impact that these three lifestyle factors have on obesity level, the last thing I will do for this analysis is build a linear regression model.

The first thing I need to do is establish a BMI column. I will be using this instead of obesity level since it is continuous instead of categorical.

For reference:

The relationship between BMI and the NObeyesdad category (obesity level) is as follows:

Insufficient Weight = Less than 18.5 BMI
Normal Weight = Between 18.5 and 24.9 BMI
Overweight = Between 25.0 and 29.9 BMI
- While there is a distinction between Overweight Level I and Overweight Level II in the data, there was no distinction made by BMI in the paper
Obesity Type I = Between 30.0 and 34.9 BMI
Obesity Type II = Between 35.0 and 39.9 BMI
Obesity Type III = Higher than 40.0 BMI

Create the BMI column to use as a response variable in the model:

obesity <- obesity |>
  mutate (BMI = Weight / (Height ^ 2))

Load in the necessary libraries for this model:

library(ggthemes)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.1.3

library(lindia)

Create the model and print a summary:

lrmodel <- lm(BMI ~ FAF + SCC + CALC, obesity)

summary(lrmodel)

## 
## Call:
## lm(formula = BMI ~ FAF + SCC + CALC, data = obesity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4201  -5.5022   0.2173   5.7866  20.0852 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     23.7806     7.5927   3.132  0.00176 ** 
## FAF             -1.2893     0.1966  -6.558 6.83e-11 ***
## SCCyes          -6.3778     0.7966  -8.007 1.93e-15 ***
## CALCFrequently   5.5560     7.6449   0.727  0.46745    
## CALCno           5.1129     7.5963   0.673  0.50097    
## CALCSometimes    8.7095     7.5930   1.147  0.25149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.59 on 2105 degrees of freedom
## Multiple R-squared:  0.1045, Adjusted R-squared:  0.1024 
## F-statistic: 49.13 on 5 and 2105 DF,  p-value: < 2.2e-16

Based on this summary:

If you do not exercise, do not count calories, and always drink alcohol, your theoretical BMI would be 23.7806, which is in the normal weight category
The intercept for SCC (do you monitor your calories?) yes - which implies that “no” is the baseline - is -6.3778, which implies that the BMI of an individual who does count calories would be around 7 points lower than those who don’t
- The p-value for this variable is low, implying that this relationship is not likely to be random, and there is some correlation between monitoring calories and BMI
The intercept for FAF (frequency of physical activity) is -1.2893, implying that as the frequency of physical activity decreases, the BMI increases by that amount, or vice versa
- The p-value for this variable is also low, much lower than the standard alpha of 0.05, which I have been using for all my statistical tests in this analysis, which, like the variable SCC, implies that this is not a random relationship, and there is some correlation here
The three intercepts for the various CALC (how often do you consume alcohol?) variable are 5.5560 for Frequently, 5.1129 for No, and 8.7095 for Sometimes. This implies that as you either frequently, don’t, or sometimes drink alcohol, your BMI would be that many points higher than someone who always drinks alcohol
- However, there is no evidence that any of those CALC frequencies has a strong relationship to BMI, so there is no reason to believe that drinking alcohol frequently, not at all, or sometimes has any significant impact on an individual’s BMI

To verify these result, I will not evaluate the model using 5 graphs to verify if they violate the assumptions of linear regression models:

Residuals versus Fitted Values
```
gg_resfitted(lrmodel) +
  geom_smooth(se=FALSE)
```
```
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
```
This plot suggests that the assumptions about linear models are being violated. The line curves towards the end, which suggests there is not a clear linear trend, so the linearity assumption is violated. Additionally, the values fan out towards the right side of the graph (as the values increase), indicating that the homoscedasticity assumption is violated as well. However, there does not seem to be any unique trends or clusters, so the independence and normality assumptions are not violated.

Residuals versus X-Axis

plots <- gg_resX(lrmodel, plot.all = TRUE)

Looking first at Residual vs. FAF:

The points appear to be approximately evenly scattered around the line, which suggest to me that the linearity and homoscedasticity assumptions are not violated, although there are more points for 0, 1, and 2 than there are for 3, which is to be expected as not many people have the ability or time to work out 4 to 5 days a week consistently. The data also appears to be linear and normal, so those assumptions aren’t violated either.

Looking second at Residual vs. SCC:

The residuals are kind of spread out, but not in any particular pattern, so I don’t think any assumptions are violated in this case. There are more values, or a wider spread, for “no”, but that was made apparent during my first analysis of obesity level and calorie monitoring.

Looking finally at Residual vs. CALC:

The residuals are spread consistently across the 4 categories here, with the exception of the “always” category having almost no, if none at all, values. This does not make me think that any linear model assumptions are violated. There are also not a lot of outliers, which means the model might be better fit and explain this relationship well.

Overall, based on this specific graph, I would say that the assumptions of linear models are held true, or at least there is not enough information in these graphs to say that they have been violated.

Residual Histogram

gg_reshist(lrmodel)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram has a very rough bell-shaped curve, but enough that I would consider it “normal”, so the normality assumption might be violated here. Additionally, the tails are uneven, with the left side being much higher than the right side. I would determine ultimately that this histogram suggests that the assumptions of linear models are violated, and that there might be some outliers in the data due to the spiked of bars in the histogram.

QQ-Plot
```
gg_qqplot(lrmodel)
```

In this Normal-QQ Plot, it is apparent that the mode is decently linear between -1 and 1, which implies that the data is normally distributed for most of this model. However, the data deviates at both of the tails, which suggests that there might be outliers than my linear regression model is not handling.

Cook’s D Observation

gg_cooksd(lrmodel, threshold = 'matlab')

Based on this Cook’s D Plot, there are two influential columns of data in this model: column 27 and column 69, with column 27 being much more influential. These columns could contain outliers, or skew the model, but because there are a very small number of influential columns, I don’t think that this leads me to think that the model violates the assumptions. If anything, I would want to look at those two observations, and potentially remove them from the model to see if they help the model be a better fit.

Conclusions & Final Recommendations

Genetics play an important role in an individual’s obesity level. In making recommendations to physicians, I would suggest that if a patient comes in with a family history of obesity, they should recommend being proactive about their weight management. How, exactly, they should do this could be one of the three lifestyle factors discussed in this analysis: calorie monitoring, physical activity, and alcohol consumption, although advice should always be given in the context of the patient’s history and medical records.
My analysis found calorie monitoring to have an important, non-random relationship to obesity level. However, whether or not calorie monitoring decreased an individual’s obesity level required further analysis, which is shown in the linear regression model. However, as with any medical advice, calorie counting should be prescribed with caution. Even individuals that are overweight or obese could potentially develop eating disorders or other unsafe eating habits, and calorie intake is not a one-size-fits-all approach and needs to be tailored to each individual in a safe manner by an expert.
Physical activity is an important component of weight management and living a healthy life - it has been shown in research to improve mental health, sleep quality, and mood. Additionally, if can have benefits on other aspects of health, like cardiovascular health and flexibility. Recommending sustainable and enjoyable physical activity, and potentially increasing it’s frequency by 1 or 2 days a week depending on the individual’s current exercise habits, is something I would recommend to physicians working with overweight or obese patients. Again, as with anything, the patient’s physical abilities (including any previous injuries) and their fitness level need to be taken into account before recommending this so that the patient does not injure themselves.
Because there was no evidence of any explanatory relationship between obesity level and frequency of alcohol consumption, I would not make any recommendations to physicians about telling patients to lower/limit alcohol consumption as a weight management tactic. That being said, I think that other research shows that limiting alcohol consumption is good for people in a broader sense, and consuming it in moderation is what is mostly recommended for a balanced, healthy lifestyle.
While the linear regression model shows that there is a non-random relationship between obesity level (BMI), monitoring calories and frequency of physical activity, it does have some limitations. Based on my analysis of it, with the five graphs I made, there are some times in which it violates the assumptions of linear models, and might have some outliers. However, it did confirm two of the three variables I looked into were good recommendations for weight management.
Because this analysis is dealing with health and obesity, it is important that recommendations, while based on this analysis, be given to patients in a holistic manner, taking into account their lifestyle genetics, current behaviors, and other things that might complicate introducing these weight management habits into everyday life. Future research and analysis should look more into the other variables presented in this dataset, and the outliers presented in the linear regression model analysis to further determine the way these habits impact weight and can be used in weight management.