Introduction

My final project is focused on determining which lifestyle factors, including things like frequency of exercise and counting calories, and genetic factors, like a family history of obesity, are good predictors for whether or not someone will be overweight or obese. This project will explore various variables and their relationship to obesity level using a dataset from UC Irvine Machine Learning Repository, Estimation of Obesity Levels Based on Eating Habits and Physical Condition. This dataset, while technically a machine learning dataset designed with predicting obesity level based on eating habits and physical condition, has 23% of it’s data collected from actual people (although the other 77% is generated) in Mexico, Peru, and Colombia. There are 16 features and 2,111 instances with no missing values.

The main goal with this project is to determine how these individual lifestyle habits and genetic backgrounds influence obesity level, and if any prescriptive advice can be given.

First - load the dataset & necessary libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
obesity <- read.csv(file.choose())

The Genetic Component

One of the most interesting things I wanted to explore with this dataset, before I started exploring the influence that lifestyle factors had on obesity, was whether or not the genetic component matters. If there is a shown relationship between a family history of obesity/being overweight and an individual being overweight/obese, then preventative advice could be given to a patient, even if they themselves are currently at a normal weight (based off BMI).

First, I visualized the split of who has a family history of being overweight/obese and who doesn’t, separated by obesity level.

library(ggplot2)

palette1 <- c("deepskyblue2", "deepskyblue4")

ggplot(obesity, aes(x = NObeyesdad, fill = family_history_with_overweight)) +
  geom_bar() +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(title = 'Family History of Obesity by Obesity Level', x = 'Obesity Level', y = 'Number of Individuals') +
  scale_fill_manual(values = palette1)

From this visualization, it appears that there are visibly fewer individuals who are of insufficient or normal weight with overweight or obese family members. Contrasting this, individuals in obesity type I, II, and III almost entirely have family members who are overweight or obese. This visualization suggests that having a family member who was or is overweight or obese increases your chance of being overweight or obese.

To confirm this, it is important to run a statistical test, so the next step would be to run a hypothesis test. Because both of these variables are categorical (family_history_with_overweight being ‘yes’ or ‘no’ and NObeyesdad being one of the 7 obesity levels), I will be running a Chi Squared Test for Independence. I will be using an alpha value of 0.05, or am willing to accept a 5% chance of rejecting my null. I selected this value because it is standard, and because this data is primarily generated data, I want to give it a bit of wiggle room. Additionally, there is not a lot at stake here - if the null is rejected when it really should not be, the worst that happens is someone is given guidance to monitor their weight.

The null hypothesis is that: There is no relationship between having a family history of being overweight/obese and an individual’s susceptibility to being overweight/obese

#Create a contingency table with my two variables 
test1 <- table(obesity$family_history_with_overweight, obesity$NObeyesdad)

#Run the chi-squared test
chisquare1 <-chisq.test(test1)

print(chisquare1)
## 
##  Pearson's Chi-squared test
## 
## data:  test1
## X-squared = 621.98, df = 6, p-value < 2.2e-16

The results of this Chi-Squared test suggest that the null hypothesis should be rejected, and there is a significant relationship between family history of overweight/obesity and current obesity level. The p-value is much smaller than 0.05 at 2.2e-16.

What these steps might be, exactly, is the next part of my analysis: what lifestyle factors contribute to obesity?

Lifestyle Factor 1: Calorie Counting

One of the variables included in my dataset is a categorical variable, SCC, which encodes whether or not an individual monitors the calories they consume in a day. Modern science and research on obesity and calories tells us that at the core of weight loss is calories in versus calories out, or burning more calories than you consume. Research also tells us that the most efficient way to make sure this is actually helping is to track calories.

First, I visualized the division of who does or does not monitor their calories by their obesity level.

palette2 = c("darkslategray3", "darkslategrey")

ggplot(obesity, aes(x = NObeyesdad, fill = SCC)) +
  geom_bar() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = 'Calorie Monitoring by Obesity Level', x = 'Number of Individuals', y = 'Obesity Level') +
  scale_fill_manual(values = palette2)

Based on this visualization, it seems that hardly anyone at any obesity level is monitoring their calories, with obesity type I, II, and III having almost no calorie monitors, and the other sections having an insignificant amount.

To confirm that this is the case, I will run another Chi-Squared test, because both of these variables are categorical again. I am using the same alpha, for the same reasons as in my last hypothesis test.

The null hypothesis is: There is no relationship between monitoring calories and an individual’s obesity level.

#Create a contingency table with my two variables
test2 <- table(obesity$NObeyesdad, obesity$SCC)

#Run the chi-squared test
chisquare2 <- chisq.test(test2)

print(chisquare2)
## 
##  Pearson's Chi-squared test
## 
## data:  test2
## X-squared = 123.02, df = 6, p-value < 2.2e-16

While this p-value suggests that the null hypothesis should be rejected and there is, in fact, a significant relationship between monitoring calories and obesity level, I would not fully come to that conclusion. The chi-square test, and any hypothesis test, does not account for the direction of the relationship between the variables (that is, whether counting calories decreases or increases obesity). My best assumption here would be that, because there are less people that count calories in the higher obesity levels, that not counting calories leads to obesity.

Lifestyle Factor 2: Physical Activity

The second lifestyle factor I wanted to explore was frequency of physical activity. While calorie counting is touted as the best way to burn more calories than you consume, increasing physical activity is another way, although it’s less recommended by professionals in the field.

Note: This variable is encoded in the following way:

I started by visualizing the average frequency of physical activity (per week) by obesity level in a box plot.

ggplot(obesity, aes(x = NObeyesdad, y = FAF)) +
  geom_boxplot() +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(title = 'Frequency of Physical Activity versus Obesity Level', x = 'Obesity Level', y = 'Frequency of Physical Activity (per week)') 

Based on this visualization, it does not appear that there is a significant difference between obesity level and frequency of physical activity per week. Most obesity levels, excluding insufficient weight and obesity type III, exercise one or two times a week on average. Based on this, I would venture to say that there is not a significant relationship between the frequency of physical activity and obesity level.

To confirm this, I will run another hypothesis test. This time, it will be an ANOVA, as I will be using the mean frequency of physical activity for each obesity level.

The null hypothesis is: There is no relationship between the frequency of physical activity and an individual’s level of obesity.

anova <- aov(FAF ~ NObeyesdad, data = obesity)

summary(anova)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## NObeyesdad     6   72.5  12.084   17.48 <2e-16 ***
## Residuals   2104 1454.1   0.691                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Despite the fact that every group has approximately the same mean value for frequency of physical activity, there is a statistically significant relationship between the frequency of physical activity and obesity level, meaning we reject the null hypothesis. I think I personally was not expecting a statistical significance because of the similarity between means, but neglected to consider that the statistical significance just means the relationship is not random.

Lifestyle Factor 3: Alcohol Consumption

The third and final lifestyle factor I wanted to explore in my analysis was alcohol consumption and how it related to obesity level. As limiting calories is the most commonly prescribed weight loss method, many professionals recommend limiting alcohol consumption, as it’s easy to overindulge in calories when drinking.

As with the other two lifestyle factors, I will start by visualizing this comparison.

palette4 = c("Always" = "darkseagreen1", "Sometimes" = "darkseagreen4", "Frequently" = "darkolivegreen", "no" = "darkslategrey")

ggplot(obesity, aes(x = NObeyesdad, fill = CALC)) +
  geom_bar(position = "stack") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Alcohol Consumption by Obesity Level", x = "Obesity Level", y = "Frequency of Alcohol Consumption") +
  scale_fill_manual(values = palette4)

Based on this visualization, it appears that almost everyone in every obesity level drinks alcohol “sometimes”, which I would assume is true for most of the population of each country included in this dataset, as well as many countries not included. However, it appears that insufficient weight and normal weight categories, as well as the lower overweight categories, have a higher number of people who do not drink alcohol, so stopping alcohol consumption might be a weight management tactic.

To confirm if there is any relationship, I will run a Chi-Squared test.

The null hypothesis is: There is no relationship between frequency of alcohol consumption and an individual’s obesity level.

#Create a contingency table
test3 <- table(obesity$NObeyesdad, obesity$CALC)

#Run the chi-squared test
chisquare3 <-chisq.test(test3)
## Warning in chisq.test(test3): Chi-squared approximation may be incorrect
print(chisquare3)
## 
##  Pearson's Chi-squared test
## 
## data:  test3
## X-squared = 338.58, df = 18, p-value < 2.2e-16

To first address the warning from the chi-squared test: There is a very low number in the “Always” category (less than 1 for each obesity level). However, despite that, the p-value is still much smaller than the alpha value of 0.05, which means that we should reject the null hypothesis and conclude that there is a relationship between the frequency of alcohol consumption and obesity level.

Regression Model

To fully explore the relationship and impact that these three lifestyle factors have on obesity level, the last thing I will do for this analysis is build a linear regression model.

The first thing I need to do is establish a BMI column. I will be using this instead of obesity level since it is continuous instead of categorical.

For reference:

The relationship between BMI and the NObeyesdad category (obesity level) is as follows:

Create the BMI column to use as a response variable in the model:

obesity <- obesity |>
  mutate (BMI = Weight / (Height ^ 2))

Load in the necessary libraries for this model:

library(ggthemes)
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.1.3
library(lindia)

Create the model and print a summary:

lrmodel <- lm(BMI ~ FAF + SCC + CALC, obesity)

summary(lrmodel)
## 
## Call:
## lm(formula = BMI ~ FAF + SCC + CALC, data = obesity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4201  -5.5022   0.2173   5.7866  20.0852 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     23.7806     7.5927   3.132  0.00176 ** 
## FAF             -1.2893     0.1966  -6.558 6.83e-11 ***
## SCCyes          -6.3778     0.7966  -8.007 1.93e-15 ***
## CALCFrequently   5.5560     7.6449   0.727  0.46745    
## CALCno           5.1129     7.5963   0.673  0.50097    
## CALCSometimes    8.7095     7.5930   1.147  0.25149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.59 on 2105 degrees of freedom
## Multiple R-squared:  0.1045, Adjusted R-squared:  0.1024 
## F-statistic: 49.13 on 5 and 2105 DF,  p-value: < 2.2e-16

Based on this summary:

To verify these result, I will not evaluate the model using 5 graphs to verify if they violate the assumptions of linear regression models:

  1. Residuals versus Fitted Values

    gg_resfitted(lrmodel) +
      geom_smooth(se=FALSE)
    ## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

    This plot suggests that the assumptions about linear models are being violated. The line curves towards the end, which suggests there is not a clear linear trend, so the linearity assumption is violated. Additionally, the values fan out towards the right side of the graph (as the values increase), indicating that the homoscedasticity assumption is violated as well. However, there does not seem to be any unique trends or clusters, so the independence and normality assumptions are not violated.

  2. Residuals versus X-Axis

    plots <- gg_resX(lrmodel, plot.all = TRUE)

Looking first at Residual vs. FAF:

Looking second at Residual vs. SCC:

Looking finally at Residual vs. CALC:

  1. Residual Histogram

    gg_reshist(lrmodel)
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram has a very rough bell-shaped curve, but enough that I would consider it “normal”, so the normality assumption might be violated here. Additionally, the tails are uneven, with the left side being much higher than the right side. I would determine ultimately that this histogram suggests that the assumptions of linear models are violated, and that there might be some outliers in the data due to the spiked of bars in the histogram.

  1. QQ-Plot

    gg_qqplot(lrmodel)

In this Normal-QQ Plot, it is apparent that the mode is decently linear between -1 and 1, which implies that the data is normally distributed for most of this model. However, the data deviates at both of the tails, which suggests that there might be outliers than my linear regression model is not handling.

  1. Cook’s D Observation

    gg_cooksd(lrmodel, threshold = 'matlab')

Based on this Cook’s D Plot, there are two influential columns of data in this model: column 27 and column 69, with column 27 being much more influential. These columns could contain outliers, or skew the model, but because there are a very small number of influential columns, I don’t think that this leads me to think that the model violates the assumptions. If anything, I would want to look at those two observations, and potentially remove them from the model to see if they help the model be a better fit.

Conclusions & Final Recommendations