Final Project

Author

Emme Gunther

Introduction

Research Question: How does a person’s age, height, weight, and gender contribute to their family history of obesity?

About the dataset: The dataset is titled “Estimation of Obesity Levels Based On Eating Habits and Physical Condition” and comes from the UC Irvine Machine Learning Repository. It contains information about obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. I chose this dataset because I was interested in understanding the genetic factors of obesity, and how it is a hereditory disease. The dataset is 2111 with 16 variables (5 I will be using, defined below).

Gender: categorical, binary variable containing the gender of the person
Height: numerical, continuous variable containing the height of the person
Weight: numerical, continuous variable containing the weight of the person
Age: numerical, continuous variable containing the age of the person
family_history_with_overweight: categorical, binary variable containing information on whether the person has a family member who suffered/suffers from obesity

Load Dataset

library(dplyr)
library(ggplot2)
library(tidyverse)
library(pROC)

setwd("~/Desktop/datasets")

obesity <- read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")

Data Analysis

For my data analysis, I will filter for 5/16 variables needed for my logistic regression. Next, I will study the structure and convert any variables needed (ex, turn age into a discrete variable) to make the regression process easier.

Cleaning

#select the 5 variables needed
obesity <- obesity |>
  select(c(Age, Weight, Height, family_history_with_overweight, Gender)) |>
  #change Age to discrete by rounding it
  mutate(Age = round(Age)) |>
  #change gender into binary codes (male = 1, female = 0)
  mutate(Gender = if_else(Gender == "Male", 1, 0)) |>
  #change family_history_with_overweight into binary codes (yes = 1, no = 0)
  mutate(family_history_with_overweight = if_else(family_history_with_overweight == "yes", 1, 0)) |>
  #change family history variable to a factor class
  mutate(family_history_with_overweight = as.factor(family_history_with_overweight)) |>
  #change gender variable to a factor class
  mutate(Gender = as.factor(Gender))

#double check work, confirm that age and family history variables are correct
str(obesity)

tibble [2,111 × 5] (S3: tbl_df/tbl/data.frame)
 $ Age                           : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
 $ Weight                        : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
 $ Height                        : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
 $ family_history_with_overweight: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
 $ Gender                        : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 2 2 2 ...

#the dataset source states that there are no missing values. Double check this information
sum(is.na(obesity))

[1] 0

unique(obesity$family_history_with_overweight)

[1] 1 0
Levels: 0 1

Statistical Analysis

For my statistical analysis, I have chosen logistic regression with a binary outcome variable. This analysis requires an equation including a binary outcome variable and predictor variables, followed by a series of tests on the model (confusion/performance matrix). I have chosen this because there is a clear outcome variable (family_history_with_overweight), which is binary, followed by predictors that can help explain the outcome variable (age, weight, gender, height)

#create final model
logistic <- glm(family_history_with_overweight ~ Age + Weight + Height + Gender, data=obesity, family="binomial")
#calculate model summary
summary(logistic)


Call:
glm(formula = family_history_with_overweight ~ Age + Weight + 
    Height + Gender, family = "binomial", data = obesity)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -10.65812    1.64845  -6.466 1.01e-10 ***
Age           0.01074    0.01302   0.825 0.409612    
Weight        0.08966    0.00565  15.868  < 2e-16 ***
Height        3.46943    0.98718   3.514 0.000441 ***
Gender1      -0.88339    0.18939  -4.664 3.10e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2005.4  on 2110  degrees of freedom
Residual deviance: 1292.3  on 2106  degrees of freedom
AIC: 1302.3

Number of Fisher Scoring iterations: 6

Intercept, Log-odds Interpretation

The intercept means that when Age, Height, Weight, and Gender are zero, (gender = 0 meaning female) the log-odds of having a family history of obesity is -10.65812.

Odds Ratio, Strongest Predictors

Weight, Height, and Gender (male only) are the most significant predictors of having a family history of obesity (pvalue > . 05).

Confusion Matrix

#there are no missing values in the dataset, and family history is already a numerical variable. no need to convert or filter for complete cases

#predicted probabilities
predicted.probs <- logistic$fitted.values


#predicted classes
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)


#cnfusion matrix
confusion <- table(
  Predicted = factor(predicted.classes, levels = c(0, 1)),
  Actual = factor(obesity$family_history_with_overweight, levels = c(0, 1))
)

confusion

         Actual
Predicted    0    1
        0  187  101
        1  198 1625

Confusion Interpretation

187 people had no family history of obesity, and the model said they had no family history of obesity. (true negative)
101 people had a family history of obesity, but the model said they had no family history of obesity. (false negative)
198 people had no family history of obesity, but the model said that they had a family history of obesity. (false positive)
1625 people had a family history of obesity, and the model said that they had a family history of obesity (true positive)

Performance Matrix

#Extract Values:
TN <- 187
FP <- 101
FN <- 198
TP <- 1625

#Metrics    
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)   
specificity <- TN / (TN + FP)  
precision <- TP / (TP + FP)     

cat("Accuracy:", round(accuracy, 3), "\nSensitivity:", round(sensitivity, 3), "\nSpecificity:", round(specificity, 3), "\nPrecision:", round(precision, 3))

Accuracy: 0.858 
Sensitivity: 0.891 
Specificity: 0.649 
Precision: 0.941

Model Performance Interpretation The model has high accuracy (85.8%). The model is worse at detecting true negatives (64.9%) over true positives (89.1%). The balance between detecting positives and avoiding false alarms is weak (24.2% difference).

ROC Curve and AOC Value

# ROC curve & AUC on full data
roc_obj <- roc(response = obesity$family_history_with_overweight,
               predictor = logistic$fitted.values,
               levels = c("0", "1"),
               direction = "<")

# Print AUC value
auc_val <- auc(roc_obj); auc_val

Area under the curve: 0.8859

plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")

The AUC = .886 means the model is very good at distinguishing between people who have a family history of obesity and people who don’t.

In plain words: if you randomly pick one person who has a family history of obesity and one person in the dataset who doesn’t, the model has about a 88.6% chance of ranking the person who does have a family history of obesity higher.

Conclusion

According to the logistic regression model, the answer to my research question, “How does a person’s age, height, weight, and gender contribute to their family history of obesity?” is that height, weight, and the gender (specifically male) significantly contribute to whether a person has a family history of obesity or not. Age was not a significant predictor. The confusion matrix showed that most people in the dataset were reported as a true positive, the model correctly said that they had a family history of obesity. Furthermore, this was supported by the performance matrix, which said that the model was much better at reporting true positives (specificity of 89.1%) over true negatives (sensitivity of 64.9%). This shows that these results are not strong enough for a binary classification model on medical data- the model is heavily biased due to an aggressively positive skewed dataset (a lot of “yes” values in the family history column). The imbalance between specificity and sensitivity show this- the model is too positively inflated to use in real-world medical research. If I were to further my research, the first thing I would do would be to add a BMI variable, combining the height and weight variables into values more appropriate for medical research on obesity. I would also like to try using a sample set of data (focusing on the cases where there was no family history of obesity) for future data analysis to hopefully balance out the dataset to get less biased results.

References

Mendoza Palechor, Fabio, and Alexis De la Hoz Manotas. Estimation of Obesity Levels Based On Eating Habits and Physical Condition. 2019, UCI Machine Learning Repository, https://doi.org/10.24432/C5H31Z.