Load Dataset

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
obesity <- read.csv(file.choose())
obesity <- obesity |>
  mutate (BMI = Weight / (Height ^ 2))

Select an interesting binary column of data

#The column I'm selecting is SCC, but because it is 'yes' and 'no', I am converting it to yes = 1 and 0 = no

obesity$SCC_binary <- ifelse(obesity$SCC == "yes", 1,0)

I selected SCC (do you monitor the calories you eat daily?) as the binary column to build my logistic regression model around. This variable was easily encoded as a binary variable (all I had to do was convert no to 0 and yes to 1), and it also is indicative of many things that can lead to obesity, based on what is generally accepted in medicine. For one, the monitoring of calories is often related to a health-conscious lifestyle, and might mean that individuals who answer ‘yes’ to SCC are making more balanced diet and exercise choices. Additionally, whether or not someone monitors their calories might indicate whether or not they fit into other groups, like being within a certain BMI/obesity level, age group, or might partake in other healthier lifestyle options (not smoking, walking for transportation, more frequent physical activity).

Build a logistic regression model for this variable

The explanatory variables I chose to use for this logistic regression model are:

  1. BMI: It makes sense to me that people with a lower BMI might be in the group that monitors their calories
  2. Age: Based on what I’ve seen in my day to day life, I feel like individuals in older age groups/ranges tend to be more likely to track their calories that individuals in younger demographics
  3. FAF (frequency of physical activity): As I mentioned in my description as to why I chose my binary variable, frequency of physical activity (and a higher frequency of it) might be able to predict whether or not a person monitors their calories
model <- glm(SCC_binary ~ BMI + Age + FAF, data = obesity,
             family = binomial(link = 'logit'))

model$coefficients
## (Intercept)         BMI         Age         FAF 
##  1.18952349 -0.10932906 -0.06947382  0.17499101
summary(model)
## 
## Call:
## glm(formula = SCC_binary ~ BMI + Age + FAF, family = binomial(link = "logit"), 
##     data = obesity)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8436  -0.3371  -0.2170  -0.1334   3.1660  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.18952    0.62649   1.899   0.0576 .  
## BMI         -0.10933    0.01845  -5.926  3.1e-09 ***
## Age         -0.06947    0.02793  -2.487   0.0129 *  
## FAF          0.17499    0.11941   1.465   0.1428    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 780.96  on 2110  degrees of freedom
## Residual deviance: 690.38  on 2107  degrees of freedom
## AIC: 698.38
## 
## Number of Fisher Scoring iterations: 7

Interpret the coefficients & explain what they mean

BMI: The BMI coefficient is -0.10932906 (-0.11), which means that for every increase by one point on the BMI scale, assuming age and frequency of physical activity (FAF) are held constant, the odds of the patient monitoring their calories decreases by 0.11. Because this is such a small number, I don’t feel confident drawing final conclusions from it, but this does suggest that the higher a BMI, the less likely a person is to monitor their calories.

Age: The age coefficient is -0.06947382, which means that for every one year increase in age, holding BMI and FAF constant, the odds of a patients monitoring their calories decreases by 0.07. Again, like BMI, this is a very small number, so I would not confidently conclude anything from this, but were the relationship stronger (maybe a higher coefficient?), this would suggest that the older someone is, the less likely they are to monitor their calories, which is the opposite of what I thought to be true.

FAF: The FAF coefficient is 0.17499101, which means that for every increase in day excercised per week, holding BMI and age constant, the odds of patients monitoring their calories increases by 0.18. Keeping in mind the fact that this is, again, a smaller number (thought larger than the other two), this suggests that for every day more that a patient exercises, they are more likely to monitor their calories. This aligns with my guess at the beginning of this assignment, although the actual cause of this may not be an overall healthier lifestyle.

Building a CI for FAF

faf_coef <- 0.17499101 
faf_se <- 0.11941
faf_z <- 1.96 #will give me a 95% CI

faf_lb <- faf_coef - (faf_z * faf_se)
faf_ub <- faf_coef + (faf_z * faf_se)

faf_ci <- c(faf_lb, faf_ub)

faf_ci
## [1] -0.05905259  0.40903461

Based on this confidence interval, I would conclude that there is not a statistically significant relationship between monitoring calories and frequency of physical activity even though my GLM suggested that more physical activity would increase the likelihood of individuals monitoring their calories. I came to this conclusion because my CI includes 0.

Because this is inconclusive, I think something else that would be good to know is the reasoning behind why individuals are choosing to monitor/not monitor calories. For some individuals, it may be part of weight loss or just a lifestyle choice, but this doesn’t factor in the fact that people monitoring calories could be attempting to gain weight, or at the very least not lose weight, and this also could impact that amount of physical activity that they are doing as well. Additionally, when discussing obesity/weight loss/calorie counting/exercise, there are many more nuances than what are touched on in this dataset, so I’m not entirely surprised that most of these tests have been inconclusive or determined that relationships are not statistically significant.