library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr 1.1.2 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.1
## v ggplot2 3.5.1 v tibble 3.2.1
## v lubridate 1.9.2 v tidyr 1.3.0
## v purrr 1.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
obesity <- read.csv(file.choose())
obesity <- obesity |>
mutate (BMI = Weight / (Height ^ 2))
#The column I'm selecting is SCC, but because it is 'yes' and 'no', I am converting it to yes = 1 and 0 = no
obesity$SCC_binary <- ifelse(obesity$SCC == "yes", 1,0)
I selected SCC (do you monitor the calories you eat daily?) as the binary column to build my logistic regression model around. This variable was easily encoded as a binary variable (all I had to do was convert no to 0 and yes to 1), and it also is indicative of many things that can lead to obesity, based on what is generally accepted in medicine. For one, the monitoring of calories is often related to a health-conscious lifestyle, and might mean that individuals who answer ‘yes’ to SCC are making more balanced diet and exercise choices. Additionally, whether or not someone monitors their calories might indicate whether or not they fit into other groups, like being within a certain BMI/obesity level, age group, or might partake in other healthier lifestyle options (not smoking, walking for transportation, more frequent physical activity).
The explanatory variables I chose to use for this logistic regression model are:
model <- glm(SCC_binary ~ BMI + Age + FAF, data = obesity,
family = binomial(link = 'logit'))
model$coefficients
## (Intercept) BMI Age FAF
## 1.18952349 -0.10932906 -0.06947382 0.17499101
summary(model)
##
## Call:
## glm(formula = SCC_binary ~ BMI + Age + FAF, family = binomial(link = "logit"),
## data = obesity)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8436 -0.3371 -0.2170 -0.1334 3.1660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.18952 0.62649 1.899 0.0576 .
## BMI -0.10933 0.01845 -5.926 3.1e-09 ***
## Age -0.06947 0.02793 -2.487 0.0129 *
## FAF 0.17499 0.11941 1.465 0.1428
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 780.96 on 2110 degrees of freedom
## Residual deviance: 690.38 on 2107 degrees of freedom
## AIC: 698.38
##
## Number of Fisher Scoring iterations: 7
BMI: The BMI coefficient is -0.10932906 (-0.11), which means that for every increase by one point on the BMI scale, assuming age and frequency of physical activity (FAF) are held constant, the odds of the patient monitoring their calories decreases by 0.11. Because this is such a small number, I don’t feel confident drawing final conclusions from it, but this does suggest that the higher a BMI, the less likely a person is to monitor their calories.
Age: The age coefficient is -0.06947382, which means that for every one year increase in age, holding BMI and FAF constant, the odds of a patients monitoring their calories decreases by 0.07. Again, like BMI, this is a very small number, so I would not confidently conclude anything from this, but were the relationship stronger (maybe a higher coefficient?), this would suggest that the older someone is, the less likely they are to monitor their calories, which is the opposite of what I thought to be true.
FAF: The FAF coefficient is 0.17499101, which means that for every increase in day excercised per week, holding BMI and age constant, the odds of patients monitoring their calories increases by 0.18. Keeping in mind the fact that this is, again, a smaller number (thought larger than the other two), this suggests that for every day more that a patient exercises, they are more likely to monitor their calories. This aligns with my guess at the beginning of this assignment, although the actual cause of this may not be an overall healthier lifestyle.
faf_coef <- 0.17499101
faf_se <- 0.11941
faf_z <- 1.96 #will give me a 95% CI
faf_lb <- faf_coef - (faf_z * faf_se)
faf_ub <- faf_coef + (faf_z * faf_se)
faf_ci <- c(faf_lb, faf_ub)
faf_ci
## [1] -0.05905259 0.40903461
Based on this confidence interval, I would conclude that there is not a statistically significant relationship between monitoring calories and frequency of physical activity even though my GLM suggested that more physical activity would increase the likelihood of individuals monitoring their calories. I came to this conclusion because my CI includes 0.
Because this is inconclusive, I think something else that would be good to know is the reasoning behind why individuals are choosing to monitor/not monitor calories. For some individuals, it may be part of weight loss or just a lifestyle choice, but this doesn’t factor in the fact that people monitoring calories could be attempting to gain weight, or at the very least not lose weight, and this also could impact that amount of physical activity that they are doing as well. Additionally, when discussing obesity/weight loss/calorie counting/exercise, there are many more nuances than what are touched on in this dataset, so I’m not entirely surprised that most of these tests have been inconclusive or determined that relationships are not statistically significant.