library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr 1.1.2 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.1
## v ggplot2 3.5.1 v tibble 3.2.1
## v lubridate 1.9.2 v tidyr 1.3.0
## v purrr 1.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
obesity <- read.csv(file.choose())
obesity <- obesity |>
mutate (BMI = Weight / (Height ^ 2))
response <- obesity$BMI
I chose BMI as my response variable because it is the things that seems like most healthcare professionals will ask about regarding health. Additionally, if diagnosing obesity or deciding who to prescribe preventative measures to, most practitioners will base their decision on BMI alone, or potentially in conjunction with other health markers that are not represented in this dataset.
explanatory <- obesity$FAVC
I chose FAVC (do you eat high calorie foods frequently?) as my explanatory column. Because a gain in weight is caused by an excess of calories, it makes sense to me that frequently eating high calorie foods might influence BMI to be higher.
Null hypothesis: There is no significant relationship between BMI and the whether or not a person consumes high calorie foods.
anova <- aov(response ~ explanatory, data = obesity)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## explanatory 1 8202 8202 136 <2e-16 ***
## Residuals 2109 127221 60
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on this two-way ANOVA, I would reject the null hypothesis, as this suggests there is a significant relationship between BMI and whether or not an individual frequently eats high calorie foods. Based on this, it would be safe for healthcare professionals to assume that individuals who have a higher BMI frequently eat higher calorie foods, and thus they can prescribe some type of dietary management if lowering BMI would improve their patients quality of life or length of life.
influence <- obesity$FAF
I chose FAF (frequency of physical activity) as my continuous column that might influence the response variable (BMI) as it makes sense to me that the addition of some form of physical activity at a frequency of multiple times a week would potentially help with weight management, which might result in a lower BMI.
library(ggplot2)
ggplot(obesity,aes( x = FAF, y = BMI)) + geom_point()
The relationship between BMI and FAF is not linear at all - I ran the ggpairs() function on this dataset and there were not a response & explanatory variable pair that were linear, so I am going to use my initial response variable (BMI) and a continuous explanatory variable (FAF) for my linear regression with the fact that they do not have a linear relationship in mind.
model <- lm(BMI ~ FAF, obesity)
summary(model)
##
## Call:
## lm(formula = BMI ~ FAF, data = obesity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5674 -5.4086 -0.8229 6.2045 21.9465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.3895 0.2665 117.771 <2e-16 ***
## FAF -1.6721 0.2018 -8.285 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.886 on 2109 degrees of freedom
## Multiple R-squared: 0.03152, Adjusted R-squared: 0.03106
## F-statistic: 68.64 on 1 and 2109 DF, p-value: < 2.2e-16
Looking at the coefficients for this linear regression model, I would conclude that there is a weak relationship between BMI based on FAF (frequency of physical activity). I could have predicted this based on the fact that there is absolutely nowhere near a linear relationship between the two variables when graphed, but this can also be determined based on the R squared value, which is very small, meaning the linear model explains about 3% of the relationship between BMI and FAF. Despite this, my p-value for this relationship is very small, implying that there is a large level of statistical significance between the two variables, but based on my research into what this might mean, it’s likely that there is any limited practical significance to this low p-value because the R squared value is so low.
Based on this linear regression model between BMI and FAF, I would conclude that while there is a strong relationship between BMI and FAF, increasing physical activity should not be prescribed as a weight management tactic because there is no evidence to suggest that they are related in a linear way (it cannot be concluded that more exercise equals a lower BMI). While I think that health practitioners should still encourage an increase in physical activity in order to maintain a healthy lifestyle and better other health markers that are not included in this dataset, I think they would be better off prescribing limiting high calorie foods to lower BMI (the other variable I looked at in this dataset).
One thing I am interested about following this analysis is, because none of the variables appeared to have a necessarily linear relationship to each BMI, is there another way we can define obesity/being overweight that might have a linear relationship to healthy lifestyle factors (like not smoking, more physical activity, walking or biking over taking public transportation, eating vegetables frequently, etc.) so that better prescriptive advice can be given?