Data Dive 8 - Regression Model

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

obesity <- read.csv(file.choose())

obesity <- obesity |>
  mutate (BMI = Weight / (Height ^ 2))

Select a continuous column of data that seems the most “valuable” as the response variable

response <- obesity$BMI

I chose BMI as my response variable because it is the things that seems like most healthcare professionals will ask about regarding health. Additionally, if diagnosing obesity or deciding who to prescribe preventative measures to, most practitioners will base their decision on BMI alone, or potentially in conjunction with other health markers that are not represented in this dataset.

Select a categorical column that you expect might influence the response variable

explanatory <- obesity$FAVC

I chose FAVC (do you eat high calorie foods frequently?) as my explanatory column. Because a gain in weight is caused by an excess of calories, it makes sense to me that frequently eating high calorie foods might influence BMI to be higher.

Devise a null hypothesis for an ANOVA test given this situation

Null hypothesis: There is no significant relationship between BMI and the whether or not a person consumes high calorie foods.

anova <- aov(response ~ explanatory, data = obesity)

summary(anova)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## explanatory    1   8202    8202     136 <2e-16 ***
## Residuals   2109 127221      60                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on this two-way ANOVA, I would reject the null hypothesis, as this suggests there is a significant relationship between BMI and whether or not an individual frequently eats high calorie foods. Based on this, it would be safe for healthcare professionals to assume that individuals who have a higher BMI frequently eat higher calorie foods, and thus they can prescribe some type of dietary management if lowering BMI would improve their patients quality of life or length of life.

Find a single continuous column of data that might influence the response variable

influence <- obesity$FAF

I chose FAF (frequency of physical activity) as my continuous column that might influence the response variable (BMI) as it makes sense to me that the addition of some form of physical activity at a frequency of multiple times a week would potentially help with weight management, which might result in a lower BMI.

Make sure the relationship between BMI and FAF is (roughly) linear

library(ggplot2)

ggplot(obesity,aes( x = FAF, y = BMI)) + geom_point()

The relationship between BMI and FAF is not linear at all - I ran the ggpairs() function on this dataset and there were not a response & explanatory variable pair that were linear, so I am going to use my initial response variable (BMI) and a continuous explanatory variable (FAF) for my linear regression with the fact that they do not have a linear relationship in mind.

Build a linear regression model of the response and evaluate its fit

model <- lm(BMI ~ FAF, obesity)

summary(model)

## 
## Call:
## lm(formula = BMI ~ FAF, data = obesity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5674  -5.4086  -0.8229   6.2045  21.9465 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.3895     0.2665 117.771   <2e-16 ***
## FAF          -1.6721     0.2018  -8.285   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.886 on 2109 degrees of freedom
## Multiple R-squared:  0.03152,    Adjusted R-squared:  0.03106 
## F-statistic: 68.64 on 1 and 2109 DF,  p-value: < 2.2e-16

Looking at the coefficients for this linear regression model, I would conclude that there is a weak relationship between BMI based on FAF (frequency of physical activity). I could have predicted this based on the fact that there is absolutely nowhere near a linear relationship between the two variables when graphed, but this can also be determined based on the R squared value, which is very small, meaning the linear model explains about 3% of the relationship between BMI and FAF. Despite this, my p-value for this relationship is very small, implying that there is a large level of statistical significance between the two variables, but based on my research into what this might mean, it’s likely that there is any limited practical significance to this low p-value because the R squared value is so low.

Based on this linear regression model between BMI and FAF, I would conclude that while there is a strong relationship between BMI and FAF, increasing physical activity should not be prescribed as a weight management tactic because there is no evidence to suggest that they are related in a linear way (it cannot be concluded that more exercise equals a lower BMI). While I think that health practitioners should still encourage an increase in physical activity in order to maintain a healthy lifestyle and better other health markers that are not included in this dataset, I think they would be better off prescribing limiting high calorie foods to lower BMI (the other variable I looked at in this dataset).

One thing I am interested about following this analysis is, because none of the variables appeared to have a necessarily linear relationship to each BMI, is there another way we can define obesity/being overweight that might have a linear relationship to healthy lifestyle factors (like not smoking, more physical activity, walking or biking over taking public transportation, eating vegetables frequently, etc.) so that better prescriptive advice can be given?