Data Dive 7 - Hypothesis Testing

Load in Data set

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.3.5     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

obesity <- read.csv(file.choose())

Create the BMI column

newobesity <- obesity |>
  mutate (BMI = Weight / (Height ^ 2))

Null Hypothesis 1:

There is no difference in average BMI between individuals who do and do not have a family history of obesity. (Calculate mean & sd to use later)

alpha <- 0.05
power <- 0.80
effect <- 1

yes <- subset(newobesity, family_history_with_overweight == "yes")$BMI

mean_yes <- mean(yes)

sd_yes <- sd(yes)

no <- subset(newobesity, family_history_with_overweight == "no")$BMI

mean_no <- mean(no)

sd_no <- sd(no)

I chose my alpha, power and minimum effect size for the following reasons:

Alpha: If I were more familiar with statistics, this data set, and this field, and were working in collaboration with other experts in this area, I would likely put much more consideration into the alpha. However, because I am none of those things, I chose to go with the standard alpha of 0.05 because it is widely accepted and accounts for both types of errors that may occur. Additionally, it has already been proven that BMI is not an accurate predictor of obesity and health, so I believe that 0.05 will provide some “wiggle room” for lack of a better term that accounts for this.

Power: Again, following suit from the alpha, I set the power at 80% (or 0.8) because that is the widely accepted number. From a Google search, and our weekly lecture, I think this is a good starting place because it is sufficient enough that it will detect a good chunk of the effects, and because this is not a statistical analysis with a real impact on patients/the world, there is minimal risk involved if something is not detected.

Minimum effect size: I chose the minimum effect size for this hypothesis test to be 1 unit of BMI because, in my opinion, given that BMI is already a number with limited diagnostic/predictive capability, the difference between 21.1 and 21.5, for example, is negligible compared to the difference between 21 and 22.

Calculate sample size

library(pwrss)

## Warning: package 'pwrss' was built under R version 4.1.3

## 
## Attaching package: 'pwrss'

## The following object is masked from 'package:stats':
## 
##     power.t.test

test <- pwrss.t.2means(mu1 = mean_yes, 
                       mu2 = mean_no,
                        sd1 = sd_yes,
                        sd2 = sd_no,
                        kappa = 1,
                        power = power, 
                        alpha = alpha, 
                        alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 7 
##   n2 = 7 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 12 
##  Non-centrality parameter = 3.086 
##  Type I error rate = 0.05 
##  Type II error rate = 0.2

From this sample size calculation, I see that I need to have at least 7 data points to maintain the defined strength. Because I have more than this for both categories, I am able to continue on and perform a t-test to test my hypothesis.

Perform t-test

ttest1 <- t.test(yes, no, var.equal = TRUE)
ttest1

## 
##  Two Sample t-test
## 
## data:  yes and no
## t = 25.367, df = 2109, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   9.253366 10.803985
## sample estimates:
## mean of x mean of y 
##  31.52917  21.50049

Based on the outcome of this two sample t-test, I would reject the null hypothesis, and go forward with the assumption that individuals with a family history of being overweight have a higher BMI than individuals without a family history of being overweight. I came to this conclusion for a few reasons, the main one being the very small p-value compared to my set alpha. This is significant because my main focus for this project is to determine if certain lifestyle factors are predictors for obesity, and now knowing that there is evidence of a relationship between a family history of obesity and BMI, this could theoretically be used to put preventative measures in place for individuals who come from families that have a history of obesity.

Visualization

library(ggplot2)

ggplot(newobesity, aes(x = family_history_with_overweight, y = BMI, fill = family_history_with_overweight)) +
  geom_boxplot() +
  labs(title = "BMI and Family History of Obesity", x = "Family History of Overweight", y = "BMI") +
  theme_minimal() +
  scale_fill_manual(values = c("yes" = "cornflowerblue", "no" = "aquamarine4"))

This graph shows a visualization of my null hypothesis. Very clearly, individuals with a family history of overweight/obesity have a higher BMI than individuals that do not have a family history of overweight/obesity (average of 32.5-ish for individuals in the “yes” category versus an average of 21 or 22-ish for individuals in the “no” category). As I discussed previously, this knowledge can be used to discern who to prescribe preventative measures to in a healthcare setting.

Null hypothesis 2:

There is no relationship between the number of meals eaten between people who do and do not track their calories.

yes_track <- subset(obesity, SCC == "yes")$NCP

no_track <- subset(obesity, SCC == "no")$NCP

Perform t-test

test2 <- t.test(yes_track, no_track)

test2

## 
##  Welch Two Sample t-test
## 
## data:  yes_track and no_track
## t = -0.66765, df = 102.87, p-value = 0.5059
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2316103  0.1149470
## sample estimates:
## mean of x mean of y 
##  2.629949  2.688281

Using an alpha of 0.05 for this test again, since, as I reasoned in the previous null hypothesis, I am generally unfamiliar with this dataset and am not an expert, so 0.05 is a good standard and provides me with enough certainty. Especially, in this case, since potentially prescribing calorie counting as a weight management tool to prevent obesity is not life threatening, I feel comfortable using this alpha again because it accounts for both types of errors and will give a generally accurate result.

Based on that, in this case, I would fail to reject the null hypothesis. There is not much in this data set that suggests that individuals who count their calories eat a different number of meals per day than those who do not. This makes sense to me; I’ve discussed this in previous data dives, but the general eating habits of Americans is 3 meals a day and potentially snacks. Another interesting hypothesis I could look into is if individuals who count their calories eat snacks in between meals, as that data point is represented in this data set.

As far as recommendations go, I would say that healthcare professionals should explore other prescriptive advice before discussing calorie counting, like recommending nutrition information first, or adding in other healthy lifestyle habits like exercise and water consumption

Visualization

ggplot(obesity, aes(x = SCC, y = NCP)) +
  geom_boxplot() +
  labs(title = "Meals Eaten vs. Calorie Counting", x = "Do you count calories?", y = "Number of meals eaten daily") +
  theme_minimal()

As seen in the graph, the means for both groups are about the same, and there are a number of outliers, which limits me from drawing real conclusions about this relationship other than that it does not statistically exist.

Data Dive 7 - Hypothesis Testing

Kylie Heagy

2024-10-13

Load in Data set

Create the BMI column

Null Hypothesis 1:

There is no difference in average BMI between individuals who do and do not have a family history of obesity. (Calculate mean & sd to use later)

Calculate sample size

Perform t-test

Visualization

Null hypothesis 2:

There is no relationship between the number of meals eaten between people who do and do not track their calories.

Perform t-test

Visualization