Background

Three different types of diet are being trialed by clinicians studying weight loss. Diet A is the diet usually recommended, diets B and C have been newly developed. On the basis, for the for the purpose of this, analysis Diet A will be the control group. The goal of this study is to confirm statistically if the newly formulated diets have an impact on the weight loss of the trial participants.

Based on the statistical analysis, the clinicians will adjust their recommendations to offer a new and improved diet that offers better weight loss results for customers. The statistical analysis is carried out using the frequentest statistical framework

Exploratory Data Analysis

Data Structure

The data consists of 42 rows with 0 missing values. 14 trials were carried out for each diet.

The data provided consists of 3 variables:

  • Diet : A categorical variable that indicates the diet considered in each trial/observation
  • Weight : The starting weight of the trial participants
  • Weight after 6 weeks: The weight of the trail participant after measured after 6 weeks on the diet
Preview of Diet Trial Data
Diet weight weight6weeks
A 58 54.2
A 60 54.0
A 64 63.3
A 64 61.1
A 65 62.2
A 66 64.0
Summary Statistics of Diet Trial Data
Diet Trials Mean Weight S.D Weight Mean Weight @ 6 weeks S.D Weight @ 6 weeks
A 14 67.9 6.0 64.9 6.9
B 14 64.8 5.9 62.2 6.3
C 14 68.0 4.4 62.1 5.0

From the summary statistics we can infer that there is a difference between the mean weight and weight after 6 weeks of the trial participants across all 3 diets. Trial participants on diet A and diet C both have a mean weight of 68kg and participants for diet B have a mean weight of 65kg. To effectively evaluate the impact of these diets on the weight of participants, we calculate the weight loss from our observations and add it to our data frame. See preview of new data frame below:

Preview of Diet Trial Data with weight loss column
Diet weight weight6weeks weight_loss
A 58 54.2 3.8
A 60 54.0 6.0
A 64 63.3 0.7
A 64 61.1 2.9
A 65 62.2 2.8

Using a box plot to insight fully visualize the weight loss we see the distribution of weight loss for each diet and the differences between these distributions and their key metrics.

Distribution of weight loss for different diet types

Distribution of weight loss for different diet types

From the Boxplot, Diet C has the highest mean weight loss from our data. There isn’t a clear difference between the average weight loss from diet A and diet B and there is a higher variance for the weight loss recorded from diet B compared to diet A. The difference in mean weight loss between the diets is thus worth investigating statistically.

Statistical Analysis

Statistical Model

We model our problem statistically as a one-way ANOVA (Analysis of Variance) model given that we are investigating the difference between the mean weight loss for more than two groups.

Given \(y_{i,j}\) be the weight lost by the \(j\)-th participant in each group fed with the \(i\)-th diet type, with \(i = 1, \ldots, 3\) and \(j = 1, \ldots, 14\). The one-way ANOVA model is as follows:

\[y_{i,j} \sim N(\mu_{i,j}, \sigma^2), \quad i = 1,2,3, \quad j = 1,\ldots,14\]

where

\[\mu_1 = \mu_{1,j}, \quad j = 1,\ldots,14\] \[\mu_2 = \mu_1 + \alpha_2, \quad j = 1,\ldots,14\] \[\mu_3 = \mu_1 + \alpha_3, \quad j = 1,\ldots,14\]

Description of model parameters \(\mu_1\), \(\alpha_2\) and \(\alpha_3\) : This One-Way ANOVA model uses \(\mu_1\) (average weight loss from diet A) as a “base” group and the mean weight loss from diets B and C (\(\mu_2\) and \(\mu_3\) respectively) are described in terms of the mean of diet A plus or minus constants \(\alpha_2\) and \(\alpha_3\).

\(\alpha_2\) and \(\alpha_3\) are thus constants used to parametrize the mean weight loss for diet B and diet C compared with diet A. Diet A is selected as the base group because it is what we are measuring against to see if the newly developed diets B and C offer improvements. In a more general sense, \(\mu_1\) can represent the mean weight loss of any of the diets depending on the control group being measured against.

Fitting the Linear Model:

Fitting this linear model we estimate the values \(\mu_1\), \(\alpha_2\), \(\alpha_3\) from the fitted model. The estimated values of the coefficients are below:

##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  3.0500000  0.5624035  5.4231521 3.271047e-06
## DietB       -0.4428571  0.7953587 -0.5568018 5.808443e-01
## DietC        2.8928571  0.7953587  3.6371728 7.964651e-04
  • \(\mu_1\) is the mean weight loss of the base group trialed with Diet A and is estimated at 3.050kg in our model.
  • \(\alpha_2\) is the the change in underlying mean weight loss to \(\mu_1\) for Diet B with an estimate of -0.443kg from our model.The estimate is as expected from our earlier visualizations. It is also not statistically significant judging from the p-valuu which is greater than 0.05.
  • \(\alpha_3\) is the the change in underlying mean weight loss to \(\mu_1\) for Diet C with an estimate of 2.893kg from the model. This estimate of the difference in weight loss is as expected from our earlier visualizations and is also statistically significant

Analysis of Variance (ANOVA)

To confirm an underlying difference in the mean weight loss from the diets, I applied the anova (analysis of variance) function to the fitted linear model which essentially implements a hypothesis test of size 0.05 to confirm my expected/hypothesized difference between the mean weight loss of the groups.The anova test is also chosen because the goal is to make statistical inferences on the difference in mean weight loss of more than two groups (versus a t-test which would suffice for two groups).

With \(\mu_1\),\(\mu_2\) & \(\mu_3\) be the underlying average weight loss observed for Diet A, Diet B, Diet C from our data, I formulated the hypothesis test as follows:

\[H_0: \mu_1 = \mu_2 = \mu_3\] \[H_1: \mu_1 \neq \mu_2 \neq \mu_3\]

anova(m)
## Analysis of Variance Table
## 
## Response: weight_loss
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Diet       2  91.895  45.947  10.376 0.0002437 ***
## Residuals 39 172.699   4.428                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p_value <- anova(m)$`Pr(>F)`[1]
p_value
## [1] 0.0002436924

Thus from our size \(\alpha\) = 0.05 test, we reject \(H_0\) as we have enough evidence to assert that there is a difference between the mean weight loss of the three groups. Though a difference between the mean values is confirmed statistically, where the statistically significant differences exist between the groups is not clear.

To confirm this, I can Perform a Follow-up Analysis using Tukey Honest Significant Differences. The Tukey test will allow me to test three pairs of hypotheses together simultaneously. I modeled the tests as follows:

\[ \begin{aligned} H_0:& \mu_2 - \mu_1 = 0 \quad \text{(or, } \mu_1 = \mu_2 \text{)} \\ H_1:& \mu_2 - \mu_1 \neq 0 \quad \text{(or, } \mu_1 \neq \mu_2 \text{)} \\ H_0:& \mu_3 - \mu_1 = 0 \quad \text{(or, } \mu_1 = \mu_3 \text{)} \\ H_1:& \mu_3 - \mu_1 \neq 0 \quad \text{(or, } \mu_1 \neq \mu_3 \text{)} \\ H_0:& \mu_3 - \mu_2 = 0 \quad \text{(or, } \mu_2 = \mu_3 \text{)} \\ H_1:& \mu_3 - \mu_2 \neq 0 \quad \text{(or, } \mu_2 \neq \mu_3 \text{)} \\ \end{aligned} \]

a <- aov(weight_loss ~ Diet, data = diet_df)
coef(a)
## (Intercept)       DietB       DietC 
##   3.0500000  -0.4428571   2.8928571
summary(a)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Diet         2  91.89   45.95   10.38 0.000244 ***
## Residuals   39 172.70    4.43                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the output of the Tukey HSD test, we can conclude the following:

  • There is no statistically significant difference in weight loss between diets A and B (p = 0.84361).
  • There is a statistically significant difference in weight loss between diets A and C (p = 0.00225).
  • There is a statistically significant difference in weight loss between diets B and C (p = 0.00044).

Therefore, we can conclude that the diet C offers a different effect on weight loss, showing the higher weight loss value than diet B when both compared with diet A (the control group). Additionally, we can conclude that the difference in weight loss between group A and B is not statistically significant. This result aligns with our initial analysis and visualization of the data.

Further Statistical Inference

Management have placed a benchmark of over 5kg increase in weight loss for a new diet formulation to be implemented. To further aid their decision on a possible implementation of a new diet, the clinicians will like to confirm if the underlying weight loss obtained following diet C food supplement is more than 5 kg higher than the average of the underlying weight loss obtained following diet B?

The hypothesis to test if the mean weight loss \(\mu_3\) of diet C is more than 5kg higher than the mean weight loss \(\mu_2\) of diet B is formulated as follows:

\[ \begin{aligned} H_0:& \mu_3 - \mu_2 \leq 5 \quad \text{: } \mu_3 \text{ is not more than 5kg higher than } \mu_2 \\ H_1:& \mu_3 - \mu_2 > 5 \quad \text{: } \mu_3 \text{ is more than 5kg higher than } \mu_2 \\ \end{aligned} \]

ght <- glht(m_mu, 
            # State the hypothesis to be tested (null hypothesis):
            linfct = c("DietC  - DietB <= 5"))
summary(ght)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = weight_loss ~ Diet - 1, data = diet_df)
## 
## Linear Hypotheses:
##                    Estimate Std. Error t value Pr(>t)
## DietC - DietB <= 5   3.3357     0.7954  -2.092  0.979
## (Adjusted p values reported -- single-step method)

Conclusion:

The p-value from the hypothesis test is greater than 0.05, hence we cannot reject the null hypothesis. We thus conclude that there is not enough evidence to disprove the null hypothesis that the underlying weight loss obtained following Diet C is not more than 5 kg higher than the underlying weight loss obtained following diet B.

Unfortunately the threshold needed to effect a diet change was not met in this instance and the clinicians will possibly need to do more trials or improve their diet formulations.

References

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2023. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2023. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.