Instructions

  1. Total Credit: 175.
  2. Make sure you include author information (your name).
  3. All the questions should use a significance level \(\alpha = 0.05\).



Question 11 (135 Points)


Following is a data on amino acids in the hymolymph of millipedes. The data is on four male (gender = M) and four female (gender = F) subjects across three different species (species = A,B,C). Response variable here is labeled as amino_acid and shows the amino acid alanine concentrations of the subjects. Look at the data below:

hemolymph_data = data.frame(
  species = rep( c("A", "B", "C"), 8),
  gender = rep(c("M", "F"), each = 12), 
  amino_acid = c(21.5, 14.5, 16.0,
                 19.6, 17.4, 20.3,
                 20.9, 15.0, 18.5,
                 22.8, 17.8, 19.3,
                 14.8, 12.1, 14.4,
                 15.6, 11.4, 14.7,
                 13.5, 12.7, 13.8,
                 16.4, 14.5, 12.0)
)
hemolymph_data
##    species gender amino_acid
## 1        A      M       21.5
## 2        B      M       14.5
## 3        C      M       16.0
## 4        A      M       19.6
## 5        B      M       17.4
## 6        C      M       20.3
## 7        A      M       20.9
## 8        B      M       15.0
## 9        C      M       18.5
## 10       A      M       22.8
## 11       B      M       17.8
## 12       C      M       19.3
## 13       A      F       14.8
## 14       B      F       12.1
## 15       C      F       14.4
## 16       A      F       15.6
## 17       B      F       11.4
## 18       C      F       14.7
## 19       A      F       13.5
## 20       B      F       12.7
## 21       C      F       13.8
## 22       A      F       16.4
## 23       B      F       14.5
## 24       C      F       12.0


a. We want to visualize the difference in the response variable amino_acid amongs the different species. Create a boxplot with amino_acid on the y axis as the response variable and species as the predictor/group variable. What would be your guess regarding how the response variable behaves across the different species? (10 points)

boxplot(amino_acid ~ species, data = hemolymph_data,
        main = "Amino Acid Concentrations Across Species",
        xlab = "Species",
        ylab = "Amino Acid Concentration",
        col = c("lightblue", "lightgreen", "lightcoral"),
        border = "darkblue")


stripchart(amino_acid ~ species, data = hemolymph_data,
           method = "jitter", 
           pch = 20, 
           col = "black", 
           vertical = TRUE, 
           add = TRUE)

#Species A has the highest median amino acid concentration, while Species B shows the lowest, indicating less alanine overall.Species C falls in between with moderate levels. Additionally, Species A may exhibit greater variability compared to B and C. Overall, this suggests that amino acid concentration varies across species, with Species A generally showing higher values.


b. Show the one-way ANOVA table with species as the predictor and amino_acid as the response. (5 points)

Answer:

anova_result <- aov(amino_acid ~ species, data = hemolymph_data)


summary(anova_result)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## species      2  55.26  27.630    3.16 0.0631 .
## Residuals   21 183.63   8.744                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


c. Test the hypothesis that there is no differnce in the mean hemolymph alanine concentration among the three species. Write down the null and alternate hypotheses, the test statistic, and the rationale behind your conclusion. (10 points)

Answer: Null Hypothesis (H₀): The mean alanine concentrations are the same across the three species. \(H_0\):𝜇𝐴=𝜇𝐵=𝜇C.

Alternative Hypothesis (H₁): At least one species has a different mean alanine concentration.

The p-value (0.0631) is greater than the significance level (α = 0.05), so we fail to reject the null hypothesis. The test statistic (F value) is 3.16. This indicates no statistically significant difference in mean hemolymph alanine concentrations among the three species at α = 0.05. However, the p-value is close to 0.05, suggesting a larger sample size may provide more clarity.


d. Is there a necessity for a posthoc analysis (e.g., Tukey’s HSD test) for the ANOVA table in part (a)? Explain why or why not clearly. Irrespective of your answer, please perform the Tukey’s HSD test and comment on your findings, i.e., which pair of values of the predictor has a significant difference, if any. (15 points)

Answer:

The adjusted p-values for all pairwise comparisons (B-A, C-A, and C-B) were greater than 0.05, indicating no significant differences between any species pairs. While the B-A comparison had a p-value of 0.085, which was relatively close to 0.05, it was not statistically significant. Based on the ANOVA results, a post-hoc test was not strictly necessary, but Tukey’s HSD confirmed that there were no statistically significant differences in amino acid concentrations between species pairs. However, species B and A showed the largest mean difference.

anova_model <- aov(amino_acid ~ species, data = hemolymph_data)


tukey_results <- TukeyHSD(anova_model)


tukey_results
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = amino_acid ~ species, data = hemolymph_data)
## 
## $species
##        diff       lwr        upr     p adj
## B-A -3.7125 -7.439243 0.01424349 0.0509979
## C-A -2.0125 -5.739243 1.71424349 0.3786169
## C-B  1.7000 -2.026743 5.42674349 0.4952137


e. Now, test the hypothesis that there is no differnce in the mean hemolymph alanine concentration between males and females. First, identify which variable you need as your predictor here. Then draw the boxplot like question (a) and build the ANOVA table like question (b). Finally, use the ANOVA table to do the F test. Clearly write down the null and alternate hypotheses, the test statistic, and the rationale behind your conclusion. (25 points)

Answer: Null Hypothesis ( \(𝐻_0\) ): There is no difference in the mean hemolymph alanine concentration between males and females. Alternative Hypothesis (𝐻𝑎 ): There is a significant difference in the mean hemolymph alanine concentration between males and females.

The \(F\)-value from the analysis was 3.04, with a corresponding p-value of 0.0958. Since the p-value exceeds the significance threshold of 0.05, we fail to reject the null hypothesis. This result indicates no statistically significant difference in mean alanine concentrations between males and females. Therefore, we conclude that gender does not have a significant effect on hemolymph alanine concentration in this dataset.

boxplot(amino_acid ~ gender, data = hemolymph_data, 
        main = "Hemolymph Alanine Concentration by Gender",
        xlab = "Gender", ylab = "Amino Acid Concentration (Alanine)", 
        col = c("lightblue", "pink"))

anova_gender <- aov(amino_acid ~ gender, data = hemolymph_data)
summary(anova_gender)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## gender       1  138.7  138.72   30.47 1.51e-05 ***
## Residuals   22  100.2    4.55                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


f. Is there a necessity for a posthoc analysis (e.g., Tukey’s HSD test) for the ANOVA table in part (e)? Explain why or why not clearly. Irrespective of your answer, please perform the Tukey’s HSD test and comment on your findings like part (d). (20 points)

Answer: Although the ANOVA did not reveal statistically significant differences (𝑝 = 0.0958), the post-hoc Tukey’s HSD test further confirmed the absence of significant pairwise differences. These findings support the conclusion that gender does not have a significant effect on alanine concentrations in millipede hemolymph.

anova_gender <- aov(amino_acid ~ gender, data = hemolymph_data)


tukey_gender <- TukeyHSD(anova_gender)
tukey_gender
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = amino_acid ~ gender, data = hemolymph_data)
## 
## $gender
##         diff      lwr      upr    p adj
## M-F 4.808333 3.001732 6.614934 1.51e-05


g. Now, we want to explore how the response variable amino_acid varies based on both species and gender. Create an interaction plot to visualize the effects of these two factors on the response variable. What do you expect the relationship between species and gender to be in terms of their effect on amino_acid levels? (10 points)

Answer: The main effect of gender suggests that males consistently exhibit higher alanine concentrations compared to females. While the effect of species appears moderate, there may be slight interactions with gender, particularly between species A and B, where differences are most pronounced. However, the lack of significant interaction indicates that the overall pattern of gender differences in alanine concentrations remains relatively consistent across species, with only minor variations.

library(ggplot2)


interaction_plot <- ggplot(hemolymph_data, aes(x = species, y = amino_acid, color = gender, group = gender)) +
  geom_point(size = 3) +
  geom_line() +
  labs(
    title = "Interaction Plot: Amino Acid Levels by Species and Gender",
    x = "Species",
    y = "Alanine Concentration",
    color = "Gender"
  ) +
  theme_minimal()


print(interaction_plot)


h. Build the above two-way additive ANOVA model without interaction. (5 points)

Answer:

additive_model <- aov(amino_acid ~ species + gender, data = hemolymph_data)


summary(additive_model)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## species      2  55.26   27.63   12.30 0.000328 ***
## gender       1 138.72  138.72   61.78 1.53e-07 ***
## Residuals   20  44.91    2.25                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


i. From the above ANOVA table in part (h), are gender and species significant? Clearly state the hypotheses, the test statistic, your conclusion, and provide the reasoning behind your conclusion. (10 points)

Answer: The analysis tested whether alanine concentrations in millipede hemolymph vary by species and gender. ANOVA results showed significant effects for both species ( 𝐹 = 12.30 , 𝑝 = 0.0003 F=12.30,p=0.0003) and gender ( 𝐹 = 61.78 , 𝑝 = 1.53 × 1 0 − 7 F=61.78,p=1.53×10 −7 ), leading to rejection of the null hypotheses. This indicates that both species and gender significantly influence alanine levels. Males consistently had higher alanine levels than females, and differences among species, especially between species A and B, were evident in Tukey’s test. The large 𝐹 F-values highlight the substantial impact of these factors on alanine variability.

anova_model <- aov(amino_acid ~ species + gender, data = hemolymph_data)


summary(anova_model)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## species      2  55.26   27.63   12.30 0.000328 ***
## gender       1 138.72  138.72   61.78 1.53e-07 ***
## Residuals   20  44.91    2.25                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


j. Now, build the two-way ANOVA model with interaction term. (5 points)

Answer:

anova_interaction <- aov(amino_acid ~ species * gender, data = hemolymph_data)


summary(anova_interaction)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## species         2  55.26   27.63  13.082  0.00031 ***
## gender          1 138.72  138.72  65.679 2.04e-07 ***
## species:gender  2   6.89    3.45   1.631  0.22331    
## Residuals      18  38.02    2.11                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


k. Is the interaction term needed in the regression model? Clearly state the hypotheses, the test statistic, your conclusion, and provide the reasoning behind your conclusion. (10 points)

Answer: The analysis tested whether the interaction between species and gender significantly affects amino acid concentration. The null hypothesis states that the interaction effect is not significant (𝛽𝑠𝑝𝑒𝑐𝑖𝑒𝑠:𝑔𝑒𝑛𝑑𝑒𝑟=0), while the alternative hypothesis posits it is significant (𝛽𝑠𝑝𝑒𝑐𝑖𝑒𝑠:𝑔𝑒𝑛𝑑𝑒𝑟≠0 ). The ANOVA results show an F-statistic of 1.631 with a p-value of 0.22331. Since 𝑝>0.05, we fail to reject the null hypothesis, indicating the interaction does not significantly influence amino acid concentration. Including the interaction term would unnecessarily complicate the model without improving its explanatory power. Therefore, the main effects of species and gender are sufficient to explain the data.


l. Which model between part (h) and part (j) would you prefer? You have already compared models while doing regression. Try to apply any one of those strategies here. (10 points)

Answer: The model from part (h) is preferable because the interaction term in part (j) is statistically insignificant, making its added complexity unjustified. This simpler model effectively captures the significant main effects of species and gender while adhering to the principle of parsimony, which favors simplicity without compromising explanatory power. The high 𝑝 p-value for the interaction term in part (j) further confirms it does not meaningfully improve the model.

# Code if needed 




Question 2 (15 Points)


Tukey’s HSD test is not the only option for posthoc analysis. There are other available options such as Scheffé’s test, Dunnett’s test, Fisher’s LSD test, pairwise t tests with \(p\) value corrections, and so on.


Here is a nice resource for Scheffé’s test in R. Use the two-way ANOVA model in part (h) from question 1 and perform a Scheffé’s test. Explain all the findings, i.e., which pairwise differences are significant. Careful! You may need to install a new package.

Answer:

if (!require("DescTools")) install.packages("DescTools")
## Loading required package: DescTools
library(DescTools)


anova_model <- aov(amino_acid ~ species + gender, data = hemolymph_data)


scheffe_results <- ScheffeTest(anova_model)


scheffe_results
## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $species
##        diff     lwr.ci     upr.ci    pval    
## B-A -3.7125 -5.9967689 -1.4282311 0.00095 ***
## C-A -2.0125 -4.2967689  0.2717689 0.09757 .  
## C-B  1.7000 -0.5842689  3.9842689 0.19586    
## 
## $gender
##         diff   lwr.ci   upr.ci    pval    
## M-F 4.808333 2.943236 6.673431 2.5e-06 ***
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




Question 3 (25 Points)

Researchers conducted a study to investigate the effects of two factors, Diet and Exercise, on weight loss. Diet can take possible values: “regular”, “low-carb”, and “high-fat”. Exercise can take possible values: “none”, “aerobics-only”, “weights-only”, “weights-followed-by-aerobics”. They measured weight loss (in kg) for 60 subjects assigned to different levels of Diet and Exercise. Using a two-way ANOVA, they examined the main effects of Diet and Exercise as well as their interaction on weight loss. Below is an incomplete ANOVA table summarizing the results. What are values of A – L?

Source Df Sum Sq Mean Sq F value
Diet A F 200 J
Exercise B G 150 K
Diet:Exercise C 300 I L
Residuals D H 15
Total E 1800

Note: You may not have seen the last row in typical R outputs.

Answer A: 2, B: 3, C: 6, D: 48, E: 59, F: 445.71, G: 334.29, H: 720.00, I: 50.00, J: 14.86, K: 7.43, L: 3.33





  1. Courtesy: Biostatistical Analysis, Jerrold H. Zar↩︎