Lab 3: ANOVA using data from Godshalx et al 2016

What this lab does (read me first):
- You will complete a fully guided one-way ANOVA for HCNp values for each Organ studied in this paper (with plots, assumptions, transformations).
- Then you will replicate the workflow for Protein ~ Organ - After each task, answer the numbered interpretation questions in your own words.

As always work through the lab one chunk of code at a time and answer the student questions as they occur. This lab is different as once you finish the guided work for the HCNp values you’ll need to copy and paste the code to do the same plots and analyses using the protein data. You will use the same procedures as last time - create a new folder for this lab and save the .Rmd file in that folder. Within that folder create a new folder called “data” which is where you will put the data files. I’m giving you all three data files from the paper you discussed on Wednesday but this lab will only use the Additional file 1 data to analyze.

1 0) Setup & data loading — why this matters

Load packages, locate the CSV and coerce Organ to a factor so ANOVA treats it as a categorical group variable.

suppressPackageStartupMessages({
  library(readr)    # fast, tidy CSV reading
  library(dplyr)    # data wrangling verbs
  library(ggplot2)  # plotting
  library(car)      # Levene's test & power transforms
  library(broom)    # tidy model outputs
  library(knitr)    # kable tables
})

knitr::opts_chunk$set(fig.width = 6, fig.asp = 1)


# Load and prepare the dataset for ANOVA

# Read in the CSV file
# - readr::read_csv() reads a CSV (comma-separated values) file into R as a tibble (a tidyverse-friendly version of a data frame)
hc <- readr::read_csv(here::here("data", "Additional file 1.csv"))
glimpse(hc) #look at the data to make sure everything looks right

## Rows: 80
## Columns: 3
## $ Organ   <chr> "young leaf", "young leaf", "young leaf", "young leaf", "young…
## $ Protein <dbl> 12.71, 13.42, 12.19, 12.18, 14.94, 13.11, 12.94, 13.77, 10.21,…
## $ HCNp    <dbl> 68.300000, 68.700000, 65.200000, 64.500000, 64.214383, 69.9803…

hc1 <- hc %>% #creates a new dataframe with Organ (categorical) and HCNp 
  select(Organ, HCNp) %>%
  mutate(Organ = factor(Organ))  # assures that Organ is categorical

Student questions (Setup):

Why do we explicitly coerce Organ to a factor before ANOVA?
We explicitly coerce ‘Organ’ to a factor before ANOVA so that R knows that we will be checking for variance relative to this variable. S
What are the variable names shown after the glimpse() command and what types of variables are they (e.g., numeric, character)? Young leaf, young leaf, young leaf, young leaf over and over. They are categorical. Organ = category. Category is young leaf.

2 Part A (Guided lab portion): HCNp ~ Organ

2.1 A1) Always visualize the data first — why start with plots?

Plots help us see patterns, spread, and outliers before modeling. Boxplots summarize the distribution per group; jittered points show raw data.

3 1.1 Boxplot

#tells knitr how big to make any figures (6 inches by 4 inches) produced #in this chunk
ggplot(
  data = hc1,
  aes(x = Organ, y = HCNp, fill = Organ)
) +
  geom_boxplot(
    alpha = 0.75,
    outlier.shape = NA
  ) +
  # Draws boxplots
  # alpha = 0.75 makes boxes semi-transparent so points are visible on top.
  # outlier.shape = NA hides the built-in outlier dots (we’ll show raw #points ourselves). Add raw observations as jittered points so they don’t #stack vertically.
  
  geom_jitter(
    aes(color = Organ),
    width  = 0.15,
    height = 0,
    alpha  = 0.6,
    size   = 1
  ) +
  # Map Organ to 'color' so point outlines match the box color.
  # width controls horizontal jitter; height = 0 keeps true HCNp values.
  # alpha < 1 and small size help with overplotting.
  # Title and axis labels.
  labs(
    title = "HCNp across organs",
    y = "HCNp",
    x = "Organ"
  ) +
  # Theme tweaks below:
  # - rotate x labels vertically for long organ names
  # - drop legend (x axis already identifies groups)
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    legend.position = "none"
  )

Student questions:

Which organs appear to have the highest and lowest median HCNp? What features of the boxplots support your answer?
Young leaf appears to have the highest median HCNp, by a very large amount. Mature fruit seems to have the lowest median HCNp. Both the medians and variation within the boxplots indicate this difference.
Do any organs show substantially larger spread (IQR or Interquartile Range-this is the range of the middle 50% of the values for each variable and is represented by the size of the box ). Do any organs look like they have potential outliers? What might that imply for ANOVA assumptions?
Small flower bud and young fruit have a substantially longer IQR. Young fruit also seems to have a potential outlier. This could violate the ANOVA assumption of normality and equal variance, and could lead to increased potential for error as well as less of a chance to see a difference between groups.
Based on the plot which groups do you think might be “significantly” different than others?
I think medium leaf, small flower bud, young fruit and young leaf will be significantly different from the others, with young leaf being significantly different from all of the groups.

3.1 A2) Fit one-way ANOVA

ANOVA tests whether all group means are equal vs at least one differs. It partitions total variability into between-group and within-group components and compares them via an F-ratio.

# Safety: ensure required columns exist and drop NAs (if there is missing
#data in an excel file, make sure the excel file has NA in that cell)
stopifnot(all(c("HCNp","Organ") %in% names(hc1)))

# Fit the ANOVA model (HCNp as response; Organ as categorical predictor)
fit1 <- aov(HCNp ~ Organ, data = hc1)

# Prints the ANOVA table so it appears in the knitted output
cat("\nANOVA table: HCNp ~ Organ\n")

## 
## ANOVA table: HCNp ~ Organ

print(summary(fit1))

##             Df Sum Sq Mean Sq F value Pr(>F)    
## Organ        9  30900    3433   381.6 <2e-16 ***
## Residuals   70    630       9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Context: group sizes & means (helps when interpreting differences)
means_hcn <- hc1 %>%
  dplyr::group_by(Organ) %>%
  dplyr::summarise(
    n    = dplyr::n(),
    mean = mean(HCNp),
    sd   = sd(HCNp),
    .groups = "drop"
  )
knitr::kable(means_hcn, digits = 3, caption = "Group sizes and means for HCNp by Organ")# kable (from knitr) turns a data frame/tibble/matrix into #a formatted table using the means_hcn when you knit an R Markdown. It allows you to  #add captions, rounding, alignment, and custom column names.

Group sizes and means for HCNp by Organ
Organ	n	mean	sd
Fully developed flower (white)	8	8.662	2.192
Fully developed flower (yellow)	8	6.235	1.253
Intermediate fruit	8	2.331	0.753
large flower bud	8	10.030	1.728
Mature fruit	8	0.093	0.081
mature leaf	8	6.464	1.708
medium leaf	8	32.111	1.572
small flower bud	8	25.255	3.823
Young fruit	8	32.055	7.080
young leaf	8	67.357	3.147

# --- Extract reporting numbers  ---

# Build the ANOVA-like summary table from the fitted model object `fit1`.
# For aov objects, summary(fit1)[[1]] returns a matrix/data.frame with rows for model terms (e.g., "Organ") and "Residuals", and columns like "Df", "F value", "Pr(>F)".
aov_tab <- summary(fit1)[[1]]

# Pull the F statistic for the factor named exactly "Organ" from the #table.Assumes the row name is "Organ" (i.e., the model was fit with Organ #as a factor term).
F_value <- aov_tab["Organ","F value"]

# Pull the p-value for the Organ effect (column name "Pr(>F)").
p_value <- aov_tab["Organ","Pr(>F)"]

# Extract degrees of freedom:
#   df1 = numerator df for Organ (its row's "Df")
#   df2 = denominator df from the Residuals row
df1 <- aov_tab["Organ","Df"]; df2 <- aov_tab["Residuals","Df"]

# Print a compact report line:
#   F(df1, df2) = F_value, p = p_value
# Note: %.4g prints p in up to 4 significant digits (switching to scientific notation if tiny).
cat(sprintf("\nReport: F(%d, %d) = %.3f, p = %.4g\n", df1, df2, F_value, p_value))

## 
## Report: F(9, 70) = 381.636, p = 8.289e-56

Student questions (A2):

Write a one-sentence result with the F statistic, df, and p-value. Is there evidence that mean HCNp differs among organs?
The results indicate a vary large F-value (381.636) with 9 degrees of freedom (8 + 1) between groups and 70 degrees of freedom within groups. This indicates the variance between groups is much greater than the variance within groups. These values, along with the p-value (which is very close to 0) indicate that variance is significantly different between at least 1 of the groups.
Looking at the means table, which organs appear most different in average HCNp? (This is just descriptive; we will do post hoc testing to confirm pairwise differences.)
Young leaf had the highest average HCNp (much larger than the rest) and mature fruit had the lowest average HCNp (smaller than the rest)

3.2 A3) Check assumptions

ANOVA assumes independent observations, roughly normal residuals, and homogeneous variances. We check residual plots and use tests as guides (not absolute rules).

# Residuals vs Fitted: look for no pattern (looking for a random cloud of #points), and roughly constant vertical spread
ggplot(data.frame(fit = fitted(fit1), res = resid(fit1)), aes(fit, res)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted (HCNp ~ Organ)", x = "Fitted values", y = "Residuals")

# QQ plot: residuals should follow the reference line if approximately normal
ggplot(data.frame(res = resid(fit1)), aes(sample = res)) +
  stat_qq() + stat_qq_line() +
  labs(title = "QQ plot of residuals (HCNp ~ Organ)", x = "Theoretical Quantiles", y = "Sample Quantiles")

# 1) Normality of residuals (ANOVA assumes normally distributed errors)
#    We test the residuals from the fitted model, not each raw group,
#    because ANOVA is about model errors, not the raw data per se.
cat("\nShapiro–Wilk normality test (residuals): HCNp ~ Organ\n")

## 
## Shapiro–Wilk normality test (residuals): HCNp ~ Organ

res <- residuals(fit1)
print(shapiro.test(resid(fit1)))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit1)
## W = 0.87512, p-value = 1.254e-06

# Interpretation tip:
#   p < 0.05 → evidence residuals deviate from normality
#   p ≥ 0.05 → no strong evidence against normality
# Remember: with moderate n, ANOVA is fairly robust to mild non-normality.

# 2) test for Homogeneity of variances (equal variances across groups)

cat("\nLevene's test for equal variances (median-centered): HCNp ~ Organ\n")

## 
## Levene's test for equal variances (median-centered): HCNp ~ Organ

print(leveneTest(HCNp ~ Organ, data = hc1, center = median))

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value    Pr(>F)    
## group  9  4.9223 4.049e-05 ***
##       70                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 3) Welch's ANOVA is a statistical test used as an alternative to the traditional one-way ANOVA when the assumption of equal variances across groups is violated. It is particularly useful for comparing the means of three or more groups when their variances are unequal. Welch's ANOVA  still assumes ~normality)
#    Use when Levene indicates heteroscedasticity.
cat("\nWelch's ANOVA (robust to unequal variances): HCNp ~ Organ\n")

## 
## Welch's ANOVA (robust to unequal variances): HCNp ~ Organ

print(oneway.test(HCNp ~ Organ, data = hc1, var.equal = FALSE))

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  HCNp and Organ
## F = 752.54, num df = 9.000, denom df = 25.805, p-value < 2.2e-16

# Interpretation tip:
#   If Welch is significant while classic ANOVA disagrees, trust Welch when variances are unequal.

# Practical guidance:
# - These tests inform, they don’t automatically veto ANOVA.
# - If assumptions look shaky: consider data transformations, Welch’s ANOVA,
#   or nonparametric alternatives (e.g., Kruskal–Wallis) depending on the issue.

Student questions (A3):

Does the Residuals vs Fitted plot suggest constant variance (homoscedasticity)? Describe any pattern (e.g., funnel shape).
No, the residuals vs fitted plot does not suggest constant variance because there is a slight funnel shape from left to right.
Is the QQ plot approximately linear? If not, what does the deviation suggest?
The line of best fit is linear, but the data points themselves vary a lot at the tails, suggesting the data is not normal.
Based on the graphs would you trust the ANOvA output using the original data? Based on the Levene’s p-value, would you trust the classical ANOVA or prefer Welch’s? Why?
I would trust Welch’s ANOVA in this case because the data has unequal variance and is not normal. Since the Levene’s value is less than 0.001, there is a significant difference in the variances between groups. —

3.3 A4) Optional transformation — when and why?

If variance increases with the mean (fan shape) or residuals are skewed, a log (or power) transform can stabilize variance and improve normality.

.safe_log <- function(x) if (all(x > 0, na.rm = TRUE)) log(x) else log1p(x)

hc_hcn <- hc1 %>% dplyr::mutate(HCNp_log = .safe_log(HCNp))

fit1_log <- aov(HCNp_log ~ Organ, data = hc_hcn)
cat("\nANOVA (log scale): HCNp_log ~ Organ\n")

## 
## ANOVA (log scale): HCNp_log ~ Organ

print(summary(fit1_log))

##             Df Sum Sq Mean Sq F value Pr(>F)    
## Organ        9  109.2  12.129   425.2 <2e-16 ***
## Residuals   70    2.0   0.029                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat("\nLevene on log scale (median-centered):\n")

## 
## Levene on log scale (median-centered):

print(leveneTest(HCNp_log ~ Organ, data = hc_hcn, center = median))

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  9  2.4378 0.01788 *
##       70                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(data.frame(fit = fitted(fit1_log), res = resid(fit1_log)), aes(fit, res)) +
  geom_point(alpha = 0.6) + geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted (log HCNp ~ Organ)", x = "Fitted (log HCNp)", y = "Residuals")

ggplot(data.frame(res = resid(fit1_log)), aes(sample = res)) +
  stat_qq() + stat_qq_line() +
  labs(title = "QQ plot (log HCNp ~ Organ)", x = "Theoretical Quantiles", y = "Sample Quantiles")

Student questions (A4):

Did the log transform improve the diagnostics? Cite plots and Levene’s p-value. Yes, the log transformation improved diagnostics by a little bit. There is still some variation in the residual vs fitted plot, but it makes less of a shape. There is still some variation in the QQ plot at the tails. Levene’s p-value is greater than 0.001, meaning that the data is homogeneus and the data has equal variance.
Which model will you report (original vs log)? Justify briefly (1–2 sentences).
I will report the log transformed analysis because the transformation resulted in equal variances of the groups, which qualifies the data for a classic ANOVA. The p-value from the ANOVA is also smaller than in the transformed data than in the original data. —

3.4 A5) Which organs differ? (Tukey HSD)

If your chosen ANOVA is significant, Tukey HSD controls the family-wise error while comparing all pairs. The code below is there for the log transformed values. You would use the fit1 if you are doing the Tukey test with the non-transformed data (the original dataset) - I don’t care which one you choose, but whichever you choose justify the choice

tuk <- TukeyHSD(if (exists("fit1_log")) fit1_log else fit1)  # use transformed fit if you chose it
tuk_df <- broom::tidy(tuk) %>%
  dplyr::mutate(sig = ifelse(adj.p.value < 0.05, "Significant", "Not significant")) %>%
  dplyr::arrange(adj.p.value)
knitr::kable(tuk_df, digits = 3, caption = "Tukey HSD pairwise comparisons (HCNp)")

Tukey HSD pairwise comparisons (HCNp)
term	contrast	estimate	conf.low	conf.high	adj.p.value	sig
Organ	Intermediate fruit-Fully developed flower (white)	-1.066	-1.342	-0.790	0.000	Significant
Organ	Mature fruit-Fully developed flower (white)	-2.160	-2.436	-1.884	0.000	Significant
Organ	medium leaf-Fully developed flower (white)	1.252	0.976	1.528	0.000	Significant
Organ	small flower bud-Fully developed flower (white)	1.012	0.736	1.288	0.000	Significant
Organ	Young fruit-Fully developed flower (white)	1.232	0.956	1.508	0.000	Significant
Organ	young leaf-Fully developed flower (white)	1.977	1.701	2.253	0.000	Significant
Organ	Intermediate fruit-Fully developed flower (yellow)	-0.786	-1.062	-0.510	0.000	Significant
Organ	Mature fruit-Fully developed flower (yellow)	-1.881	-2.157	-1.605	0.000	Significant
Organ	medium leaf-Fully developed flower (yellow)	1.532	1.256	1.808	0.000	Significant
Organ	small flower bud-Fully developed flower (yellow)	1.291	1.015	1.567	0.000	Significant
Organ	Young fruit-Fully developed flower (yellow)	1.511	1.235	1.787	0.000	Significant
Organ	young leaf-Fully developed flower (yellow)	2.257	1.981	2.533	0.000	Significant
Organ	large flower bud-Intermediate fruit	1.209	0.933	1.485	0.000	Significant
Organ	Mature fruit-Intermediate fruit	-1.095	-1.371	-0.819	0.000	Significant
Organ	mature leaf-Intermediate fruit	0.804	0.528	1.080	0.000	Significant
Organ	medium leaf-Intermediate fruit	2.318	2.042	2.594	0.000	Significant
Organ	small flower bud-Intermediate fruit	2.077	1.801	2.353	0.000	Significant
Organ	Young fruit-Intermediate fruit	2.298	2.022	2.574	0.000	Significant
Organ	young leaf-Intermediate fruit	3.043	2.767	3.319	0.000	Significant
Organ	Mature fruit-large flower bud	-2.304	-2.580	-2.028	0.000	Significant
Organ	medium leaf-large flower bud	1.109	0.833	1.385	0.000	Significant
Organ	small flower bud-large flower bud	0.868	0.592	1.144	0.000	Significant
Organ	Young fruit-large flower bud	1.088	0.812	1.364	0.000	Significant
Organ	young leaf-large flower bud	1.834	1.558	2.110	0.000	Significant
Organ	mature leaf-Mature fruit	1.899	1.623	2.175	0.000	Significant
Organ	medium leaf-Mature fruit	3.412	3.136	3.688	0.000	Significant
Organ	small flower bud-Mature fruit	3.172	2.896	3.448	0.000	Significant
Organ	Young fruit-Mature fruit	3.392	3.116	3.668	0.000	Significant
Organ	young leaf-Mature fruit	4.137	3.861	4.413	0.000	Significant
Organ	medium leaf-mature leaf	1.514	1.238	1.790	0.000	Significant
Organ	small flower bud-mature leaf	1.273	0.997	1.549	0.000	Significant
Organ	Young fruit-mature leaf	1.493	1.217	1.769	0.000	Significant
Organ	young leaf-mature leaf	2.239	1.963	2.515	0.000	Significant
Organ	young leaf-small flower bud	0.966	0.690	1.241	0.000	Significant
Organ	young leaf-Young fruit	0.745	0.469	1.021	0.000	Significant
Organ	young leaf-medium leaf	0.725	0.449	1.001	0.000	Significant
Organ	large flower bud-Fully developed flower (yellow)	0.423	0.147	0.699	0.000	Significant
Organ	mature leaf-large flower bud	-0.405	-0.681	-0.129	0.000	Significant
Organ	Fully developed flower (yellow)-Fully developed flower (white)	-0.279	-0.555	-0.003	0.045	Significant
Organ	mature leaf-Fully developed flower (white)	-0.262	-0.537	0.014	0.078	Not significant
Organ	small flower bud-medium leaf	-0.241	-0.516	0.035	0.141	Not significant
Organ	Young fruit-small flower bud	0.220	-0.056	0.496	0.233	Not significant
Organ	large flower bud-Fully developed flower (white)	0.144	-0.132	0.420	0.792	Not significant
Organ	Young fruit-medium leaf	-0.020	-0.296	0.256	1.000	Not significant
Organ	mature leaf-Fully developed flower (yellow)	0.018	-0.258	0.294	1.000	Not significant

Student questions (A5):

List two organ pairs with the strongest evidence of different means. On which scale did you assess this (original or log), and why?
Organ young leaf-Mature fruit with a difference of means of 4.137. Organ medium leaf-Mature fruit with a difference of means of 3.412. I used the transformed data because I felt that the transformed data suited the analysis better since it met the assumptions of the ANOVA. —

4 Part B (Replication): Protein ~ Organ (write the code)

Now repeat the workflow for analysis of HCNp for Protein. The steps are listed; write/copy and the code yourself. This tests your ability to execute the analysis and interpret results independently.

General tip when copying/pasting then editing code: make sure to change the names of the variables you create - for example, make sure to change hc1 to hc2, and fit1 to fit2, etc.

4.1 B1) Plot Protein across organs — boxplot + jitter

Why: See relative medians, spread, and outliers before modeling.

Do: Create a boxplot with jittered points (geom_boxplot(...) + geom_jitter(...)). Make the panel square using aspect.ratio = 1 and rotate x-axis labels for readability.

suppressPackageStartupMessages({
  library(readr)    # fast, tidy CSV reading
  library(dplyr)    # data wrangling verbs
  library(ggplot2)  # plotting
  library(car)      # Levene's test & power transforms
  library(broom)    # tidy model outputs
  library(knitr)    # kable tables
})

knitr::opts_chunk$set(fig.width = 6, fig.asp = 1)


# Load and prepare the dataset for ANOVA

# Read in the CSV file
# - readr::read_csv() reads a CSV (comma-separated values) file into R as a tibble (a tidyverse-friendly version of a data frame)
hc <- readr::read_csv(here::here("data", "Additional file 1.csv"))

## Rows: 80 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Organ
## dbl (2): Protein, HCNp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(hc) #look at the data to make sure everything looks right

## Rows: 80
## Columns: 3
## $ Organ   <chr> "young leaf", "young leaf", "young leaf", "young leaf", "young…
## $ Protein <dbl> 12.71, 13.42, 12.19, 12.18, 14.94, 13.11, 12.94, 13.77, 10.21,…
## $ HCNp    <dbl> 68.300000, 68.700000, 65.200000, 64.500000, 64.214383, 69.9803…

hc2 <- hc %>% #creates a new dataframe with Organ (categorical) and protein 
  select(Organ, Protein) %>%
  mutate(Organ = factor(Organ))  # assures that Organ is categorical

5 1.1 Boxplot

#tells knitr how big to make any figures (6 inches by 4 inches) produced #in this chunk
ggplot(
  data = hc2,
  aes(x = Organ, y = Protein, fill = Organ)
) +
  geom_boxplot(
    alpha = 0.75,
    outlier.shape = NA
  ) +
  # Draws boxplots
  # alpha = 0.75 makes boxes semi-transparent so points are visible on top.
  # outlier.shape = NA hides the built-in outlier dots (we’ll show raw #points ourselves). Add raw observations as jittered points so they don’t #stack vertically.
  
  geom_jitter(
    aes(color = Organ),
    width  = 0.15,
    height = 0,
    alpha  = 0.6,
    size   = 1
  ) +
  # Map Organ to 'color' so point outlines match the box color.
  # width controls horizontal jitter; height = 0 keeps true protein values.
  # alpha < 1 and small size help with overplotting.
  # Title and axis labels.
  labs(
    title = "Protein across organs",
    y = "Protein",
    x = "Organ"
  ) +
  # Theme tweaks below:
  # - rotate x labels vertically for long organ names
  # - drop legend (x axis already identifies groups)
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    legend.position = "none"
  )

5.1 A2) Fit one-way ANOVA

ANOVA tests whether all group means are equal vs at least one differs. It partitions total variability into between-group and within-group components and compares them via an F-ratio.

# Safety: ensure required columns exist and drop NAs (if there is missing
#data in an excel file, make sure the excel file has NA in that cell)
stopifnot(all(c("Protein","Organ") %in% names(hc2)))

# Fit the ANOVA model (HCNp as response; Organ as categorical predictor)
fit2 <- aov(Protein ~ Organ, data = hc2)

# Prints the ANOVA table so it appears in the knitted output
cat("\nANOVA table: Protein ~ Organ\n")

## 
## ANOVA table: Protein ~ Organ

print(summary(fit2))

##             Df Sum Sq Mean Sq F value Pr(>F)    
## Organ        9  657.5   73.06   21.68 <2e-16 ***
## Residuals   70  235.9    3.37                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Context: group sizes & means (helps when interpreting differences)
means_protein <- hc2 %>%
  dplyr::group_by(Organ) %>%
  dplyr::summarise(
    n    = dplyr::n(),
    mean = mean(Protein),
    sd   = sd(Protein),
    .groups = "drop"
  )
knitr::kable(means_protein, digits = 3, caption = "Group sizes and means for Protein by Organ")# kable (from knitr) turns a data frame/tibble/matrix into #a formatted table using the means_hcn when you knit an R Markdown. It allows you to  #add captions, rounding, alignment, and custom column names.

Group sizes and means for Protein by Organ
Organ	n	mean	sd
Fully developed flower (white)	8	12.137	1.604
Fully developed flower (yellow)	8	11.470	2.366
Intermediate fruit	8	7.862	1.357
large flower bud	8	13.598	2.460
Mature fruit	8	7.338	1.271
mature leaf	8	8.717	0.617
medium leaf	8	10.308	1.128
small flower bud	8	15.543	1.620
Young fruit	8	15.820	3.304
young leaf	8	13.157	0.907

# --- Extract reporting numbers  ---

# Build the ANOVA-like summary table from the fitted model object `fit2`.
# For aov objects, summary(fit2)[[1]] returns a matrix/data.frame with rows for model terms (e.g., "Organ") and "Residuals", and columns like "Df", "F value", "Pr(>F)".
aov_tab <- summary(fit2)[[1]]

# Pull the F statistic for the factor named exactly "Organ" from the #table.Assumes the row name is "Organ" (i.e., the model was fit with Organ #as a factor term).
F_value <- aov_tab["Organ","F value"]

# Pull the p-value for the Organ effect (column name "Pr(>F)").
p_value <- aov_tab["Organ","Pr(>F)"]

# Extract degrees of freedom:
#   df1 = numerator df for Organ (its row's "Df")
#   df2 = denominator df from the Residuals row
df1 <- aov_tab["Organ","Df"]; df2 <- aov_tab["Residuals","Df"]

# Print a compact report line:
#   F(df1, df2) = F_value, p = p_value
# Note: %.4g prints p in up to 4 significant digits (switching to scientific notation if tiny).
cat(sprintf("\nReport: F(%d, %d) = %.3f, p = %.4g\n", df1, df2, F_value, p_value))

## 
## Report: F(9, 70) = 21.680, p = 5.497e-17

# Residuals vs Fitted: look for no pattern (looking for a random cloud of #points), and roughly constant vertical spread
ggplot(data.frame(fit = fitted(fit2), res = resid(fit2)), aes(fit, res)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted (Protein ~ Organ)", x = "Fitted values", y = "Residuals")

# QQ plot: residuals should follow the reference line if approximately normal
ggplot(data.frame(res = resid(fit2)), aes(sample = res)) +
  stat_qq() + stat_qq_line() +
  labs(title = "QQ plot of residuals (Protein ~ Organ)", x = "Theoretical Quantiles", y = "Sample Quantiles")

# 1) Normality of residuals (ANOVA assumes normally distributed errors)
#    We test the residuals from the fitted model, not each raw group,
#    because ANOVA is about model errors, not the raw data per se.
cat("\nShapiro–Wilk normality test (residuals): Protein ~ Organ\n")

## 
## Shapiro–Wilk normality test (residuals): Protein ~ Organ

res <- residuals(fit2)
print(shapiro.test(resid(fit2)))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit2)
## W = 0.96671, p-value = 0.03509

# Interpretation tip:
#   p < 0.05 → evidence residuals deviate from normality
#   p ≥ 0.05 → no strong evidence against normality
# Remember: with moderate n, ANOVA is fairly robust to mild non-normality.

# 2) test for Homogeneity of variances (equal variances across groups)

cat("\nLevene's test for equal variances (median-centered): Protein ~ Organ\n")

## 
## Levene's test for equal variances (median-centered): Protein ~ Organ

print(leveneTest(Protein ~ Organ, data = hc2, center = median))

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  9  2.3298 0.02333 *
##       70                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 3) Welch's ANOVA is a statistical test used as an alternative to the traditional one-way ANOVA when the assumption of equal variances across groups is violated. It is particularly useful for comparing the means of three or more groups when their variances are unequal. Welch's ANOVA  still assumes ~normality)
#    Use when Levene indicates heteroscedasticity.
cat("\nWelch's ANOVA (robust to unequal variances): Protein ~ Organ\n")

## 
## Welch's ANOVA (robust to unequal variances): Protein ~ Organ

print(oneway.test(Protein ~ Organ, data = hc2, var.equal = FALSE))

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  Protein and Organ
## F = 31.22, num df = 9.000, denom df = 28.153, p-value = 2.474e-12

# Interpretation tip:
#   If Welch is significant while classic ANOVA disagrees, trust Welch when variances are unequal.

# Practical guidance:
# - These tests inform, they don’t automatically veto ANOVA.
# - If assumptions look shaky: consider data transformations, Welch’s ANOVA,
#   or nonparametric alternatives (e.g., Kruskal–Wallis) depending on the issue.

5.2 A4) Optional transformation — when and why?

If variance increases with the mean (fan shape) or residuals are skewed, a log (or power) transform can stabilize variance and improve normality.

.safe_log <- function(x) if (all(x > 0, na.rm = TRUE)) log(x) else log1p(x)

hc_protein <- hc2 %>% dplyr::mutate(Protein_log = .safe_log(Protein))

fit2_log <- aov(Protein_log ~ Organ, data = hc_protein)
cat("\nANOVA (log scale): Protein_log ~ Organ\n")

## 
## ANOVA (log scale): Protein_log ~ Organ

print(summary(fit2_log))

##             Df Sum Sq Mean Sq F value Pr(>F)    
## Organ        9  5.350  0.5945   25.64 <2e-16 ***
## Residuals   70  1.623  0.0232                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat("\nLevene on log scale (median-centered):\n")

## 
## Levene on log scale (median-centered):

print(leveneTest(Protein_log ~ Organ, data = hc_protein, center = median))

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  9  1.6619 0.1149
##       70

ggplot(data.frame(fit = fitted(fit2_log), res = resid(fit2_log)), aes(fit, res)) +
  geom_point(alpha = 0.6) + geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted (log Protein ~ Organ)", x = "Fitted (log Protein)", y = "Residuals")

ggplot(data.frame(res = resid(fit2_log)), aes(sample = res)) +
  stat_qq() + stat_qq_line() +
  labs(title = "QQ plot (log Protein ~ Organ)", x = "Theoretical Quantiles", y = "Sample Quantiles")

tuk <- TukeyHSD(if (exists("fit2_log")) fit2_log else fit2)  # use transformed fit if you chose it
tuk_df <- broom::tidy(tuk) %>%
  dplyr::mutate(sig = ifelse(adj.p.value < 0.05, "Significant", "Not significant")) %>%
  dplyr::arrange(adj.p.value)
knitr::kable(tuk_df, digits = 3, caption = "Tukey HSD pairwise comparisons (HCNp)")

Tukey HSD pairwise comparisons (HCNp)
term	contrast	estimate	conf.low	conf.high	adj.p.value	sig
Organ	small flower bud-Intermediate fruit	0.688	0.440	0.937	0.000	Significant
Organ	Young fruit-Intermediate fruit	0.694	0.445	0.942	0.000	Significant
Organ	small flower bud-Mature fruit	0.758	0.509	1.007	0.000	Significant
Organ	Young fruit-Mature fruit	0.763	0.515	1.012	0.000	Significant
Organ	Mature fruit-large flower bud	-0.615	-0.864	-0.366	0.000	Significant
Organ	young leaf-Mature fruit	0.595	0.346	0.844	0.000	Significant
Organ	Young fruit-mature leaf	0.580	0.332	0.829	0.000	Significant
Organ	small flower bud-mature leaf	0.575	0.326	0.824	0.000	Significant
Organ	large flower bud-Intermediate fruit	0.545	0.296	0.794	0.000	Significant
Organ	young leaf-Intermediate fruit	0.525	0.276	0.774	0.000	Significant
Organ	Mature fruit-Fully developed flower (white)	-0.508	-0.757	-0.260	0.000	Significant
Organ	Mature fruit-Fully developed flower (yellow)	-0.440	-0.689	-0.191	0.000	Significant
Organ	Intermediate fruit-Fully developed flower (white)	-0.438	-0.687	-0.190	0.000	Significant
Organ	mature leaf-large flower bud	-0.432	-0.681	-0.183	0.000	Significant
Organ	Young fruit-medium leaf	0.416	0.167	0.665	0.000	Significant
Organ	young leaf-mature leaf	0.412	0.163	0.661	0.000	Significant
Organ	small flower bud-medium leaf	0.411	0.162	0.660	0.000	Significant
Organ	Intermediate fruit-Fully developed flower (yellow)	-0.370	-0.619	-0.121	0.000	Significant
Organ	medium leaf-Mature fruit	0.347	0.099	0.596	0.001	Significant
Organ	mature leaf-Fully developed flower (white)	-0.325	-0.574	-0.076	0.002	Significant
Organ	Young fruit-Fully developed flower (yellow)	0.324	0.075	0.573	0.002	Significant
Organ	small flower bud-Fully developed flower (yellow)	0.319	0.070	0.567	0.003	Significant
Organ	medium leaf-Intermediate fruit	0.278	0.029	0.526	0.017	Significant
Organ	medium leaf-large flower bud	-0.267	-0.516	-0.019	0.025	Significant
Organ	mature leaf-Fully developed flower (yellow)	-0.257	-0.505	-0.008	0.038	Significant
Organ	Young fruit-Fully developed flower (white)	0.255	0.006	0.504	0.040	Significant
Organ	small flower bud-Fully developed flower (white)	0.250	0.001	0.499	0.048	Significant
Organ	young leaf-medium leaf	0.247	-0.001	0.496	0.052	Not significant
Organ	mature leaf-Mature fruit	0.183	-0.066	0.432	0.339	Not significant
Organ	large flower bud-Fully developed flower (yellow)	0.175	-0.074	0.424	0.402	Not significant
Organ	young leaf-Young fruit	-0.168	-0.417	0.080	0.458	Not significant
Organ	medium leaf-mature leaf	0.164	-0.084	0.413	0.494	Not significant
Organ	young leaf-small flower bud	-0.163	-0.412	0.085	0.503	Not significant
Organ	medium leaf-Fully developed flower (white)	-0.161	-0.410	0.088	0.525	Not significant
Organ	young leaf-Fully developed flower (yellow)	0.155	-0.094	0.404	0.576	Not significant
Organ	Young fruit-large flower bud	0.149	-0.100	0.397	0.634	Not significant
Organ	small flower bud-large flower bud	0.143	-0.105	0.392	0.680	Not significant
Organ	mature leaf-Intermediate fruit	0.113	-0.136	0.362	0.893	Not significant
Organ	large flower bud-Fully developed flower (white)	0.106	-0.142	0.355	0.924	Not significant
Organ	medium leaf-Fully developed flower (yellow)	-0.092	-0.341	0.157	0.968	Not significant
Organ	young leaf-Fully developed flower (white)	0.087	-0.162	0.335	0.979	Not significant
Organ	Mature fruit-Intermediate fruit	-0.070	-0.319	0.179	0.995	Not significant
Organ	Fully developed flower (yellow)-Fully developed flower (white)	-0.069	-0.317	0.180	0.996	Not significant
Organ	young leaf-large flower bud	-0.020	-0.269	0.229	1.000	Not significant
Organ	Young fruit-small flower bud	0.005	-0.244	0.254	1.000	Not significant

Student questions (B1):

Which organs look most different in median Protein? Which show the largest spread?
Mature fruit had the smallest median protein and small flower bud had the largest median protein, so they looked the most different. Young fruit, large flower and fully developed flower had the largest spread.

5.3 B2) Fit one-way ANOVA: Protein ~ Organ

Why: Test if mean Protein differs among organs.

Do: Drop NAs, ensure Organ is a factor, fit aov(Protein ~ Organ, data = ...), print summary(...), and show a small table of group sizes and means.

Student questions (B2):

Report F, df, and p. Is there evidence that mean Protein differs among organs?
F = 21. 68

df = 9 (within), 70 (between)

p-value = <2e-16

There is evidence that mean Protein differs among organs.

5.4 B3) Check assumptions (residuals vs fitted, QQ, Levene, Welch)

Why: Validate model assumptions and pick a robust alternative if needed.

Do: Plot residuals vs fitted and a QQ plot; run shapiro.test(resid(...)); run car::leveneTest(...); if Levene’s p < 0.05, report Welch’s ANOVA (oneway.test(..., var.equal = FALSE)).

Student questions (B3):

Do the diagnostics support classical ANOVA? If not, which method (Welch or transform) will you prefer and why?
No because the residual vs fitted plot shows a funnel shape from left to right. The QQ plot has outliers at both tails, indicating the data is not normal, so I would prefer to transform the data. Also, the levene’s p-value grows with the transformation (from 0.02 to 0.11). This means the transformed data has more equal variance than the original data.

5.5 B4) Optional transformation (log) and re-check

Why: Stabilize variance and improve normality when spread increases with the mean.

Do: Create Protein_log via log(...) or log1p(...), fit aov(Protein_log ~ Organ, ...), and repeat diagnostics.

Student questions (B4):

Did diagnostics improve after transforming? Which model will you report?
Yes. Diagnostics did improve after transforming. I would prefer to report the transformed data since it meets classical ANOVA assumptions.

5.6 B5) (Optional) Tukey HSD for Protein

Why: Identify which specific organ means differ, controlling family-wise error.

Do: Run TukeyHSD(fit_protein_chosen) where fit_protein_chosen is the model you decided to report.

Student questions (B5):

Name two organ pairs with significant differences in mean Protein. Briefly interpret their direction and magnitude. Small flower bud-mature fruit had a significant difference in mean protein with a magnitude of 0.758, indicating mature fruit mean protein is smaller than small flower bud. Young fruit-mature fruit had a significant difference in mean protein with a magnitude of 0.763, indicating mature fruit protein is smaller than young fruit.

6 Submission checklist

HCNp (guided): plot, ANOVA table, diagnostics, transform + diagnostics, and a 2–3 sentence interpretation.
Protein (replication): the same set using your own code.
If you used transformations, say how you interpret differences

File names: Lastname_Firstname_Lab_ANOVA.html and Lastname_Firstname_Lab_ANOVA.Rmd.

7 Optional BONUS POINT OPPORTUNITY - Kruskal–Wallis test (nonparametric alternative for HCNp ~ Organ)

Do the if you have time just to see how it works, but not required since the parametric tests are probably appropriate for these data. It is good to have the code for future reference however.

suppressPackageStartupMessages({
  library(dplyr)
  library(knitr)
  # install.packages("rstatix") # <- run once if needed
  library(rstatix)  # for dunn_test()
})

# Use the existing 'hc' if present; require necessary columns
stopifnot(exists("hc"))
stopifnot(all(c("Organ","HCNp") %in% names(hc)))

# Analysis frame: drop NAs
hc_kw <- hc %>%
  dplyr::filter(!is.na(Organ), !is.na(HCNp))

# Descriptive medians & IQRs (rank-based summary)
hc_kw %>%
  dplyr::group_by(Organ) %>%
  dplyr::summarise(
    n      = dplyr::n(),
    median = median(HCNp),
    IQR    = IQR(HCNp),
    .groups = "drop"
  ) %>%
  knitr::kable(digits = 3, caption = "HCNp by Organ: medians and IQRs")

HCNp by Organ: medians and IQRs
Organ	n	median	IQR
Fully developed flower (white)	8	8.420	3.219
Fully developed flower (yellow)	8	6.167	0.624
Intermediate fruit	8	2.357	1.048
Mature fruit	8	0.083	0.138
Young fruit	8	32.533	6.834
large flower bud	8	9.939	2.294
mature leaf	8	6.350	1.032
medium leaf	8	32.550	0.399
small flower bud	8	25.408	6.229
young leaf	8	66.750	4.151

# Omnibus Kruskal–Wallis test
kw <- kruskal.test(HCNp ~ Organ, data = hc_kw)
kw

## 
##  Kruskal-Wallis rank sum test
## 
## data:  HCNp by Organ
## Kruskal-Wallis chi-squared = 75.138, df = 9, p-value = 1.484e-12

# Effect size: epsilon-squared
k <- nlevels(hc_kw$Organ); n <- nrow(hc_kw); H <- unname(kw$statistic)
epsilon2 <- (H - k + 1) / (n - k)
cat(sprintf("\nEpsilon-squared effect size: %.3f\n", epsilon2))

## 
## Epsilon-squared effect size: 0.952

# ---- Dunn post-hoc pairwise comparisons (Holm-adjusted) --------------
dunn_tbl <- dunn_test(hc_kw, HCNp ~ Organ, p.adjust.method = "holm") %>%
  arrange(p.adj) %>%
  transmute(
    group1, group2,
    z = statistic,
    p = p,
    p_adj = p.adj,
    significant = ifelse(p_adj < 0.05, "Yes", "No")
  )

knitr::kable(dunn_tbl, digits = 3,
             caption = "Dunn post-hoc (Holm-adjusted) for HCNp by Organ")

Dunn post-hoc (Holm-adjusted) for HCNp by Organ
group1	group2	z	p	p_adj	significant
Mature fruit	young leaf	6.197	0.000	0.000	Yes
Intermediate fruit	young leaf	5.498	0.000	0.000	Yes
Mature fruit	medium leaf	5.132	0.000	0.000	Yes
Mature fruit	Young fruit	5.035	0.000	0.000	Yes
Fully developed flower (yellow)	young leaf	4.465	0.000	0.000	Yes
Intermediate fruit	medium leaf	4.432	0.000	0.000	Yes
Intermediate fruit	Young fruit	4.336	0.000	0.001	Yes
Mature fruit	small flower bud	4.293	0.000	0.001	Yes
mature leaf	young leaf	4.250	0.000	0.001	Yes
Intermediate fruit	small flower bud	3.593	0.000	0.012	Yes
Fully developed flower (white)	young leaf	3.421	0.001	0.022	Yes
Fully developed flower (yellow)	medium leaf	3.400	0.001	0.023	Yes
Fully developed flower (yellow)	Young fruit	3.303	0.001	0.032	Yes
mature leaf	medium leaf	3.185	0.001	0.046	Yes
large flower bud	Mature fruit	-3.174	0.002	0.047	Yes
mature leaf	Young fruit	3.088	0.002	0.061	No
large flower bud	young leaf	3.023	0.003	0.073	No
Fully developed flower (white)	Mature fruit	-2.776	0.006	0.154	No
Fully developed flower (yellow)	small flower bud	2.561	0.010	0.282	No
Intermediate fruit	large flower bud	2.474	0.013	0.347	No
Fully developed flower (white)	medium leaf	2.356	0.018	0.462	No
mature leaf	small flower bud	2.345	0.019	0.462	No
Fully developed flower (white)	Young fruit	2.259	0.024	0.549	No
Fully developed flower (white)	Intermediate fruit	-2.076	0.038	0.833	No
Fully developed flower (white)	Fully developed flower (yellow)	-1.044	0.297	1.000	No
Fully developed flower (white)	large flower bud	0.398	0.691	1.000	No
Fully developed flower (white)	mature leaf	-0.828	0.407	1.000	No
Fully developed flower (white)	small flower bud	1.517	0.129	1.000	No
Fully developed flower (yellow)	Intermediate fruit	-1.033	0.302	1.000	No
Fully developed flower (yellow)	large flower bud	1.442	0.149	1.000	No
Fully developed flower (yellow)	Mature fruit	-1.732	0.083	1.000	No
Fully developed flower (yellow)	mature leaf	0.215	0.830	1.000	No
Intermediate fruit	Mature fruit	-0.699	0.484	1.000	No
Intermediate fruit	mature leaf	1.248	0.212	1.000	No
large flower bud	mature leaf	-1.226	0.220	1.000	No
large flower bud	medium leaf	1.958	0.050	1.000	No
large flower bud	small flower bud	1.119	0.263	1.000	No
large flower bud	Young fruit	1.861	0.063	1.000	No
Mature fruit	mature leaf	1.947	0.052	1.000	No
medium leaf	small flower bud	-0.839	0.401	1.000	No
medium leaf	Young fruit	-0.097	0.923	1.000	No
medium leaf	young leaf	1.065	0.287	1.000	No
small flower bud	Young fruit	0.742	0.458	1.000	No
small flower bud	young leaf	1.904	0.057	1.000	No
Young fruit	young leaf	1.162	0.245	1.000	No

Student questions (Kruskal–Wallis): Bonus Question

Compare the Kruskal–Wallis p-value to your ANOVA/Welch result. Do they lead to the same conclusion about differences among organs? Explain briefly. THe Kruskal-Wallis p-value was 1.484e-12 and the transformed ANOVA result was <2e-16.Both results reveal that organ identity explains almost all of the variability in HCNp values, and that these results are robust to assumptions of normality and equal variance. However, ANOVA is more sensitive to unequal variances, but with the transformation, the variance seems stabilized. The Kruskal–Wallis test confirms the robustness of the finding with a large non-parametric effect size (ε² = 0.952).