1 ✍️ Introduction

This worksheet will guide you through basic statistical analysis and data visualization techniques in R using simulated crop treatment data. You will learn about ANOVA, LSD, t-tests, and how to make boxplots with informative labels and color themes.

1.1 First load required packages

# Required for reading files (not used in this example)
library(readxl)       # Not essential here, but useful if reading Excel input
library(readr)       # Not essential here, but useful if reading CSV input
library(data.table)  # Not essential here, but useful for speed + large files

# Core plotting and data handling
library(ggplot2)      # Essential: used for all visualizations
library(dplyr)        # Essential: used for summarizing and transforming data

# Optional enhancements to plots
library(ggbeeswarm)   # Optional: jittered swarm plots
library(viridisLite)  # Optional: for viridis color scales
library(viridis)      # Optional: color scale options
library(RColorBrewer) # Optional: for using ColorBrewer palettes

# Plot annotation and combination
library(ggpubr)       # Essential: used for adding statistical comparisons
library(cowplot)      # Optional: combines multiple plots into one

# Image support
library(magick)       # Optional: advanced image processing
library(plotly)
# Document support
library(knitr)        # Essential: powers R Markdown knitting

# Post-hoc testing
library(agricolae)    # Essential: needed for LSD.test after ANOVA

2 πŸ§ͺ Load and Inspect Data

πŸ‘‰ Instructions: Simulate or load a dataset with a Treatment column and a numeric outcome (e.g., Yield).

# Simulate data with distinct mean for Fungicide_B
set.seed(123) #exactly the same numbers every time we run the code, for reproducibility.
  data <- data.frame(
  Treatment = rep(c("Control", "Fungicide_A", "Fungicide_B", "Fungicide_C"), each = 50),
  Yield = c(
    rnorm(50, mean = 50, sd = 10),   # Control
    rnorm(50, mean = 50, sd = 10),   # Fungicide_A
    rnorm(50, mean = 60, sd = 10),   # Fungicide_B --> deliberately higher
    rnorm(50, mean = 50, sd = 10)    # Fungicide_C
  )
)

# View the first few rows
head(data)

πŸ“ Question: What type of variable is Treatment? What type is Yield?


3 πŸ“Š Create a Boxplot

πŸ‘‰ Instructions: Use ggplot2 to create a boxplot of Yield by Treatment.

library(ggplot2)
ggplot(data, aes(x = Treatment, y = Yield)) +
  geom_boxplot()

# This visual helps us to quickly see distribution, central tendency, spread, and outliers in the data.

πŸ“ Question: What do the horizontal lines in the boxplot represent?


4 🎨 Add Color and Labels

πŸ‘‰ Instructions: Add axis labels and fill colors based on Treatment.

ggplot(data, aes(x = Treatment, y = Yield, fill = Treatment)) +
  geom_boxplot() +
  xlab("Treatment Applied") +
  ylab("Yield (bu/acre)") +
  scale_fill_brewer(palette = "Set2")

πŸ“ Question: Why do we map fill = Treatment inside aes()?


5 πŸ“ Customize the Theme

πŸ‘‰ Instructions: Clean up the plot using a white background and rotated axis labels.

ggplot(data, aes(x = Treatment, y = Yield, fill = Treatment)) +
  geom_boxplot() +
  xlab("Treatment Applied") +
  ylab("Yield (bu/acre)") +
  scale_fill_brewer(palette = "Set2") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

πŸ“ Question: What does theme_bw() do?


6 πŸ“Š Statistical Methods

6.1 Check Normality (Shapiro-Wilk Test)

shapiro_results <- data %>%
  group_by(Treatment) %>%
  summarise(p_value = shapiro.test(Yield)$p.value)
shapiro_results

πŸ“ Question: If any group’s p-value < 0.05, data may not be normally distributed. Should we use ANOVA or Kruskal-Wallis? In case: kruskal.test(Yield ~ Treatment, data = data) Follow it with Dunn’s test for pairwise comparisons library(FSA) dunnTest(Yield ~ Treatment, data = data, method = β€œbonferroni”)

6.2 ANOVA (Analysis of Variance) β€” if data is normally distributed

  • Purpose: To test whether the means of three or more groups are significantly different. It compares.

    Null Hypothesis (Hβ‚€): All group means are equal Alternative (H₁): At least one group mean is different

    How it works: It compares the variation between groups to the variation within groups. If variation between is much larger than we conclude: significant difference

aov(Yield ~ Treatment, data = data)
## Call:
##    aov(formula = Yield ~ Treatment, data = data)
## 
## Terms:
##                 Treatment Residuals
## Sum of Squares   1738.183 17260.573
## Deg. of Freedom         3       196
## 
## Residual standard error: 9.38425
## Estimated effects may be unbalanced
  • Interpretation: A significant p-value (e.g., < 0.05) means at least one group mean differs.

6.3 LSD (Least Significant Difference), (Post-hoc for ANOVA)

  • Purpose: Post-hoc test after ANOVA to find which pairs of groups differ.
library(agricolae)
model <- aov(Yield ~ Treatment, data = data)
LSD.test(model, "Treatment", p.adj = "none")

6.4 t-Test (Pairwise Comparisons), (If data is normal)

  • Purpose: Compares means between two specific groups.
  • In R (with ggpubr):
ggplot(data, aes(x = Treatment, y = Yield, fill = Treatment)) +
  geom_boxplot() +
  stat_compare_means(comparisons = list(c("Control", "Fungicide_A")), method = "t.test")

  • Result Interpretation:
    • ns = not significant
    • * = p < 0.05
    • ** = p < 0.01
    • *** = p < 0.001

7 🧠 Perform ANOVA

πŸ‘‰ Instructions: Test whether yields are significantly different among treatments.

model <- aov(Yield ~ Treatment, data = data)
summary(model)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Treatment     3   1738   579.4   6.579 0.000293 ***
## Residuals   196  17261    88.1                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

πŸ“ Question: What is the null hypothesis of ANOVA? Is it rejected?


8 πŸ”¬ Add ANOVA Result to Plot

πŸ‘‰ Instructions: Add the p-value to your plot.

library(ggpubr)

ggplot(data, aes(x = Treatment, y = Yield, fill = Treatment)) +
  geom_boxplot() +
  stat_compare_means(method = "anova", label.y = 70) +
  scale_fill_brewer(palette = "Set2") +
  theme_bw()

πŸ“ Question: What does a p-value < 0.05 indicate?


9 πŸ” Pairwise Comparisons (t-tests)

πŸ‘‰ Instructions: Compare individual treatment pairs using t-tests.

pairs <- list(c("Control", "Fungicide_A"),
              c("Control", "Fungicide_B"),
              c("Control", "Fungicide_C"))

ggplot(data, aes(x = Treatment, y = Yield, fill = Treatment)) +
  geom_boxplot() +
  stat_compare_means(comparisons = pairs, method = "t.test", label = "p.signif") +
  scale_fill_brewer(palette = "Set2") +
  theme_bw()

πŸ“ Question: Which treatment(s) show statistically significant differences from the Control?


10 βœ… Summary Questions

  1. What is the difference between ANOVA and a t-test?
  2. When should you use LSD or pairwise tests?
  3. What does the fill argument in aes() do?
  4. What are the benefits of using theme_bw()?

11 πŸ“ Submit Your Plots

Save your final plots as PNG or PDF and submit them along with your written answers.

ggsave("your_final_plot.png", width = 8, height = 6, dpi = 300)

Happy plotting! πŸŽ‰

11.1 Key Components

11.1.1 aes() β€” Aesthetic Mappings

aes(x = Treatment, y = Yield, fill = Treatment)
## Aesthetic mapping: 
## * `x`    -> `Treatment`
## * `y`    -> `Yield`
## * `fill` -> `Treatment`

Maps variables to plot features: x-axis, y-axis, color, size, etc.

11.1.2 geom_boxplot()

Draws boxplots to show distributions and differences.

11.1.3 xlab() and ylab()

Set axis labels.

ggplot(data, aes(x = Treatment, y = Yield)) +
  geom_boxplot() +
  xlab("Treatment") +
  ylab("Yield (bu/acre)")

11.1.4 fill and Color Scales

Customize plot colors:

11.1.5 theme() and theme_bw()

Customize the overall look:

ggplot(data, aes(x = Treatment, y = Yield, fill = Treatment)) +
  geom_boxplot() +
  scale_fill_viridis_d(option = "plasma") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 1, face = "bold"))

11.1.6 aspect.ratio

Adjust the plot height vs.Β width:

theme(aspect.ratio = 1.25)
## List of 1
##  $ aspect.ratio: num 1.25
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

11.1.7 stat_compare_means()

Adds statistical test results on plots:

stat_compare_means(method = "anova")
## geom_text: na.rm = FALSE
## stat_compare_means: label.x.npc = left, label.y.npc = top, label.x = NULL, label.y = NULL, label.sep = , , method = anova, method.args = list(), paired = FALSE, ref.group = NULL, symnum.args = list(), hide.ns = FALSE, na.rm = FALSE
## position_identity

12 πŸ“˜ Summary Table

Concept Description
ANOVA Tests for overall mean differences among 3+ groups
LSD Identifies which groups are different after ANOVA
t-Test Compares two groups directly
p-value Probability that observed difference is due to chance
aes() Maps data to plot elements (e.g., x, y, fill)
geom_boxplot() Visualizes distribution and outliers for each group
fill Colors plot elements based on grouping
xlab(), ylab() Labels for axes
theme() Customize fonts, colors, spacing
stat_compare_means() Add test results to plots visually

13 πŸ§ͺ Optional: Example Code for Classroom

# Simulated Example Data
set.seed(123)
  data_PS433 <- data.frame(
  Treatment = rep(c("Control", "Fungicide_A", "Fungicide_B", "Fungicide_C"), each = 50),
  Yield = c(
    rnorm(50, mean = 50, sd = 10),   # Control
    rnorm(50, mean = 50, sd = 10),   # Fungicide_A
    rnorm(50, mean = 60, sd = 10),   # Fungicide_B --> deliberately higher
    rnorm(50, mean = 50, sd = 10)    # Fungicide_C
  )
)
# Boxplot with ANOVA and Pairwise
library(ggplot2)
library(ggpubr)

p <- ggplot(data=data_PS433, aes(x = Treatment, y = Yield, fill = Treatment)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set2") +
  stat_compare_means(method = "anova", label.y = 70) +
  stat_compare_means(comparisons = list(c("Control", "Fungicide_A"), c("Control", "Fungicide_B"),
                                        c("Control", "Fungicide_C")), label = "p.signif") +
  xlab("Treatment") + ylab("Yield (bu/acre)") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 0, face = "bold"))

p


14 πŸ§ͺ Optional: Higher Order-> β€˜smplot2’ package for Visualization

14.1 Required packages

if (!requireNamespace(β€œsmplot2”, quietly = TRUE)) remotes::install_github(β€œsmin95/smplot2”) if (!requireNamespace(β€œsdamr”, quietly = TRUE)) remotes::install_github(β€œsmin95/sdamr”)

# Load required libraries
library(ggplot2)      # Data visualization package
library(smplot2)      # For polished smplots
library(sdamr)        # For sm_raincloud and jitter-enhanced raincloud plots
library(dplyr)        # For data wrangling (e.g., summarizing by group)
library(agricolae)    # For performing LSD (Least Significant Difference) test
library(ggpubr)       # For adding significance stars to ggplot

# Set random seed to make results reproducible
set.seed(123)

# Simulate example dataset
# Each treatment has 50 observations (like 50 plots or plants per treatment)
data_PS533_SDSU_Solanki <- data.frame(
  Treatment = rep(c("Control", "Fungicide_A", "Fungicide_B", "Fungicide_C"), each = 50),
  Yield = c(
    rnorm(50, mean = 50, sd = 10),   # Control group
    rnorm(50, mean = 50, sd = 10),   # Fungicide A group
    rnorm(50, mean = 60, sd = 10),   # Fungicide B group (higher mean yield on purpose)
    rnorm(50, mean = 50, sd = 10)    # Fungicide C group
  )
)

# Calculate group means and sample sizes for labeling the plot
means <- data_PS533_SDSU_Solanki %>%
  group_by(Treatment) %>%
  summarise(mean_yield = round(mean(Yield), 2),
            n = n())

# Perform ANOVA to check for overall treatment differences
model <- aov(Yield ~ Treatment, data = data_PS533_SDSU_Solanki)

# Perform LSD test to see which specific groups are significantly different
lsd_result <- LSD.test(model, "Treatment", p.adj = "none")

# Extract the LSD value (how big a difference needs to be to count as "significant")
lsd_value <- round(lsd_result$statistics$LSD, 2)

# Get group comparison letters from LSD (e.g., a, b, ab)
group_letters <- as.data.frame(lsd_result$groups)
group_letters$Treatment <- rownames(group_letters)

# Combine group letters with mean values
label_data <- merge(means, group_letters, by = "Treatment")

# Create raincloud plot with statistical annotations
Solanki <- ggplot(data = data_PS533_SDSU_Solanki, aes(x = Treatment, y = Yield, fill = Treatment)) +
  # Add raincloud layers (half-violin, boxplot, and points)
  sm_raincloud(
    position = position_nudge(x = 0.15),
    which_side = "right",
    show.legend = TRUE,
    point.params = list(
      size = 2, shape = 22, color = "black", alpha = 0.4,
      position = sdamr::position_jitternudge(nudge.x = 0.06, seed = 123, jitter.width = 0.06)
    )
  ) +
  # Add t-test comparison between selected treatment pairs
  stat_compare_means(
    comparisons = list(
      c("Control", "Fungicide_A"),
      c("Control", "Fungicide_B"),
      c("Control", "Fungicide_C")
    ),
    method = "t.test",
    label = "p.signif",      # Use *, **, *** based on p-value
    tip.length = 0.01
  ) +
  # Label each group with its mean and sample size
  geom_text(data = means,
            aes(x = Treatment, y = mean_yield + 6, 
                label = paste0("Mean = ", mean_yield, "\nn = ", n)),
            inherit.aes = FALSE,
            size = 3, fontface = "bold", vjust = 0, hjust = 1) +
  # Label with group letters from LSD test (e.g., a, b, ab)
  geom_text(data = label_data,
            aes(x = Treatment, y = mean_yield + 12, label = groups),
            inherit.aes = FALSE,
            size = 5, fontface = "bold", vjust = -1, hjust = 1.5) +
  # Annotate the LSD value on the plot
  annotate("text", 
           x = 2.5, 
           y = max(data_PS533_SDSU_Solanki$Yield) + 15, 
           label = paste0("LSD = ", lsd_value),
           size = 5, fontface = "bold", color = "darkred") +
  # Define color scheme for treatments
  scale_fill_manual(values = sm_color("blue", "orange", "purple", "darkred")) +
  # Set general visual theme
  theme_minimal(base_size = 14) +
  labs(
    x = "Treatment",                         # X-axis label
    y = "Soybean Yield (bu/acre)",           # Y-axis label
    title = "Brookings: Effect of Fungicide Treatment on Soybean Yield"  # Title
  ) +
  # Customize theme appearance: bold axis lines and text
  theme(
    axis.line.x = element_line(size = 1, color = "black"),
    axis.line.y = element_line(size = 1, color = "black"),
    axis.text.x = element_text(angle = 20, hjust = 1, face = "bold"),
    axis.title = element_text(face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.position = c(.95, 0.9),                   # Top-right corner
    legend.title = element_blank(),
    legend.text = element_text(size = 8),
    legend.key.size = unit(0.2, "cm")                # Shrink legend keys
  )

# Print the plot
Solanki

ggsave("raincloud_plot.png", plot = Solanki, width = 10, height = 6, dpi = 900)
# When you will create picture (png) in R and save, it will look much better and polished.

15 πŸ“š Selected References

  1. R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/

  2. RStudio Team (2019). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA. URL http://www.rstudio.com/.

  3. Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. URL: https://ggplot2.tidyverse.org/

  4. Min, S. (2023). smplot2: Simplified Plotting Functions for ggplot2. GitHub repository: https://github.com/smin95/smplot2

  5. R Markdown Cookbook Yihui Xie, Christophe Dervieux, Emily Riederer.2025-04-03 https://bookdown.org/yihui/rmarkdown-cookbook/

16 πŸŽ‰ Hurrah! we did it! πŸŽ‰

16.1 πŸ‘ Congratulations PS 433/533-2025 batch on completing the real-world data visualization and analysis exercise! You have just applied tools like ANOVA, LSD tests, and ggplot visualizations in R statistical language to explore crop yield responce due to fungicide application, and that’s no small feat! Keep asking questions, stay curious.