ANOVA (Analysis of Variance)

One-way: To analyze the differences in < SOMETHING > (example: % T cells) between your experimental groups (example: clean, dirty, co-housed) we can use a one-way ANOVA.
Two-way: A two-way ANOVA will allow us to test the main effects of each factor group as well as their interaction effect, such a group:sex, to determine if there’s a difference between the groups and the sexes.

Assumptions

The assumptions of ANOVA are as follows: Normality: The data within each group should be normally distributed.
Homogeneity of variance: The variance of the data within each group should be equal.
Independence: The observations within each group should be independent.

Read in the data

First, the data was added to the cluster and put in the appropriate project folder. You can use scp in the terminal.
In a terminal that is NOT logged into the mayo cluster run the following:

scp /PathTo/Spleen_General_males_1_Results.xls m####@mforgehn4:/research/labs/neurology/fryer/m####/ProjectFolder/

# read in the excel spreed sheet
df <- read_excel("Spleen_General_males_1_Results.xls")

## New names:
## • `` -> `...1`

# We need a group column so we know which samples belong to which group! 
# First, rename the first column to sampleID
df <- df %>% rename(sampleID = 1)

# oh my, there's two rows that are not needed and will mess up our dataframe, let's remove mean and SD
df <- df %>% filter(!sampleID %in% c("Mean", "SD"))

# Create the new column group based on the conditions
df <- df %>% mutate(group = case_when(
  startsWith(sampleID, "B") ~ "bedding",
  startsWith(sampleID, "CH") ~ "co_housed",
  startsWith(sampleID, "C") ~ "clean",
  startsWith(sampleID, "D") ~ "dirty",
  TRUE ~ NA_character_  # Assign NA if none of the conditions are met
))

# Display the resulting data frame
print(df) # Inspect

## # A tibble: 25 × 49
##    sampleID         `% live Cells` % Granulocytes of Li…¹ %B Cells of Live Cel…²
##    <chr>            <chr>          <chr>                  <chr>                 
##  1 B_1_M general.f… 79.3 %         3.22 %                 52.2 %                
##  2 B_2_M general.f… 73.2 %         4.30 %                 46.0 %                
##  3 B_3_M general.f… 13.1 %         13.0 %                 4.62 %                
##  4 B_4_M general.f… 60.7 %         7.96 %                 13.2 %                
##  5 B_5_M general.f… 71.9 %         29.2 %                 33.3 %                
##  6 B_6_M general.f… 67.4 %         20.7 %                 35.7 %                
##  7 B_10_M general.… 64.3 %         9.26 %                 22.2 %                
##  8 CH_1_M general.… 57.7 %         16.1 %                 35.2 %                
##  9 CH_2_M general.… 60.8 %         9.87 %                 49.9 %                
## 10 CH_3_M general.… 69.5 %         18.8 %                 31.1 %                
## # ℹ 15 more rows
## # ℹ abbreviated names: ¹`% Granulocytes of Live Cells`,
## #   ²`%B Cells of Live Cells`
## # ℹ 45 more variables: `MFI CD19 of B Cells` <dbl>,
## #   `MFI CD45R of B Cells` <dbl>, `% NK Cells of Live Cells` <chr>,
## #   `MFI of CD161 of NK Cells` <dbl>, `% NKT Cells of Live Cells` <chr>,
## #   `MFI CD3 of NKT Cells` <dbl>, `MFI CD161 of NKT Cells` <dbl>, …

Update the dataframe

# R will not recognize numbers as numeric if that string contains characters. 
is.numeric(df$`% live Cells`) # FALSE

## [1] FALSE

typeof(df$`% live Cells`) # Check the typeof data is % live Cells, see how it reports back as character.

## [1] "character"

# We will need to remove all % signs and convert from character to numeric. 
# Remove all "%" characters in the data frame except for in the column names. 
df <- df %>% mutate_all(~ gsub("%", "", .))
typeof(df$`% live Cells`) # Recheck the % live Cells. It still shows as character. Next convert to numeric

## [1] "character"

is.numeric(df$`% live Cells`) # FALSE

## [1] FALSE

# Convert all columns to numeric except for columns sampleID and group. 
df <- df %>% mutate_at(vars(-sampleID, -group), as.numeric)

## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `MFI CD62L of Naive CD4 = .Primitive("as.double")(`MFI CD62L of
##   Naive CD4`)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

# The warning messages are due to the fact there there are NAs in some of the columns.
# You can't convert NA to numeric, hence the warning. 
is.numeric(df$`% live Cells`) # Recheck the % live Cells. It now shows TRUE

## [1] TRUE

# Define the order of the groups
df$group <- factor(df$group, levels = c("clean", "bedding", "co_housed", "dirty"))
df$group # Inspect

##  [1] bedding   bedding   bedding   bedding   bedding   bedding   bedding  
##  [8] co_housed co_housed co_housed co_housed co_housed co_housed clean    
## [15] clean     clean     clean     clean     clean     dirty     dirty    
## [22] dirty     dirty     dirty     dirty    
## Levels: clean bedding co_housed dirty

# To avoid some of these above pre-set up sets, think carefully about how R will handle column names, strings, and numbers when setting up your dataframe.

Check the assumptions of the ANOVA

Shapiro test - testing for normality.
Levene Test - homogeneity of variance across groups.

clean_values <- df[df$group == "clean",]
bedding_values <- df[df$group == "bedding",]
co_housed_values <- df[df$group == "co_housed",]
dirty_values <- df[df$group == "dirty",]

# Shapiro-Wilk test, normality test for the % live cells 
clean_shapiro <- shapiro.test(clean_values$`% live Cells`)
bedding_shapiro <- shapiro.test(bedding_values$`% live Cells`) # We reject the NULL 
co_housed_shapiro <- shapiro.test(co_housed_values$`% live Cells`)
dirty_shapiro <- shapiro.test(dirty_values$`% live Cells`)

# Let us inspect the bedding values distribution 
# Q-Q plots (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. 
ggqqplot(bedding_values$`% live Cells`)

# There's an outlier sample which explains why we reject the null of normality 

# Levene's test, performed using all samples from all the groups. 
levTest <- leveneTest(df$`% live Cells` ~ df$group)

ANOVA

# Fit the model
model <- aov(`% live Cells` ~ group, data = df)
model # inspect

## Call:
##    aov(formula = `% live Cells` ~ group, data = df)
## 
## Terms:
##                    group Residuals
## Sum of Squares  2698.595  4621.305
## Deg. of Freedom        3        21
## 
## Residual standard error: 14.83449
## Estimated effects may be unbalanced

summary(model) # summary

##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        3   2699   899.5   4.088 0.0197 *
## Residuals   21   4621   220.1                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot

ggplot(df,aes(x=group,y=`% live Cells`,colour=group)) + 
  geom_violin() +
  geom_point(position = position_dodge(0.3)) +
   scale_shape_manual(values=seq(0,15)) +
  scale_color_manual(values=c("blue", "purple", "red", "black")) +
  theme_bw() +
  labs(title = "% live Cells", y = "% live Cells", x = "Group") +
  stat_compare_means(method = "anova", paired = FALSE, size = 4, label.x = 1.25)

Post hoc

TukeyHSD(model)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = `% live Cells` ~ group, data = df)
## 
## $group
##                        diff          lwr      upr     p adj
## bedding-clean     18.630952  -4.37328653 41.63519 0.1405537
## co_housed-clean   27.616667   3.74403974 51.48929 0.0196100
## dirty-clean       23.800000  -0.07262692 47.67263 0.0508840
## co_housed-bedding  8.985714 -14.01852463 31.98995 0.7000276
## dirty-bedding      5.169048 -17.83519129 28.17329 0.9224471
## dirty-co_housed   -3.816667 -27.68929359 20.05596 0.9697457