After a great discussion started by Jesse Maegan (@kiersi) on Twitter, I decided to post a workthrough of some (altered) mouse behavior data. These data correspond to motor function testing in mice via the rotarod, a semi-automated apparatus that tests the ability of the mouse to balance and maintain position on spinning cylinder. The time to fall from the cylinder onto a well-padded surface is the measure of interest, where a longer time to fall indicates better balance and better performance! The mice will be tested across at least 8 timepoints, but we are primarily interested in their peak performance at their LAST
session.
We will be working through loading, plotting, analyzing, and saving the outputs through the tidyverse, an “opinionated collection of R packages” designed for data analysis. The rest of our functions will be performed in base R to limit package dependence.
Using the library function we will load the tidyverse
. If you have never installed it before you can also use the install.packages("tidyverse")
call to install it for the first time. This package includes ggplot2
(graphs), dplyr
/tidyr
(sumamry statistics, data manipulation), and readxl
(reading excel files) as well as the pipe %>%
which will make our code much more readable! We will also load the broom
package to tidy up some of our statistical outputs. Lastly we will load knitr
for making nice tables for the html output, but not necessary for simply saving the outputs to Excel.
# Load for ggplot, dplyr, tidyr, readxl
library(tidyverse)
library(broom)
library(knitr)
While I am calling readxl::read_xlsx
you could also simply use read_xlsx
, but in the interest of transparency, I will be using the full call to begin. By using the glimpse
function from dplyr
we can see how the variables were imported, as well as the first few rows. We can already see some NAs, which we will need to address later!
# read excel file
raw_df <- readxl::read_xlsx("rotorod_study.xlsx")
dplyr::glimpse(raw_df)
## Observations: 198
## Variables: 11
## $ Sex <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,...
## $ Age <dbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,...
## $ Genotype <dbl> 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ S1 <dbl> 10.9025, 11.4275, 12.1775, 6.6500, 13.9775, 15.0450, ...
## $ S2 <dbl> 15.7250, 22.2475, 17.5900, 7.7350, 18.2075, 24.3075, ...
## $ S3 <dbl> 17.9100, 28.4525, 22.4350, 22.9175, 31.8125, 19.9050,...
## $ S4 <dbl> 19.9400, 26.1250, 3.9875, 29.0450, 28.3950, 20.8750, ...
## $ S5 <dbl> 33.7150, 34.3150, 17.2275, 25.5925, 24.1325, 38.2100,...
## $ S6 <dbl> 36.2925, 36.3100, 17.1800, 30.2475, 26.4300, 37.6000,...
## $ S7 <dbl> 35.3425, 32.5975, 21.8025, 23.1625, 30.2675, 39.5600,...
## $ LAST <dbl> 35.3425, 32.5975, 21.8000, 23.1625, 30.2675, 39.5600,...
By calling ggplot
we can take a look at the distribution. It looks like we have a slight right-skewed dataset for our variable of interest (LAST). There appear to be a few mice that performed very well.
# density plot
ggplot(raw_df, aes(x = LAST)) +
geom_density()
## Warning: Removed 2 rows containing non-finite values (stat_density).
Since we had a few NAs and we noticed that the dataset is in a wide-format, instead of a tall format we need to make some adjustments. A wide-dataset means that a variable of interest (session) is listed as a column header. We can convert it to a tall dataset by transposing each of the sessions (S1, S2, S3, etc) into a new column named session
and the corresponding time to fall data into a parallel column titled fall_time
. We will use the tidyr::gather
function to switch from wide to tall format.
# check for NAs
raw_df %>%
summarize(na_count = sum(is.na(LAST)))
## # A tibble: 1 x 1
## na_count
## <int>
## 1 2
# create a tidy/tall data frame
clean_df <- raw_df %>% # begin by piping the df
na.omit() %>% # remove the NAs from the dataset
gather(session, fall_time, S1, S2, S3, S4, S5, S6, S7, LAST) #%>%
# use gather to shift S1 --> LAST into one column named session
# and corresponding fall time into column named fall_time
# now we can check for NAs and take a glimpse at the data
clean_df %>%
summarize(na_count = sum(is.na(fall_time)))
## # A tibble: 1 x 1
## na_count
## <int>
## 1 0
glimpse(clean_df)
## Observations: 1,568
## Variables: 5
## $ Sex <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0...
## $ Age <dbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0...
## $ Genotype <dbl> 2, 2, 2, 2, 0, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0...
## $ session <chr> "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1"...
## $ fall_time <dbl> 10.9025, 11.4275, 12.1775, 6.6500, 13.9775, 15.0450,...
Now that we have a clean data set, we can get our summary stats. We are looking to generate the mean and standard error for fall time. We have our categorical variables of Sex, Age, Genotype, and session. Importantly Sex, Age, Genotype are coded as numerical variables so let’s convert them into factors AND change the labels to something more meaningful.
# create a summary data frame
sum_df <- clean_df %>% # start with data
group_by(Sex, Age, Genotype, session) %>% # group by categorical variables
summarize(fall_mean = mean(fall_time), # calc mean
fall_se = sd(fall_time)/sqrt(n())) %>% # calc standard error
ungroup() %>% # really important to drop the grouping variable for next stage
mutate(Age = factor(Age, labels = c("Young", "Adult", "Old")),
session = factor(session, levels = c("S1", "S2", "S3", 'S4', "S5", 'S6', "S7", "LAST")),
Genotype = factor(Genotype, labels = c("WT", "KO")),
Sex = factor(Sex, labels = c("Male", "Female")))
glimpse(sum_df)
## Observations: 96
## Variables: 6
## $ Sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male...
## $ Age <fct> Young, Young, Young, Young, Young, Young, Young, You...
## $ Genotype <fct> WT, WT, WT, WT, WT, WT, WT, WT, KO, KO, KO, KO, KO, ...
## $ session <fct> LAST, S1, S2, S3, S4, S5, S6, S7, LAST, S1, S2, S3, ...
## $ fall_mean <dbl> 32.017500, 15.250667, 20.137667, 25.021333, 25.43983...
## $ fall_se <dbl> 2.680055, 1.574577, 1.778810, 1.760546, 1.456819, 1....
We can see that our categorical variables have been converted to factors, with the appropriate levels/labels! We also have the mean and se calculated now!
There is an important distinction between personal use graphs and publication graphs. We don’t want to necessarily spend a large chunk of time making an exploratory graph JUST PERFECT, if we might end up discarding it. The first graph below is fine for just taking a look at the data. We will work on making a publication grade graph later on.
ggplot(data = sum_df, # add the data
aes(x = session, y = fall_mean, # aesthetics for x, y coordinates
group = Genotype, # group by genotype for line graphs
color = Genotype)) + # color by genotype
geom_line() +
geom_point() +
facet_grid(Sex~Age) # this breaks the graph into panes based on the sex and the 3 ages
We will be prepping a dataframe for analysis via ANOVA. We need to exclude the NAs and add labels to make the post-hocs and the ANOVA outputs easier to read.
# set up the statistics df
stats_df <- raw_df %>% # data
select(Sex, Age, Genotype, LAST) %>% # select only the column of interest
na.omit() %>% # remove the 2 NAs
mutate(Sex = factor(Sex, labels = c("Male", "Female")), # add the labels!
Age = factor(Age, labels = c("Young", "Adult", "Old")),
Genotype = factor(Genotype, labels = c("WT", "KO")))
glimpse(stats_df)
## Observations: 196
## Variables: 4
## $ Sex <fct> Male, Male, Female, Female, Female, Male, Male, Male,...
## $ Age <fct> Old, Old, Old, Old, Old, Adult, Adult, Adult, Adult, ...
## $ Genotype <fct> KO, KO, KO, KO, WT, KO, WT, WT, WT, KO, KO, KO, KO, K...
## $ LAST <dbl> 35.3425, 32.5975, 21.8000, 23.1625, 30.2675, 39.5600,...
Now that the data frame is ready to go we can move on to calling the ANOVA via aov
and inputting our formula. By using *
in between our variables we can do the main effects and interactions of those terms. Importantly the variables need to be factors for them to work in both the aov
call and the subsequent post-hoc pairwise comparisons.
# rotorod ANOVA, posthocs
rr_aov <- aov(LAST ~ Sex * Age * Genotype, data = stats_df)
# look at effects and interactions
summary(rr_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Sex 1 1777 1777.1 13.932 0.000253 ***
## Age 2 4575 2287.7 17.934 7.66e-08 ***
## Genotype 1 934 934.4 7.326 0.007438 **
## Sex:Age 2 103 51.7 0.406 0.667221
## Sex:Genotype 1 0 0.3 0.002 0.961824
## Age:Genotype 2 1580 790.2 6.195 0.002491 **
## Sex:Age:Genotype 2 439 219.6 1.722 0.181606
## Residuals 184 23470 127.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see significant main effects of Sex, Age, Genotype, and an interaction of Age:Genotype. Exciting! We can use the pairwise comparisons to find where any specific differences are in the comparisons.
For the pairwise, we need to use the $
to select columns from each of the dataframe and look at the interaction via :
. Our first pairwise has NO correction for multiple comparisons, and is comparable to a unprotected Fisher’s-LSD post-hoc. This is not stringent at all, and given the amount of comparisons we have it is advisable to move forward with the Tukey post-hoc test seen below.
# pairwise t.tests
rr_last <- pairwise.t.test(stats_df$LAST,
stats_df$Sex:stats_df$Age:stats_df$Genotype,
p.adj = "none")
We need to pull out the matrix of P values and can save it to an Excel file for long-term storage.
# pull out the posthoc p.values into a df
rr_last_pairwise <- data.frame(rr_last$p.value)
kable(head(rr_last_pairwise))
Male.Young.WT | Male.Young.KO | Male.Adult.WT | Male.Adult.KO | Male.Old.WT | Male.Old.KO | Female.Young.WT | Female.Young.KO | Female.Adult.WT | Female.Adult.KO | Female.Old.WT | |
---|---|---|---|---|---|---|---|---|---|---|---|
Male:Young:KO | 0.2819991 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Male:Adult:WT | 0.0159042 | 0.0005583 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Male:Adult:KO | 0.8092087 | 0.1882095 | 0.0296472 | NA | NA | NA | NA | NA | NA | NA | NA |
Male:Old:WT | 0.0076084 | 0.0001785 | 0.8754982 | 0.0153778 | NA | NA | NA | NA | NA | NA | NA |
Male:Old:KO | 0.0123509 | 0.0003368 | 0.9880718 | 0.0241009 | 0.8571498 | NA | NA | NA | NA | NA | NA |
Female:Young:WT | 0.7932903 | 0.4054488 | 0.0068485 | 0.6120026 | 0.0028831 | 0.0049361 | NA | NA | NA | NA | NA |
A more stringent option would be to perform a TukeyHSD test which accounts for multiple comparisons. It is a simpler call, and takes the ANOVA results into consideration for multiple comparisons. We will wrap our Tukey call with the broom::tidy
function to coerce the list output into a dataframe.
rr_last_tukey <- tidy(TukeyHSD(rr_aov, which = 'Sex:Age:Genotype'))
kable(head(rr_last_tukey))
term | comparison | estimate | conf.low | conf.high | adj.p.value |
---|---|---|---|---|---|
Sex:Age:Genotype | Female:Young:WT-Male:Young:WT | 1.0651875 | -12.373056 | 14.503431 | 1.0000000 |
Sex:Age:Genotype | Male:Adult:WT-Male:Young:WT | -10.0363333 | -23.689603 | 3.616936 | 0.3891134 |
Sex:Age:Genotype | Female:Adult:WT-Male:Young:WT | -2.1110294 | -15.356646 | 11.134587 | 0.9999953 |
Sex:Age:Genotype | Male:Old:WT-Male:Young:WT | -10.6558333 | -23.727843 | 2.416177 | 0.2353913 |
Sex:Age:Genotype | Female:Old:WT-Male:Young:WT | -0.5873438 | -14.025588 | 12.850900 | 1.0000000 |
Sex:Age:Genotype | Male:Young:KO-Male:Young:WT | 4.4498333 | -9.203436 | 18.103103 | 0.9952287 |
# write the posthocs to excel
write.csv(rr_last_pairwise, "rr_last_post-hoc.csv")
write.csv(rr_last_tukey, "rr_last_tukey_post-hoc.csv")
Now that we have our ANOVA results, we can work on a publication-grade plot. We will spend more time with custom adjustments to get it JUST RIGHT.
# plot of RR vs session
g1 <- ggplot(sum_df,
aes(x = session, y = fall_mean, fill = Genotype,
group = Genotype), color = "black") +
geom_line() +
geom_errorbar(aes(ymin = fall_mean - fall_se, ymax = fall_mean + fall_se), width = 0.2) +
geom_point(shape = 21, size = 3) +
facet_grid(Sex~Age) +
theme_bw() +
scale_fill_manual(values = c("white", "black")) +
theme(legend.position = c(0.04, .92),
legend.title = element_blank(),
axis.title = element_text(size = 20),
legend.background = element_blank(),
panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
labs(x = "\nSession",
y = "Latency to Fall (s)\n")
g1
Lastly we can save our publication graph as a .png file!
# save the RR graph yo
ggsave("RR.png", g1, height = 6, width = 10, units = "in")