A Gentle Introduction to Statistics in R

Introduction

After a great discussion started by Jesse Maegan (@kiersi) on Twitter, I decided to post a workthrough of some (altered) mouse behavior data. These data correspond to motor function testing in mice via the rotarod, a semi-automated apparatus that tests the ability of the mouse to balance and maintain position on spinning cylinder. The time to fall from the cylinder onto a well-padded surface is the measure of interest, where a longer time to fall indicates better balance and better performance! The mice will be tested across at least 8 timepoints, but we are primarily interested in their peak performance at their LAST session.

We will be working through loading, plotting, analyzing, and saving the outputs through the tidyverse, an “opinionated collection of R packages” designed for data analysis. The rest of our functions will be performed in base R to limit package dependence.

Load the tidyverse, broom, knirt

Using the library function we will load the tidyverse. If you have never installed it before you can also use the install.packages("tidyverse") call to install it for the first time. This package includes ggplot2 (graphs), dplyr/tidyr (sumamry statistics, data manipulation), and readxl (reading excel files) as well as the pipe %>% which will make our code much more readable! We will also load the broom package to tidy up some of our statistical outputs. Lastly we will load knitr for making nice tables for the html output, but not necessary for simply saving the outputs to Excel.

# Load for ggplot, dplyr, tidyr, readxl
library(tidyverse)
library(broom)
library(knitr)

Read Excel file

While I am calling readxl::read_xlsx you could also simply use read_xlsx, but in the interest of transparency, I will be using the full call to begin. By using the glimpse function from dplyr we can see how the variables were imported, as well as the first few rows. We can already see some NAs, which we will need to address later!

# read excel file
raw_df <- readxl::read_xlsx("rotorod_study.xlsx")

dplyr::glimpse(raw_df)

## Observations: 198
## Variables: 11
## $ Sex      <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,...
## $ Age      <dbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,...
## $ Genotype <dbl> 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ S1       <dbl> 10.9025, 11.4275, 12.1775, 6.6500, 13.9775, 15.0450, ...
## $ S2       <dbl> 15.7250, 22.2475, 17.5900, 7.7350, 18.2075, 24.3075, ...
## $ S3       <dbl> 17.9100, 28.4525, 22.4350, 22.9175, 31.8125, 19.9050,...
## $ S4       <dbl> 19.9400, 26.1250, 3.9875, 29.0450, 28.3950, 20.8750, ...
## $ S5       <dbl> 33.7150, 34.3150, 17.2275, 25.5925, 24.1325, 38.2100,...
## $ S6       <dbl> 36.2925, 36.3100, 17.1800, 30.2475, 26.4300, 37.6000,...
## $ S7       <dbl> 35.3425, 32.5975, 21.8025, 23.1625, 30.2675, 39.5600,...
## $ LAST     <dbl> 35.3425, 32.5975, 21.8000, 23.1625, 30.2675, 39.5600,...

Check distribution

By calling ggplot we can take a look at the distribution. It looks like we have a slight right-skewed dataset for our variable of interest (LAST). There appear to be a few mice that performed very well.

# density plot

ggplot(raw_df, aes(x = LAST)) +
  geom_density()

## Warning: Removed 2 rows containing non-finite values (stat_density).

Data cleaning

Since we had a few NAs and we noticed that the dataset is in a wide-format, instead of a tall format we need to make some adjustments. A wide-dataset means that a variable of interest (session) is listed as a column header. We can convert it to a tall dataset by transposing each of the sessions (S1, S2, S3, etc) into a new column named session and the corresponding time to fall data into a parallel column titled fall_time. We will use the tidyr::gather function to switch from wide to tall format.

# check for NAs
raw_df %>% 
  summarize(na_count = sum(is.na(LAST)))

## # A tibble: 1 x 1
##   na_count
##      <int>
## 1        2

# create a tidy/tall data frame

clean_df <- raw_df %>% # begin by piping the df 
  na.omit() %>% # remove the NAs from the dataset 
  gather(session, fall_time, S1, S2, S3, S4, S5, S6, S7, LAST) #%>% 
# use gather to shift S1 --> LAST into one column named session
# and corresponding fall time into column named fall_time

# now we can check for NAs and take a glimpse at the data
clean_df %>% 
  summarize(na_count = sum(is.na(fall_time)))

## # A tibble: 1 x 1
##   na_count
##      <int>
## 1        0

glimpse(clean_df)

## Observations: 1,568
## Variables: 5
## $ Sex       <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0...
## $ Age       <dbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0...
## $ Genotype  <dbl> 2, 2, 2, 2, 0, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0...
## $ session   <chr> "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1"...
## $ fall_time <dbl> 10.9025, 11.4275, 12.1775, 6.6500, 13.9775, 15.0450,...

Summary Statistics

Now that we have a clean data set, we can get our summary stats. We are looking to generate the mean and standard error for fall time. We have our categorical variables of Sex, Age, Genotype, and session. Importantly Sex, Age, Genotype are coded as numerical variables so let’s convert them into factors AND change the labels to something more meaningful.

# create a summary data frame
sum_df <- clean_df %>% # start with data
  group_by(Sex, Age, Genotype, session) %>%  # group by categorical variables
  summarize(fall_mean = mean(fall_time),    # calc mean
            fall_se = sd(fall_time)/sqrt(n())) %>%  # calc standard error
  ungroup() %>% # really important to drop the grouping variable for next stage
  
  mutate(Age = factor(Age, labels = c("Young", "Adult", "Old")),
         session = factor(session, levels = c("S1", "S2", "S3", 'S4', "S5", 'S6', "S7", "LAST")),
         Genotype = factor(Genotype, labels = c("WT", "KO")),
         Sex = factor(Sex, labels = c("Male", "Female")))

glimpse(sum_df)

## Observations: 96
## Variables: 6
## $ Sex       <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male...
## $ Age       <fct> Young, Young, Young, Young, Young, Young, Young, You...
## $ Genotype  <fct> WT, WT, WT, WT, WT, WT, WT, WT, KO, KO, KO, KO, KO, ...
## $ session   <fct> LAST, S1, S2, S3, S4, S5, S6, S7, LAST, S1, S2, S3, ...
## $ fall_mean <dbl> 32.017500, 15.250667, 20.137667, 25.021333, 25.43983...
## $ fall_se   <dbl> 2.680055, 1.574577, 1.778810, 1.760546, 1.456819, 1....

We can see that our categorical variables have been converted to factors, with the appropriate levels/labels! We also have the mean and se calculated now!

Plotting summary statistics

There is an important distinction between personal use graphs and publication graphs. We don’t want to necessarily spend a large chunk of time making an exploratory graph JUST PERFECT, if we might end up discarding it. The first graph below is fine for just taking a look at the data. We will work on making a publication grade graph later on.

ggplot(data = sum_df, # add the data
       aes(x = session, y = fall_mean, # aesthetics for x, y coordinates 
           group = Genotype,  # group by genotype for line graphs
           color = Genotype)) +    # color by genotype
  geom_line() +
  geom_point() +
  facet_grid(Sex~Age) # this breaks the graph into panes based on the sex and the 3 ages

Statistics

We will be prepping a dataframe for analysis via ANOVA. We need to exclude the NAs and add labels to make the post-hocs and the ANOVA outputs easier to read.

# set up the statistics df
stats_df <- raw_df %>% # data
  select(Sex, Age, Genotype, LAST) %>% # select only the column of interest
  na.omit() %>% # remove the 2 NAs
  mutate(Sex = factor(Sex, labels = c("Male", "Female")), # add the labels!
         Age = factor(Age, labels = c("Young", "Adult", "Old")),
         Genotype = factor(Genotype, labels = c("WT", "KO")))
glimpse(stats_df)

## Observations: 196
## Variables: 4
## $ Sex      <fct> Male, Male, Female, Female, Female, Male, Male, Male,...
## $ Age      <fct> Old, Old, Old, Old, Old, Adult, Adult, Adult, Adult, ...
## $ Genotype <fct> KO, KO, KO, KO, WT, KO, WT, WT, WT, KO, KO, KO, KO, K...
## $ LAST     <dbl> 35.3425, 32.5975, 21.8000, 23.1625, 30.2675, 39.5600,...

Now that the data frame is ready to go we can move on to calling the ANOVA via aov and inputting our formula. By using * in between our variables we can do the main effects and interactions of those terms. Importantly the variables need to be factors for them to work in both the aov call and the subsequent post-hoc pairwise comparisons.

# rotorod ANOVA, posthocs
rr_aov <- aov(LAST ~ Sex * Age * Genotype, data = stats_df)

# look at effects and interactions
summary(rr_aov)

##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## Sex                1   1777  1777.1  13.932 0.000253 ***
## Age                2   4575  2287.7  17.934 7.66e-08 ***
## Genotype           1    934   934.4   7.326 0.007438 ** 
## Sex:Age            2    103    51.7   0.406 0.667221    
## Sex:Genotype       1      0     0.3   0.002 0.961824    
## Age:Genotype       2   1580   790.2   6.195 0.002491 ** 
## Sex:Age:Genotype   2    439   219.6   1.722 0.181606    
## Residuals        184  23470   127.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see significant main effects of Sex, Age, Genotype, and an interaction of Age:Genotype. Exciting! We can use the pairwise comparisons to find where any specific differences are in the comparisons.

For the pairwise, we need to use the $ to select columns from each of the dataframe and look at the interaction via :. Our first pairwise has NO correction for multiple comparisons, and is comparable to a unprotected Fisher’s-LSD post-hoc. This is not stringent at all, and given the amount of comparisons we have it is advisable to move forward with the Tukey post-hoc test seen below.

# pairwise t.tests
rr_last <- pairwise.t.test(stats_df$LAST, 
                stats_df$Sex:stats_df$Age:stats_df$Genotype, 
                p.adj = "none")

We need to pull out the matrix of P values and can save it to an Excel file for long-term storage.

# pull out the posthoc p.values into a df
rr_last_pairwise <- data.frame(rr_last$p.value)
kable(head(rr_last_pairwise))

	Male.Young.WT	Male.Young.KO	Male.Adult.WT	Male.Adult.KO	Male.Old.WT	Male.Old.KO	Female.Young.WT	Female.Young.KO	Female.Adult.WT	Female.Adult.KO	Female.Old.WT
Male:Young:KO	0.2819991	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Male:Adult:WT	0.0159042	0.0005583	NA	NA	NA	NA	NA	NA	NA	NA	NA
Male:Adult:KO	0.8092087	0.1882095	0.0296472	NA	NA	NA	NA	NA	NA	NA	NA
Male:Old:WT	0.0076084	0.0001785	0.8754982	0.0153778	NA	NA	NA	NA	NA	NA	NA
Male:Old:KO	0.0123509	0.0003368	0.9880718	0.0241009	0.8571498	NA	NA	NA	NA	NA	NA
Female:Young:WT	0.7932903	0.4054488	0.0068485	0.6120026	0.0028831	0.0049361	NA	NA	NA	NA	NA

A more stringent option would be to perform a TukeyHSD test which accounts for multiple comparisons. It is a simpler call, and takes the ANOVA results into consideration for multiple comparisons. We will wrap our Tukey call with the broom::tidy function to coerce the list output into a dataframe.

rr_last_tukey <- tidy(TukeyHSD(rr_aov, which = 'Sex:Age:Genotype'))


kable(head(rr_last_tukey))

term	comparison	estimate	conf.low	conf.high	adj.p.value
Sex:Age:Genotype	Female:Young:WT-Male:Young:WT	1.0651875	-12.373056	14.503431	1.0000000
Sex:Age:Genotype	Male:Adult:WT-Male:Young:WT	-10.0363333	-23.689603	3.616936	0.3891134
Sex:Age:Genotype	Female:Adult:WT-Male:Young:WT	-2.1110294	-15.356646	11.134587	0.9999953
Sex:Age:Genotype	Male:Old:WT-Male:Young:WT	-10.6558333	-23.727843	2.416177	0.2353913
Sex:Age:Genotype	Female:Old:WT-Male:Young:WT	-0.5873438	-14.025588	12.850900	1.0000000
Sex:Age:Genotype	Male:Young:KO-Male:Young:WT	4.4498333	-9.203436	18.103103	0.9952287

# write the posthocs to excel
write.csv(rr_last_pairwise, "rr_last_post-hoc.csv")
write.csv(rr_last_tukey, "rr_last_tukey_post-hoc.csv")

Publication plot

Now that we have our ANOVA results, we can work on a publication-grade plot. We will spend more time with custom adjustments to get it JUST RIGHT.

# plot of RR vs session
g1 <- ggplot(sum_df, 
       aes(x = session, y = fall_mean, fill = Genotype,  
          group = Genotype), color = "black") +
  geom_line() +
  geom_errorbar(aes(ymin = fall_mean - fall_se, ymax = fall_mean + fall_se), width = 0.2) +
  geom_point(shape = 21, size = 3) +
  facet_grid(Sex~Age) +
  theme_bw() +
  scale_fill_manual(values = c("white", "black")) +
  theme(legend.position = c(0.04, .92),
        legend.title = element_blank(),
        axis.title = element_text(size = 20),
        legend.background = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  labs(x = "\nSession", 
       y = "Latency to Fall (s)\n")
g1

Lastly we can save our publication graph as a .png file!

# save the RR graph yo
ggsave("RR.png", g1, height = 6, width = 10, units = "in")