Introduction

This is an example R Markdown file for my students in BIO 320 (Spring 2025) at La Salle University. I admit that I’m still very new to producing RMarkdown files so it will be a learning process for us all! Increasingly, R Markdown files are used over raw scripts because of the capacity to add a lot more commentary and context to your code outside of traditional comments (made with the # that we are all familiar with).

Well-explained code is useful for tutorials for anyone who is learning R, but it’s also really helpful in terms of explaining your thought process in analyzing data, and in terms of increasing the reproducibility of your data wrangling and statistical analysis.

For the aforementioned reasons, I am requiring your final project* to be turned in as an RMarkdown file that produces a HTML file, which I can successfully open via Firefox.

*You will be doing these projects as partners.

Note 1: I would like to add a special note that this R Markdown file tutorial is entirely based off of Whitlock and Schluter’s R Lab 11 associated with their classic biostatistics textbook called Analysis of Biological Data (3rd edition). Note 2: The title of this file is punchy and flippant. Brain size does not equal intelligence! Plus, who are we to comment on the cleverness of squirrels?

Load Libraries & Data

library(ggplot2)
library(dplyr)
library(magrittr)

mammals.df <- read.csv("DataForLabs/mammals.csv", header = TRUE)

Explore Data

hist(mammals.df$body_mass_kg, main = "Body Mass Histogram", xlab = "Body Mass (kg)")

hist(mammals.df$brain_mass_g, main = "Brain Mass Histogram", xlab = "Brain Mass (g)")

Notice that both traits are super skewed! Let’s go ahead and try log-transforming these two variables to improve normality.

mammals.df$logbrain <- log(mammals.df$brain_mass_g)
mammals.df$logbody <- log(mammals.df$body_mass_kg)
hist(mammals.df$logbody)

hist(mammals.df$logbrain)

ggplot(mammals.df, aes(x = logbody, y = logbrain)) +
  geom_point() +
  theme_minimal() +
  xlab("Log Body Mass (kg)") +
  ylab("Log Brain Mass (g)")

Based on this scatterplot, we can take an educated guess that there is going to be a positive relationship between body mass and brain size. As a biologist, this doesn’t really surprise me, although I’ve been outsmarted by a number of sneaky street cats in my life.

Let’s add a regression line to this plot, but notice that I’m going to hold off on adding the linear regression equation. Why? Because I don’t want to let myself get too excited about the data without first assessing the residuals!

ggplot(mammals.df, aes(x = logbody, y = logbrain)) +
  geom_point(na.rm = TRUE) +
  geom_smooth(method = lm, se = FALSE, na.rm = TRUE) +
  theme_minimal() +
  xlab("Log Body Mass (kg)") +
  ylab("Log Brain Mass (g)")
## `geom_smooth()` using formula = 'y ~ x'

Nice, this looks good! Let’s build a linear regression model and assess whether or not a standard linear regression or an alternative approach needs to be taken. ## Make a Linear Model

mammals.lm <- lm(logbrain ~ logbody, data = mammals.df)

Assess the Residuals of the Model

hist(mammals.lm$residuals, main = "Histogram of Residuals")

qqnorm(mammals.lm$residuals)
qqline(mammals.lm$residuals)

plot(mammals.df$logbrain, residuals(mammals.lm),
     main = "Residuals vs Log Brain Mass",
     xlab = "Log Brain Mass", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)

The residuals look fairly normal to me! I see a nice cloud of residuals around the horizontal “zero” line, the histogram looks okay*, and the QQ plot looks solid.

*One thing to think about is whether the residuals are “good enough”. Biology is messy, biological data is messy, and it’s rare to have perfect residuals from real-life data.

Let’s look at the results of the model.

Look at Summary of Linear Regression Model

summary(mammals.lm)
## 
## Call:
## lm(formula = logbrain ~ logbody, data = mammals.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71143 -0.50667 -0.05606  0.43833  1.94425 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.12719    0.09682   21.97   <2e-16 ***
## logbody      0.75451    0.02878   26.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.699 on 60 degrees of freedom
## Multiple R-squared:  0.9197, Adjusted R-squared:  0.9184 
## F-statistic: 687.3 on 1 and 60 DF,  p-value: < 2.2e-16

Identifying Over- and Underperformers in the Dataset

This data makes me curious–there was a decent spread (“scatter”) in the data around the regression line (the “trend”). This makes me wonder if there are mammals have brain masses that are much greater than predicted by the linear regression, and if there are mammals with brain masses that are much smaller than predicted. Let’s see if we can identify the mammal that has the biggest brain mass relative to their predicted brain mass (overperformer), and the mammal that has the smallest brain mass relative to their predicted brain mass (underperformer).

Let’s look at the residuals again, but this time we’ll label them.

brains_with_residuals <- mammals.df %>%
  mutate(residual = resid(mammals.lm))
library(ggrepel)

ggplot(brains_with_residuals, aes(x = logbody, y = residual)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
  geom_text_repel(aes(label = name), size = 3) +
  labs(title = "Residuals of Log Brain Size ~ Log Body Size",
       x = "Log Body Size",
       y = "Residual Log Brain Size (Observed - Predicted)") +
  theme_minimal()

# Find species with highest and lowest residuals
max_brain_species <- brains_with_residuals %>%
  filter(residual == max(residual)) %>%
  select(name, logbody, logbrain, residual)

min_brain_species <- brains_with_residuals %>%
  filter(residual == min(residual)) %>%
  select(name, logbody, logbrain, residual)

max_brain_species
##    name  logbody logbrain residual
## 1 Human 4.127134 7.185387 1.944249
min_brain_species
##            name  logbody logbrain  residual
## 1 Water opossum 1.252763 1.360977 -1.711431

This tells us which species has a larger brain than expected (overperformer) and which one has a smaller brain than expected (underperformer). Remember, this isn’t about the smallest brain — it’s about the deviation from the expected value.

max_brain_species
##    name  logbody logbrain residual
## 1 Human 4.127134 7.185387 1.944249
min_brain_species
##            name  logbody logbrain  residual
## 1 Water opossum 1.252763 1.360977 -1.711431

Are you surprised by these results?

Time to Make a Poster-Worthy Scatterplot

library(ggrepel)
library(ggpmisc)
## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2
## 
## Attaching package: 'ggpp'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(ggpubr)
## 
## Attaching package: 'ggpubr'
## The following objects are masked from 'package:ggpp':
## 
##     as_npc, as_npcx, as_npcy
ggplot(mammals.df, aes(x = logbody, y = logbrain)) +
  geom_point(na.rm = TRUE) +
  geom_smooth(method = lm, se = TRUE, na.rm = TRUE) +
  theme_minimal() +
  xlab("Log Body Mass (kg)") +
  ylab("Log Brain Mass (g)") +
  geom_text_repel(aes(label = name), size = 3) + 
  stat_cor(label.y = 6) +
  stat_regline_equation(label.y = 6.5)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: ggrepel: 13 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Notice that you got a warning that 15 of the dots are not labeled because of overlap. You could fix this with a bit of code called max.overlaps. I’ll show you what that looks like below.

library(ggrepel)
library(ggpmisc)
library(ggpubr)
ggplot(mammals.df, aes(x = logbody, y = logbrain)) +
  geom_point(na.rm = TRUE) +
  geom_smooth(method = lm, se = TRUE, na.rm = TRUE) +
  theme_minimal() +
  xlab("Log Body Mass (kg)") +
  ylab("Log Brain Mass (g)") +
  geom_text_repel(aes(label = name), size = 3, max.overlaps = 20) + 
  stat_cor(label.y = 6) +
  stat_regline_equation(label.y = 6.5)
## `geom_smooth()` using formula = 'y ~ x'

To me, this is too much and too busy! I’d rather drop some labels.