Describing data: R code for Chapter 3 examples

Note: This document was converted to R-Markdown from this page, with modifications, by M. Drew LaMar. You can download the R-Markdown here.

Download the original R code on this page as a single file here

New methods

New methods on this page

Hover over a function argument for a short description of its meaning. The variable names are plucked from the examples further below.

Sample mean, assuming that there are no missing (NA) entries:

mean(snakeData$undulationRate)

Standard deviation, assuming that there are no missing (NA) entries:

sd(snakeData$undulationRate)

Calculate a statistic by group, assuming that there are no missing (NA) entries:

tapply(sticklebackData$$$plates, sticklebackData$genotype, FUN = median)

Obtain a subset of data from a data frame:

subset(spiderData, treatment == “before”)

Other new methods:

Variance and coefficient of variation.
Median and interquartile range.
Round to a preferred number of decimals.
Cumulative frequency distribution.
Table of descriptive statistics by group.
Mean and standard deviation from a frequency table.

Example 3.1. Gliding snakes

Sample mean, standard deviation, variance and coefficient of variation of undulation rate of 8 gliding paradise tree snakes, in Hz.

Read and inspect the data.

snakeData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e1GlidingSnakes.csv")
head(snakeData)

##   snake undulationRateHz
## 1     1              0.9
## 2     2              1.2
## 3     3              1.2
## 4     4              1.3
## 5     5              1.4
## 6     6              1.4

Draw a histogram of the data.

hist(snakeData$undulationRateHz, right = FALSE)

Commands for a fancier histogram are shown here.

hist(snakeData$undulationRateHz, 
     right = FALSE, 
     las = 1, 
     col = "firebrick",
     breaks = seq(0.8,2.2,by=0.2), 
     xlab = "Undulation rate (Hz)", 
     ylab = "Frequency", main = "")

Calculate the mean, standard deviation and variance of the undulation rates.

m <- mean(snakeData$undulationRate)
s <- sd(snakeData$undulationRate)
var(snakeData$undulationRate)

## [1] 0.105

Calculate the coefficient of variation.

100 * s/m

## [1] 23.56633

Table 3.1-2. Number sof convictions

Mean and standard deviation from a frequency table. The data are from Chapter 2.

Read and inspect the data. It is a frequency table of the number of convictions.

convictionsFreq <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03t1_2ConvictionsFreq.csv")
head(convictionsFreq)

##   convictions frequency
## 1           0       265
## 2           1        49
## 3           2        21
## 4           3        19
## 5           4        10
## 6           5        10

Calculate the mean and standard deviation from the frequency table. First, use the rep command to “repeat” each value in the table, the number of times according to its frequency. Here, we store the result in convictions.

convictions <- rep(convictionsFreq$convictions, convictionsFreq$frequency)

Then, calculate mean and standard deviation on the result.

mean(convictions)

## [1] 1.126582

sd(convictions)

## [1] 2.456562

Example 3.2. Spider running speed

Median, interquartile range, and box plot of running speed (cm/s) of male Tidarren spiders. We also include below the cumulative frequency distribution of running speed before amputation.

Read and inspect the data. The data are in “long” format. One variable indicates running speed, and a second variable gives treatment (before vs after amputation). Therefore, every individual spider is on two rows, once for its before-amputation measurement and one for its after-amputation measurement.

spiderData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e2SpiderAmputation.csv")
head(spiderData)

##   spider speed treatment
## 1      1  1.25    before
## 2      2  2.94    before
## 3      3  2.38    before
## 4      4  3.09    before
## 5      5  3.41    before
## 6      6  3.00    before

Box plot of the data. Begin by ordering the treatment levels so that the “before” amputation measurements come before the “after” measurements in the plot.

spiderData$treatment <- factor(spiderData$treatment, levels = c("before", "after"))
boxplot(speed ~ treatment, data = spiderData)

Instructions for a fancier box plot, made with additional options, is shown here.

par(bty = "l") # Makes the axes L-shaped
boxplot(speed ~ treatment, 
        data = spiderData, 
        ylim = c(0,max(spiderData$speed)),
        col = "goldenrod1", 
        boxwex = 0.5, # Scale width of boxplots by 0.5
        whisklty = 1, # Make whiskers linetype 1 (i.e. solid)
        xlab = "Amputation treatment", 
        ylab = "Running speed (cm/s)")

Extract the before-amputation data using subset. Note the double equal sign “==” needed in the logical statement, which indicates “is equal to” in R. Save the result in a new data frame named speedBefore.

speedBefore <- subset(spiderData, treatment == "before") 
speedBefore

##    spider speed treatment
## 1       1  1.25    before
## 2       2  2.94    before
## 3       3  2.38    before
## 4       4  3.09    before
## 5       5  3.41    before
## 6       6  3.00    before
## 7       7  2.31    before
## 8       8  2.93    before
## 9       9  2.98    before
## 10     10  3.55    before
## 11     11  2.84    before
## 12     12  1.64    before
## 13     13  3.22    before
## 14     14  2.87    before
## 15     15  2.37    before
## 16     16  1.91    before

Calculate the sample median of before-amputation running speed.

median(speedBefore$speed)

## [1] 2.9

Calculate the first and third quartiles of before-amputation running speed (0.25 and 0.75 quantiles). Type 5 reproduces the method we use in the book to calculate quartiles.

quantile(speedBefore$speed, 
         probs = c(0.25, 0.75), 
         type = 5)

##   25%   75% 
## 2.340 3.045

Determine the interquartile range of before-amputation running speed. Type 5 reproduces the method we use in the book to calculate quartiles.

IQR(speedBefore$speed, type = 5)

## [1] 0.705

Draw a cumulative frequency distribution of running speed before amputation (Figure 3.4-1).

plot(ecdf(speedBefore$speed), 
     verticals = TRUE,
     ylab = "Cumulative relative frequency", 
     xlab = "Running speed before amputation (cm/s)")

Instructions for a slightly more polished cumulative frequency distribution are here.

par(bty = "l")
plot(ecdf(speedBefore$speed), 
     verticals = TRUE,  
     las = 1, 
     main = "", 
     do.points = FALSE,
     ylab = "Cumulative relative frequency", 
     xlab = "Running speed before amputation (cm/s)" )

Example 3.3. Stickleback lateral plates

*Draw multiple histograms and produce a table of descriptive statistics by group for plate numbers of three stickleback genotypes. We also include a table of frequencies and proportions of stickleback genotypes.

Download and inspect the data.

sticklebackData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e3SticklebackPlates.csv")
head(sticklebackData)

##     id plates genotype
## 1  4-1     11       mm
## 2  4-2     63       Mm
## 3  4-4     22       Mm
## 4  4-5     10       Mm
## 5 4-10     14       mm
## 6 4-12     11       mm

Draw multiple histograms of number of plates, one histogram per genotype. Begin by setting the preferred order of the three genotypes in the figure. Here, we use the lattice package to draw the histograms, so it must be loaded first.

sticklebackData$genotype <- factor(sticklebackData$genotype, levels = c("MM","Mm","mm"))
library(lattice)
histogram(~ plates | genotype, 
          data = sticklebackData, 
          breaks = seq(0,70,by=2), 
          layout = c(1, 3), 
          col = "firebrick")

Commands to draw multiple histograms in base R, without using the lattice package, are here. This method is more tedious, but allows for easier addition of options.

oldpar = par(no.readonly = TRUE) # make backup of default graph settings
par(mfrow = c(3,1), 
    las = 1, 
    oma = c(4, 6, 2, 6), 
    mar = c(2, 5, 4, 2)) # adjust margins
hist(sticklebackData$plates[sticklebackData$genotype == "MM"], 
     right = FALSE, 
     breaks = seq(0,70,by=2), 
     main = "MM", 
     col = "firebrick",
     las = 1, 
     ylab = "Frequency")
hist(sticklebackData$plates[sticklebackData$genotype == "Mm"], 
     right = FALSE, 
     breaks = seq(0,70,by=2), 
     main = "Mm", 
     col = "firebrick",
     las = 1, 
     ylab = "Frequency")
hist(sticklebackData$plates[sticklebackData$genotype == "mm"], 
     right = FALSE, 
     breaks = seq(0,70,by=2), 
     main = "mm", 
     col = "firebrick",
     las = 1, 
     ylab = "Frequency")
mtext("Number of lateral plates", 
      side = 1, 
      outer = TRUE, 
      padj = 1.5)

par(oldpar) # revert to default graph settings

Make a table of descriptive statistics by group using tapply. The commands below assume that the variables contain no missing (NA) elements. We need to calculate all the statistics, one at a time. Save them so that we can put them all together in a table afterward.

To begin, get the sample sizes by group (genotype):

n <- tapply(sticklebackData$plates, 
            INDEX = sticklebackData$genotype, 
            FUN = length)
n

##  MM  Mm  mm 
##  82 174  88

Next, calculate the mean number of plates by group. Let’s round the results to 1 decimal place to make it easier to read them when we place into a table. To accomplish this, set digits = 1 in the round function.

meanPlates <- tapply(sticklebackData$plates, 
                     INDEX = sticklebackData$genotype, 
                     FUN = mean)
meanPlates <- round(meanPlates, digits = 1)
meanPlates

##   MM   Mm   mm 
## 62.8 50.4 11.7

Repeat for medians.

medianPlates <- tapply(sticklebackData$plates, 
                       INDEX = sticklebackData$genotype, 
                       FUN = median)
medianPlates

## MM Mm mm 
## 63 59 11

Repeat for standard deviations, including rounding.

sdPlates <- tapply(sticklebackData$plates, 
                   INDEX = sticklebackData$genotype, 
                   FUN = sd)
sdPlates <- round(sdPlates, 1) 
sdPlates

##   MM   Mm   mm 
##  3.4 15.1  3.6

Finally, the interquartile range by group.

iqrPlates <- tapply(sticklebackData$plates, 
                    INDEX = sticklebackData$genotype, 
                    FUN = IQR, type = 5)
iqrPlates

## MM Mm mm 
##  2 21  3

Make the table by assembling all the results in a new data frame (data frames can be used as tables for display purposes).

sticklebackTable <- data.frame(genotype = names(n), 
                               n = n, 
                               mean = meanPlates, 
                               median = medianPlates, 
                               sd = sdPlates, 
                               iqrange = iqrPlates)
sticklebackTable

##    genotype   n mean median   sd iqrange
## MM       MM  82 62.8     63  3.4       2
## Mm       Mm 174 50.4     59 15.1      21
## mm       mm  88 11.7     11  3.6       3

We can make a table of frequencies and proportions of the stickleback genotypes (Table 3.5-1). Below, we first generate a frequency table of genotypes (the dnn argument names the variable in the table).

sticklebackFreq <- table(sticklebackData$genotype, dnn = "genotype")
sticklebackFreq

## genotype
##  MM  Mm  mm 
##  82 174  88

We then convert the table to a data frame so that we can include the proportions in the table too.

sticklebackFreq <- data.frame(sticklebackFreq)
sticklebackFreq

##   genotype Freq
## 1       MM   82
## 2       Mm  174
## 3       mm   88

Finally, we calculate the proportions and put them into the data frame. To convert frequencies to proportions, divide the frequencies by the sum of the frequencies.

sticklebackFreq$proportion <- sticklebackFreq$Freq / sum(sticklebackFreq$Freq)
sticklebackFreq

##   genotype Freq proportion
## 1       MM   82  0.2383721
## 2       Mm  174  0.5058140
## 3       mm   88  0.2558140

The table would look even nicer if you round the proportions before including them in the table (give this a try).