Basic R tools

Introduction

In this script I take a very simple dataset and walk through, step by step, how to

Enter the data as vectors
Summarize the data (mean, variance, etc)
Make histograms and boxplots
Outline a script for making future plots
Conduct a t-test

I plan on adding

Calculate an effect size
Calculate the t-test by hand

Functions used in the script include: mean, var, sd, sqrt, length, sum, hist, boxplot, boxplot2, t.test

I have written this at a very basic level and I have built in a lot of redundancy, showing step by step how things are done, providing alternate ways of accomplishing the same thing, and building up elements in plots step by step.

Data Entry

There two mains ways to get data into R. First, you can enter it by hand, or second, you can load it from an external file, usually generated from Excel. In this example I'll enter a very simple dataset from Sokal and Rohlf's 4th edition of Biometry, published in 2012, on page 120. The data is originally from a study by Thomson and Randall-Maciever (1905). These particualr data are measurements of the skulls of Egyptians from two time periods, 3300 BC and 150 AD. They represent two small, independent samples. They question is whether there is any difference between the means of the data.

First, I'll enter the data by hand into two R objects. This is done using the assignment function, <- and the “combine values into a vector” function c(). Each value is separated by a comma. For sample problems and very small data sets this is fast way to enter day, and it keeps your data within your R script file.

# Data from pg 120 of Sokal and Rohlf (2012) The 3300 BC data, n = 8
BC <- c(131, 126, 131, 132, 135, 137, 130, 132)

# The 120 AD data, n = 10
AD <- c(147, 130, 137, 129, 135, 128, 138, 136, 145, 134)

Basic Data Exploration

R has many basic functions for examining data. They are very similar to those found in Excel, except you don't have to do any clicking and dragging. Let's take a look at sume of the basic characteristics of these two vectors of skull data

First, the mean of each one.

mean(BC)

## [1] 131.8

mean(AD)

## [1] 135.9

We can see that the means are fairly similar. We will later use a t-test to determine if this difference is “significant.”

How about the variance?

var(BC)

## [1] 10.79

var(AD)

## [1] 40.1

The variance of the two samples are pretty different. However, because the sample sizes are small they might not be significantly different. The standard t-test, as well as its big brother ANOVA, assume that the variances of the two groups are not statistically different. Violations of this assumption can sometimes cause problems with ANOVA, though you will often here the phrase “ANOVA is robust to violation of assumptions.” Often this is true, but it can, and its good to get into the practice of paying attention to such things. Also, for t-tests its easy to run in R a slightly modified test that does not assumse equal variances.

We rarely thing in terms of the raw variance. We can get the more standard deviation easily by using the sqrt() function like this

sqrt(var(BC))

## [1] 3.284

sqrt(var(AD))

## [1] 6.332

Even simpler is to just use R's sd() function

sd(BC)

## [1] 3.284

sd(AD)

## [1] 6.332

Base R does not have a function for the standard error, but you can do this yourself using the sd() and the sqrt() function. Recall that the sample size for the BC sample was n = 8

sd(BC)/sqrt(8)

## [1] 1.161

Given the flexibility of R, we could code this other ways too, such using the var() function instead of sd() and changing what is within the sqrt() function.

sqrt(var(BC)/8)

## [1] 1.161

Like functions and formulas in Excel, keeping track of your parentheses can be a pain. Good R script editors, like RStudio (Mac & PC) and TextWrangler (Mac) can be set up to automatically highlight matching parentheses, something the the editor that comes with R does not do.

In the calculations of the the standard error above I “hard coded” the sample size of n = 8 into the formula. An alternative would be to save this number into its own object and do the calculation with the object.

n.BC <- 8
sqrt(var(BC)/n.BC)

## [1] 1.161

What if we forgot how much data there is? THe easiest way would simply be to call up the BC object and count the number of elements.

BC

## [1] 131 126 131 132 135 137 130 132

If we're lazy, or want to write clean code, we can use R's length() function to do that

length(BC)

## [1] 8

We can clean things up then and make our code easier to reuse like this?

# First, determine our two sample sizes
n.BC <- length(BC)
n.AD <- length(AD)

# 2nd, calcualte the standard error (SE)
sqrt(var(BC)/n.BC)

## [1] 1.161

sqrt(var(BC)/n.AD)

## [1] 1.039

If we have a good eye for parenetheses, we can skip the first step and put everything into a single line.

# 2nd, calcualte the standard error (SE)
sqrt(var(BC)/length(BC))

## [1] 1.161

sqrt(var(BC)/length(AD))

## [1] 1.039

R pros love writing compact code like this. However, I find that its often easier when I'm looking back over an old analysis of to do things step by step to remind me what I did.

Introduction to Plotting: Histograms

What if we want to plot the data? For example, maybe we want to look at histograms of the distributions of each sample. R has a built-in histogram function, hist() for this.

hist(BC)

plot of chunk unnamed-chunk-13

hist(AD)

plot of chunk unnamed-chunk-13

Unfortunately, because the dataset is so small these graphs aren't very information.

Right now we have two separate histograms. We can put these two plots right next to each other by changing the graphics set up in R. Unfortunately, some of R's graphics stuff is not initially very intuitive and can have a fairly steep learning curve. For example, if we want to histograms next to each other we first use the par() command, which I think means something like “graphical parameters”, but I'm not sure. Within the par() command we put “mfrow”; not sure what the “mf” means either. It will be easier if I show you.

# First, set the graphical parameters.  c(1,2) means 'one column, two rows'
par(mfrow = c(1, 2))

# Then create the hisotrams.
hist(BC)
hist(AD)

plot of chunk unnamed-chunk-14

If we wanted the two histograms stacked, we'd the mfrow= stuff.

# First, set the graphical parameters.  c(2,1) means '2 columns, one row'
par(mfrow = c(2, 1))

# Then create the hisotrams.
hist(BC)
hist(AD)

plot of chunk unnamed-chunk-15

If happened to have four sets of data we could tell mfrow we wanted 2 columns and 2 rows

par(mfrow = c(2, 2))
hist(BC)
hist(AD)
hist(BC)
hist(AD)

plot of chunk unnamed-chunk-16

If we want to go back to showing just a single graph in the plotting area, we re-set part

par(mfrow = c(1, 1))

Introduction to Plotting: Barplots with barplot()

Histograms should be part of the normal practice of getting to know your data, but you'll rarely show them to other people. For that, barplots are frequently used. This requires a little set up of the data, though.

# First, we need the mean for each sample.  We'll save them to their own
# object
mean.BC <- mean(BC)
mean.AD <- mean(AD)

# 2nd, use the barplot() function.  Note that the two means have to be
# contained within a set of parentheses, like this: 'c(mean.BC,mean.AD)'

barplot(c(mean.AD, mean.BC))

plot of chunk unnamed-chunk-18

This, however, is pretty blah, and unfortunately the stock barplot() function is pretty limited in what it can do. For starters, it doen't have any easy way to plot error bars We'll spice it up using a similar function, barplot2()

Introduction to Plotting: Barplots with error bars, via barplot2()

The barplot2() function allows you to plot error bars. Firt, though, we need error bars to plot! The standard thing most people plot is the standard error. LEt's re-do those calculations and this time save them to some objects. For completeness we'll re-do the means and sample sizes

# The means
mean.BC <- mean(BC)
mean.AD <- mean(AD)

# Sample sizes using length()
n.BC <- length(BC)
n.AD <- length(AD)

# The standard errors
se.BC <- sd(BC)/sqrt(n.BC)
se.AD <- sd(AD)/sqrt(n.AD)

# We can see what we've made by calling up the objects.  Using a semi colon
# lets you write everyting on a single line.
mean.BC

## [1] 131.8

n.BC

## [1] 8

se.BC

## [1] 1.161

Now we have the parts for building out barplot. The barplot2() function is not in base R so you'll have to first download the gplots library where it is found. Then use the library() function to load it. I then have to explicitly define where the upper and lower error bars reach to using “ci.u” and “ci.l”, where “ci” means “confidence interval. Finally, because for some reason actually plotting the error bars is not the default, you need to tell barplot2() the "plot.ci = TRUE”.

#Load the library gtools
library(gtools)

#Make the plot with the objects we've made.
barplot2(height = c(mean.BC, mean.AD),           #The height of each bar
         ci.u = c(mean.BC+se.BC, mean.AD+se.AD),  #The height of the upper ci
         ci.l = c(mean.BC-se.BC, mean.AD-se.AD), #Height of the lower ci         
         plot.ci = TRUE)                         #Command to actually plot the cis

plot of chunk unnamed-chunk-20

We have error bars, but its still not that impressive. You're also probably thinking that it would be way easier to do this in Excel, which is probably true. I use R for about 2 years before I completely stopped plotting any of may data in Excel. For regression-type analyses the plotting functions are a bit more straight forward, For the sake of completeness, I'll continue walking through how to make this plot better. Also, once great advantage about R is that once you write code to produce a figure, you can save it in a script and recycle it. This makes re-making figures very easy or adapating them to new data fast. In contrast, for Excel you basically have to build each graph from scratch. Also with R, as I'll show you can easily change things like color and line width with very simple code (if you can remember it!); changes such as these can require lots of double-clicking on elements of graphs in Excel.

Building a better plot, part 1

First, let's clean up the code for this plot a bit. Right now I'm doing the math for where to extend the error bars within the plotting function. I can do the math outside the function by making objects for the upper and lower limits.


# Make two objects for the four positions.
CI.u <- c(mean.BC + se.BC, mean.AD + se.AD)
CI.l <- c(mean.BC - se.BC, mean.AD - se.AD)

In R and programmer speak these two objects would be considered “vectors” because they each contain two values. Alternatively, I could have made four separate objects.

Note: To prevent confusion, I capitalize C.

Now I'll build a new plot

barplot2(height = c(mean.BC, mean.AD), ci.u = CI.u, ci.l = CI.l, plot.ci = TRUE)

plot of chunk unnamed-chunk-22

I can clean it up even further by saving both of my means to a single vector object

means.BC.AD <- c(mean.BC, mean.AD)
# Load the library gtools
barplot2(height = means.BC.AD, ci.u = CI.u, ci.l = CI.l, plot.ci = TRUE)

plot of chunk unnamed-chunk-23

There are lots of ways to add information to a plot, such as categorcy title under the x-axis. This can be done adding the cryptically named “names.arg” to the code. You have to put what you want to call them within c() thingy, and put the names in qutation marks.

barplot2(height = means.BC.AD, ci.u = CI.u, ci.l = CI.l, plot.ci = TRUE, names.arg = c("BC", 
    "AD"))  # the column names

plot of chunk unnamed-chunk-24

Speaking of x-axes, for some reason the default doesn't include them. This is include by adding axes=TRUE.

barplot2(height = means.BC.AD, ci.u = CI.u, ci.l = CI.l, plot.ci = TRUE, names.arg = c("BC", 
    "AD"), axes = TRUE)

Hmmmm… that didn't do anything. That's because barplot2 unfortunately needs you to be very explicit about what you want. In this case, now that you told it you want an x-axis, you need to tell it what type of line (“lty”) you want you using the “axis.lty = 1” command. This is definately where using R can be a drag…

barplot2(height = means.BC.AD, ci.u = CI.u, ci.l = CI.l, plot.ci = TRUE, names.arg = c("BC", 
    "AD"), axis.lty = 1, axes = TRUE)

plot of chunk unnamed-chunk-26

…and it only plots a partial line! I haven't figured out how to get it all the way across. However, one thing: “lty” is a general R prefix/suffix for line type that shows up a lot in plotting.

Now, how about some more lables. LEt's assume what we're plotting is the diameter of these Egyptian skulls. We can add a label with this information to the y axis, as well as some information to the x axis and a descriptive title.

barplot2(height = means.BC.AD,           
         ci.u = CI.u,  
         ci.l = CI.l,          
         plot.ci = TRUE,
         names.arg = c("BC", "AD"),
         axis.lty = 1,
         axes = TRUE,
         ylab = "Cranial diamter (cm",                      #y axis label
         xlab = "Archeological Period" ,                    #x axis label
         main = "Comparison of BC and AD Egyptian Skulls")  #main title title.

plot of chunk unnamed-chunk-27

There are many other things we can change - the width of the lines within the plot (lwd), the width of the lines that make up the error beors (“ci.lwd”), the color of the bars. Font sizes can be a bit tricky because they are set in plots using an “expansion factor”, abbreviate “cex”; not sure what that means! In this plot, I increase the size of the column names a plot (cex.names = 2), and the size of the numbers on the axes and size of the axes lables a bit (cex.axis = 1.35). Though its heresy to R pros, it can be easier to do labels and change font size by pasting an R plot (or, heaven forbid, an Excel Plot!) into Power Point. Don't Tell anyone I said that.

barplot2(height = means.BC.AD,           
         ci.u = CI.u,  
         ci.l = CI.l,          
         plot.ci = TRUE,
         names.arg = c("BC", "AD"),
         axis.lty = 1,
         axes = TRUE,
         ylab = "Cranial diamter (cm)",    #y axis label
         xlab = "Archeological Period" ,   #x axis label
         main = "Comparison of BC & AD Egyptian Skulls",      
         lwd = 4,                          #line width
         ci.lwd = 2,                       #line width of ci
         col = c(2,3),                      #colors of bars
         cex.names = 2,                     #Increase font size.
         cex.axis = 1.35,
         cex.lab = 1.35)

plot of chunk unnamed-chunk-28

Speeding up Plotting: making and annotating a script

I if wanted to make plots like this in the future I could put all the code in the same place and make a handy boxplot script file along these lines. Including data makes it self-contained and sufficiently annotating helps one remember what's going on with all the code! You can make a directory full of these. What I'd do is save this as a text file titled “SxBoxplotCode.txt”, where “Sx” is my abbreviation for “Script.” I find putting a common prefix on the file names helps keep things organized.

#File name:             SxBoxplotCode.txt
#Script description:    bar plot for two means and standard errors using barplot2() from gtools libary.

#load data
data.1 <- c(131, 126, 131, 132, 135, 137, 130, 132)
data.2 <- c(147, 130, 137, 129, 135, 128, 138, 136, 145, 134)

#means
mean.1 <- mean(data.1)
mean.2 <- mean(data.2)
means.1.2 <- c(mean.1, mean.2)

#sample size
n.1 <- length(data.1)
n.2 <- length(data.2)

#standard error (SE)
se.1 <- sd(data.1)/sqrt(n.1)
se.2 <- sd(data.2)/sqrt(n.2)

CI.u <- c(mean.1+se.1, mean.2+se.2)
CI.l <- c(mean.1-se.1, mean.2-se.2) 

#load gplots library
library(gplots)

#Make plot
barplot2(height = means.1.2,           
         ci.u = CI.u,  
         ci.l = CI.l,          
         plot.ci = TRUE,                     #plot the ci for real
         names.arg = c("data.1", "data.2"),  #generic names
         axis.lty = 1,                       #axis line type
         axes = TRUE,                        #plot a horitzonal bar
         ylab = "y axis (units)",            #y axis label
         xlab = "x axis" ,                   #x axis label
         main = "Title",      
         lwd = 4,                           #line width for all plot elements
         ci.lwd = 2,                        #line width of ci
         col = c(2,3),                      #colors of bars. 1 = black.
         cex.names = 2,                     #Increase font size of column names
         cex.axis = 1.35,                   #Increase font size of axis numbers
         cex.lab = 1.35)                    #Increase size of x and y axis lables

plot of chunk unnamed-chunk-29

Statistical Inference with a t-test

The difference between the two means is small

abs(mean(BC) - mean(AD))

## [1] 4.15

A t-test, assuming variance are equal

t.test(BC, AD, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  BC and AD
## t = -1.675, df = 16, p-value = 0.1133
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.402  1.102
## sample estimates:
## mean of x mean of y 
##     131.8     135.9

I can print just the t-statistic by tagging the call to t.test with “$statistic”

t.test(BC, AD, var.equal = TRUE)$statistic

##      t 
## -1.675

Note that if you do not assume that the variance are equal you get a slightly different t

t.test(BC, AD, var.equal = FALSE)$statistic

##      t 
## -1.793

The standardized effect size is (mean(BC) - mean(AD))/(sd(c(BC,AD)))