Summary

The goal of this document is to increase your R skills in a meaningful way. This tutorial will give you two very valuable tools when summarizing and plots you data. ddply is a package that will allow you to create and summarize your data easily. ggplot2 will improve your data display and wow your friends and employers!

Lets begin with the basics…

Step 1: Load your packages and data

We will use the built in CO2 dataset in R to run our data. Make note of the column headers. We will use this information to make our graphs.

library(plyr)
library(ggplot2)

# Carbon Dioxide Uptake in Grass Plants ####
    # Type: Dataframe
    # Continuous Variables: 2
    # Qualitative Variables: 3

# The CO2 data frame has 84 rows and 5 columns of data from an experiment on the 
# cold tolerance of the grass species Echinochloa crus-galli. The CO2 uptake of 
# six plants from Quebec and six plants from Mississippi was measured at 
# several levels of ambient CO2 concentration. Half the plants of each type
# were chilled overnight before the experiment was conducted.

data("CO2") # Import iris data
head(CO2)# first six rows of data

##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2

Step 2: Investigate the data types

When making graphs, R often requires certain types of data. Data types can range from: numerical, factorial, logical, integers. Use the str() function to see what R ‘thinks’ your data type is.

str(CO2)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
##  - attr(*, "formula")=Class 'formula'  language uptake ~ conc | Plant
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Treatment * Type
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Ambient carbon dioxide concentration"
##   ..$ y: chr "CO2 uptake rate"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(uL/L)"
##   ..$ y: chr "(umol/m^2 s)"

Notice that Type and Treatment are factors. conc and uptake are numeric. Problem: conc is considered a number when it should be factor according to the experimental desgin. We can change the data type easily.

CO2$conc <- as.factor(CO2$conc)
  
str(CO2)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : Factor w/ 7 levels "95","175","250",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
##  - attr(*, "formula")=Class 'formula'  language uptake ~ conc | Plant
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Treatment * Type
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Ambient carbon dioxide concentration"
##   ..$ y: chr "CO2 uptake rate"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(uL/L)"
##   ..$ y: chr "(umol/m^2 s)"

Step 3: Summarize the data with ddply

ddply() is an excellent tool. I use this extensively when preparing ggplot() graphs. Below I will give you a simple example of how ddply works. Then, I will give you more useful code to prep our ggplot() visualization.

Note: Copy and paste this code. You do not have to type this out every time.

Easy ddply example

Say we want to take the sum of the CO2 uptake data for each Treatment variable. We can accomplish this task easily with the following code.

Notice the order: ddply(dataset, c(“column”), summarize, newColumnName = function)

example1 <- ddply(CO2, c("Treatment"), summarise,
                  sumUptake = sum(uptake))

head(example1) # summary statistics used for ggplot()

##    Treatment sumUptake
## 1 nonchilled    1287.0
## 2    chilled     998.9

ddply for ggplots

Here is the code which gives you the number of samples (i.e. replication), mean values, standard deviations, and standard errors. You can summarize any number of categorical variables using this code.

So, lets create an object with the descriptive statistics for both the Treatment, Type, and conc categorial variables..

example2 <- ddply(CO2, c("Treatment", "Type", "conc"), summarise,
                  dataRep  = sum(!is.na(uptake)),
                  dataMean = mean(uptake, na.rm=T), 
                  dataSD   = sd(uptake, na.rm=T),
                  dataSE   = dataSD / sqrt(dataRep)) 

head(example2)

##    Treatment   Type conc dataRep dataMean   dataSD    dataSE
## 1 nonchilled Quebec   95       3 15.26667 1.446836 0.8353309
## 2 nonchilled Quebec  175       3 30.03333 2.569695 1.4836142
## 3 nonchilled Quebec  250       3 37.40000 2.762245 1.5947832
## 4 nonchilled Quebec  350       3 40.36667 2.746513 1.5857000
## 5 nonchilled Quebec  500       3 39.60000 3.897435 2.2501852
## 6 nonchilled Quebec  675       3 41.50000 2.351595 1.3576941

You did it! Great job! You can easily change what you want categorical variable that you want to summarize by. Lets use ddply() to make some sample plots with ggplot()

Step 4: Graphing Options

Now that we learned how ddply() works. I am going to show you how to code for common graphs in R. When working with your own code, you can copy and paste this information to run your graphs. Be Sure to change the code to match your values.

Boxplots

Boxplots require a continuous dependent variable and at least one categorical variable.

Here are a couple of examples.

Notice that in the first object, you need to load your data. Be sure to load the data that makes sense with your dataframe. In this case, we are using the CO2 dataset with Treatment on the x-axis and uptake on the y-axis.

In the second part, we are addition style layers to our data object! To make a boxplot, we use the geom_boxplot() function. Note that each additional layer requires a + sign.

### Load the graphing data for the x and y axis
graphBox1 <- ggplot(data = CO2, aes(x = Treatment, y = uptake))

#### Run your graph
graphBox1 + 
  theme_classic() +
  geom_boxplot()

Boxplot with a single factor.

To make a boxplot with two categorical factors, the code changes slightly. Notice the use of the fill code when we load our data. This is the categorical factor which we want in our legend.

## GGPLOT Box Plots - Two Independent Variables #####

### Load the graphing data for the x and y axis
graphBox2 <- ggplot(data = CO2, aes(x = Treatment, y = uptake, fill = Type))

#### Run your graph
graphBox2 + 
  theme_classic() +
  geom_boxplot()

It is often very useful to visualize the density and raw data. This can give us a sense of the normality and data spread. Notice also we use a dodge code. This lines everything up.

### Load the graphing data for the x and y axis
graphBox2 <- ggplot(data = CO2, aes(x = Treatment, y = uptake, fill = Type))

dodge <- position_dodge(width = 0.9)

#### Run your graph
graphBox2 + 
  theme_classic() +
  geom_violin(trim = FALSE) +
  geom_point(position = dodge)

Bar Graphs

Here is some bar graph code. I will give examples of single or multivariable data using the CO2 dataset. I will be using ddply() first then run our ggplot() code.

Single factor bar graph

singleVar <- ddply(CO2, c("Treatment"), summarise,
                  dataRep  = sum(!is.na(uptake)),
                  dataMean = mean(uptake, na.rm=T), 
                  dataSD   = sd(uptake, na.rm=T),
                  dataSE   = dataSD / sqrt(dataRep)) 

head(singleVar)

##    Treatment dataRep dataMean    dataSD   dataSE
## 1 nonchilled      42 30.64286  9.704994 1.497513
## 2    chilled      42 23.78333 10.884312 1.679486

### Load the graphing data for the x and y axis
graphBar1 <- ggplot(data = singleVar, aes(x = Treatment, y = dataMean))

#### Set error bar limits using SD or SE around the mean
limits <- aes(ymax = dataMean + dataSD, ymin = dataMean - dataSD) 

#### Run your graph
graphBar1 + 
  theme_classic() +       
  geom_bar(stat="identity", fill = "royalblue1", colour = "royalblue4") +
  geom_errorbar(limits, width = 0.2)  # Limits

Notice I added some color to the graph. For the names of colors, go to: http://sape.inf.usi.ch/quick-reference/ggplot2/colour

Two factor bar graph

twoVar <- ddply(CO2, c("Treatment", "conc"), summarise,
                  dataRep  = sum(!is.na(uptake)),
                  dataMean = mean(uptake, na.rm=T), 
                  dataSD   = sd(uptake, na.rm=T),
                  dataSE   = dataSD / sqrt(dataRep)) 

head(twoVar)

##    Treatment conc dataRep dataMean   dataSD    dataSE
## 1 nonchilled   95       6 13.28333 2.398680 0.9792571
## 2 nonchilled  175       6 25.11667 5.711888 2.3318686
## 3 nonchilled  250       6 32.46667 5.924075 2.4184936
## 4 nonchilled  350       6 35.13333 6.116099 2.4968870
## 5 nonchilled  500       6 35.10000 5.650133 2.3066570
## 6 nonchilled  675       6 36.01667 6.343317 2.5896482

### Load the graphing data for the x and y axis
graphBar2 <- ggplot(data = twoVar, aes(x = Treatment, y = dataMean, fill = conc))

#### Set error bar limits using SD or SE around the mean
limits <- aes(ymax = dataMean + dataSD, ymin = dataMean - dataSD) 

dodge <- position_dodge(width = 0.9) # Dodge overlapping objects side-to-side

#### Run your graph
graphBar2 + 
  theme_classic() +                   
  geom_bar(stat="identity", position = dodge, colour = "black") +         
  geom_errorbar(limits, width = 0.2, position = dodge) +
  scale_fill_brewer(palette="Greens")# Limits

Notice that I now use color brewer to create my color scales. To see the options, install the library and use this code:

library(RColorBrewer)
display.brewer.all(type="div")

display.brewer.all(type="seq")

display.brewer.all(type="qual")

Line graph with two variables (X-Y Plots)

When plotting two continuous variables, you do not need to use ddply(). We will load and install a new dataset that has two continuous variables.

# Motor Trend Car Road Tests ####
    # Type: Dataframe
    # Continuous Variables: many
    # Qualitative Variables: many
# The data was extracted from the 1974 Motor Trend US magazine, and 
# comprises fuel consumption and 10 aspects of automobile design and 
# performance for 32 automobiles (1973-74 models)

data("mtcars")

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

To make a line graph, you can use the raw data. Load your information

### Load the graphing data for the x and y axis
graphLine1 <- ggplot(data = mtcars, aes(x = wt, y = mpg))

#### Run your graph
graphLine1 + 
  theme_classic() +                   
  geom_point(colour = "blue")+
  geom_smooth(method = "lm", se = FALSE)

Notice that the geom_smooth() function adds a best fit line using a liner model funtion.

Line graph with a categorical factor

# Make the number cylinders a factor
mtcars$cyl <- as.factor(mtcars$cyl)

### Load the graphing data for the x and y axis
graphLine2 <- ggplot(data = mtcars, aes(x = wt, y = mpg, colour = cyl))

#### Run your graph
graphLine2 + 
  theme_classic() +                   
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE)

Line graph with a categorical factor as facet

# Make the number cylinders a factor
mtcars$cyl <- as.factor(mtcars$cyl)

### Load the graphing data for the x and y axis
graphLine3 <- ggplot(data = mtcars, aes(x = wt, y = mpg, colour = cyl))

#### Run your graph
graphLine3 + 
  theme_classic() +                   
  geom_point(size = 3) +
  facet_grid(. ~ cyl) +
  geom_smooth(method = "lm", se = FALSE)

Brian Test