The goal of this document is to increase your R skills in a meaningful way. This tutorial will give you two very valuable tools when summarizing and plots you data. ddply is a package that will allow you to create and summarize your data easily. ggplot2 will improve your data display and wow your friends and employers!
Lets begin with the basics…
We will use the built in CO2 dataset in R to run our data. Make note of the column headers. We will use this information to make our graphs.
library(plyr)
library(ggplot2)
# Carbon Dioxide Uptake in Grass Plants ####
# Type: Dataframe
# Continuous Variables: 2
# Qualitative Variables: 3
# The CO2 data frame has 84 rows and 5 columns of data from an experiment on the
# cold tolerance of the grass species Echinochloa crus-galli. The CO2 uptake of
# six plants from Quebec and six plants from Mississippi was measured at
# several levels of ambient CO2 concentration. Half the plants of each type
# were chilled overnight before the experiment was conducted.
data("CO2") # Import iris data
head(CO2)# first six rows of data
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
When making graphs, R often requires certain types of data. Data types can range from: numerical, factorial, logical, integers. Use the str() function to see what R ‘thinks’ your data type is.
str(CO2)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 84 obs. of 5 variables:
## $ Plant : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
## $ conc : num 95 175 250 350 500 675 1000 95 175 250 ...
## $ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
## - attr(*, "formula")=Class 'formula' language uptake ~ conc | Plant
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Treatment * Type
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Ambient carbon dioxide concentration"
## ..$ y: chr "CO2 uptake rate"
## - attr(*, "units")=List of 2
## ..$ x: chr "(uL/L)"
## ..$ y: chr "(umol/m^2 s)"
Notice that Type and Treatment are factors. conc and uptake are numeric. Problem: conc is considered a number when it should be factor according to the experimental desgin. We can change the data type easily.
CO2$conc <- as.factor(CO2$conc)
str(CO2)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 84 obs. of 5 variables:
## $ Plant : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
## $ conc : Factor w/ 7 levels "95","175","250",..: 1 2 3 4 5 6 7 1 2 3 ...
## $ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
## - attr(*, "formula")=Class 'formula' language uptake ~ conc | Plant
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Treatment * Type
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Ambient carbon dioxide concentration"
## ..$ y: chr "CO2 uptake rate"
## - attr(*, "units")=List of 2
## ..$ x: chr "(uL/L)"
## ..$ y: chr "(umol/m^2 s)"
ddply() is an excellent tool. I use this extensively when preparing ggplot() graphs. Below I will give you a simple example of how ddply works. Then, I will give you more useful code to prep our ggplot() visualization.
Note: Copy and paste this code. You do not have to type this out every time.
Say we want to take the sum of the CO2 uptake data for each Treatment variable. We can accomplish this task easily with the following code.
Notice the order: ddply(dataset, c(“column”), summarize, newColumnName = function)
example1 <- ddply(CO2, c("Treatment"), summarise,
sumUptake = sum(uptake))
head(example1) # summary statistics used for ggplot()
## Treatment sumUptake
## 1 nonchilled 1287.0
## 2 chilled 998.9
Here is the code which gives you the number of samples (i.e. replication), mean values, standard deviations, and standard errors. You can summarize any number of categorical variables using this code.
So, lets create an object with the descriptive statistics for both the Treatment, Type, and conc categorial variables..
example2 <- ddply(CO2, c("Treatment", "Type", "conc"), summarise,
dataRep = sum(!is.na(uptake)),
dataMean = mean(uptake, na.rm=T),
dataSD = sd(uptake, na.rm=T),
dataSE = dataSD / sqrt(dataRep))
head(example2)
## Treatment Type conc dataRep dataMean dataSD dataSE
## 1 nonchilled Quebec 95 3 15.26667 1.446836 0.8353309
## 2 nonchilled Quebec 175 3 30.03333 2.569695 1.4836142
## 3 nonchilled Quebec 250 3 37.40000 2.762245 1.5947832
## 4 nonchilled Quebec 350 3 40.36667 2.746513 1.5857000
## 5 nonchilled Quebec 500 3 39.60000 3.897435 2.2501852
## 6 nonchilled Quebec 675 3 41.50000 2.351595 1.3576941
You did it! Great job! You can easily change what you want categorical variable that you want to summarize by. Lets use ddply() to make some sample plots with ggplot()
Now that we learned how ddply() works. I am going to show you how to code for common graphs in R. When working with your own code, you can copy and paste this information to run your graphs. Be Sure to change the code to match your values.
Boxplots require a continuous dependent variable and at least one categorical variable.
Here are a couple of examples.
Notice that in the first object, you need to load your data. Be sure to load the data that makes sense with your dataframe. In this case, we are using the CO2 dataset with Treatment on the x-axis and uptake on the y-axis.
In the second part, we are addition style layers to our data object! To make a boxplot, we use the geom_boxplot() function. Note that each additional layer requires a + sign.
### Load the graphing data for the x and y axis
graphBox1 <- ggplot(data = CO2, aes(x = Treatment, y = uptake))
#### Run your graph
graphBox1 +
theme_classic() +
geom_boxplot()
Boxplot with a single factor.
To make a boxplot with two categorical factors, the code changes slightly. Notice the use of the fill code when we load our data. This is the categorical factor which we want in our legend.
## GGPLOT Box Plots - Two Independent Variables #####
### Load the graphing data for the x and y axis
graphBox2 <- ggplot(data = CO2, aes(x = Treatment, y = uptake, fill = Type))
#### Run your graph
graphBox2 +
theme_classic() +
geom_boxplot()
It is often very useful to visualize the density and raw data. This can give us a sense of the normality and data spread. Notice also we use a dodge code. This lines everything up.
### Load the graphing data for the x and y axis
graphBox2 <- ggplot(data = CO2, aes(x = Treatment, y = uptake, fill = Type))
dodge <- position_dodge(width = 0.9)
#### Run your graph
graphBox2 +
theme_classic() +
geom_violin(trim = FALSE) +
geom_point(position = dodge)
Here is some bar graph code. I will give examples of single or multivariable data using the CO2 dataset. I will be using ddply() first then run our ggplot() code.
singleVar <- ddply(CO2, c("Treatment"), summarise,
dataRep = sum(!is.na(uptake)),
dataMean = mean(uptake, na.rm=T),
dataSD = sd(uptake, na.rm=T),
dataSE = dataSD / sqrt(dataRep))
head(singleVar)
## Treatment dataRep dataMean dataSD dataSE
## 1 nonchilled 42 30.64286 9.704994 1.497513
## 2 chilled 42 23.78333 10.884312 1.679486
### Load the graphing data for the x and y axis
graphBar1 <- ggplot(data = singleVar, aes(x = Treatment, y = dataMean))
#### Set error bar limits using SD or SE around the mean
limits <- aes(ymax = dataMean + dataSD, ymin = dataMean - dataSD)
#### Run your graph
graphBar1 +
theme_classic() +
geom_bar(stat="identity", fill = "royalblue1", colour = "royalblue4") +
geom_errorbar(limits, width = 0.2) # Limits
Notice I added some color to the graph. For the names of colors, go to: http://sape.inf.usi.ch/quick-reference/ggplot2/colour
twoVar <- ddply(CO2, c("Treatment", "conc"), summarise,
dataRep = sum(!is.na(uptake)),
dataMean = mean(uptake, na.rm=T),
dataSD = sd(uptake, na.rm=T),
dataSE = dataSD / sqrt(dataRep))
head(twoVar)
## Treatment conc dataRep dataMean dataSD dataSE
## 1 nonchilled 95 6 13.28333 2.398680 0.9792571
## 2 nonchilled 175 6 25.11667 5.711888 2.3318686
## 3 nonchilled 250 6 32.46667 5.924075 2.4184936
## 4 nonchilled 350 6 35.13333 6.116099 2.4968870
## 5 nonchilled 500 6 35.10000 5.650133 2.3066570
## 6 nonchilled 675 6 36.01667 6.343317 2.5896482
### Load the graphing data for the x and y axis
graphBar2 <- ggplot(data = twoVar, aes(x = Treatment, y = dataMean, fill = conc))
#### Set error bar limits using SD or SE around the mean
limits <- aes(ymax = dataMean + dataSD, ymin = dataMean - dataSD)
dodge <- position_dodge(width = 0.9) # Dodge overlapping objects side-to-side
#### Run your graph
graphBar2 +
theme_classic() +
geom_bar(stat="identity", position = dodge, colour = "black") +
geom_errorbar(limits, width = 0.2, position = dodge) +
scale_fill_brewer(palette="Greens")# Limits
Notice that I now use color brewer to create my color scales. To see the options, install the library and use this code:
library(RColorBrewer)
display.brewer.all(type="div")
display.brewer.all(type="seq")
display.brewer.all(type="qual")
When plotting two continuous variables, you do not need to use ddply(). We will load and install a new dataset that has two continuous variables.
# Motor Trend Car Road Tests ####
# Type: Dataframe
# Continuous Variables: many
# Qualitative Variables: many
# The data was extracted from the 1974 Motor Trend US magazine, and
# comprises fuel consumption and 10 aspects of automobile design and
# performance for 32 automobiles (1973-74 models)
data("mtcars")
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
To make a line graph, you can use the raw data. Load your information
### Load the graphing data for the x and y axis
graphLine1 <- ggplot(data = mtcars, aes(x = wt, y = mpg))
#### Run your graph
graphLine1 +
theme_classic() +
geom_point(colour = "blue")+
geom_smooth(method = "lm", se = FALSE)
Notice that the geom_smooth() function adds a best fit line using a liner model funtion.
# Make the number cylinders a factor
mtcars$cyl <- as.factor(mtcars$cyl)
### Load the graphing data for the x and y axis
graphLine2 <- ggplot(data = mtcars, aes(x = wt, y = mpg, colour = cyl))
#### Run your graph
graphLine2 +
theme_classic() +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE)
# Make the number cylinders a factor
mtcars$cyl <- as.factor(mtcars$cyl)
### Load the graphing data for the x and y axis
graphLine3 <- ggplot(data = mtcars, aes(x = wt, y = mpg, colour = cyl))
#### Run your graph
graphLine3 +
theme_classic() +
geom_point(size = 3) +
facet_grid(. ~ cyl) +
geom_smooth(method = "lm", se = FALSE)