Data are the basis for all scientific insights, but a table of data is no more science than a pile of bricks is a house. In order to gain knowledge, we have to work with our data, teasing meaning from the mass of often noisy information. Scientific evidence consists of patterns in our data, and to discover those patterns we need to develop tools for summarizing and visualizing data, so that we can confront our expectations and predictions with reality.
In this document, we’ll review many of the basic summary and visualization tools that we can use to explore data in R
. At this point, R
has three common packages for visualization: graphics
is the “base” package, lattice
is a somewhat more flexible alternative, and ggplot2
is an newer implementation based on the “grammar of graphics.” All of them have their pros and cons, but we will focus primarily on lattice
package. We will also need some of the summary tools and data from the mosaic
package and data from the datasets
package.
require(mosaic) # be sure to install and load the usual packages if you have not already
require(datasets)
require(lattice)
For each of the examples below, we will start by specifying a data set using the data()
function. To learn more specifics about each data set, just use the help(<data set name>)
function.
The most appropriate summary and visualization methods depend on the types of variables under consideration. There are fundamentally two different types of data:
Quantitative variables are naturally represented by a number, e.g., diameter, temperature, number of eggs in a clutch). These may be either continuous (on a decimal scale), or discrete (on an integer scale), but the numbers provide genuine, quantitative information, not just an arbitrary label.
Categorical variables are naturally described by simple categories, e.g. red vs. white flowers, or larva, pupa, adult butterflies. The value is always selected from a fixed set of possibilities or “levels”. Ordinal data are categorical data that have a natural order to them, e.g., cold, cool, warm, hot.
With quantitative data, we can use numerical summaries like the mean, standard deviation, or quantiles, while with categorical data, we generally think about counts or proportions of different values.
Things can get tricky because sometimes we code categorical data using numbers, even though the numbers convey no quantitative information. For example, one of the data sets we will examine has a variable called female
coded as 0 for males and 1 for females. It is important to note that whenever R
encounters a number, it will treat it as a quantitative variable by default. I mention this only to emphasize the importance of knowing your data, so you know how to treat it.
If we want to know how each of the variables in a data frame is coded, we can combine the sapply()
and class()
functions.
data(CO2)
sapply(CO2, class)
## $Plant
## [1] "ordered" "factor"
##
## $Type
## [1] "factor"
##
## $Treatment
## [1] "factor"
##
## $conc
## [1] "numeric"
##
## $uptake
## [1] "numeric"
Categorical variables should be of class factor
or ordered
while quantitative variables should be of class numeric
(decimals) or int
(integer). However, if we have categorical variables coded as numeric
, we can make R
behave by using functions like as.factor()
or as.ordered
, which I’ll describe below.
Univariate simply means we are interested in summarizing or displaying information on a single variable. Below we will consider bivariate situations in which we are interested in the relationship between two variables. We can go even a bit further and think (just a little…) about multivariate relationships among three (or more) variables.
In general, we are interested in the distribution of values across individuals in the sample as they are in some sense representative of some underlying population.
When summarizing a quantitative variable, we are generally interested in knowing about at least two aspects of the distribution: the central tendency of the data, e.g., the mean or median, and the variability of the data, e.g., the range, the standard deviation (the average deviation from the mean), or the inter-quartile range (IQR, the range between the 25th and 75th percentiles). R
has functions to compute each of these individually, but here are couple of very useful functions that give us a lot of information in a single command.
data(HELPrct) # HELPrct = Health Evaluation and Linkage to Primary Care study
fivenum(mcs, data=HELPrct) # pcs = "mental component score" lower is worse
## [1] 6.762923 21.675755 28.602417 40.941338 62.175503
favstats(mcs, data=HELPrct) # favstats depends on mosaic
## min Q1 median Q3 max mean sd n missing
## 6.762923 21.67575 28.60242 40.94134 62.1755 31.67668 12.83934 453 0
As you can see, fivenum()
returns the minimum, Q1 (first quartile, 25th percentile), median, Q3 (third quartile, 75th percentile), and maximum. The function favstats()
from mosaic
adds the mean, standard deviation, sample size, and number of missing values - pretty much a complete basic summary.
These include both histograms and boxplots, as well as some additional options in particular situations. For some examples, we will stick with the HELPrct
data.
data(HELPrct) # HELPrct = Health Evaluation and Linkage to Primary Care study
bwplot(~mcs, data=HELPrct, xlab="Mental Component Score") # bwplot is for box-and-whiskers plot
Note the leading ~
before mcs
. Functions from the lattice
package generally expect formula
notation, i.e. y ~ x
where y
designates what is on the y-axis, x
designates the x-axis. Since we just want mcs on the x-axis, we have to put the ~
in front of it.
The boxplot is basically a graph of the five number summary. The dot is the median, the box spans the IQR, and the whiskers extend from the minimum to the maximum. From this figure, we can also see that the distribution has a slight positive skew, a pattern that is even more evident in the histogram.
histogram(~mcs, data=HELPrct, xlab="Mental Component Score")
Histograms always have continuous intervals of the same width. They are not the same as a bar graph, which should only be used on categorical data (see below). We can change the number of intervals with nint=
.
histogram(~mcs, data=HELPrct, xlab="Mental Component Score", nint=15)
Both histograms and boxplots run into problems when sample sizes are small. For example, the sleep
data contains data on the amount of time slept for ten subjects given one of two drugs. If we want to just look at one of the drugs, we can subset the data, but the histogram is pretty uninformative.
data(sleep)
sleep2 = subset(sleep, group==2, drop=TRUE) # "group" is a terrible name for a variable describing the drug given.
histogram(~extra, data=sleep2, xlab="Extra sleep on drug 2 (h)") # "extra" is extra sleep under the influence of the drug
The problem is that we only have 10 data points arrayed into four intervals, so it is hard to see what is going on. Similarly, with the boxplot, we run into problems because we are using only 10 values to estimate the five necessary numbers.
bwplot(~extra, data=sleep2, xlab="Extra sleep on drug 2 (h)")
The boxplot may look okay, but would you trust the College Board to estimate the percentile rank of your SAT score if only nine other students took the test?
In the case of small sample sizes, say less than about 12 - 15, we have few enough points that we can just put them all on the graph using what is called a strip plot or dot plot.
stripplot(~extra, data=sleep2, xlab="Extra sleep on drug 2 (h)")
This plot may not be very exciting, but it gives us a clearer picture of the distribution than the histogram, and even the boxplot. Strip plots will come in handy when we look at bivariate relationships as well.
Categorical variables require a different approach. If we have a sample of apples that are either Honeycrisp, Fuji, or Macintosh, we can’t calculate a mean variety for the apples. But we can still look at the distribution of varieties in the sample. For example, the HELPrct
data are based on a study of a drug rehabilitation program, and the variable substance
notes each subject’s drug of addiction, whether cocaine, alcohol, or heroin. We can use tally()
(from mosaic
) to get counts on the number of subjects using each drug.
data(HELPrct)
tally(~substance, data=HELPrct)
##
## alcohol cocaine heroin
## 177 152 124
We can use format=
to give us proportions or percents as well, and remember that we can always assign the output to a variable as well.
tally(~substance, data=HELPrct, format="proportion")
##
## alcohol cocaine heroin
## 0.3907285 0.3355408 0.2737307
substance.percs = tally(~substance, data=HELPrct, format="percent")
substance.percs
##
## alcohol cocaine heroin
## 39.07285 33.55408 27.37307
A table like substance.percs
can be easily turned into a bar plot to visualize the distribution.
barchart(substance.percs, horizontal=FALSE, xlab="Drug of addiction", ylab="Percentage of subjects")
Note that here we are not using a formula and a data=
argument. The function expects a numeric value so we have to give it our output from tally()
for a categorical variable. In fact we could put the tally()
command right into the plotting command.
barchart(tally(~substance, data=HELPrct, format="proportion"), horizontal=FALSE, xlab="Drug of addiction", ylab="Proportion of subjects")
The mosaic
package has a nifty function called bargraph()
that tallies data and makes a bar plot for us, but it will only tally counts (no proportions or percents).
bargraph(~substance, data=HELPrct, xlab="Drug of addiction")
Again, remember that despite their visual similarities, bar plots and histograms are different! Histogram bars are on equal intervals for a continuous quantitative variable. Bar plots have one bar for each level of a categorical factor.
Univariate distributions are important descriptions, but in science we generally want to know about relationships between variables, so that we can test whether we understand the process that link them. In fact, for almost any (relatively simple) experiment you might want to perform, you ought to be able to envision what the graph would look like if your underlying hypothesis is true - even before you go out and collect data. And what type of graph you might envision depends in part on the types of variables under consideration.
For most (but not all) bivariate relationships in science, we can separate out independent vs. dependent variables. Independent variables are what we use as the basis of our scientific prediction; they are also called predictor variables in statistics. The independent variable generally goes on the horizontal axis (or x-axis or abscissa). The dependent variable, or response variable, on the other hand, typically is placed on the vertical axis (or y-axis, or ordinate). Often we interpret the relationship between the two variables as causative, but these sorts of scientific claims must be viewed with caution and skepticism.
There are many things to consider in the relationship between two quantitative variables, including the stength of the relationship, the direction (positive or negative) of the relationship, and the magnitude of the relationship (how much does x change with y?). Addition considerations include whether the relationship is linear or non-linear, and whether it is monotonic (always in the same direction).
Obviously, there is not one numerical summary that is going to capture all of these aspects of a relationship, but a great place to start is with the correlation coefficient (r) which measures the strength and direction of the linear relationship between two variables. The value of r ranges from -1 (all of the data exactly on a straight line with a negative slope) to 1 (all on a line with a positive slope). As the points deviate from a straight line, r goes towards zero, which indicates no linear relationship between the variables.
As an example, let’s consider some variables from HELPrct
. We’ve already looked at the “mental component score” (mcs
) above. Let’s see how well correlated with the “physical component score” (pcs
). In both cases, lower scores correspond to worse condition.
cor(pcs~mcs, data=HELPrct)
## [1] 0.1104571
Generally a correlation coefficient less than about 0.2 (positive or negative) is not considered too remarkable. Since r=0.11
in this case, it appears that the mental and physical component scores for the subjects do not share a strong relationship. They are not well correlated with one another.
Alternatively, consider the “Center for Epidemiologic Studies Depression measure” (cesd
), which was measured on the same subjects. If someone can be described as being in poor mental condition, they might also be prone to depression.
cor(cesd~mcs, data=HELPrct)
## [1] -0.6819239
In this case, with r=-0.68
, we observe a negative correlation between the two variables. This makes sense: subjects with “lower”" mental well being tend to have higher “depression”" scores.
Visualizing quantitative relationships can help us to get a feel for what the correlation coefficient means. Typically, we use a scatterplot to look at the relationship between two quantitative variables. Let’s look at the two relationships we summarized above. First, the very weak relationship between physical and mental component scores.
xyplot(pcs~mcs, data=HELPrct, xlab="Mental component score", ylab="Physical component score")
Here, the rather indistinct “blast” of points is corresponds to the small value of the correlation coefficient. These points are nowhere near any line we could draw through the data.
What about the depression scores?
xyplot(cesd~mcs, data=HELPrct, xlab="Mental component score", ylab="Depression score")
Although there is still substantial scatter, the negative relationship is much clearer here, and it even seems to be pretty linear. If we want to model the relationship using a linear regression, we can add a line to the plot using the type=c("p","r")
argument in xyplot
. The “p” means “plot the points” and the “r” means “plot the regression, too”.
xyplot(cesd~mcs, data=HELPrct, type=c("p","r"), xlab="Mental component score", ylab="Depression score")
Sometimes the relationship between two variables exists but it is nonlinear. For example, notice the saturating relationship between CO2 concentration and CO2 uptake by some grass plants from Quebec. While the relationship looks positive, the regression line is a pretty poor model of the relationship.
data(CO2)
CO2Q = subset(CO2, Type=="Quebec" & Treatment=="nonchilled")
xyplot(uptake~conc, data=CO2Q, type=c("p","r"), xlab="CO2 concentration (mL/L)", ylab=expression(paste(CO[2], " uptake (", mu,"mol ", m^-2, " ", s^-1,")")))
While the relationship looks positive, the regression line is a pretty poor model of the relationship; the points start out well below the line, then go above the line, then below it again as CO2 uptake saturates at high CO2 concentrations.
We can highlight the nonlinear relationship using a “scatterplot smoother,” which basically smooths out a series of relationships over narrow ranges of the data. All we have to do is use type=c("p","smooth")
.
xyplot(uptake~conc, data=CO2Q, type=c("p","smooth"), xlab="CO2 concentration (mL/L)", ylab=expression(paste(CO[2], " uptake (", mu,"mol ", m^-2, " ", s^-1,")")))
Here it becomes much easier to visualize the nonlinear relationship between CO2 uptake and CO2 concentration.
We can also examine relationships between two categorical variables. For example, our apples may be either bruised or unbruised, and whether or not an apple is bruised could depend on the apple variety. We can summarize this relationship using a contingency table which tallies the occurences of each possible combination of categorical outcome for the two variables.
For example, in the HELPrct
data, let’s consider the relationship between the sex of the subject (coded in the variable female
as “0” for male and “1” for female) and their drug of addiction (substance
). We can use the tally()
function from mosaic
.
tally(substance~female, data=HELPrct)
## female
## substance 0 1
## alcohol 141 36
## cocaine 111 41
## heroin 94 30
One pattern that immediately leaps out is that there are many more men than women in the study - which makes it somewhat difficult to see patterns. However, we can also see that among the men, alcohol addiction is the most prevalent problem, while the women seem to be pretty evenly distributed among the three addictions. So the distribution of drug problems may be contingent upon the sex of the subjects underconsideration.
We can visualize the relationship between sex and addiction using bar plots like we did above, but by adding information on the sex of the subjects. We have some choices about how to display the information by varying the color and/or the placement of the bars, or by making separate graphs for the two sexes. To make separate colored bars for the two sexes, we can use groups=female
, and adding auto.key=TRUE
provides a legend.
bargraph(~substance, groups=female, data=HELPrct, xlab="Drug of addiction", auto.key=TRUE)
This graph illustrates the patterns in the data quite well, but the coding of the female
variable makes it hard to know what the “0” and “1” refer to. By messing around with auto.key
we can make something a little clearer. The auto.key=
argument can take a list()
of attributes, the most important of which are the title
and the text
. For the latter we can use the c()
(concatenate) function to give the text strings in order, remembering that “0” is male and “1” is female. We can also add the attribute corner
to the list
to place the key in one of the corners of the plot, rather than outside it.
bargraph(~substance, groups=female, data=HELPrct, xlab="Drug of addiction", auto.key=list(title="Sex", text=c("male","female"), corner=c(1,1)))
Alternatively, we could make one more slight change and stack the bars on top of each other for each substance if we wish, by using stack=TRUE
.
bargraph(~substance, groups=female, data=HELPrct, xlab="Drug of addiction", stack=TRUE, auto.key=list(title="Sex", text=c("male","female"), corner=c(1,1)))
A final option is to give each of the sexes a separate panel in the graph. In this case, rather than using groups
(which codes categories by color), we can use the |
(pipe) operator in the formula statement, which breaks down panels in all lattice
plotting routines.
bargraph(~substance | female, data=HELPrct, xlab="Drug of addiction", stack=TRUE, strip=strip.custom(factor.levels=c("male","female")))
The code for strip=strip.custom(factor.levels=c("male","female"))
changes the text on the identifying strips at the top of each panel, sort of like we did with the key in the previous example.
All of the plots above communicate the same information, but as creators of scientific visualizations, we have to choose the one that we feel communicates most effectively.
Frequently, we are interested in the relationship between a categorical independent variable, e.g., fertilizer type, and a quantitative dependent variable, e.g., plant growth rate. Here we can adapt many of the tools that we developed above. If the sample size within each categorical group is sufficiently large (n > 12 - 15), we can use side-by-side boxplots. For example, let’s look at the physical component scores for subjects addicted to different drugs.
bwplot(pcs~substance, data=HELPrct, xlab="Drug of addiction", ylab="Physical component score")
Here we can see that the median physical component scores are lower for alcohol and heroin, compared to cocaine, but also that the distributions are broadly overlapping (remember the box includes the middle 50% of the distribution).
If we want more details on the shape of each distribution we can construct histograms of the component scores with each drug in its own panel.
histogram(~pcs|substance, data=HELPrct, xlab="Physical component score")
We can change the arrangement of the panels using the layout=c(<ncols>,<nrows>)
argument, where
histogram(~pcs|substance, data=HELPrct, xlab="Physical component score", layout=c(1,3))
In some cases, we may want to just see all of the individual observations for each category - usually when sample sizes are relatively small. In this case, we can use either stripplot
or xyplot
. Let’s go back to the CO2 data set and compare Quebec plants to Mississippi plants.
data(CO2)
stripplot(uptake~Type, data=CO2, xlab="Plant type", ylab=expression(paste(CO[2], " uptake (", mu,"mol ", m^-2, " ", s^-1,")")))
Here there are enough data points that they overlap a lot. If we want to distinguish the different uptake values a little more clearly, we can jitter
the data. By adding type=c("p","a")
, we can also draw a line connecting the mean values of the two groups (“a” stands for average).
stripplot(uptake~Type, data=CO2, jitter=TRUE, type=c("p","a"), xlab="Plant type", ylab=expression(paste(CO[2], " uptake (", mu,"mol ", m^-2, " ", s^-1,")")))
As mentioned above, xyplot
can give really similar results, but here we have to use jitter.x
rather than just jitter
.
xyplot(uptake~Type, data=CO2, jitter.x=TRUE, type=c("p","a"), xlab="Plant type", ylab=expression(paste(CO[2], " uptake (", mu,"mol ", m^-2, " ", s^-1,")")))
Note that the jittering is random, so it looks a bit different each time.
We often are interested in more than one predictor variable, and we want to understand how multiple influences interact to affect a particular response variable. Graphically, we can add additional dimensions to a two dimensional graph by using color to designate an additional categorical variable, or by including multiple panels. For example, in the relationship between depression and mental status above, we could compare females and males by using groups=female
.
xyplot(cesd~mcs, groups=female, data=HELPrct, type=c("p","r"), xlab="Mental component score", ylab="Depression score", auto.key=list(title="Sex", text=c("male","female"), corner=c(1,1)))
We could even separate out these relationships by substance if we want, using |substance
in the formula statement.
xyplot(cesd~mcs|substance, groups=female, data=HELPrct, type=c("p","r"), layout=c(1,3), xlab="Mental component score", ylab="Depression score", auto.key=list(title="Sex", text=c("male","female"), corner=c(0,0)))
Depending on the research question, it might be better to swap substance
into the colors and use sex to make two panels. Note here that we need to explicitly consider the female
variable as a factor using as.factor()
.
xyplot(cesd~mcs|as.factor(female), groups=substance, data=HELPrct, type=c("p","r"), xlab="Mental component score", ylab="Depression score", strip=strip.custom(factor.levels=c("male","female")), auto.key=list(title="Drug", corner=c(1,0.95)), )
We can use color to add dimensions to strip plots as well.
data(CO2)
stripplot(uptake~Type, groups=Treatment, data=CO2, jitter=TRUE, type=c("p","a"), xlab="Plant type", ylab=expression(paste(CO[2], " uptake (", mu,"mol ", m^-2, " ", s^-1,")")), auto.key=list(title="Treatment", corner=c(1,1)))
Here we can see that chilling has less of an effect on CO2 uptake in the Quebec plants than in the Mississippi plants, as we would expect if the Quebec plants were adapted to cold conditions.
One of the great features of the lattice
graphing functions is the ability to “layer” different components onto the same graph using the panel=function(x,y,...){}
argument, where a number of different components can all be added within the curly braces {}
. There are a tremendous number of features to explore, but just as one example, let’s add means and 95% confidence intervals to our graph of physical component scores for subjects addicted to different drugs. Also note that we can change colors using the col=
argument for different components of the graph.
To make this easier we need a function (group.CI()
) from the Rmisc
package.
require(Rmisc) #install if you do not have it
## Loading required package: Rmisc
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:mosaic':
##
## count
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
xyplot(pcs~substance, data=HELPrct, xlab="Drug of addiction", ylab="Physical component score", panel=function(x,y,..){
panel.xyplot(x,y,jitter.x=TRUE, col="light green") #draws the points in light green
ci=group.CI(pcs~substance, data=HELPrct) #calculates confidence intervals for each drug
panel.points(x=1:nrow(ci), y=ci[,3], col="dark green", pch=19, cex=1.5) #draws means, pch is to select the point type (circle), cex is used to change the size of the point (>1 is larger than default)
panel.arrows(x0=1:nrow(ci), x1=1:nrow(ci), y0=ci[,2], y1=ci[,4], angle=90, ends="both", length=0.1, col="dark green", lwd=1) #draws ci error bars
}
)
If we just wanted the mean estimates and the confidence intervals, without the raw data points, we can just leave out the command that draws them. This is a common way of presenting results, but to me it is much less informative, as we learn very little about the variability of the data.
xyplot(pcs~substance, data=HELPrct, xlab="Drug of addiction", ylab="Physical component score", panel=function(x,y,..){
ci=group.CI(pcs~substance, data=HELPrct) #calculates confidence intervals for each drug
panel.points(x=1:nrow(ci), y=ci[,3], col="dark green", pch=19, cex=1.5) #draws means
panel.arrows(x0=1:nrow(ci), x1=1:nrow(ci), y0=ci[,2], y1=ci[,4], angle=90, ends="both", length=0.1, col="dark green", lwd=1) #draws ci error bars
}
)
Despite the length of this document, we have only scratched the the data visualization capabilities of R
. More important than sophisticated aesthetic features is the ability to select the right graph to highlight relevant patterns of variation and communicate the “scientific story” that we think the data are telling us.