KickstartR: Data Management and Manipulation

Data are at the core of science. They have the power to inspire the most elegant of scientific theories, and to destroy them. As such, they must be treated with respect, as precious clues to the workings of nature. This tutorial addresses provides some basic principles of data management and manipulation in R. As usual, we want to start by loading some useful packages.

require(mosaic) # be sure to install and load the usual packages if you have not already
require(datasets)

Data Entry and Formatting

R is a data visualization and analysis program, not a data entry platform, so we need to use a spreadsheet like Excel, Numbers, or GoogleSheets to construct our data file. R can read a number of file formats, but we will always be using comma-separated-values (.csv) format, which you can save from any of the above programs.

Below is a picture of a spreadsheet containing data on CO2 uptake of plants. The CO2 dataset is included in the datasets package, but here we are pretending that we have entered the data ourselves.

There are a few things to note about spreadsheet data formatting:

The data are in Data Frame format: Each row is an individual observation or “case”, and each column is a single “variable” - a measured quantity, attribute, or treatment describing the observed cases. Thus, each cell in the spreadsheet represents a consistently formatted observation of a single case of a single variable. Data Frames are a standard way of formatting data, and they are a native format for R and other statistical analysis packages. There are many other ways of storing data, but we will almost always start here. Data frames should be constructed so that the cases are the natural units of analysis, i.e., the replicate observations of the study. These represent the smallest grain at which we can resolve our data.
Column headings or variable names should be clear but concise. For example, in this file CO2 uptake in micromoles per square meter per second is simple called “uptake.” You will use the variable names in writing code for visualization and analysis, and complex variable names like “CO2 uptake (umol m-2 sec)” will be misinterpretted in R. They also clutter up your code and make it hard to read. Variable names should always be a single string with no spaces. You can include a “.” to represent a space (e.g., “CO2.uptake”) or you can write in “CamelCase”" to make words stand out (e.g., “UptakeCO2”). Developing a consistent style (e.g., always capitalizing variable names) will make writing code easier too. Most quantitative variables have units (e.g. milligrams), but it is best to track these in your lab notebook and document them in a “code book” describing the data set, rather than incorporating them into the column headings.
File names should also be brief but descriptive. “MyAwesomeData9-8.csv” may be very clear to you on September 8^th, but by November, you will be hard-pressed to recall what you were doing at the time. “BFECButterflyWings.csv” might be better. File names should also be recorded in your lab notebook (along with the variable “code book” mentioned above), to facilitate the repeatability and veracity of your science.
Missing data can be included simply by leaving cells blank. There is no need for “NA” or “-999” codes, which could be misinterpretted as actual values during analysis.

Reading Tabular Data into R

Once we have saved our data frame as a .csv file, we can easily load it into R. In RStudio, we can use the “Import Dataset” tab in the “Environment” pane, then navigate to the file. In the case of .csv files, this procedure simple calls the read.csv() procedure. Given the location of my “CO2.csv”" file, it looks like this.

CO2 = read.csv("~/Google Drive/Courses/Biol109/IntroLabR/CO2.csv")

But since the CO2 data are in the datasets pacakage, we can also call it simply by using

data(CO2)

You can find out more about the data using help() and once the data are loaded you can click on the data frame in the “Environment” pane to view the actual dataset in the data frame. To see the variable names in a data frame, we can use

names(CO2)

## [1] "Plant"     "Type"      "Treatment" "conc"      "uptake"

Variable Types

There are fundamentally two different types of data:

Quantitative data are naturally represented by a number, e.g., diameter, temperature, number of eggs in a clutch). These may be either continuous (on a decimal scale), or discrete (on an integer scale), but the numbers provide genuine, quantitative information, not just an arbitrary label.
Categorical data are naturally described by simple categories, e.g. red vs. white flowers, or larva, pupa, adult butterflies. The value is always selected from a fixed set of possibilities or “levels”. Ordinal data are categorical data that have a natural order to them, e.g., cold, cool, warm, hot.

These different forms of data lead to differences in how we summarize and examine data. With quantitative data, we can use numerical summaries like the mean, standard deviation, or quantiles, while with categorical data, we generally think about counts or proportions of different values. An easy way to gain such useful information about a data frame is to use summary()

summary(CO2)

##      Plant             Type         Treatment       conc     
##  Qn1    : 7   Quebec     :42   nonchilled:42   Min.   :  95  
##  Qn2    : 7   Mississippi:42   chilled   :42   1st Qu.: 175  
##  Qn3    : 7                                    Median : 350  
##  Qc1    : 7                                    Mean   : 435  
##  Qc3    : 7                                    3rd Qu.: 675  
##  Qc2    : 7                                    Max.   :1000  
##  (Other):42                                                  
##      uptake     
##  Min.   : 7.70  
##  1st Qu.:17.90  
##  Median :28.30  
##  Mean   :27.21  
##  3rd Qu.:37.12  
##  Max.   :45.50  
##

Here it is not surprising that the count for the Type and Treatment variables are the same, because these factors were chosen by the researchers. However, notice that even though conc is a treatment factor imposed by the researchers, summary() treats it like any other quantitative variable.

Subsetting Data Frames

Sometimes we only want to work with a subset of our data, to get at a more narrowly focused question. For example, if we only wanted to look at the data from the “Quebec” plants and not the “Mississippi” plants, we can use subset() to make a new data frame called CO2Q.

CO2Q = subset(CO2, Type=="Quebec", drop=TRUE) #drop=TRUE gets rid of "extra" factor levels like "Mississippi."
summary(CO2Q)

##  Plant       Type         Treatment       conc          uptake     
##  Qn1:7   Quebec:42   nonchilled:21   Min.   :  95   Min.   : 9.30  
##  Qn2:7               chilled   :21   1st Qu.: 175   1st Qu.:30.32  
##  Qn3:7                               Median : 350   Median :37.15  
##  Qc1:7                               Mean   : 435   Mean   :33.54  
##  Qc3:7                               3rd Qu.: 675   3rd Qu.:40.15  
##  Qc2:7                               Max.   :1000   Max.   :45.50

Notice the == sign. A single = is used to assign a value to a variable, while the double == is used to compare a variable to a value to get a logical (true or false) outcome. Similarly, we can subset data frames based on quantitative variables, using logical operators like <, <=, or >. For example, the rate of CO2 uptake approximately saturates at concentrations above 200, and we may want to look at the effect of chilling only when CO2 is saturating. We can use subset() to make a data frame called CO2sat that collects those values.

CO2sat = subset(CO2, conc>200, drop=TRUE)
summary(CO2sat)

##      Plant             Type         Treatment       conc     
##  Qn1    : 5   Quebec     :30   nonchilled:30   Min.   : 250  
##  Qn2    : 5   Mississippi:30   chilled   :30   1st Qu.: 350  
##  Qn3    : 5                                    Median : 500  
##  Qc1    : 5                                    Mean   : 555  
##  Qc3    : 5                                    3rd Qu.: 675  
##  Qc2    : 5                                    Max.   :1000  
##  (Other):30                                                  
##      uptake     
##  Min.   :12.30  
##  1st Qu.:24.90  
##  Median :32.45  
##  Mean   :31.19  
##  3rd Qu.:38.83  
##  Max.   :45.50  
##

Variables in Data Frames

Often, we want to refer to the values of a single varible in a data frame. There are many ways to do so in R, but here we will focus on only two: functions with data= argument and the $ notation.

Most (but not all) of the statistical and visualization functions we use in R are designed to work with data frames, and they allow us to work with particular variables within them. In particular, the mosaic package adapts many basic functions to work easily with data frames. For example, if we want to know the mean CO2 uptake of chilled vs. unchilled plants, we can use the mean() function with a data= argument.

mean(uptake~Treatment, data=CO2)

## nonchilled    chilled 
##   30.64286   23.78333

We can do the same thing, even with a single variable. Say we want to know the minimum CO2 concentration at which the plants were tested.

min(conc, data=CO2)

## [1] 95

Alternatively, we can use the $ operator to reference a single variable in a data frame and treat it as a single vector of data. For example, we can find the minimum CO2 concentration using the $ rather than a data= argument.

min(CO2$conc)

## [1] 95

The data frame name comes first, followed by the $ then the variable name.

The data= approach is more flexible, as when we found the mean uptake by treatment, but some functions are built to work with vectors rather than data frames, and here the $ approach is more useful. For example, notice that when we used summary(CO2) above, the Plant variable listed only some of the values, leaving 42 observations in the category “(Other)”. If we want to know all the possible values for a categorical variable, we can use levels(), but levels() will not accept a data= argument.

levels(CO2$Plant)

##  [1] "Qn1" "Qn2" "Qn3" "Qc1" "Qc3" "Qc2" "Mn3" "Mn2" "Mn1" "Mc2" "Mc3"
## [12] "Mc1"

Another way around using the $ when a function will not accept a data= argument, is to use the with() function to specify a data frame. To get all 12 possible values for the Plant variable, we can do this.

with(data=CO2, levels(Plant))

##  [1] "Qn1" "Qn2" "Qn3" "Qc1" "Qc3" "Qc2" "Mn3" "Mn2" "Mn1" "Mc2" "Mc3"
## [12] "Mc1"

As usual, there are many paths to the same destination in R.

Calculating Additional Variables

Often, the data we are interested in analyzing are not our raw measurements, but some calculation based on them. In the simplest example, what if we wanted to examine our uptake rate per minute, rather than per second. Let’s add a new variable to the CO2 data frame called uptake.min, which is simply uptake divided by 60. We can use the $ to make new variabled as well.

CO2$uptake.min = CO2$uptake/60

Notice that this command creates no output to the console. However, if we now look at the variables present in the data frame, we find a new one.

names(CO2)

## [1] "Plant"      "Type"       "Treatment"  "conc"       "uptake"    
## [6] "uptake.min"

Another way to calculate new variables is to use transform().

CO2 = transform(CO2, uptake.min=uptake/60)

This command gives the same result as above.

We could have done that calculation in the spreadsheet before we loaded the data into R, but it is usually much more efficient (and scientifically transparent) to put such calculations into the code for your analysis. Doing repetitive calculations in a spreadsheet, or worse yet on a hand calculator, is both slower and invites many more opportunities for error. What’s more, if we were to collect more data, we have to go back and do more calculations, rather than just running our analysis code on the new data set. As a general rule, always code your calculations!

In summary, we’ve outlined some simple principles for data management and analysis.

Use properly formatted data frames for managing and manipulating your data. Keep variables in columns and cases in rows, with consistently formatted values in all cells, leaving missing observations blank. Variable names and file names should be simple and concise, but descriptive and useful. Always save data in .csv format.
Know your variables and be able to distinguish quantitative and categorical variables, since they are handled differently in visualization, summary, and analysis. We can subset() data frames based on both types of variables, and we can access and operate on variables from data frames using data=, $, and with() approaches.
Code your calculations to keep data management and analysis efficient and accurate. We can use both $ and transform() approaches to calculate new variables and incorporate those calculations into analysis and visualization scripts.

Adhering to these very general (and relatively easy) principles when managing and manipulating data facilitates transparent, repeatable, and verifiable science.