Data Handling in R

Vectors

Creating a Vector

The most basic type of data in R is called a vector. In fact, in R, everything is made of vectors. A vector is a series of values. All of the values in a vector must be of the same mode (e.g., numeric, character, or logical).

Vectors can be created using the c() function. For example, this creates a new numeric vector, and uses the assignment operator <- to store it as an object called “myVector”. It then prints myVector to the screen.

# Note: you can insert comments like this into an R script by placing them after the "#"
# character. Comments are to be read by humans; they are not executed by the computer.
(myVector <- c(5, 11, 8, 1, -2, NA, 40))  # NA indicates missing data.
## [1]  5 11  8  1 -2 NA 40

The length() function can be used to determine the number of elements in the vector. The length includes any elements that are NA. Beware: NAs are a constant source of error/confusion in R.

length(myVector)
## [1] 7

Accessing a Vector

There are two ways to access the values inside a vector. The first is using vector indexing. For example:

# To print the second element of myVector:
myVector[2]
## [1] 11
# To print the second, third, and fourth elements of myVector:
myVector[2:4]
## [1] 11  8  1
# To print the fourth, fifth, sixth, and fourth (again) elements of myVector:
myVector[c(4, 5, 6, 4)]
## [1]  1 -2 NA  1
# To print all elements except the sixth element:
myVector[-6]
## [1]  5 11  8  1 -2 40

The other way to access elements of a vector is by using a logical (i.e., TRUE/FALSE) vector. For example:

# This command prints a vector of TRUE or FALSE values corresponding to whether each
# element in myVector is greater than zero (TRUE) or not (FALSE). myVector[5] is -2,
# so the fifth element is FALSE, but all others are TRUE, except for myVector[6]
# because it is NA.
myVector > 0
## [1]  TRUE  TRUE  TRUE  TRUE FALSE    NA  TRUE
# When we use a logical condition in square brackets, we extract all vector elements for
# which the condition is TRUE (or NA). Therefore, this command prints all elements of
# myVector that are greater than zero:
myVector[myVector > 0]
## [1]  5 11  8  1 NA 40
# Notice that the -2 was not printed, but the NA was. If we wish to remove the NA, we have
# to use the is.na() function. This returns TRUE if an element is NA, but FALSE
# otherwise:
is.na(myVector)
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
# If we want to find elements that are *not* NA, we can use the logical NOT operator (!):
!is.na(myVector)
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
# This prints all elements that are *not* NA:
myVector[!is.na(myVector)]
## [1]  5 11  8  1 -2 40
# Finally, we can combine our conditions using the logical AND (&). This command will
# print all elements of myVector that are greater than zero and that are not NA.
myVector[myVector > 0 & !is.na(myVector)]
## [1]  5 11  8  1 40

Operating on Vectors

In R, we can perform mathematical operations on entire vectors all at one time. For example:

# To add a constant to each element of a vector:
c(1, 4, 5) + 1
## [1] 2 5 6
# To perform element-wise multiplication (or addition, subtraction, division, etc.) of two vectors:
c(1, 4, 5) * c(3, 2, 6)
## [1]  3  8 30
# To take the base-10 logarithm of all elements of a vector:
log10(c(10, 100, 1000))
## [1] 1 2 3

Adding New Elements to Vectors

To add a new element to the end of a vector, just use an index number that hasn’t been assigned yet.

# myVector contains 7 elements
print(myVector)
## [1]  5 11  8  1 -2 NA 40
# To add an eighth element:
myVector[8] <- 100
print(myVector)
## [1]   5  11   8   1  -2  NA  40 100
# To add ninth and tenth elements:
myVector[9:10] <- c(1000, 10000)
print(myVector)
##  [1]     5    11     8     1    -2    NA    40   100  1000 10000
# To add a 15th element (and NAs for elements 11-14):
myVector[15] <- 1000000
print(myVector)
##  [1]       5      11       8       1      -2      NA      40     100    1000
## [10]   10000      NA      NA      NA      NA 1000000

Data Frames

Creating a Data Frame

In R, a data frame is the equivalent of an Excel worksheet. A data frame is a list of vectors of the same length. Each vector is represented as a column, and each vector may be of a different mode. Data frames can be created using the data.frame() function. They are also created automatically when you read data from a file using read.csv(), read.delim(), etc.

# This creates a data frame with six columns and then prints it. Each column is a vector of
# length 5.
(myDataFrame <- data.frame(Name = c("Rosie", "John", "Kate", "Sandra", "Hayden"),
                           Age = c(19, 20, 18, 26, 70),
                           Sex = c("F", "M", "F", "F", NA),
                           Infected = c(TRUE, TRUE, FALSE, TRUE, FALSE),
                           Height.m = c(1.60, 1.80, 1.54, 1.67, 1.90),
                           Mass.kg = c(71, 105, 88, 84, 200)))
##     Name Age  Sex Infected Height.m Mass.kg
## 1  Rosie  19    F     TRUE     1.60      71
## 2   John  20    M     TRUE     1.80     105
## 3   Kate  18    F    FALSE     1.54      88
## 4 Sandra  26    F     TRUE     1.67      84
## 5 Hayden  70 <NA>    FALSE     1.90     200

Note that in the command above, character values (e.g., in the Name and Sex vectors) needed to be surrounded by quotation marks. Numeric values (e.g., Age) and logical values (e.g., Infected) do not take quotation marks. NA also does not take quotation marks. Incorrect use of quotation marks is a very common error in R.

We can confirm the dimensions of our data frame using nrow() and ncol():

nrow(myDataFrame)  # Number of rows
## [1] 5
ncol(myDataFrame)  # Number of columns
## [1] 6

Accessing a Data Frame

A single column of a data frame can be accessed using $:

myDataFrame$Name  # Print the Name column
## [1] "Rosie"  "John"   "Kate"   "Sandra" "Hayden"

Columns are vectors, so the regular rules for accessing vectors apply. This prints the names in the second and third rows of the data frame:

myDataFrame$Name[2:3]
## [1] "John" "Kate"

To access more than one column, you have to use square brackets. You can specify the column names or the column numbers:

# Note the comma. Because there is nothing before it, it means "print all rows"
myDataFrame[, c("Infected", "Name")]  # Print the Infected column and the Name column
##   Infected   Name
## 1     TRUE  Rosie
## 2     TRUE   John
## 3    FALSE   Kate
## 4     TRUE Sandra
## 5    FALSE Hayden
myDataFrame[, c(2, 3, 1)]             # Print columns #2, #3, and #1
##   Age  Sex   Name
## 1  19    F  Rosie
## 2  20    M   John
## 3  18    F   Kate
## 4  26    F Sandra
## 5  70 <NA> Hayden
# This will print the third and fourth rows of the Name and Infected columns.
myDataFrame[3:4, c("Name", "Infected")]
##     Name Infected
## 3   Kate    FALSE
## 4 Sandra     TRUE

Filtering a Data Frame

You can use the values in a column to filter your data. For example, this will print all rows where the individual’s age is greater than or equal to 20:

# Note the comma. Because there is nothing after it, it means "print all columns"
myDataFrame[myDataFrame$Age >= 20, ]  
##     Name Age  Sex Infected Height.m Mass.kg
## 2   John  20    M     TRUE     1.80     105
## 4 Sandra  26    F     TRUE     1.67      84
## 5 Hayden  70 <NA>    FALSE     1.90     200

As before, we can specify multiple conditions using &. This prints the Name column for all individuals whose ages are between 20 and 65:

myDataFrame[myDataFrame$Age >= 20 & myDataFrame$Age <= 65, "Name"]
## [1] "John"   "Sandra"

Adding New Columns to a Data Frame

To add a new column to an existing data frame, just use $, name a new column, and assign values to it:

# This creates a new column called "BMI" that is calculated as mass/(height^2).
myDataFrame$BMI <- myDataFrame$Mass.kg/(myDataFrame$Height.m^2)

# This creates a new column called "Obese" that is TRUE if BMI is greater than or equal to
# 30 and FALSE otherwise.
myDataFrame$Obese <- ifelse(myDataFrame$BMI >= 30, TRUE, FALSE)
print(myDataFrame)
##     Name Age  Sex Infected Height.m Mass.kg      BMI Obese
## 1  Rosie  19    F     TRUE     1.60      71 27.73437 FALSE
## 2   John  20    M     TRUE     1.80     105 32.40741  TRUE
## 3   Kate  18    F    FALSE     1.54      88 37.10575  TRUE
## 4 Sandra  26    F     TRUE     1.67      84 30.11940  TRUE
## 5 Hayden  70 <NA>    FALSE     1.90     200 55.40166  TRUE

Sorting a Data Frame

To sort a data frame by one or more columns, use the order() function. For example, to sort myDataFrame from shortest to tallest:

rowOrder <- order(myDataFrame$Height.m)
myDataFrame <- myDataFrame[rowOrder, ]
print(myDataFrame)
##     Name Age  Sex Infected Height.m Mass.kg      BMI Obese
## 3   Kate  18    F    FALSE     1.54      88 37.10575  TRUE
## 1  Rosie  19    F     TRUE     1.60      71 27.73437 FALSE
## 4 Sandra  26    F     TRUE     1.67      84 30.11940  TRUE
## 2   John  20    M     TRUE     1.80     105 32.40741  TRUE
## 5 Hayden  70 <NA>    FALSE     1.90     200 55.40166  TRUE

Saving a Data Frame to Disk

Finally, we can save a copy of our myDataFrame data frame to disk. If we wanted to save myDataFrame as a CSV file we could use:

write.csv(myDataFrame, file = "myDataFrame.csv", quote = FALSE, row.names = FALSE)

To learn more about the arguments for write.csv(), look it up in R’s documentation using ?write.csv. You can always use the question mark in front of a function name to learn more about it.

Before continuing, make sure you feel comfortable accessing vectors and data frames. You may want to make some vectors and data frames of your own to play with them.

This Week’s Task - Preliminaries…

Clearing the Environment

This command will remove all of the objects that you currently have in your R environment. This gives you a “fresh start”. This can also be accomplished using the broom in RStudio’s “Environment” tab.

rm(list = ls())

Some R programmers recommend that you put that command at the beginning of every script.

Setting the Working Directory

Before you can start working on real data, you need to set your working directory to the folder where your data is stored. The working directory is the folder on your computer where R tries to read and write files by default. If you do not know your current working directory, type getwd() and R will print the working directory for you.

The easiest way to change your working directory to a different location is using the “Files” tab of the RStudio interface. (I will show you during tutorial.) Otherwise, you can use the setwd() function. For example, if I wanted to set my working directory to ~/Documents/BIOL5473/Tutorial1/Data:

setwd("~/Documents/BIOL5473/Tutorial1/")  # Note: "~/" is a shortcut for the user's home directory

Convert the Data File

The data file we will be using is jeb13353-sup-0002-appendixs1.xlsx. This file contains data on brain size and body size of many mammal species. It also includes information about whether each species hibernates, the latitude of the species range, whether the species is diurnal or nocturnal, etc. For more information on the file, see the README.txt file on D2L.

The data come from the following paper:

  • Heldstab SA, Isler K, van Schaik CP (2018) Hibernation constrains brain size evolution in mammals. Journal of Evolutionary Biology 31(10):1582-1588. https://doi.org/10.1111/jeb.13353

The data are saved as an Excel file (.xslx). It is best to store your data as a comma-separated value (.csv) or a tab-delimited text (.txt) file. Use Microsoft Excel to convert the file to CSV or tab-delimited format. If you are unsure how to do this, please ask.

Loading the Data

Your next step is usually to get data to work with. Often this will be stored in a file. To read in a CSV file, you can use read.csv(); to read in a tab-delimited file, you can use read.delim(). For example:

# This reads a csv file called jeb13353-sup-0002-appendixs1.csv, which should be in the
# current working directory
MammalData <- read.csv("jeb13353-sup-0002-appendixs1.csv")

Examining the Data and Converting to Factors

After you load your data, you should always examine it to make sure that the file has been read in correctly and to ensure that you understand the structure of the dataset. The str() function shows us the structure of the object.

This shows us that our MammalData object has 1104 rows and 16 columns. Some columns like “Group” are stored as vectors of character values (chr). Other columns like “brain.mass” are stored as numbers with decimal places (num). Other columns like “hibernation” are stored as whole numbers with no decimal places (int). Our next step after reading in the data is to make sure that all of these data types are correct.

Note that “hibernation” is stored as a number: either 0 or 1. But hibernation is not really a number; it is a code. In the README.txt file, you can see that 0 means “does not hibernate” and 1 means “does hibernate”. In R, a variable that can only take a finite number of discrete values is called a “factor”. Each of the possible values of the factor is called a “level”. Let’s change the hibernation column from an integer to a factor using the factor() function.

Here we will tell R that there are two possible values for hibernation (0 and 1), and we will relabel those values as 0 –> “No” and 1 –> “Yes” so that we don’t get confused. (The labels must be specified in the same order as the levels.) Note: the order in which you specify levels is extremely important. The first level that you specify is called the reference level, and it is the level that is the basis of comparison. For example, your reference level may represent a control group to which all of your treatments are compared.

# Convert the hibernation column and save it over the original column (thereby replacing it)
# Here, we are using 0 ("does not hibernate") as the reference level:
MammalData$hibernation <- factor(MammalData$hibernation,
                                 levels = c(0, 1),
                                 labels = c("No", "Yes"))

Let’s see how our data has changed with str().

Note that hibernation is now a factor with two levels (level 0 = “No” and level 1 = “Yes”). The first six entries of the column look like this:

head(MammalData$hibernation)
## [1] No  No  Yes No  No  No 
## Levels: No Yes

Are there any other columns that should be converted to factors?

Printing Session Info

It is always a good idea to record the version of R and the versions of the various packages you are using. This allows others to reproduce your work exactly, because they know exactly which versions of which programs you were using. You can obtain this information with the sessionInfo() function.

This Week’s Task

Your task for this week is to become comfortable manipulating data and doing basic plotting. Create an R script that does the following.

  1. Read in the mammal dataset from a CSV or tab-delimited file.
  2. Remove “non-data” columns. For example, we do not need the columns that contain references/citations. You should remove those from the dataset using R.
  3. Convert columns to factors wherever it is appropriate.
  4. Create two histograms: one for brain mass and one for body mass. Do brain mass and body mass look normally distributed (i.e., shaped like bell curves)?
    • We will talk more about graphics in next week’s lab, but for now hist(MammalData$brain.mass, breaks = "scott", right = FALSE) will create a decent histogram of brain mass, for example.
  5. Log transformations are commonly used in biology when data is right-skewed (i.e., there is a long tail on the right). Create a new column in the dataset for log-transformed brain mass and a new column for log-transformed body mass. The logarithm functions in R are log() for natural logarithms and log10() for common logarithms.
  6. Make histograms for log-transformed brain mass and log-transformed body mass. How did log transformation affect the distributions?
  7. Create a scatterplot of brain mass vs. body mass: plot(x = MammalData$body.mass, y = MammalData$brain.mass) Does it look like a straight line would fit this data well?
  8. Create a scatterplot of log-transformed brain mass vs. log-transformed body mass. Does it look like a straight line would fit this data well? What did the log transformation do? Overall, how would you describe the relationship between body mass and brain mass?
  9. Create a new column for the “encephalization quotient” (EQ). EQ estimates how much larger or smaller a mammal’s brain is than expected from its body size. A high EQ represents a larger brain than expected, which suggests high intelligence. The formula is: \(EQ=\frac{mass_{brain}}{0.059 \times {mass_{body}}^{0.76}}\) Which group of mammals seems to have the highest EQ?
  10. Write your final dataset to a CSV file.

After you have done all this, make sure you take time to play around with the data in R. You will not get good at R if you do not play with it and fight through frustration. For example, here are some of the data wrangling tasks you should be able to do:

Taking It to the Next Level

The data manipulation steps that we have covered here using base R can also be performed using the dplyr package, which is part of the “tidyverse”. dplyr is great for “big data”, and some people find dplyr easier to use than base R. For more information about dplyr, see Chapter 5 of R for Data Science here: https://r4ds.had.co.nz/