The most basic type of data in R is called a vector. In fact, in R, everything is made of vectors. A vector is a series of values. All of the values in a vector must be of the same mode (e.g., numeric, character, or logical).
Vectors can be created using the c() function. For
example, this creates a new numeric vector, and uses the
assignment operator <- to store it as
an object called “myVector”. It then prints myVector to the screen.
# Note: you can insert comments like this into an R script by placing them after the "#"
# character. Comments are to be read by humans; they are not executed by the computer.
(myVector <- c(5, 11, 8, 1, -2, NA, 40)) # NA indicates missing data.
## [1] 5 11 8 1 -2 NA 40
The length() function can be used to determine the
number of elements in the vector. The length includes any elements that
are NA. Beware: NAs are a constant source of error/confusion
in R.
length(myVector)
## [1] 7
There are two ways to access the values inside a vector. The first is using vector indexing. For example:
# To print the second element of myVector:
myVector[2]
## [1] 11
# To print the second, third, and fourth elements of myVector:
myVector[2:4]
## [1] 11 8 1
# To print the fourth, fifth, sixth, and fourth (again) elements of myVector:
myVector[c(4, 5, 6, 4)]
## [1] 1 -2 NA 1
# To print all elements except the sixth element:
myVector[-6]
## [1] 5 11 8 1 -2 40
The other way to access elements of a vector is by using a logical (i.e., TRUE/FALSE) vector. For example:
# This command prints a vector of TRUE or FALSE values corresponding to whether each
# element in myVector is greater than zero (TRUE) or not (FALSE). myVector[5] is -2,
# so the fifth element is FALSE, but all others are TRUE, except for myVector[6]
# because it is NA.
myVector > 0
## [1] TRUE TRUE TRUE TRUE FALSE NA TRUE
# When we use a logical condition in square brackets, we extract all vector elements for
# which the condition is TRUE (or NA). Therefore, this command prints all elements of
# myVector that are greater than zero:
myVector[myVector > 0]
## [1] 5 11 8 1 NA 40
# Notice that the -2 was not printed, but the NA was. If we wish to remove the NA, we have
# to use the is.na() function. This returns TRUE if an element is NA, but FALSE
# otherwise:
is.na(myVector)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
# If we want to find elements that are *not* NA, we can use the logical NOT operator (!):
!is.na(myVector)
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
# This prints all elements that are *not* NA:
myVector[!is.na(myVector)]
## [1] 5 11 8 1 -2 40
# Finally, we can combine our conditions using the logical AND (&). This command will
# print all elements of myVector that are greater than zero and that are not NA.
myVector[myVector > 0 & !is.na(myVector)]
## [1] 5 11 8 1 40
In R, we can perform mathematical operations on entire vectors all at one time. For example:
# To add a constant to each element of a vector:
c(1, 4, 5) + 1
## [1] 2 5 6
# To perform element-wise multiplication (or addition, subtraction, division, etc.) of two vectors:
c(1, 4, 5) * c(3, 2, 6)
## [1] 3 8 30
# To take the base-10 logarithm of all elements of a vector:
log10(c(10, 100, 1000))
## [1] 1 2 3
To add a new element to the end of a vector, just use an index number that hasn’t been assigned yet.
# myVector contains 7 elements
print(myVector)
## [1] 5 11 8 1 -2 NA 40
# To add an eighth element:
myVector[8] <- 100
print(myVector)
## [1] 5 11 8 1 -2 NA 40 100
# To add ninth and tenth elements:
myVector[9:10] <- c(1000, 10000)
print(myVector)
## [1] 5 11 8 1 -2 NA 40 100 1000 10000
# To add a 15th element (and NAs for elements 11-14):
myVector[15] <- 1000000
print(myVector)
## [1] 5 11 8 1 -2 NA 40 100 1000
## [10] 10000 NA NA NA NA 1000000
In R, a data frame is the equivalent of an Excel
worksheet. A data frame is a list of vectors of the same length. Each
vector is represented as a column, and each vector may be of a different
mode. Data frames can be created using the data.frame()
function. They are also created automatically when you read data from a
file using read.csv(), read.delim(), etc.
# This creates a data frame with six columns and then prints it. Each column is a vector of
# length 5.
(myDataFrame <- data.frame(Name = c("Rosie", "John", "Kate", "Sandra", "Hayden"),
Age = c(19, 20, 18, 26, 70),
Sex = c("F", "M", "F", "F", NA),
Infected = c(TRUE, TRUE, FALSE, TRUE, FALSE),
Height.m = c(1.60, 1.80, 1.54, 1.67, 1.90),
Mass.kg = c(71, 105, 88, 84, 200)))
## Name Age Sex Infected Height.m Mass.kg
## 1 Rosie 19 F TRUE 1.60 71
## 2 John 20 M TRUE 1.80 105
## 3 Kate 18 F FALSE 1.54 88
## 4 Sandra 26 F TRUE 1.67 84
## 5 Hayden 70 <NA> FALSE 1.90 200
Note that in the command above, character values (e.g., in the Name and Sex vectors) needed to be surrounded by quotation marks. Numeric values (e.g., Age) and logical values (e.g., Infected) do not take quotation marks. NA also does not take quotation marks. Incorrect use of quotation marks is a very common error in R.
We can confirm the dimensions of our data frame using
nrow() and ncol():
nrow(myDataFrame) # Number of rows
## [1] 5
ncol(myDataFrame) # Number of columns
## [1] 6
A single column of a data frame can be accessed using
$:
myDataFrame$Name # Print the Name column
## [1] "Rosie" "John" "Kate" "Sandra" "Hayden"
Columns are vectors, so the regular rules for accessing vectors apply. This prints the names in the second and third rows of the data frame:
myDataFrame$Name[2:3]
## [1] "John" "Kate"
To access more than one column, you have to use square brackets. You can specify the column names or the column numbers:
# Note the comma. Because there is nothing before it, it means "print all rows"
myDataFrame[, c("Infected", "Name")] # Print the Infected column and the Name column
## Infected Name
## 1 TRUE Rosie
## 2 TRUE John
## 3 FALSE Kate
## 4 TRUE Sandra
## 5 FALSE Hayden
myDataFrame[, c(2, 3, 1)] # Print columns #2, #3, and #1
## Age Sex Name
## 1 19 F Rosie
## 2 20 M John
## 3 18 F Kate
## 4 26 F Sandra
## 5 70 <NA> Hayden
# This will print the third and fourth rows of the Name and Infected columns.
myDataFrame[3:4, c("Name", "Infected")]
## Name Infected
## 3 Kate FALSE
## 4 Sandra TRUE
You can use the values in a column to filter your data. For example, this will print all rows where the individual’s age is greater than or equal to 20:
# Note the comma. Because there is nothing after it, it means "print all columns"
myDataFrame[myDataFrame$Age >= 20, ]
## Name Age Sex Infected Height.m Mass.kg
## 2 John 20 M TRUE 1.80 105
## 4 Sandra 26 F TRUE 1.67 84
## 5 Hayden 70 <NA> FALSE 1.90 200
As before, we can specify multiple conditions using
&. This prints the Name column for all individuals
whose ages are between 20 and 65:
myDataFrame[myDataFrame$Age >= 20 & myDataFrame$Age <= 65, "Name"]
## [1] "John" "Sandra"
To add a new column to an existing data frame, just use
$, name a new column, and assign values to it:
# This creates a new column called "BMI" that is calculated as mass/(height^2).
myDataFrame$BMI <- myDataFrame$Mass.kg/(myDataFrame$Height.m^2)
# This creates a new column called "Obese" that is TRUE if BMI is greater than or equal to
# 30 and FALSE otherwise.
myDataFrame$Obese <- ifelse(myDataFrame$BMI >= 30, TRUE, FALSE)
print(myDataFrame)
## Name Age Sex Infected Height.m Mass.kg BMI Obese
## 1 Rosie 19 F TRUE 1.60 71 27.73437 FALSE
## 2 John 20 M TRUE 1.80 105 32.40741 TRUE
## 3 Kate 18 F FALSE 1.54 88 37.10575 TRUE
## 4 Sandra 26 F TRUE 1.67 84 30.11940 TRUE
## 5 Hayden 70 <NA> FALSE 1.90 200 55.40166 TRUE
To sort a data frame by one or more columns, use the
order() function. For example, to sort myDataFrame from
shortest to tallest:
rowOrder <- order(myDataFrame$Height.m)
myDataFrame <- myDataFrame[rowOrder, ]
print(myDataFrame)
## Name Age Sex Infected Height.m Mass.kg BMI Obese
## 3 Kate 18 F FALSE 1.54 88 37.10575 TRUE
## 1 Rosie 19 F TRUE 1.60 71 27.73437 FALSE
## 4 Sandra 26 F TRUE 1.67 84 30.11940 TRUE
## 2 John 20 M TRUE 1.80 105 32.40741 TRUE
## 5 Hayden 70 <NA> FALSE 1.90 200 55.40166 TRUE
Finally, we can save a copy of our myDataFrame data frame to disk. If we wanted to save myDataFrame as a CSV file we could use:
write.csv(myDataFrame, file = "myDataFrame.csv", quote = FALSE, row.names = FALSE)
To learn more about the arguments for write.csv(), look
it up in R’s documentation using ?write.csv. You can always
use the question mark in front of a function name to learn more about
it.
Before continuing, make sure you feel comfortable accessing vectors and data frames. You may want to make some vectors and data frames of your own to play with them.
This command will remove all of the objects that you currently have in your R environment. This gives you a “fresh start”. This can also be accomplished using the broom in RStudio’s “Environment” tab.
rm(list = ls())
Some R programmers recommend that you put that command at the beginning of every script.
Before you can start working on real data, you need to set your
working directory to the folder where your data is
stored. The working directory is the folder on your computer where R
tries to read and write files by default. If you do not know your
current working directory, type getwd() and R will print
the working directory for you.
The easiest way to change your working directory to a different
location is using the “Files” tab of the RStudio interface. (I will show
you during tutorial.) Otherwise, you can use the setwd()
function. For example, if I wanted to set my working directory to
~/Documents/BIOL5473/Tutorial1/Data:
setwd("~/Documents/BIOL5473/Tutorial1/") # Note: "~/" is a shortcut for the user's home directory
The data file we will be using is jeb13353-sup-0002-appendixs1.xlsx. This file contains data on brain size and body size of many mammal species. It also includes information about whether each species hibernates, the latitude of the species range, whether the species is diurnal or nocturnal, etc. For more information on the file, see the README.txt file on D2L.
The data come from the following paper:
The data are saved as an Excel file (.xslx). It is best to store your data as a comma-separated value (.csv) or a tab-delimited text (.txt) file. Use Microsoft Excel to convert the file to CSV or tab-delimited format. If you are unsure how to do this, please ask.
Your next step is usually to get data to work with. Often this will
be stored in a file. To read in a CSV file, you can use
read.csv(); to read in a tab-delimited file, you can use
read.delim(). For example:
# This reads a csv file called jeb13353-sup-0002-appendixs1.csv, which should be in the
# current working directory
MammalData <- read.csv("jeb13353-sup-0002-appendixs1.csv")
After you load your data, you should always examine it to make sure
that the file has been read in correctly and to ensure that you
understand the structure of the dataset. The str() function
shows us the structure of the object.
This shows us that our MammalData object has 1104 rows
and 16 columns. Some columns like “Group” are stored as vectors of
character values (chr). Other columns like “brain.mass” are stored as
numbers with decimal places (num). Other columns like “hibernation” are
stored as whole numbers with no decimal places (int). Our next step
after reading in the data is to make sure that all of these data types
are correct.
Note that “hibernation” is stored as a number: either 0 or 1. But
hibernation is not really a number; it is a code. In
the README.txt file, you can see that 0 means “does not
hibernate” and 1 means “does hibernate”. In R, a variable that can only
take a finite number of discrete values is called a
“factor”. Each of the possible values of the factor is
called a “level”. Let’s change the hibernation column
from an integer to a factor using the factor()
function.
Here we will tell R that there are two possible values for hibernation (0 and 1), and we will relabel those values as 0 –> “No” and 1 –> “Yes” so that we don’t get confused. (The labels must be specified in the same order as the levels.) Note: the order in which you specify levels is extremely important. The first level that you specify is called the reference level, and it is the level that is the basis of comparison. For example, your reference level may represent a control group to which all of your treatments are compared.
# Convert the hibernation column and save it over the original column (thereby replacing it)
# Here, we are using 0 ("does not hibernate") as the reference level:
MammalData$hibernation <- factor(MammalData$hibernation,
levels = c(0, 1),
labels = c("No", "Yes"))
Let’s see how our data has changed with str().
Note that hibernation is now a factor with two levels (level 0 = “No” and level 1 = “Yes”). The first six entries of the column look like this:
head(MammalData$hibernation)
## [1] No No Yes No No No
## Levels: No Yes
Are there any other columns that should be converted to factors?
It is always a good idea to record the version of R and the versions
of the various packages you are using. This allows others to reproduce
your work exactly, because they know exactly which versions of which
programs you were using. You can obtain this information with the
sessionInfo() function.
Your task for this week is to become comfortable manipulating data and doing basic plotting. Create an R script that does the following.
hist(MammalData$brain.mass, breaks = "scott", right = FALSE)
will create a decent histogram of brain mass, for example.log() for natural logarithms and log10() for
common logarithms.plot(x = MammalData$body.mass, y = MammalData$brain.mass)
Does it look like a straight line would fit this data well?After you have done all this, make sure you take time to play around with the data in R. You will not get good at R if you do not play with it and fight through frustration. For example, here are some of the data wrangling tasks you should be able to do:
The data manipulation steps that we have covered here using base R
can also be performed using the dplyr package, which is
part of the “tidyverse”.
dplyr is great for “big data”, and some people find
dplyr easier to use than base R. For more information about
dplyr, see Chapter 5 of R for Data Science here:
https://r4ds.had.co.nz/