R is a computer programming language that is increasingly used for ‘data science’, that is, the manipulation, summarization, and visualization of large data sets. In this section we will get familiar with the RStudio interface, the different R objects, the use of R scripts and RProjects, and the help function in R.
RStudio is an integrated development environment (IDE) for R. You will see 3 panes, each of which has a different function.
On the left is the console. You can enter commands and receive a response to your commands. The console is useful for trying out code, but will not save code. It is best to work in and write code in R scripts.
The environment/history pane is at the top right. Any R objects that you create or input from other places will be listed here. You can access these data at any time while your RStudio session is open. The history pane contains a list of commands you have previously entered.
The output or plotting pane, located at the bottom right, has many tabs. The first is the files tab, which shows the files available in your working directory. Plots shows you the results of R code you use to make graphs/charts. In the packages tab you can view all installed and loaded packages, and also load and install additional packages. The help tab brings up R help files.
It is best to work and write code in R scripts because you can return to and build on saved code and it gives you more room to work. Open it up either by clicking the File menu, and selecting New File, then R script, using the keyboard shortcut Ctrl + Shift + N, or by clicking the New File drop down
The script is also a great place to build up complex code, annotate the code, and execute the commands. There are many time-saving keyboard shortcuts that will allow you to use the script editor effectively:
The working directory is where your analysis lives. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows the working directory at the top of the console. You can also print this out by using getwd():
getwd()
## [1] "C:/Users/akeever/Documents/Classes/AdvancedSpatialEcology/RBasics"
You can also set your working directory within R:
setwd("path/to/my/directory")
You should never use absolute paths (i.e., paths that point to the same place regardless of your working directory) because it hinders sharing. Instead, use relative paths to your working directory: filepath = "~/Results" will save or search for files in the Results folder within your working directory.
Arguably the best way to organize an analysis is to use R Projects. It keeps all files associated with a project together: input data, R scripts, results, and figures. It also makes it easy to share because R Projects are associated with R working directories. You can create an RStudio Project:
Make a project to use for this class. Click File > New Project > New Directory > Empty Project. Name your project and decide which sub directory you want to put the project in.
Check that the working directory of your project is the current working directory. Whenever you refer to a file with a relative path it will look for it here.
Your project will now be displayed in the Projects toolbar, with some additional project options:
R objects are essentially a symbol that is used in place of another value. That value can be a single number or a larger form of data. We can tell R to store an entire data frame in an objected named x:
x <- data.frame(y = rnorm(n = 5, mean = 1.5, sd = 0.5),
age = rep(x = c(0, 1), length.out = 5),
sex = sample(x = c("M", "F"), size = 5, replace = TRUE,
prob = c(0.4, 0.6)))
x
## y age sex
## 1 1.4313948 0 M
## 2 0.5681779 1 M
## 3 1.5986439 0 M
## 4 1.2290741 1 M
## 5 1.8274965 0 M
There are different types of objects, and we will cover the most common objects in R. First, some coding best practices according to Hadley Wickham & Garrett Grolemund in "R for Data Science":
<- and not =. To make it quicker, use the shortcut Alt + -snake_caseA vector is a sequence of values of the same data type (e.g, numeric, character, logical). You can create vectors in a few different ways:
# Make a vector of numbers from 1 to 10
x <- 1:10
# Display the object named x
x
## [1] 1 2 3 4 5 6 7 8 9 10
# This action can be shortened by surrounding the assignment with parentheses, which
# causes assignment and "print screen" to happen.
(x <- c(1, 2, 3, 4, 5))
## [1] 1 2 3 4 5
# What is the length of our new vector?
length(x)
## [1] 5
# Make a sequence of numbers from 1 to 10 that increases in increments of 0.1
(x <- seq(1, 10, by = 0.1))
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
## [16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
## [31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4
## [46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
## [61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
## [76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
## [91] 10.0
# Make a character vector by combining c() the letters a, b and c
(x <- c("a", "b", "c"))
## [1] "a" "b" "c"
# Make a character vector using the first 10 letters of the alphebet
(x <- letters[1:10])
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
# Make a logical vector
(x <- c(T, F, T, T))
## [1] TRUE FALSE TRUE TRUE
An array can be considered a multi-dimensional collection of values. A matrix is a 2 dimensional array (think rows and columns). It is important to note that although matrices and data frames are both 2 dimensional, they are quite different. All values of a matrix must be the same data type (i.e., all numeric or all character). Data frames can have different data types in different columns.
# Try to make a matrix with numbers and letters. Notice how it converts all to a
# character
matrix(c(1,2,3, "a", "b", "c"), nrow = 3)
## [,1] [,2]
## [1,] "1" "a"
## [2,] "2" "b"
## [3,] "3" "c"
# Now try with a dataframe. Notice how the numbers are still numeric
str(data.frame(num = 1:3, chrt = c("a", "b", "c")))
## 'data.frame': 3 obs. of 2 variables:
## $ num : int 1 2 3
## $ chrt: chr "a" "b" "c"
Another difference is that R contains many operators and functions that are available only for matrices.
# First create 2 matrices of the same size
(A <- matrix(1:9, nrow = 3))
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
(B <- matrix(10:18, nrow = 3, byrow = TRUE))
## [,1] [,2] [,3]
## [1,] 10 11 12
## [2,] 13 14 15
## [3,] 16 17 18
# Now we can transpose the matrix
t(A)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
# We can calculate element by element products
A * B
## [,1] [,2] [,3]
## [1,] 10 44 84
## [2,] 26 70 120
## [3,] 48 102 162
# We can calculate the matrix product
A %*% B
## [,1] [,2] [,3]
## [1,] 174 186 198
## [2,] 213 228 243
## [3,] 252 270 288
# We can extract the number on the diagonal
diag(A)
## [1] 1 5 9
# You can easily calculate the eigenvalues and eigenvectors
eigen(A)
## eigen() decomposition
## $values
## [1] 1.611684e+01 -1.116844e+00 -5.700691e-16
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.4645473 -0.8829060 0.4082483
## [2,] -0.5707955 -0.2395204 -0.8164966
## [3,] -0.6770438 0.4038651 0.4082483
# You can add rows or columns
cbind(A, B)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 4 7 10 11 12
## [2,] 2 5 8 13 14 15
## [3,] 3 6 9 16 17 18
rbind(A, B)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [4,] 10 11 12
## [5,] 13 14 15
## [6,] 16 17 18
Arrays function similarly to matrices, except there can be more than 2 dimensions.
# Create a 3 dimensional array 3 x 3 x 2
(A <- array(1:18, dim = c(3, 3, 2)))
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 10 13 16
## [2,] 11 14 17
## [3,] 12 15 18
# Extract the first element of the 3rd dimension
A[,,1]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
# Extract the first row
A[1,,]
## [,1] [,2]
## [1,] 1 10
## [2,] 4 13
## [3,] 7 16
# Extract the first column for the second element of the 3rd dimension
A[,1,2]
## [1] 10 11 12
# To get a vector of all the values in a matrix or array, use the concatenate function
c(A)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
A data frame is essentially a 2D (rows and columns) collection of variables (in the columns) and observations (in the rows). Each variable is in fact a vector, containing a single type of data. The variables in a data frame can be of different data types, but they must have the same number of observations. To make a data frame it is best to name each variable:
# Make a data frame with 3 variables: y, age, and sex. Each variable will have 20
# observations.
(x <- data.frame(y = rnorm(n = 20, mean = 1.5, sd = 0.5),
age = rep(x = c(0, 1), length.out = 20),
sex = sample(x = c("M", "F"), size = 20, replace = TRUE,
prob = c(0.4, 0.6))))
## y age sex
## 1 2.0776650 0 F
## 2 0.9198927 1 F
## 3 1.3173901 0 F
## 4 1.2253713 1 F
## 5 2.5078769 0 M
## 6 2.0132959 1 F
## 7 1.6354645 0 M
## 8 1.6593715 1 F
## 9 0.8676595 0 F
## 10 1.4309889 1 M
## 11 2.0832852 0 F
## 12 2.2964562 1 F
## 13 1.6902128 0 F
## 14 0.9660068 1 M
## 15 1.4374551 0 M
## 16 1.1502599 1 M
## 17 1.2103359 0 M
## 18 0.7299933 1 F
## 19 1.6616790 0 M
## 20 2.4410713 1 F
# You can also coerce other R objects into a data frame
(y <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
as.data.frame(y)
## y
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
(A <- matrix(1:9, nrow = 3))
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
as.data.frame(A)
## V1 V2 V3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
# You can view only the first few observations of a data frame
head(x)
## y age sex
## 1 2.0776650 0 F
## 2 0.9198927 1 F
## 3 1.3173901 0 F
## 4 1.2253713 1 F
## 5 2.5078769 0 M
## 6 2.0132959 1 F
head(x, n = 10)
## y age sex
## 1 2.0776650 0 F
## 2 0.9198927 1 F
## 3 1.3173901 0 F
## 4 1.2253713 1 F
## 5 2.5078769 0 M
## 6 2.0132959 1 F
## 7 1.6354645 0 M
## 8 1.6593715 1 F
## 9 0.8676595 0 F
## 10 1.4309889 1 M
head(x, n = -2)
## y age sex
## 1 2.0776650 0 F
## 2 0.9198927 1 F
## 3 1.3173901 0 F
## 4 1.2253713 1 F
## 5 2.5078769 0 M
## 6 2.0132959 1 F
## 7 1.6354645 0 M
## 8 1.6593715 1 F
## 9 0.8676595 0 F
## 10 1.4309889 1 M
## 11 2.0832852 0 F
## 12 2.2964562 1 F
## 13 1.6902128 0 F
## 14 0.9660068 1 M
## 15 1.4374551 0 M
## 16 1.1502599 1 M
## 17 1.2103359 0 M
## 18 0.7299933 1 F
# You can do the same for the last few observations of a data frame
tail(x)
## y age sex
## 15 1.4374551 0 M
## 16 1.1502599 1 M
## 17 1.2103359 0 M
## 18 0.7299933 1 F
## 19 1.6616790 0 M
## 20 2.4410713 1 F
tail(x, n = 5)
## y age sex
## 16 1.1502599 1 M
## 17 1.2103359 0 M
## 18 0.7299933 1 F
## 19 1.6616790 0 M
## 20 2.4410713 1 F
tail(x, n = -10)
## y age sex
## 11 2.0832852 0 F
## 12 2.2964562 1 F
## 13 1.6902128 0 F
## 14 0.9660068 1 M
## 15 1.4374551 0 M
## 16 1.1502599 1 M
## 17 1.2103359 0 M
## 18 0.7299933 1 F
## 19 1.6616790 0 M
## 20 2.4410713 1 F
# You can subset a data frame similar to subsetting a matrix
x[,2]
## [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
x[1:5,]
## y age sex
## 1 2.0776650 0 F
## 2 0.9198927 1 F
## 3 1.3173901 0 F
## 4 1.2253713 1 F
## 5 2.5078769 0 M
# But you can also subset by the name of the variables
x$age
## [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
We will go into more detail about managing and manipulating data frames later. For now, it is enough to know what a data frame is and how to extract variables and observations from it.
A list is an ordered collection of different R objects, known as components. The components do not need to be of the same data type or dimensions.
The components of a list are always numbered and may be referred to as such. Thus, if my_list is the name of list with 4 components, these may be called by my_list[[1]], my_list[[2]], my_list[[3]], and my_list[[4]]. Let's look more closely at lists:
# First, we will make a list of all our saved elements
(my_list <- list(A, B, x, y))
## [[1]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 10 11 12
## [2,] 13 14 15
## [3,] 16 17 18
##
## [[3]]
## y age sex
## 1 2.0776650 0 F
## 2 0.9198927 1 F
## 3 1.3173901 0 F
## 4 1.2253713 1 F
## 5 2.5078769 0 M
## 6 2.0132959 1 F
## 7 1.6354645 0 M
## 8 1.6593715 1 F
## 9 0.8676595 0 F
## 10 1.4309889 1 M
## 11 2.0832852 0 F
## 12 2.2964562 1 F
## 13 1.6902128 0 F
## 14 0.9660068 1 M
## 15 1.4374551 0 M
## 16 1.1502599 1 M
## 17 1.2103359 0 M
## 18 0.7299933 1 F
## 19 1.6616790 0 M
## 20 2.4410713 1 F
##
## [[4]]
## [1] 1 2 3 4 5 6 7 8 9 10
# Notice how they are numbered. We can extract matrix B, and the first row like this
my_list[[2]]
## [,1] [,2] [,3]
## [1,] 10 11 12
## [2,] 13 14 15
## [3,] 16 17 18
my_list[[2]][1,]
## [1] 10 11 12
# Lets get the age variable of data frame x
my_list[[3]]$age
## [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
# Get the number of top-level components
length(my_list)
## [1] 4
# It is better practice to name the components of your list. It makes it easier to
# get the right component. We can make a named list:
(my_list <- list(A = A, B = B, x = x, y = y))
## $A
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## $B
## [,1] [,2] [,3]
## [1,] 10 11 12
## [2,] 13 14 15
## [3,] 16 17 18
##
## $x
## y age sex
## 1 2.0776650 0 F
## 2 0.9198927 1 F
## 3 1.3173901 0 F
## 4 1.2253713 1 F
## 5 2.5078769 0 M
## 6 2.0132959 1 F
## 7 1.6354645 0 M
## 8 1.6593715 1 F
## 9 0.8676595 0 F
## 10 1.4309889 1 M
## 11 2.0832852 0 F
## 12 2.2964562 1 F
## 13 1.6902128 0 F
## 14 0.9660068 1 M
## 15 1.4374551 0 M
## 16 1.1502599 1 M
## 17 1.2103359 0 M
## 18 0.7299933 1 F
## 19 1.6616790 0 M
## 20 2.4410713 1 F
##
## $y
## [1] 1 2 3 4 5 6 7 8 9 10
# Now we can extract components like this:
my_list$A
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
my_list$y
## [1] 1 2 3 4 5 6 7 8 9 10
my_list$y[4]
## [1] 4
# It is important to distinguish between [[]] and []. [[]] is used to select a
# single element, whereas [] is a general subscripting operator.
my_list[[2]]
## [,1] [,2] [,3]
## [1,] 10 11 12
## [2,] 13 14 15
## [3,] 16 17 18
my_list[2]
## $B
## [,1] [,2] [,3]
## [1,] 10 11 12
## [2,] 13 14 15
## [3,] 16 17 18
We are going to learn more about R and the general workflow for running analyses in R by performing some simple linear regressions. We assume you have a basic understanding of linear regression and statistics.
If you haven't already, create a project for this class. Within that working directory (i.e., R project folder) create a new folder called RBasics. You can do so in the FILE Pane of R Studio. Start a new R script and save it in your new folder.
We want to know what factors influence the amount (in kilograms) of treasure a niffler can hold. We think that body size, sex, and age all influence the amount of treasure a niffler can hold. First, we will create a data frame with 3 variables (body size, sex, age), and then we will simulate data (kg of treasure).
Sex is the easiest, so we will start there. We will assume we have a sample size of 75 nifflers with slightly more males than females. There are a few different ways we can do this:
# Sex of the niffler
c(rep("Male", 40), rep("Female", 35))
## [1] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [9] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [17] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [25] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [33] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [41] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
## [49] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
## [57] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
## [65] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
## [73] "Female" "Female" "Female"
rep(c("Male", "Female"), each = 40, length.out = 75)
## [1] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [9] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [17] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [25] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [33] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [41] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
## [49] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
## [57] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
## [65] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
## [73] "Female" "Female" "Female"
# Or, a more flexible option is to take a sample with different probabilities of
# getting a male or female
sample(c("Male", "Female"), 75, replace = TRUE, prob = c(0.6, 0.4))
## [1] "Male" "Male" "Male" "Female" "Female" "Female" "Female" "Male"
## [9] "Male" "Male" "Male" "Female" "Female" "Male" "Female" "Female"
## [17] "Male" "Male" "Male" "Male" "Female" "Male" "Male" "Female"
## [25] "Female" "Male" "Female" "Female" "Male" "Male" "Male" "Male"
## [33] "Female" "Female" "Male" "Male" "Female" "Male" "Male" "Male"
## [41] "Male" "Female" "Male" "Male" "Female" "Female" "Male" "Female"
## [49] "Female" "Male" "Female" "Female" "Female" "Female" "Male" "Female"
## [57] "Male" "Female" "Male" "Female" "Female" "Female" "Female" "Male"
## [65] "Male" "Male" "Male" "Male" "Female" "Female" "Male" "Female"
## [73] "Male" "Male" "Female"
Now for age. Nifflers can live up to 10 years old, but most don't make it past the age of 8. How can we simulate age? You could use the rep(0:10, each = 7, length.out = 75) function, or try taking a sample of different ages with different probabilities like we did with sex sample(0:10, 75, replace = TRUE, prob = c(0.15, 0.12, 0.12, 0.11, 0.11, 0.11, 0.10, 0.08, 0.06, 0.03, 0.01). This would be a pain, and it is easier to rely on the statistical distributions in R to create this:
# Using the uniform distribution to create age, but we have to round the
# values to the nearest whole number:
(age <- round(runif(n = 75, min = 0, max = 10), 0))
## [1] 2 7 7 5 5 8 9 3 7 10 2 2 6 3 4 2 7 9 3 9 3 2 3 9 3
## [26] 1 10 2 9 9 5 7 0 6 5 10 3 9 6 3 0 2 1 1 1 8 7 9 4 5
## [51] 6 7 5 6 9 4 4 10 2 4 10 8 8 2 2 3 1 5 1 6 9 1 6 6 1
# But this doesn't capture the idea that there will be fewer old nifflers
plot(age, dunif(age, min = 0, max = 10), ylab = "Frequency", xlab = "Age")
# Fortunately, R has a great many built-in distributions to play with. Our best
# option is the Poisson distribution, negative-binomial distribtion, geometric
# distribution or a similar one that is discrete (i.e., whole numbers).
# Poisson
(age <- rpois(n = 75, lambda = 2.5))
## [1] 2 2 2 4 2 5 10 2 3 3 3 3 3 5 1 0 3 1 5 1 3 3 3 0 2
## [26] 2 2 7 5 3 0 5 4 4 2 4 2 1 2 3 4 0 5 0 5 5 2 2 1 2
## [51] 4 0 0 2 1 6 3 5 7 2 6 1 4 5 4 1 3 5 2 1 0 4 1 1 2
plot(age, dpois(age, lambda = 2.5), ylab = "Frequency", xlab = "Age")
# The problem with the Poisson is that the variance equals the mean, so the negative
# binomial would be better in this instance because we could include greater variance
# in the age of animals. Also, you will notice that we have fewer 0-1 year olds than
# 2-3 year olds. Normally we would expect the largest age class to be age
# 0. We can also improve that with the negative binomial.
(age <- rnbinom(n = 75, mu = 2.5, size = 2.5))
## [1] 3 0 2 6 4 10 2 1 3 2 2 1 2 1 3 0 2 3 6 3 2 2 0 2 4
## [26] 1 5 3 1 3 0 2 2 2 5 7 2 3 2 1 2 3 3 0 3 3 2 2 2 0
## [51] 2 4 3 2 0 1 5 1 3 2 0 2 2 10 1 0 1 0 1 3 1 0 2 1 7
plot(age, dnbinom(age, mu = 2.5, size = 2.5), ylab = "Frequency", xlab = "Age")
# The negative binomial looks a little better, but lets also check out the geometric
# distribution just for fun.
(age <- rgeom(n = 75, prob = 0.3))
## [1] 2 0 0 3 4 9 11 5 0 0 9 0 0 2 0 4 5 2 1 1 3 0 1 2 5
## [26] 1 0 1 3 0 2 0 0 2 2 0 1 3 6 1 7 6 4 5 3 1 0 6 1 3
## [51] 0 0 1 1 2 0 5 2 4 0 3 0 2 0 0 9 1 4 1 0 3 5 7 1 1
plot(age, dgeom(age, prob = 0.3), ylab = "Frequency", xlab = "Age")
Last, we need to simulate body size. Sure, we could just use rnorm() to simulate body size randomly for each niffler, but we know better. We know that nifflers grow as they get older and don't reach maturity until the age of 3-4. We also know that the males tend to be a little bigger, therefore we cannot simply assign a random body size to each niffler. Body size is a function of age and sex with some random noise. It is just like a linear model. In fact, we could run a linear model to predict body size if we wanted. For now, however, lets just simulate the data. The linear model will look something like this: body size ~ b0 + b1 * sex + b2 * age. This will allow for males to be larger than females, and for older animals to be larger than young ones. What if males and females grow at a different rate? Well that would be an INTERACTION, and we aren't going to mess with that, yet.
# The 2 sexes as a binary variable (0:females and 1:males)
sex <- 0:1
# Age ranges
age <- 0:10
# Make a data frame by expanding sex and age to include all combinations
(datum <- expand.grid(sex = sex, age = age))
## sex age
## 1 0 0
## 2 1 0
## 3 0 1
## 4 1 1
## 5 0 2
## 6 1 2
## 7 0 3
## 8 1 3
## 9 0 4
## 10 1 4
## 11 0 5
## 12 1 5
## 13 0 6
## 14 1 6
## 15 0 7
## 16 1 7
## 17 0 8
## 18 1 8
## 19 0 9
## 20 1 9
## 21 0 10
## 22 1 10
# Beta coefficient values - b0: the intercept, or the average weight of females
# at age 0
b0 <- 1.25
# Beta coefficient values - b.sex: the difference in weight between females and males
b.sex <- 1.9
# Beta coefficient values - b.age: slope of age-size relationship
b.age <- 0.75
# Body size prediction
datum$pred <- b0 + b.sex * datum$sex + b.age * datum$age
# Let's visualize it
library(ggplot2)
(p <- ggplot(data = datum, aes(x = age, y = pred, color = factor(sex))) +
geom_line() +
scale_x_continuous(name = "Age", breaks = seq(0, 10, by = 1)) +
scale_y_continuous(name = "Body size", breaks = seq(0, 15, by = 0.5)) +
theme_classic() +
theme(legend.position = "none"))
# As you know, this is only our prediction, and does not account for the random
# noise. Remember, tacked on the end of every linear model is the random error
# term: body size ~ b0 + b1 * sex + b2 * age + E; E ~ Normal(0, sd)
datum$body_size <- rnorm(n = nrow(datum), mean = datum$pred, sd = 1.2)
# Add the actual body size to our plot
p + geom_point(data = datum, aes(x = age, y = body_size, color = factor(sex)))
# You will see that some weigh nothing, this is not realistic. So we will truncate
# our data at 0.75, the smallest weight a niffler can be.
datum$body_size[datum$body_size < 0] <- 0.75
Now we can put it all together to create a data frame of our simulated niffler data set. Save your version of the R script to simulate niffler data. Feel free to use your own values for probabilities and beta coefficients. Note that your version could look very different from mine.
I am also going to use set.seed(1) to make sure all my random values will be the same every time. This makes it easier to recreate things when you use R for random number generation. Set it before you create your data frame.
head(niffler_datum)
## sex age body_size
## 1 1 1 2.175610
## 2 1 0 1.810351
## 3 1 3 4.182304
## 4 0 6 4.475047
## 5 1 2 5.687787
## 6 0 2 2.845916
Now that we have our covariate data, we can simulate our response data. We will do this similarly to how we created our body size data.
# Set our coefficient values
b0 <- - 0.10
b.age <- 0.25
b.sex <- 0.10
b.bs <- 0.75
# Now to make the data. We will use dplyr because it is the best.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Here we are telling R to take our original data frame and add a new variable called
# treasure_kg. For the new variable we want R to use the normal distribution to
# generate 75 random numbers with a mean of our linear model prediction and a
# standard deviation of 1.15.
niffler_datum <- niffler_datum %>%
mutate(treasure_kg = rnorm(n = 75,
mean = b0 + b.age * age + b.sex * sex + b.bs * body_size,
sd = 1.15))
# Get rid of any negative values. If any of your body size values are negative,
# fix those as well. Really, we should have used a different distribution that
# only generated positive values, but let's just keep things simple.
niffler_datum$treasure_kg[niffler_datum$treasure_kg < 0] <- 0
# Visualize it
ggplot(data = niffler_datum,
aes(x = body_size, y = treasure_kg, color = factor(age))) +
geom_point() +
scale_x_continuous(name = "Body size", breaks = seq(0, 15, by = 1)) +
scale_y_continuous(name = "Treasure", breaks = seq(0, 15, by = 0.5)) +
facet_wrap(~ sex)+
theme_classic()
Now that we simulated our niffler data set, let's explore it. Before running any models it is important to look at your data. First we want to look at the structure of the data frame. This will be very important when you load data into R, particularly for date/time data. We will talk more about that later in the semester.
# Use str() to look at the structure of your data frame
str(niffler_datum)
## 'data.frame': 75 obs. of 4 variables:
## $ sex : num 1 1 1 0 1 0 0 0 0 1 ...
## $ age : int 1 0 3 6 2 2 2 3 0 6 ...
## $ body_size : num 2.18 1.81 4.18 4.48 5.69 ...
## $ treasure_kg: num 1.924 0.132 3.613 6.476 6.114 ...
# You will see that niffler_datum is a data frame, and that all the variables are
# numeric (age is an interger, which of course is numeric). The thing is, we don't
# want sex to be numeric. Technically, it is a binary variable representing 2
# categories. We can easily change this to a factor.
factor(niffler_datum$sex)
## [1] 1 1 1 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1
## [39] 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 1 1 1
## Levels: 0 1
# Notice that levels (i.e., the values that the factor can take) are 0 and 1. Now
# we will change our data frame
niffler_datum$sex <- factor(niffler_datum$sex)
str(niffler_datum)
## 'data.frame': 75 obs. of 4 variables:
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 1 2 ...
## $ age : int 1 0 3 6 2 2 2 3 0 6 ...
## $ body_size : num 2.18 1.81 4.18 4.48 5.69 ...
## $ treasure_kg: num 1.924 0.132 3.613 6.476 6.114 ...
head(niffler_datum$sex)
## [1] 1 1 1 0 1 0
## Levels: 0 1
Another thing you should always do before running models is looking at the correlation between your covariates. Highly correlated covariates should not be included in the same model. This can be done easily in R.
# The cor() function comes in the stats package that is included in base R,
# but the problem is it can only use numeric data. We will be using the dplyr and
# tidyr package to help us work around this in 2 ways.
# First, we will just exclude sex, the factor, using select()
library(dplyr)
library(tidyr)
# We are telling the cor function to use the niffler data frame and select all the
# variables EXCEPT sex and our response variable.
cor(niffler_datum %>%
select(-sex, -treasure_kg), use = "complete.obs")
## age body_size
## age 1.0000000 0.5995262
## body_size 0.5995262 1.0000000
# Age and body size are correlated, but not perfectly. The correlation coefficient
# is under 0.65 (at least for mine), so we will include them both in a model. If
# yours is not, do you know why that is? Try playing with different seed values
# until you get a correlation coefficient under 0.65. I used set.seed(1).
# Now, what about sex?
# We will use dplry and tidyr to temporarily change sex back to numeric without
# actually changing the data
cor(niffler_datum %>%
mutate(sex = as.numeric(sex)) %>%
select(-treasure_kg), use = "complete.obs")
## sex age body_size
## sex 1.0000000 -0.1119856 0.3857117
## age -0.1119856 1.0000000 0.5995262
## body_size 0.3857117 0.5995262 1.0000000
A great package for looking at correlation is library(PerformanceAnalytics). I like this one because it also allows you to visualize your data, another important step.
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
chart.Correlation(niffler_datum %>%
mutate(sex = as.numeric(sex)) %>%
select(-treasure_kg),
histogram = TRUE, pch = 19)
Time to run some models! Again, this class is not focused on stats but on R and how to use it effectively.
# Let's just run the global model
m <- lm(treasure_kg ~ sex + body_size + age, data = niffler_datum)
summary(m)
##
## Call:
## lm(formula = treasure_kg ~ sex + body_size + age, data = niffler_datum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47114 -0.73207 -0.05099 0.66538 3.00350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1859 0.3062 -0.607 0.545644
## sex1 0.1961 0.3440 0.570 0.570488
## body_size 0.7083 0.1058 6.694 4.26e-09 ***
## age 0.4241 0.1080 3.926 0.000198 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.192 on 71 degrees of freedom
## Multiple R-squared: 0.7445, Adjusted R-squared: 0.7337
## F-statistic: 68.95 on 3 and 71 DF, p-value: < 2.2e-16
So how did we do? Our estimates should be close to the beta coefficient values we used to generate the data. We will make a table we can export with our estimates and the true values.
beta_coeffs <- as.data.frame(summary(m)$coefficients) %>%
mutate(truth = c(b0, b.sex, b.bs, b.age),
param = c("Intercept", "Sex", "Body size", "Age"))
beta_coeffs
## Estimate Std. Error t value Pr(>|t|) truth param
## 1 -0.1859093 0.3061665 -0.6072164 5.456444e-01 -0.10 Intercept
## 2 0.1960928 0.3440320 0.5699842 5.704876e-01 0.10 Sex
## 3 0.7082604 0.1058119 6.6935824 4.260765e-09 0.75 Body size
## 4 0.4240912 0.1080293 3.9257065 1.979156e-04 0.25 Age
# T values and P values are not important, so I am going to drop them and reorder
(beta_coeffs <- beta_coeffs %>%
select(param, truth, Estimate, 'Std. Error') %>%
rename(estimate = Estimate, se = 'Std. Error'))
## param truth estimate se
## 1 Intercept -0.10 -0.1859093 0.3061665
## 2 Sex 0.10 0.1960928 0.3440320
## 3 Body size 0.75 0.7082604 0.1058119
## 4 Age 0.25 0.4240912 0.1080293
# Not bad. Not perfect, but nothing ever is. Try running models with some of the
# covariates missing and see what that does to your estimates.
Now we can save this output as a CSV file so we don't have to manually write values into a Word Table when we want to share our results. This is a very simple step. We will save it in your Rbasics folder in your R project for this class.
# You can save it as a CSV
write.csv(beta_coeffs, file = "Rbasics/BetaEstimates.csv")
# For things that are not in table or data frame format, you can save them as R data.
# This is convienent if you run a model that takes a long time and you want to save
# the output, or if you want to save multiple things.
save(niffler_datum, m, beta_coeffs, file = "Rbasics/NifflerModelResults.RData")
# Now try clearing your workspace and loading the R data you saved.
rm(niffler_datum, m, beta_coeffs)
load("Rbasics/NifflerModelResults.RData")
When writing code you should never copy-and-paste code chunks multiple times. Instead, you should try to reduce duplication in your code by using iteration or functions. Iteration helps you when you need to do the same thing to multiple inputs (e.g., repeating the same operation on different columns or different data sets). Functions help to reduce duplication by identifying repeated patterns of code and extracting them into independent pieces that can be updated and reused.
Reducing duplication in your code makes it easier to read, makes it easier to change or alter requirements because you only need to make the change in one place rather than changing it every where you copied-and-pasted the code, and it makes it easier to avoid and to fix human-error in writing code.
The for loop is a staple for R, and the function makes it easy to perform an operation over a loop of values. Every loop has 3 components:
c(). The best way to do this is to create an empty vector of the right length.i in seq_along(datum). i is used like a pronoun, and for each iteration i will be assigned a different value from seq_along(datum).Let's make a simple for loop to simulate abundance of a population over time. We will use geometric growth equation (\(N_t = N_{t-1} * \lambda\)) with a starting population of 10, a per capita growth rate of 1.03, and simulate the population for 50 years.
First, simulate the population for only 1 year.
# Simulate the population for 1 year
(Nt <- 10 * 1.03)
## [1] 10.3
# Now if we want to simulate the 2nd year, we need to use the abundance estimate
# from year 1, Nt
(Nt2 <- Nt * 1.03)
## [1] 10.609
We can easily make a for loop to iterate this process for 50 years
# First, set up an empty vector for our output. Remember, it is best to allocate
# the space needed at the beginning
N <- vector(mode = "numeric", length = 50)
# Set the first year abundance to our initial population size of 10
N[1] <- 10
# Now we need to loop over N to calculate abundance each year, but we already have
# year 1 abundnace, so we want to go from year 2 to 50.
# This says, for each i (year) in 2 to 50 do the following chunk of code
for(i in 2:50){
# This says, N for year i = N for year i - 1 times lambda, i.e., our equation
N[i] <- N[i-1] * 1.03
}
# Lets look at N
N
## [1] 10.00000 10.30000 10.60900 10.92727 11.25509 11.59274 11.94052 12.29874
## [9] 12.66770 13.04773 13.43916 13.84234 14.25761 14.68534 15.12590 15.57967
## [17] 16.04706 16.52848 17.02433 17.53506 18.06111 18.60295 19.16103 19.73587
## [25] 20.32794 20.93778 21.56591 22.21289 22.87928 23.56566 24.27262 25.00080
## [33] 25.75083 26.52335 27.31905 28.13862 28.98278 29.85227 30.74783 31.67027
## [41] 32.62038 33.59899 34.60696 35.64517 36.71452 37.81596 38.95044 40.11895
## [49] 41.32252 42.56219
Sometimes you don't know how long the input sequence should be iterated for. For example, say we wanted to simulate the population above until we got an abundance of 100 individuals. You can't do that sort of iteration with a for loop and need to use a while loop. A while loop is more general, and until the condition is met, the body of code will continue to iterate. We will turn our for loop above into a while loop to get a population of 100 individuals.
# The while loop has 2 components, a condtion and the body of the code
n <- 10
N <- NULL
# The condition is N < 100, while abundance is under 100, iteratre the following
# chunk of code
while(n < 100){
N <- c(N, n)
n <- n * 1.03
}
# Look at final population size, and the vector that held all of our abundances
n
## [1] 100.3006
N
## [1] 10.00000 10.30000 10.60900 10.92727 11.25509 11.59274 11.94052 12.29874
## [9] 12.66770 13.04773 13.43916 13.84234 14.25761 14.68534 15.12590 15.57967
## [17] 16.04706 16.52848 17.02433 17.53506 18.06111 18.60295 19.16103 19.73587
## [25] 20.32794 20.93778 21.56591 22.21289 22.87928 23.56566 24.27262 25.00080
## [33] 25.75083 26.52335 27.31905 28.13862 28.98278 29.85227 30.74783 31.67027
## [41] 32.62038 33.59899 34.60696 35.64517 36.71452 37.81596 38.95044 40.11895
## [49] 41.32252 42.56219 43.83906 45.15423 46.50886 47.90412 49.34125 50.82149
## [57] 52.34613 53.91651 55.53401 57.20003 58.91603 60.68351 62.50402 64.37914
## [65] 66.31051 68.29983 70.34882 72.45929 74.63307 76.87206 79.17822 81.55357
## [73] 84.00017 86.52018 89.11578 91.78926 94.54293 97.37922
Apply functions (and the map functions in the purr package) can basically replace most for loops. The idea is that you pass a function to the apply or map functions to iterate code.
Different apply functions work on different R objects. apply() takes a data frame as input and can be applied to rows or columns, lapply() uses lists as input and returns a list, sapply() takes a list and returns a simple vector, vapply() takes a list and lets you specify the output desired, and tapply() is for vector inputs. Let's look at apply() more closely.
# We want to get the sum of rows for a data frame
# Make the dataframe
(df <- data.frame(x1 = 1:5, x2 = 2:6, x3 = 3))
## x1 x2 x3
## 1 1 2 3
## 2 2 3 3
## 3 3 4 3
## 4 4 5 3
## 5 5 6 3
# Get the sum of the rows
apply(df, 1, sum)
## [1] 6 8 10 12 14
# The arguments are the data, whether it is applied to the rows(1) or the columns(2),
# and the function to be applied. If we want to get the sum of the columns...
apply(df, 2, sum)
## x1 x2 x3
## 15 20 15
# We can also specify our own function
apply(df, 1, function(x) sum(x)^2)
## [1] 36 64 100 144 196
# And apply our own function to every element
apply(df, 1, function(x) x^2)
## [,1] [,2] [,3] [,4] [,5]
## x1 1 4 9 16 25
## x2 4 9 16 25 36
## x3 9 9 9 9 9
apply(df, 1, function(x) paste0("a", x))
## [,1] [,2] [,3] [,4] [,5]
## [1,] "a1" "a2" "a3" "a4" "a5"
## [2,] "a2" "a3" "a4" "a5" "a6"
## [3,] "a3" "a3" "a3" "a3" "a3"
apply(df, 1, function(x) paste(letters[x], x))
## [,1] [,2] [,3] [,4] [,5]
## [1,] "a 1" "b 2" "c 3" "d 4" "e 5"
## [2,] "b 2" "c 3" "d 4" "e 5" "f 6"
## [3,] "c 3" "c 3" "c 3" "c 3" "c 3"
There are 3 key steps to creating a new function:
It looks something like this: my_function_name(argument1, argument2){body of code}. Lets say we have this chunk of code we made to calculate the shape parameters for a beta distribution to use for survival rates.
# Calculate alpha and beta shape parameters
alpha_adult_f <- 0.8 * ((0.8 * (1 - 0.8) / 0.001) - 1)
beta_adult_f <-(1 - 0.8) * ((0.8 * (1 - 0.8) / 0.001) - 1)
alpha_adult_m <-0.75 * ((0.75 * (1 - 0.75) / 0.005) - 1)
beta_adult_m <-(1 - 0.75) * ((0.75 * (1 - 0.75) / 0.005) - 1)
alpha_juv_f <-0.65 * ((0.65 * (1 - 0.65) / 0.009) - 1)
beta_juv_f <-(1 - 0.65) * ((0.65 * (1 - 0.65) / 0.009) - 1)
alpha_juv_m <-0.60 * ((0.60 * (1 - 0.60) / 0.009) - 1)
beta_juv_m <-(1 - 0.60) * ((0.60 * (1 - 0.60) / 0.009) - 1)
# Now visualize adult female survival with plot
y <-rbeta(1:1000, alpha_adult_f, beta_adult_f)
plot(y, dbeta(y, alpha_adult_f, beta_adult_f), ylab = "Frequency",
xlab = "Adult female survival rate")
We repeated a lot of code, which leaves plenty of room for mistakes. We can simplify all of this by writing a function. First we need to figure out how many inputs it has. If you look at just the adult female survival, it has 2 numbers 0.8 (mean survival) and 0.001 (the variance). All other stages have a different mean and different variance. We can replace our inputs with an argument.
# Replace inputs with an argument
mean_rate <- 0.08
var_rate <- 0.001
alpha_adult_f <- mean_rate * ((mean_rate * (1 - mean_rate) / var_rate) - 1)
beta_adult_f <-(1 - mean_rate) * ((mean_rate * (1 - mean_rate) / var_rate) - 1)
Now we can generalize to make a function that will do this for males and juveniles females as well.
# Function name and the arguments included
method_of_moments_beta <- function(mean_rate, var_rate) {
# Body of the function
alpha <- mean_rate * ((mean_rate * (1 - mean_rate) / var_rate) - 1)
beta <- (1 - mean_rate) * ((mean_rate * (1 - mean_rate) / var_rate) - 1)
# We will also make a plot
y <-rbeta(1:1000, alpha, beta)
beta_plot <- plot(y, dbeta(y, alpha, beta), ylab = "Frequency",
xlab = "Adult female survival rate")
# We need to tell the function what to return
return(list(alpha = alpha, beta = beta, figure = beta_plot))
}
Now let's run it for adult and juvenile females
# Adult females
method_of_moments_beta(0.8, 0.001)
## $alpha
## [1] 127.2
##
## $beta
## [1] 31.8
##
## $figure
## NULL
# Juvenile females
method_of_moments_beta(0.65, 0.009)
## $alpha
## [1] 15.78056
##
## $beta
## [1] 8.497222
##
## $figure
## NULL
You will notice that the x axis title is incorrect. We will need to modify our function to make it more general.
method_of_moments_beta <- function(mean_rate, var_rate, stage, sex) {
alpha <- mean_rate * ((mean_rate * (1 - mean_rate) / var_rate) - 1)
beta <- (1 - mean_rate) * ((mean_rate * (1 - mean_rate) / var_rate) - 1)
# We will also make a plot
y <-rbeta(1:1000, alpha, beta)
beta_plot <- plot(y, dbeta(y, alpha, beta), ylab = "Frequency",
xlab = paste(stage, sex, "survival rate", sep = " "))
# We need to tell the function what to return
return(list(alpha = alpha, beta = beta, figure = beta_plot))
}
# Now let's try it out on juvenile males
method_of_moments_beta(0.60, 0.009, stage = "Juvenile", sex = "male")
## $alpha
## [1] 15.4
##
## $beta
## [1] 10.26667
##
## $figure
## NULL
An if statement allows you to conditionally execute code based on a condition being met. It looks like this:
if (condition) {
# code executed when condition is TRUE
} else {
# code executed when condition is FALSE
}
The condition must evaluate to either TRUE or FALSE, and not a vector of logical values. If you do have a logical vector, you can use any() or all() to collapse to a single value.
# A simple if statement
nums <- -5:5
# This produces a warning because it is more than 1 value
if(nums >= 0) {
print("This number is positive")
}
## Warning in if (nums >= 0) {: the condition has length > 1 and only the first
## element will be used
# Make it only 1 number
if(nums[1] >= 0) {
print("This number is positive")
}
if(nums[10] >= 0) {
print("This number is positive")
}
## [1] "This number is positive"
We can make this an if else statement easily.
# Make it only 1 number
if(nums[1] >= 0) {
print("This number is positive")
} else {
print("This number is negative")
}
## [1] "This number is negative"
if(nums[10] >= 0) {
print("This number is positive")
} else {
print("This number is negative")
}
## [1] "This number is positive"
# We can also add more than one else with nesting
if(nums[6] > 0) {
print("This number is positive")
} else if(nums[6] == 0) {
print("This number is zero")
} else {
print("This number is negative")
}
## [1] "This number is zero"
One of the reasons I love working in R projects is because it makes it easy to source code to keep your working R script clean. Sure, you can write a bunch of functions to clean up code and simplify, but if you R script is covered in a bunch of functions that doesn't really help with readability.
I like to create and save an R script to manipulate my data, and save another R script with all my functions and then simply call them in my working R script with source(). For example, if I had saved an R script called DataManipulation.R that took my raw data and returned a usable data frame ready for analyses and an R script called SurvivalFunctions.R that contained all my named functions I would call on them in my working R script (the one I use to run analyses):
# This runs all the code in this source file - i.e., the R script in my folder
# Rscripts. Any objects that were created in this code will be recreated with 1 line
source("Rscripts/DataManipulation.R")
# This again runs all the code, and because all my functions are specified here
# they will all be saved in my working environment.
source("Rscripts/SurvivalFunctions.R")
There are many different plotting packages to make figures. We will focus on ggplot2 package because it is widely used and valued for its simple, consistent approach to making plots. It also works well with tidy data and the dplyr and tidyr packages.
We will load ggplot2 and all the core tidyverse packages and work with the mpg data frame that comes in ggplot2 package:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.4 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x xts::first() masks dplyr::first()
## x dplyr::lag() masks stats::lag()
## x xts::last() masks dplyr::last()
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manual~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manual~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto(a~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto(l~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manual~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto(a~ f 18 27 p comp~
## 8 audi a4 quat~ 1.8 1999 4 manual~ 4 18 26 p comp~
## 9 audi a4 quat~ 1.8 1999 4 auto(l~ 4 16 25 p comp~
## 10 audi a4 quat~ 2 2008 4 manual~ 4 20 28 p comp~
## # ... with 224 more rows
With ggplot2, you begin each plot with the function ggplot(), which creates a coordinate system that you can add layers to:
# First create the coordinate system and specify the data
ggplot(data = mpg)
# You complete your graph by adding the geometric representation you want to use
# for your data. Let's make a scatter plot, so we want to use points. Each geom
# function takes a mapping argument which defines how variables are mapped.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Each layer is added by including the + operator. Each layer will be added over the plot according to its position in the code. You can include the mapping in the ggplot(data = mpg, mapping = aes()) instead.
An aesthetic is a visual property of the objects in your plot. It includes things like size, shape, and color. Here are some common aesthetics you can map:
# Map the colors of points to the class of the vehicle
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
# Map class to the size aesthetic
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.
# Map class to the alpha aesthetic, which controls transparency
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
# Or map class to the shape aesthetic
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
If you want to simply change the appearance of the plot, and not the aesthetic of how data are mapped, you do so outside of aes(). For example, if you wanted to make a plot with blue points:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy),
color = "blue")
If you wanted to add a trend line, you can simply add another layer to the plot. The geom_smooth() function draws a trend line through the data. The default behavior is to draw a local regression line (curve) through the points, however these can be hard to interpret. We want to add a straight line based on a linear model lm of the relationship between x and y.
# Adding a trend line
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
geom_smooth(mapping = aes(x = displ, y = hwy), method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Another way to add additional variables to a plot (besides mapping to different aesthetics) is using facets. Facets are subplots that display a subset of the data. To facet by a single variable, you use facet_wrap(), and to facet by 2 variables use facet_grid().
# The first argument of facet_wrap is a function, denoted with ~
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
facet_wrap(~ class, nrow = 2)
# For 2 variables, the function separates the two variables with ~
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
facet_grid(drv ~ class)
# You can also facet with more complex plots (i.e., more layers)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ class)
## `geom_smooth()` using formula 'y ~ x'
The overall appearance of the plot can be modified using theme() functions. There are some built in themes, and you can also customize your own.
# We will first save our base plot as an R object so we can easily change our theme
p <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
# Black and white theme
p + theme_bw()
# Classic theme
p + theme_classic()
# Dark theme
p + theme_dark()
# Minimal theme
p + theme_minimal()
Personal preference: I like to start with a classic theme, then alter the theme from there. I will not go over all the theme elements you can change, but you should really look into them here.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
facet_wrap(~ class, nrow = 2) +
theme_classic() +
theme(axis.title = element_text(size = 14),
axis.title.y = element_text(face = "italic"),
legend.position = "bottom",
legend.key = element_rect(fill = "white"),
strip.background = element_rect(color = "gray"),
strip.text = element_text(face = "bold"),
panel.spacing = unit(1, "lines"))
The best way to figure out what all those lines of code do is to remove a single theme element and render the plot to see what changed. An important thing to note is that all elements that deal in text, such as axis.title are altered by changing the element_text. Lines, such as axis and tick marks, and changed with element_line whereas all rectangular elements, such as legend keys and the facet strip background, are changed with element_rect.
Many plots, like scatter plots, just plot the raw data contained in the data frame. Other graphs, like bar charts, calculate new values to plot. Bar charts, histograms, and frequency plots bin your data and then plot bin counts (i.e., the number of points that fall within each bin). Smoothers fit a model to your data and then plot predictions from the model. Box plots compute a summary of the distribution and then display a specially formatted box.
# We are going to use the diamonds data frame in ggplot2
# Make a simple boxplot
ggplot(data = diamonds) +
geom_bar(aes(x = cut))
# Map the clarity to different colors. Notice how for rectangles it is fill, not
# color that we set
ggplot(data = diamonds) +
geom_bar(aes(x = cut, fill = clarity))
# We can rearrange the clarity groups into adjacent bars by changing the position
ggplot(data = diamonds) +
geom_bar(aes(x = cut, fill = clarity),
position = "dodge")
If we want to look at price by cut, we can use points.
# Scatter plot of price of diamonds by cut
ggplot(data = diamonds) +
geom_point(aes(x = cut, y = price))
This plot is essentially useless because all the points are stacked on top of each other. We can improve this by altering the position of the points. There are 2 ways to do this. We can set the position within the geom function: geom_point(position = "jitter") or we can use the shorthand of geom_jitter().
ggplot(data = diamonds) +
geom_point(aes(x = cut, y = price), position = "jitter")
# This will produce the same thing, only more concise
ggplot(data = diamonds) +
geom_jitter(aes(x = cut, y = price),
height = 0, # this will prevent the point from being altered vertically
alpha = 0.25, # this changes the transparency so we can see clusters
size = 0.5)
We can easily add a box plot over our raw data. This will provide a useful summary while still showing the raw data.
ggplot(data = diamonds) +
geom_jitter(aes(x = cut, y = price), height = 0, alpha = 0.25, size = 0.5) +
geom_boxplot(aes(x = cut, y = price, color = cut), alpha = 0.1)
The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are some useful ways to control the coordinate space.
# First, we can switch the x and y axes
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
coord_quickmap() sets the aspect ratio correctly for maps, which is important for spatial data.
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
coord_polar() uses polar coordinates which can be useful for a Coxcomb chart.
bar <- ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()
There are so many resources available to learn how to code and use R, how to fix trouble-shoot code and fix problems when they pop up, and learn more about the functions included in the packages.
First, to learn more about what a function does and what arguments it takes, use ? before the function. So to get help with rnorm and learn what arguments it takes, use ?rnorm(). This is the first place I look when I am trying to figure out how to use a function.
There are also a lot of great books and online resources:
These are just a few of the many online books you can use. Another great resource are the RStudio Cheatsheets. These are such a great resource. I have three of them sitting in front of me right now! I highly recommend printing off the ones that will be most helpful to you. They have ones for Python, Dates and Times, Work with Strings, Data Transformations (i.e., dplyr and tidyr), Data Visualization (i.e., ggplot), Package Development, and many more!
Your next best friend for R help is going to be Google. I Google "how to do x in R" all the time. Most often, you will get directed to a site like stack overflow. Not only is this a great place to find answers to questions that have already been asked, but you can also ask your own questions and get help. A word of warning, these people can be BRUTAL. Make sure you take the time to create a minimal, reproducible example, otherwise you will get torn to shreds. Trust me. Here was a question on stack overflow about how to make a great, reproducible R example. It gives some good pointers. Check it out.