Matthew Dixon
Email: mdixon7@stuart.iit.edu
TA: Bo Wang
Email: bwang54@hawk.iit.edu
We saw in Session 1 a small sample of the rich set of functionality in R and the thousands of CRAN packages which make it a very powerful tool for data science. Today, we shall focus on how R represents data, through “data structures”, and how we write programs consisting of control flow and functions to interact with these data structures. These basic programming concepts are essential for data analysis in R.
A new programmer will spend 10% of their time coding and thinking about program design, the remaining time is spent debugging. It is vice-versa for an experienced programmer.
By far the biggest frustration that new R users face is experiencing unattended program behavior (perhaps a ‘bug’) and not knowing how to solve it, and more importantly, safeguard against it in the first place. We will devote time to ensuring that we are equipped with the basic know how to ensure good program design.
To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on those.
Very Important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.
Everything in R is an object.
R has 6 (although we will not discuss the raw class for this workshop) atomic classes.
| Example | Type |
|---|---|
| “a”, “swc” | character |
| 2, 15.5 | numeric |
2 (Must add a L at end to denote integer) |
integer |
TRUE, FALSE |
logical |
| 1+4i | complex |
typeof() # what is it?
length() # how long is it? What about two dimensional objects?
attributes() # does it have any metadata?
# Example
x <- "dataset"
typeof(x)
attributes(x)
y <- 1:10
typeof(y)
length(y)
attributes(y)
z <- c(1L, 2L, 3L)
typeof(z)
R has many data structures. These include
A vector is the most common and basic data structure in R and is pretty much the workhorse of R. Technically, vectors can be one of two types:
although the term “vector” most commonly refers to the atomic type not lists.
Atomic Vectors
A vector can be a vector of elements that are most commonly character, logical, integer or numeric.
You can create an empty vector with vector() (By default the mode is logical. You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as character(), numeric(), etc.
x <- vector()
# with a length and type
vector("character", length = 10)
## [1] "" "" "" "" "" "" "" "" "" ""
character(5) ## character vector of length 5
## [1] "" "" "" "" ""
numeric(5)
## [1] 0 0 0 0 0
logical(5)
## [1] FALSE FALSE FALSE FALSE FALSE
Various examples:
x <- c(1, 2, 3)
x
## [1] 1 2 3
length(x)
## [1] 3
x is a numeric vector. These are the most common kind. They are numeric objects and are treated as double precision real numbers. To explicitly create integers, add an L at the end.
x1 <- c(1L, 2L, 3L)
You can also have logical vectors.
y <- c(TRUE, TRUE, FALSE, FALSE)
Finally you can have character vectors:
z <- c("Alec", "Dan", "Rob", "Karthik")
Examine your vector
typeof(z)
## [1] "character"
length(z)
## [1] 4
class(z)
## [1] "character"
str(z)
## chr [1:4] "Alec" "Dan" "Rob" "Karthik"
Question: Do you see a property that’s common to all these vectors above?
Add elements
z <- c(z, "Annette")
z
## [1] "Alec" "Dan" "Rob" "Karthik" "Annette"
More examples of vectors
x <- c(0.5, 0.7)
x <- c(TRUE, FALSE)
x <- c("a", "b", "c", "d", "e")
x <- 9:100
x <- c(1+0i, 2+4i)
You can also create vectors as a sequence of numbers
series <- 1:10
seq(10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1, 10, by = 0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
## [15] 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [29] 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1
## [43] 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
## [57] 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9
## [71] 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3
## [85] 9.4 9.5 9.6 9.7 9.8 9.9 10.0
Other objects
Inf is infinity. You can have either positive or negative infinity.
1/0
## [1] Inf
1/Inf
## [1] 0
NaN means Not a number. It’s an undefined value.
0/0
## [1] NaN
Each object can have attributes. Attribues can be part of an object of R. These include:
You can also glean other attribute-like information such as length (works on vectors and lists) or number of characters (for character strings).
length(1:10)
## [1] 10
nchar("Stuart Business School")
## [1] 22
What happens when you mix types?
R will create a resulting vector that is the least common denominator. The coercion will move towards the one that’s easiest to coerce to.
Guess what the following do without running them first
xx <- c(1.7, "a")
xx <- c(TRUE, 2)
xx <- c("a", TRUE)
This is called implicit coercion. You can also coerce vectors explicitly using the as.<class_name>. Example
as.numeric()
as.character()
When you coerce an existing numeric vector with as.numeric(), it converts the vector to a double.
x <- 0:6
identical(x, as.numeric(x))
## [1] FALSE
typeof(x)
## [1] "integer"
typeof(as.numeric(x))
## [1] "double"
x <- 0:6
as.numeric(x)
## [1] 0 1 2 3 4 5 6
as.logical(x)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
as.character(x)
## [1] "0" "1" "2" "3" "4" "5" "6"
as.complex(x)
## [1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i
Sometimes coercions, especially nonsensical ones, won’t work.
x <- c("a", "b", "c")
as.numeric(x)
## Warning: NAs introduced by coercion
## [1] NA NA NA
as.logical(x)
## [1] NA NA NA
# both don't work
Sometimes there is implicit conversion
1 < "2"
## [1] TRUE
"1" > 2
## [1] FALSE
Matrices are a special vector in R. They are not a separate type of object but simply an atomic vector with dimensions added on to it. Matrices have rows and columns.
m <- matrix(nrow = 2, ncol = 2)
m
## [,1] [,2]
## [1,] NA NA
## [2,] NA NA
dim(m)
## [1] 2 2
Matrices are filled column-wise.
m <- matrix(1:6, nrow = 2, ncol = 3)
Other ways to construct a matrix
m <- 1:10
dim(m) <- c(2, 5)
This takes a vector and transform into a matrix with 2 rows and 5 columns.
Another way is to bind columns or rows using cbind() and rbind().
x <- 1:3
y <- 10:12
cbind(x, y)
## x y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
# or
rbind(x, y)
## [,1] [,2] [,3]
## x 1 2 3
## y 10 11 12
You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:
mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,
dimnames = list(c("row1", "row2"),
c("C.1", "C.2", "C.3")))
mdat
## C.1 C.2 C.3
## row1 1 2 3
## row2 11 12 13
In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a single mode and can encompass any mixture of data types. Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.
A list is a special type of vector. Each element can be a different type.
Create lists using list() or coerce other objects using as.list()
x <- list(1, "a", TRUE, 1+4i)
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
x <- 1:10
x <- as.list(x)
length(x)
## [1] 10
What is the class of x[1]?
How about x[[1]]?
xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris))
xlist
## $a
## [1] "Karthik Ram"
##
## $b
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
What is the length of this object? What about its structure?
A list can contain many lists nested inside.
temp <- list(list(list(list())))
temp
## [[1]]
## [[1]][[1]]
## [[1]][[1]][[1]]
## list()
is.recursive(temp)
## [1] TRUE
Lists are extremely useful inside functions. You can “staple” together lots of different kinds of results into a single object that a function can return.
A list does not print to the console like a vector. Instead, each element of the list starts on a new line.
Elements are indexed by double brackets. Single brackets will still return a(nother) list.
Factors are special vectors that represent categorical data. Factors can be ordered or unordered and are important when for modelling functions such as lm() and glm() and also in plot methods.
Factors can only contain pre-defined values.
Factors are pretty much integers that have labels on them. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings. Some string methods will coerce factors to strings, while others will throw an error.
Sometimes factors can be left unordered. Example: male, female.
Other times you might want factors to be ordered (or ranked). Example: low, medium, high.
Underlying it’s represented by numbers 1, 2, 3.
They are better than using simple integer labels because factors are what are called self describing. male and female is more descriptive than 1s and 2s. Helpful when there is no additional metadata.
Which is male? 1 or 2? You wouldn’t be able to tell with just integer data. Factors have this information built in.
Factors can be created with factor(). Input is generally a character vector.
x <- factor(c("yes", "no", "no", "yes", "yes"))
x
## [1] yes no no yes yes
## Levels: no yes
table(x) will return a frequency table.
If you need to convert a factor to a character vector, simply use
as.character(x)
## [1] "yes" "no" "no" "yes" "yes"
In modeling functions, it is important to know what the baseline level is. This is the first factor but by default the ordering is determined by alphabetical order of words entered. You can change this by speciying the levels (another option is to use the function relevel).
x <- factor(c("yes", "no", "yes"), levels = c("yes", "no"))
x
## [1] yes no yes
## Levels: yes no
A data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics.
Data frames can have additional attributes such as rownames(), which can be useful for annotating data, like subject_id or sample_id. But most of the time they are not used.
Some additional information on data frames:
Usually created by read.csv() and read.table().
Can convert to matrix with data.matrix()
Coercion will be forced and not always what you expect.
Can also create with data.frame() function.
Find the number of rows and columns with nrow(df) and ncol(df), respectively.
Rownames are usually 1..n.
Combining data frames
df <- data.frame(id = letters[1:10], x = 1:10, y = rnorm(10))
df
## id x y
## 1 a 1 -2.1638237
## 2 b 2 -0.7698114
## 3 c 3 0.3853157
## 4 d 4 -0.9379378
## 5 e 5 -0.1890014
## 6 f 6 -1.6957013
## 7 g 7 -1.7340270
## 8 h 8 0.4321022
## 9 i 9 -1.1770204
## 10 j 10 1.0208223
cbind(df, data.frame(z = 4))
## id x y z
## 1 a 1 -2.1638237 4
## 2 b 2 -0.7698114 4
## 3 c 3 0.3853157 4
## 4 d 4 -0.9379378 4
## 5 e 5 -0.1890014 4
## 6 f 6 -1.6957013 4
## 7 g 7 -1.7340270 4
## 8 h 8 0.4321022 4
## 9 i 9 -1.1770204 4
## 10 j 10 1.0208223 4
When you combine column wise, only row numbers need to match. If you are adding a vector, it will get repeated.
Useful functions
head() - see first 6 rows
tail() - see last 6 rows
dim() - see dimensions
nrow() - number of rows
ncol() - number of columns
str() - structure of each column
names() - will list the names attribute for a data frame (or any object really), which gives the column names.
A data frame is a special type of list where every element of the list has same length.
See that it is actually a special list:
is.list(iris)
## [1] TRUE
class(iris)
## [1] "data.frame"
Naming objects
Other R objects can also have names. Adding names is helpful since it’s useful for readable code and self describing objects.
x <- 1:3
names(x) <- c("karthik", "ram", "rocks")
x
## karthik ram rocks
## 1 2 3
Lists can also have names.
x <- as.list(1:10)
names(x) <- letters[seq(along = x)]
x
## $a
## [1] 1
##
## $b
## [1] 2
##
## $c
## [1] 3
##
## $d
## [1] 4
##
## $e
## [1] 5
##
## $f
## [1] 6
##
## $g
## [1] 7
##
## $h
## [1] 8
##
## $i
## [1] 9
##
## $j
## [1] 10
Finally matrices can have names and these are called dimnames
m <- matrix(1:4, nrow = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))
# first element = rownames
# second element = colnames
m
## c d
## a 1 3
## b 2 4
dimnames(m)
## [[1]]
## [1] "a" "b"
##
## [[2]]
## [1] "c" "d"
colnames(m) ## or rownames(m)
## [1] "c" "d"
Denoted by NA and/or NaN for undefined mathematical operations.
is.na()
is.nan()
check for both.
NA values have a class. So you can have both an integer NA (NA_integer_) and a character NA (NA_character_).
NaN is also NA. But not the other way around.
x <- c(1,2, NA, 4, 5)
x
## [1] 1 2 NA 4 5
is.na(x) # returns logical
## [1] FALSE FALSE TRUE FALSE FALSE
# shows third
is.nan(x)
## [1] FALSE FALSE FALSE FALSE FALSE
# none are NaN
x <- c(1,2, NA, NaN, 4, 5)
is.na(x)
## [1] FALSE FALSE TRUE TRUE FALSE FALSE
# shows 2 TRUE
is.nan(x)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
# shows 1 TRUE
These allow you to control the flow of execution of a script typically inside of a function. Common ones include:
if, else
for
while
repeat
break
next
return
We don’t use these while working with R interactively but rather inside functions.
if(condition) {
# do something
} else {
# do something else
}
e.g.
x <- 1:15
if(sample(x, 1) <= 10) {
print("x is less than 10")
} else {
print("x is greater than 10")
}
Vectorization with ifelse
ifelse(x <= 10, "x less than 10", "x greater than 10")
Other valid ways of writing if/else
if(sample(x,1) < 10) {
y <- 5
} else {
y <- 0
}
y <- if(sample(x,1) < 10) {
5
} else {
0
}
A for loop works on an iterable variable and assigns successive values till the end of a sequence.
for(i in 1:10) {
print(i)
}
x <- c("apples", "oranges", "bananas", "strawberries")
for(i in x) {
print(i)
}
for(i in 1:4) {
print(x[i])
}
for(i in seq(x)) {
print(x[i])
}
for(i in 1:4) print(x[i])
m <- matrix(1:10, 2)
for(i in seq(nrow(m))) {
for(j in seq(ncol(m))) {
print(m[i,j])
}
}
i <- 1
while(i < 10) {
print(i)
i <- i + 1
}
Be sure there is a way to exit out of a while loop.
repeat {
# simulations; generate some value
# have an expectation
# if within some range, then exit the loop
if((value - expectation) <= threshold) {
break
}
}
for(i in 1:20) {
if(i %%2 ==1) {
next
} else {
print(i)
}
}
This loop will only print even numbers and skip over odd numbers. Later we’ll learn other functions that will help us avoid these types of slow control flows as much as possible (mostly the while and for loops).
If you have to repeat the same few lines of code more than once, then you really need to write a function. Functions are a fundamental building block of R. You use them all the time in R and it’s not that much harder to string functions together (or write entirely new ones from scratch) to do more.
body(), the code inside the function.formals(), the “formal” argument list, which controls how you can call the function.args() to list arguments.f <- function(x) x
f
formals(f)
environment(f)
Question: How do we delete this function from our environment?
Variables defined inside functions exist in a different environment than the global environment. However, if a variabe is not defined inside a function, it will look one level above.
example.
x <- 2
g <- function() {
y <- 1
c(x, y)
}
g()
## [1] 2 1
rm(x, g)
Same rule applies for nested functions.
A first useful function.
first <- function(x, y) {
z <- x + y
return(z)
}
add <- function(a, b) {
return (a + b)
}
vector <- c(3, 4, 5, 6)
sapply(vector, add, 1)
What does this function return?
x <- 5
f <- function() {
y <- 10
c(x = x, y = y)
}
f()
What does this function return?
x <- 5
g <- function() {
x <- 20
y <- 10
c(x = x, y = y)
}
g()
What does this function return??
x <- 5
h <- function() {
y <- 10
i <- function() {
z <- 20
c(x = x, y = y, z = z)
}
i()
}
h()
Functions with pre defined values
temp <- function(a = 1, b = 2) {
return(a + b)
}
Functions usually return the last value it computed
f <- function(x) {
if (x < 10) {
0
} else {
10
}
}
f(5)
f(15)
Being aware of potential conflicts of variable names is a significant step towards avoiding unexpected execution. Like all programming languages, R has some rules about where variables live depending on where they are defined. There are also rules for where R looks for user definitions of variables and functions. If known, these can assist in troublshooting problems.
Let’s assume we run the following code:
c <- 100
(c + 1)
Can we still use c() to concactenate vectors?
(x1 <- c(1:4))
How does R know which value of c to use when? R has separate namespaces for functions and non-functions. That’s why this is possible.
When R tried to “bind” a value to a symbol (in this case c), it follows a very specific search path, looking first at the Global environment, then the namespaces of each package.
What is this order?
> search()
[1] ".GlobalEnv" "package:graphics" "package:grDevices"
[4] "package:datasets" "package:devtools" "package:knitr"
[7] "package:plyr" "package:reshape2" "package:ggplot2"
[10] "package:stats" "package:coyote" "package:utils"
[13] "package:methods" "Autoloads" "package:base"
Newly loaded packages end up in position 2 and everything else gets bumped down the list. base is always at the very end.
.GlobalEnv is just your workspace. If there’s a symbol matching your request, it will take that value based on your request.
If nothing is found, it will search the namespace of each of the packages you’ve loaded (your list will look different).
Package loading order matters.
Example:
install.packages("Hmisc")
library(plyr)
library(Hmisc)
is.discrete
library(Hmisc)
library(plyr)
is.discrete
Reference functions inside a package’s namespace using the :: operator.
Hmisc::is.discrete
plyr::is.discrete
R uses scoping rules called Lexical scoping (otherwise known as static scoping).
It determines how a value is associated with a free variable in a function.
add <- function(a, b) {
(a + b)/n
}
n here is the free variable.
Rules of scoping
R first searches in the environment where the function was defined. An environment is a collection of symbols and values. Environments have parents.
> parent.env(globalenv())
<environment: package:graphics>
attr(,"name")
[1] "package:graphics"
attr(,"path")
[1] "/Library/Frameworks/R.framework/Versions/3.0/Resources/library/graphics"
> search()
[1] ".GlobalEnv" "package:graphics" "package:grDevices"
[4] "package:datasets" "package:devtools" "package:knitr"
[7] "package:plyr" "package:reshape2" "package:ggplot2"
[10] "package:stats" "package:coyote" "package:utils"
[13] "package:methods" "Autoloads" "package:base"
Since we defined add in the global env, R looks for n in that environment. You can confirm that the function add was defined in the global env using the function environment.
environment(add)
These rules matter because you can define nested functions.
Example:
make.power <- function(n) {
pow <- function(x) {
x^n
}
pow
}
This is a constructor function, i.e. a function that creates another one.
cube <- make.power(3)
square <- make.power(2)
cube(3)
square(3)
ls(environment(cube))
get("n", environment(cube))
ls(environment(square))
get("n", environment(square))
You can see that R is searching for n first within each environment before looking elsewhere.
Why scoping matters?
y <- 10
f1 <- function(x) {
y <- 2
y^2 + f2(x)
}
f2 <- function(x) {
x * y
}
What does f1(10) return?
Possible answers: * 104 * 24
This is a consequence of lexical or static scoping. The alternate will result if R were using dynamic scoping. One downside (as you’ll see with larger tasks) is that R has to carry everything in memory.
It’s very easy to write code in R which is slow, inefficient and unreadble - this can be a source of much frustration for you and others that may use your code. Fortunately there are some tricks to help us out.
One approach is to use ‘vectorization’ operations, another is to use the family of apply functions.
Many operations in R are ‘vectorized’ which means that writing code is MUCH more efficient, concise and easy to read.
The basic idea with vectorized operations is to execute identify operations in parallel without needing to act on one element at a time.
x <- 1:4
y <- 6:9
Mathematical operations are performed element wise:
x + y
## [1] 7 9 11 13
x - y
## [1] -5 -5 -5 -5
x * y
## [1] 6 14 24 36
x / y
## [1] 0.1666667 0.2857143 0.3750000 0.4444444
Boolean comparisons return logical vectors:
y == 8
## [1] FALSE FALSE TRUE FALSE
x > 2
## [1] FALSE FALSE TRUE TRUE
x >= 2
## [1] FALSE TRUE TRUE TRUE
A shorter vector is recycled:
z <- c(1, 2)
x * z
## [1] 1 4 3 8
Matrix operations are also vectorized:
x <- matrix(1:4, 2, 2)
y <- matrix(rep(10, 4), 2, 2)
x * y # is not matrix multiplication. It's element wise
## [,1] [,2]
## [1,] 10 30
## [2,] 20 40
x / y # is elementwise division.
## [,1] [,2]
## [1,] 0.1 0.3
## [2,] 0.2 0.4
Element wise matrix operations are performed by column. In the example below, y[1, 1] is multipled by z[1], followed by y[1, 2] multiplied by z[2], and then the vector z is recycled for multipying the second column of y.
y * z
## [,1] [,2]
## [1,] 10 10
## [2,] 20 20
True matrix multiplication is:
x %*% y
## [,1] [,2]
## [1,] 40 40
## [2,] 60 60
Vectorized operations make code a lot simpler.
There are a family of ‘apply’ functions which can be used to efficiently evaluate a function over a data frame or matrix.
apply
by
lapply
tapply
sapply
apply applies a function to each row or column of a matrix.
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
m
## [,1] [,2]
## [1,] 1 11
## [2,] 2 12
## [3,] 3 13
## [4,] 4 14
## [5,] 5 15
## [6,] 6 16
## [7,] 7 17
## [8,] 8 18
## [9,] 9 19
## [10,] 10 20
# 1 is the row index
# 2 is the column index
apply(m, 1, sum)
## [1] 12 14 16 18 20 22 24 26 28 30
apply(m, 2, sum)
## [1] 55 155
apply(m, 1, mean)
## [1] 6 7 8 9 10 11 12 13 14 15
apply(m, 2, mean)
## [1] 5.5 15.5
by applies a function to subsets of a data frame.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
by(iris[, 1:2], iris[,"Species"], summary)
## iris[, "Species"]: setosa
## Sepal.Length Sepal.Width
## Min. :4.300 Min. :2.300
## 1st Qu.:4.800 1st Qu.:3.200
## Median :5.000 Median :3.400
## Mean :5.006 Mean :3.428
## 3rd Qu.:5.200 3rd Qu.:3.675
## Max. :5.800 Max. :4.400
## --------------------------------------------------------
## iris[, "Species"]: versicolor
## Sepal.Length Sepal.Width
## Min. :4.900 Min. :2.000
## 1st Qu.:5.600 1st Qu.:2.525
## Median :5.900 Median :2.800
## Mean :5.936 Mean :2.770
## 3rd Qu.:6.300 3rd Qu.:3.000
## Max. :7.000 Max. :3.400
## --------------------------------------------------------
## iris[, "Species"]: virginica
## Sepal.Length Sepal.Width
## Min. :4.900 Min. :2.200
## 1st Qu.:6.225 1st Qu.:2.800
## Median :6.500 Median :3.000
## Mean :6.588 Mean :2.974
## 3rd Qu.:6.900 3rd Qu.:3.175
## Max. :7.900 Max. :3.800
by(iris[, 1:2], iris[,"Species"], sum)
## iris[, "Species"]: setosa
## [1] 421.7
## --------------------------------------------------------
## iris[, "Species"]: versicolor
## [1] 435.3
## --------------------------------------------------------
## iris[, "Species"]: virginica
## [1] 478.1
tapply applies a function to subsets of a vector.
df <- data.frame(names = sample(c("A","B","C"), 10, rep = T), length = rnorm(10))
df
## names length
## 1 B -0.15564185
## 2 C -2.01990840
## 3 A -1.05965372
## 4 A 0.09074220
## 5 C 0.07270563
## 6 A 1.33217365
## 7 B 0.56650964
## 8 B 0.79729735
## 9 B 0.59426159
## 10 B -0.15828292
tapply(df$length, df$names, mean)
## A B C
## 0.1210874 0.3288288 -0.9736014
Now with a more familiar dataset.
tapply(iris$Petal.Length, iris$Species, mean)
## setosa versicolor virginica
## 1.462 4.260 5.552
What it does: Returns a list of same length as the input. Each element of the output is a result of applying a function to the corresponding element.
my_list <- list(a = 1:10, b = 2:20)
my_list
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $b
## [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
lapply(my_list, mean)
## $a
## [1] 5.5
##
## $b
## [1] 11
sapply is a more user friendly version of lapply and will return a list of matrix where appropriate.
Let’s work with the same list we just created.
my_list
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $b
## [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x <- sapply(my_list, mean)
x
## a b
## 5.5 11.0
class(x)
## [1] "numeric"
An extremely useful function to generate datasets for simulation purposes.
replicate(10, rnorm(10))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.26421373 -0.10159728 -1.9375789 0.6193420 -0.37341041
## [2,] -0.30863414 0.50976733 0.2055021 -0.4598711 -0.89605471
## [3,] -0.06765376 0.52928225 0.1580237 1.3963495 -0.81323233
## [4,] 0.17311238 0.22412245 0.8983873 1.9073195 -0.37334082
## [5,] 1.46754668 -0.80187641 -0.7102795 1.9133833 1.15881163
## [6,] -1.41534907 1.28553724 -0.2017006 -1.5268909 1.14028436
## [7,] -0.82721664 -2.18201060 0.2037915 -0.2475745 -0.26601050
## [8,] 0.73795417 -1.56953094 -0.4912095 1.1445191 0.17755596
## [9,] 0.91559526 1.25281136 -1.2630158 0.2805345 0.09110074
## [10,] -0.63911177 -0.03310822 1.5041693 0.2057096 -1.71796162
## [,6] [,7] [,8] [,9] [,10]
## [1,] -0.30087259 1.00063954 -0.504888500 -0.2648303 1.3302251
## [2,] 0.65033846 0.63341144 -2.136281521 0.8952265 -2.1804741
## [3,] -0.80649729 0.13914735 -0.731222701 -0.7382418 1.3266580
## [4,] -0.25434165 0.08408901 0.005663705 0.1393083 -0.7212685
## [5,] 1.85881674 1.11605037 1.336734467 1.8564083 -1.2622886
## [6,] 0.46422573 -0.57456118 -0.249801668 -2.1058102 -0.6862764
## [7,] 0.06724379 1.05281759 -0.088825598 -0.5552824 0.5025837
## [8,] -0.93850496 0.52840918 -0.371982637 -0.8857639 -0.9486380
## [9,] 0.56307305 -1.34094760 -0.946183435 -1.4913859 0.5943181
## [10,] 1.38833429 -0.42663614 0.956575116 -0.7419571 0.8020775
replicate(10, rnorm(10), simplify = TRUE)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.20960402 -0.4158126 0.02510334 1.158939833 0.71317936
## [2,] -0.27429215 -1.8490742 0.72829752 -1.251474297 0.63496378
## [3,] 0.51196608 0.2789386 -1.21910489 -0.305100265 -1.31396758
## [4,] 1.63004468 -1.0069109 1.30153988 -1.292111005 -0.89604490
## [5,] 2.25532261 0.4742752 -0.28319620 -0.441805702 1.34853333
## [6,] -0.50389760 -0.6409965 -0.89472811 -1.207065136 -0.09317015
## [7,] 0.05006981 -0.1776505 1.47746350 -0.728781404 -0.83565480
## [8,] 1.27836776 2.7134612 -1.32250365 1.868500614 -0.82169765
## [9,] -0.47332446 -0.1848393 0.90136592 0.009677263 -0.48432410
## [10,] -0.72155071 -1.3830132 0.30951871 -1.542600789 -0.48148185
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.8843542 0.022220671 -0.79730862 0.51982447 0.25013255
## [2,] -1.2180504 -1.909112828 0.21654130 -0.39123723 -0.63367137
## [3,] 0.8889140 0.007988984 1.74494642 0.71555944 -0.02367925
## [4,] 1.3713941 0.764123382 -1.44089197 0.88094830 0.89915134
## [5,] -1.7088335 -0.816713311 0.30085003 -0.67295293 -1.92298810
## [6,] -0.3128146 -0.457666717 -1.12377527 1.87636709 0.49752297
## [7,] 0.5325853 -0.287594995 -1.70361001 -0.98239953 -2.57322846
## [8,] -1.2103010 0.080100320 -1.23877166 -0.08359462 1.22175106
## [9,] 0.0428119 0.762190614 -0.08887393 1.32409417 -0.46036287
## [10,] 0.7869221 -1.642560278 -0.30738330 -0.31920379 -0.59422701
The final arguments turns the result into a vector or matrix if possible.
It’s more or less a multivariate version of sapply. It applies a function to all corresponding elements of each argument.
example:
list_1 <- list(a = c(1:10), b = c(11:20))
list_1
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $b
## [1] 11 12 13 14 15 16 17 18 19 20
list_2 <- list(c = c(21:30), d = c(31:40))
list_2
## $c
## [1] 21 22 23 24 25 26 27 28 29 30
##
## $d
## [1] 31 32 33 34 35 36 37 38 39 40
mapply(sum, list_1$a, list_1$b, list_2$c, list_2$d)
## [1] 64 68 72 76 80 84 88 92 96 100