Instructor Details

Matthew Dixon

TA: Bo Wang

The R Language, in some detail

We saw in Session 1 a small sample of the rich set of functionality in R and the thousands of CRAN packages which make it a very powerful tool for data science. Today, we shall focus on how R represents data, through “data structures”, and how we write programs consisting of control flow and functions to interact with these data structures. These basic programming concepts are essential for data analysis in R.

The 90/10 Rule

A new programmer will spend 10% of their time coding and thinking about program design, the remaining time is spent debugging. It is vice-versa for an experienced programmer.

By far the biggest frustration that new R users face is experiencing unattended program behavior (perhaps a ‘bug’) and not knowing how to solve it, and more importantly, safeguard against it in the first place. We will devote time to ensuring that we are equipped with the basic know how to ensure good program design.

Objectives for this session

Learn the basic data types and how to use them in R.
Learn the basic data structures in R and how to convert between them.
Learn how to implement basic logic for controlling the output of the program.
Learn how to write functions to perform certain modular tasks.
Learn the rules for variable scoping in R.
Learn how to write more efficient code making use of vectorization.

Understanding basic data types in R

To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on those.
Very Important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.
Everything in R is an object.

R has 6 (although we will not discuss the raw class for this workshop) atomic classes.

character
numeric (real or decimal)
integer
logical
complex

Example	Type
“a”, “swc”	character
2, 15.5	numeric
2 (Must add a `L` at end to denote integer)	integer
`TRUE`, `FALSE`	logical
1+4i	complex

typeof() # what is it?
length() # how long is it? What about two dimensional objects?
attributes() # does it have any metadata?

# Example

x <- "dataset"
typeof(x)
attributes(x)

y <- 1:10
typeof(y)
length(y)
attributes(y)

z <- c(1L, 2L, 3L)
typeof(z)

R has many data structures. These include

atomic vector
list
matrix
data frame
factors
tables

Vectors

A vector is the most common and basic data structure in R and is pretty much the workhorse of R. Technically, vectors can be one of two types:

atomic vectors
lists

although the term “vector” most commonly refers to the atomic type not lists.

Atomic Vectors

A vector can be a vector of elements that are most commonly character, logical, integer or numeric.

You can create an empty vector with vector() (By default the mode is logical. You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as character(), numeric(), etc.

x <- vector()
# with a length and type
vector("character", length = 10)

##  [1] "" "" "" "" "" "" "" "" "" ""

character(5) ## character vector of length 5

## [1] "" "" "" "" ""

numeric(5)

## [1] 0 0 0 0 0

logical(5)

## [1] FALSE FALSE FALSE FALSE FALSE

Various examples:

x <- c(1, 2, 3)
x

## [1] 1 2 3

length(x)

## [1] 3

x is a numeric vector. These are the most common kind. They are numeric objects and are treated as double precision real numbers. To explicitly create integers, add an L at the end.

x1 <- c(1L, 2L, 3L)

You can also have logical vectors.

y <- c(TRUE, TRUE, FALSE, FALSE)

Finally you can have character vectors:

z <- c("Alec", "Dan", "Rob", "Karthik")

Examine your vector

typeof(z)

## [1] "character"

length(z)

## [1] 4

class(z)

## [1] "character"

str(z)

##  chr [1:4] "Alec" "Dan" "Rob" "Karthik"

Question: Do you see a property that’s common to all these vectors above?

Add elements

z <- c(z, "Annette")
z

## [1] "Alec"    "Dan"     "Rob"     "Karthik" "Annette"

More examples of vectors

x <- c(0.5, 0.7)
x <- c(TRUE, FALSE)
x <- c("a", "b", "c", "d", "e")
x <- 9:100
x <- c(1+0i, 2+4i)

You can also create vectors as a sequence of numbers

series <- 1:10
seq(10)

##  [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 10, by = 0.1)

##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
## [29]  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1
## [43]  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5
## [57]  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
## [71]  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3
## [85]  9.4  9.5  9.6  9.7  9.8  9.9 10.0

Other objects

Inf is infinity. You can have either positive or negative infinity.

1/0

## [1] Inf

1/Inf

## [1] 0

NaN means Not a number. It’s an undefined value.

0/0

## [1] NaN

Each object can have attributes. Attribues can be part of an object of R. These include:

names
dimnames
dim
class
attributes (contain metadata)

You can also glean other attribute-like information such as length (works on vectors and lists) or number of characters (for character strings).

length(1:10)

## [1] 10

nchar("Stuart Business School")

## [1] 22

What happens when you mix types?

R will create a resulting vector that is the least common denominator. The coercion will move towards the one that’s easiest to coerce to.

Guess what the following do without running them first

xx <- c(1.7, "a") 
xx <- c(TRUE, 2) 
xx <- c("a", TRUE)

This is called implicit coercion. You can also coerce vectors explicitly using the as.<class_name>. Example

as.numeric()
as.character()

When you coerce an existing numeric vector with as.numeric(), it converts the vector to a double.

x <- 0:6
identical(x, as.numeric(x))

## [1] FALSE

typeof(x)

## [1] "integer"

typeof(as.numeric(x))

## [1] "double"

x <- 0:6
as.numeric(x)

## [1] 0 1 2 3 4 5 6

as.logical(x)

## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

as.character(x)

## [1] "0" "1" "2" "3" "4" "5" "6"

as.complex(x)

## [1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i

Sometimes coercions, especially nonsensical ones, won’t work.

x <- c("a", "b", "c")
as.numeric(x)

## Warning: NAs introduced by coercion

## [1] NA NA NA

as.logical(x)

## [1] NA NA NA

# both don't work

Sometimes there is implicit conversion

1 < "2"

## [1] TRUE

"1" > 2

## [1] FALSE

Matrix

Matrices are a special vector in R. They are not a separate type of object but simply an atomic vector with dimensions added on to it. Matrices have rows and columns.

m <- matrix(nrow = 2, ncol = 2)
m

##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA

dim(m)

## [1] 2 2

Matrices are filled column-wise.

m <- matrix(1:6, nrow = 2, ncol = 3)

Other ways to construct a matrix

m <- 1:10
dim(m) <- c(2, 5)

This takes a vector and transform into a matrix with 2 rows and 5 columns.

Another way is to bind columns or rows using cbind() and rbind().

x <- 1:3
y <- 10:12
cbind(x, y)

##      x  y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12

# or 
rbind(x, y)

##   [,1] [,2] [,3]
## x    1    2    3
## y   10   11   12

You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:

mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,
               dimnames = list(c("row1", "row2"),
                               c("C.1", "C.2", "C.3")))
mdat

##      C.1 C.2 C.3
## row1   1   2   3
## row2  11  12  13

List

In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a single mode and can encompass any mixture of data types. Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.

A list is a special type of vector. Each element can be a different type.

Create lists using list() or coerce other objects using as.list()

x <- list(1, "a", TRUE, 1+4i)
x

## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i

x <- 1:10
x <- as.list(x)
length(x)

## [1] 10

What is the class of x[1]?
How about x[[1]]?

xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris))
xlist

## $a
## [1] "Karthik Ram"
## 
## $b
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $data
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

What is the length of this object? What about its structure?

A list can contain many lists nested inside.

temp <- list(list(list(list())))
temp

## [[1]]
## [[1]][[1]]
## [[1]][[1]][[1]]
## list()

is.recursive(temp)

## [1] TRUE

Lists are extremely useful inside functions. You can “staple” together lots of different kinds of results into a single object that a function can return.

A list does not print to the console like a vector. Instead, each element of the list starts on a new line.

Elements are indexed by double brackets. Single brackets will still return a(nother) list.

Factors

Factors are special vectors that represent categorical data. Factors can be ordered or unordered and are important when for modelling functions such as lm() and glm() and also in plot methods.

Factors can only contain pre-defined values.

Factors are pretty much integers that have labels on them. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings. Some string methods will coerce factors to strings, while others will throw an error.

Sometimes factors can be left unordered. Example: male, female.

Other times you might want factors to be ordered (or ranked). Example: low, medium, high.

Underlying it’s represented by numbers 1, 2, 3.

They are better than using simple integer labels because factors are what are called self describing. male and female is more descriptive than 1s and 2s. Helpful when there is no additional metadata.

Which is male? 1 or 2? You wouldn’t be able to tell with just integer data. Factors have this information built in.

Factors can be created with factor(). Input is generally a character vector.

x <- factor(c("yes", "no", "no", "yes", "yes"))
x

## [1] yes no  no  yes yes
## Levels: no yes

table(x) will return a frequency table.

If you need to convert a factor to a character vector, simply use

as.character(x)

## [1] "yes" "no"  "no"  "yes" "yes"

In modeling functions, it is important to know what the baseline level is. This is the first factor but by default the ordering is determined by alphabetical order of words entered. You can change this by speciying the levels (another option is to use the function relevel).

x <- factor(c("yes", "no", "yes"), levels = c("yes", "no"))
x

## [1] yes no  yes
## Levels: yes no

Data frame

A data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics.

Data frames can have additional attributes such as rownames(), which can be useful for annotating data, like subject_id or sample_id. But most of the time they are not used.

Some additional information on data frames:

Usually created by read.csv() and read.table().
Can convert to matrix with data.matrix()
Coercion will be forced and not always what you expect.
Can also create with data.frame() function.
Find the number of rows and columns with nrow(df) and ncol(df), respectively.
Rownames are usually 1..n.

Combining data frames

df <- data.frame(id = letters[1:10], x = 1:10, y = rnorm(10))
df

##    id  x          y
## 1   a  1 -2.1638237
## 2   b  2 -0.7698114
## 3   c  3  0.3853157
## 4   d  4 -0.9379378
## 5   e  5 -0.1890014
## 6   f  6 -1.6957013
## 7   g  7 -1.7340270
## 8   h  8  0.4321022
## 9   i  9 -1.1770204
## 10  j 10  1.0208223

cbind(df, data.frame(z = 4))

##    id  x          y z
## 1   a  1 -2.1638237 4
## 2   b  2 -0.7698114 4
## 3   c  3  0.3853157 4
## 4   d  4 -0.9379378 4
## 5   e  5 -0.1890014 4
## 6   f  6 -1.6957013 4
## 7   g  7 -1.7340270 4
## 8   h  8  0.4321022 4
## 9   i  9 -1.1770204 4
## 10  j 10  1.0208223 4

When you combine column wise, only row numbers need to match. If you are adding a vector, it will get repeated.

Useful functions

head() - see first 6 rows
tail() - see last 6 rows
dim() - see dimensions
nrow() - number of rows
ncol() - number of columns
str() - structure of each column
names() - will list the names attribute for a data frame (or any object really), which gives the column names.

A data frame is a special type of list where every element of the list has same length.

See that it is actually a special list:

is.list(iris)

## [1] TRUE

class(iris)

## [1] "data.frame"

Naming objects

Other R objects can also have names. Adding names is helpful since it’s useful for readable code and self describing objects.

x <- 1:3
names(x) <- c("karthik", "ram", "rocks")
x

## karthik     ram   rocks 
##       1       2       3

Lists can also have names.

x <- as.list(1:10)
names(x) <- letters[seq(along = x)]
x

## $a
## [1] 1
## 
## $b
## [1] 2
## 
## $c
## [1] 3
## 
## $d
## [1] 4
## 
## $e
## [1] 5
## 
## $f
## [1] 6
## 
## $g
## [1] 7
## 
## $h
## [1] 8
## 
## $i
## [1] 9
## 
## $j
## [1] 10

Finally matrices can have names and these are called dimnames

m <- matrix(1:4, nrow = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))
# first element = rownames
# second element = colnames
m

##   c d
## a 1 3
## b 2 4

dimnames(m)

## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "c" "d"

colnames(m) ## or rownames(m)

## [1] "c" "d"

Missing values

Denoted by NA and/or NaN for undefined mathematical operations.

is.na()
is.nan()

check for both.

NA values have a class. So you can have both an integer NA (NA_integer_) and a character NA (NA_character_).

NaN is also NA. But not the other way around.

x <- c(1,2, NA, 4, 5)
x

## [1]  1  2 NA  4  5

is.na(x) # returns logical

## [1] FALSE FALSE  TRUE FALSE FALSE

# shows third
is.nan(x)

## [1] FALSE FALSE FALSE FALSE FALSE

# none are NaN

x <- c(1,2, NA, NaN, 4, 5)
is.na(x)

## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE

# shows 2 TRUE
is.nan(x)

## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

# shows 1 TRUE

Control structures in R

These allow you to control the flow of execution of a script typically inside of a function. Common ones include:

if, else
for
while
repeat
break
next
return

We don’t use these while working with R interactively but rather inside functions.

If

if(condition) {
    # do something 
} else { 
    # do something else
}

e.g.

x <- 1:15
if(sample(x, 1) <= 10) {
    print("x is less than 10") 
} else {
    print("x is greater than 10")
}

Vectorization with ifelse

ifelse(x <= 10, "x less than 10", "x greater than 10")

Other valid ways of writing if/else

if(sample(x,1) < 10) {
    y <- 5
} else {
    y <- 0
}

y <- if(sample(x,1) < 10) {
 5
} else {
    0
}

for

A for loop works on an iterable variable and assigns successive values till the end of a sequence.

for(i in 1:10) {
    print(i)
}

x <- c("apples", "oranges", "bananas", "strawberries")

for(i in x) {
    print(i)
}

for(i in 1:4) {
    print(x[i])
}

for(i in seq(x)) {
    print(x[i])
}

for(i in 1:4) print(x[i])

Nested loops

m <- matrix(1:10, 2)
for(i in seq(nrow(m))) {
    for(j in seq(ncol(m))) {
        print(m[i,j])
}
}

While

i <- 1
while(i < 10) {
    print(i)
    i <- i + 1
}

Be sure there is a way to exit out of a while loop.

Repeat and break

repeat {
    # simulations; generate some value
    # have an expectation
    # if within some range, then exit the loop
    if((value - expectation) <= threshold) {
    break
}
}

for(i in 1:20) {
    if(i %%2 ==1) {
        next
    } else { 
        print(i)
        }
}

This loop will only print even numbers and skip over odd numbers. Later we’ll learn other functions that will help us avoid these types of slow control flows as much as possible (mostly the while and for loops).

Writing functions in R

If you have to repeat the same few lines of code more than once, then you really need to write a function. Functions are a fundamental building block of R. You use them all the time in R and it’s not that much harder to string functions together (or write entirely new ones from scratch) to do more.

R functions are objects just like anything else.
By default, R function arguments are lazy - they’re only evaluated if they’re actually used:
Every call on a R object is almost always a function call.

Basic components of a function

The body(), the code inside the function.
The formals(), the “formal” argument list, which controls how you can call the function.
The `environment()`` which determines how variables referred to inside the function are found.
args() to list arguments.

f <- function(x) x
f

formals(f)

environment(f)

Question: How do we delete this function from our environment?

More on environments

Variables defined inside functions exist in a different environment than the global environment. However, if a variabe is not defined inside a function, it will look one level above.

example.

x <- 2
g <- function() { 
  y <- 1
  c(x, y)
}  
g()

## [1] 2 1

rm(x, g)

Same rule applies for nested functions.

A first useful function.

first <- function(x, y) {
    z <- x + y
    return(z)
}

add <- function(a, b) {
  return (a + b)
}
vector <- c(3, 4, 5, 6)

sapply(vector, add, 1)

What does this function return?

x <- 5
f <- function() {
  y <- 10
  c(x = x, y = y)
}
f()

What does this function return?

x <- 5
g <- function() {
  x <- 20
  y <- 10
  c(x = x, y = y)
} 
g()

What does this function return??

x <- 5
h <- function() {
  y <- 10
  i <- function() {
    z <- 20
    c(x = x, y = y, z = z)
  }
  i() 
}
h()

Functions with pre defined values

temp <- function(a = 1, b = 2) {
    return(a + b)
}

Functions usually return the last value it computed

f <- function(x) {
  if (x < 10) {
    0
  } else {
    10
  }
}
f(5)
f(15)

Advanced Topics: Scoping in R

Being aware of potential conflicts of variable names is a significant step towards avoiding unexpected execution. Like all programming languages, R has some rules about where variables live depending on where they are defined. There are also rules for where R looks for user definitions of variables and functions. If known, these can assist in troublshooting problems.

Let’s assume we run the following code:

c <- 100
(c  + 1)

Can we still use c() to concactenate vectors?

(x1 <- c(1:4))

How does R know which value of c to use when? R has separate namespaces for functions and non-functions. That’s why this is possible.

When R tried to “bind” a value to a symbol (in this case c), it follows a very specific search path, looking first at the Global environment, then the namespaces of each package.

What is this order?

> search()
 [1] ".GlobalEnv"        "package:graphics"  "package:grDevices"
 [4] "package:datasets"  "package:devtools"  "package:knitr"
 [7] "package:plyr"      "package:reshape2"  "package:ggplot2"
[10] "package:stats"     "package:coyote"    "package:utils"
[13] "package:methods"   "Autoloads"         "package:base"

Newly loaded packages end up in position 2 and everything else gets bumped down the list. base is always at the very end.

.GlobalEnv is just your workspace. If there’s a symbol matching your request, it will take that value based on your request.

If nothing is found, it will search the namespace of each of the packages you’ve loaded (your list will look different).

Package loading order matters.

Example:

install.packages("Hmisc")
library(plyr)
library(Hmisc)
is.discrete

library(Hmisc)
library(plyr)
is.discrete

Reference functions inside a package’s namespace using the :: operator.

Hmisc::is.discrete
plyr::is.discrete

R uses scoping rules called Lexical scoping (otherwise known as static scoping).

It determines how a value is associated with a free variable in a function.

add <- function(a, b) {
    (a + b)/n
}

n here is the free variable.

Rules of scoping

R first searches in the environment where the function was defined. An environment is a collection of symbols and values. Environments have parents.

> parent.env(globalenv())
<environment: package:graphics>
attr(,"name")
[1] "package:graphics"
attr(,"path")
[1] "/Library/Frameworks/R.framework/Versions/3.0/Resources/library/graphics"
> search()
 [1] ".GlobalEnv"        "package:graphics"  "package:grDevices"
 [4] "package:datasets"  "package:devtools"  "package:knitr"
 [7] "package:plyr"      "package:reshape2"  "package:ggplot2"
[10] "package:stats"     "package:coyote"    "package:utils"
[13] "package:methods"   "Autoloads"         "package:base"

Since we defined add in the global env, R looks for n in that environment. You can confirm that the function add was defined in the global env using the function environment.

environment(add)

These rules matter because you can define nested functions.

Example:

make.power <- function(n) {
    pow <- function(x) {
    x^n
 }
pow
}

This is a constructor function, i.e. a function that creates another one.

cube <- make.power(3)
square <- make.power(2)

cube(3)
square(3)

ls(environment(cube))
get("n", environment(cube))

ls(environment(square))
get("n", environment(square))

You can see that R is searching for n first within each environment before looking elsewhere.

Why scoping matters?

y <- 10

f1 <- function(x) {
    y <- 2
    y^2 + f2(x)
}


f2 <- function(x) {
    x * y
}

What does f1(10) return?

Possible answers: * 104 * 24

This is a consequence of lexical or static scoping. The alternate will result if R were using dynamic scoping. One downside (as you’ll see with larger tasks) is that R has to carry everything in memory.

Advanced topics: Making your code more efficient and scalable

It’s very easy to write code in R which is slow, inefficient and unreadble - this can be a source of much frustration for you and others that may use your code. Fortunately there are some tricks to help us out.

One approach is to use ‘vectorization’ operations, another is to use the family of apply functions.

Vectorization

Many operations in R are ‘vectorized’ which means that writing code is MUCH more efficient, concise and easy to read.

The basic idea with vectorized operations is to execute identify operations in parallel without needing to act on one element at a time.

x <- 1:4
y <- 6:9

Mathematical operations are performed element wise:

x + y

## [1]  7  9 11 13

x - y

## [1] -5 -5 -5 -5

x * y

## [1]  6 14 24 36

x / y

## [1] 0.1666667 0.2857143 0.3750000 0.4444444

Boolean comparisons return logical vectors:

y == 8

## [1] FALSE FALSE  TRUE FALSE

x > 2

## [1] FALSE FALSE  TRUE  TRUE

x >= 2

## [1] FALSE  TRUE  TRUE  TRUE

A shorter vector is recycled:

z <- c(1, 2)
x * z

## [1] 1 4 3 8

Matrix operations are also vectorized:

x <- matrix(1:4, 2, 2)
y <- matrix(rep(10, 4), 2, 2)
x * y # is not matrix multiplication. It's element wise

##      [,1] [,2]
## [1,]   10   30
## [2,]   20   40

x / y # is elementwise division.

##      [,1] [,2]
## [1,]  0.1  0.3
## [2,]  0.2  0.4

Element wise matrix operations are performed by column. In the example below, y[1, 1] is multipled by z[1], followed by y[1, 2] multiplied by z[2], and then the vector z is recycled for multipying the second column of y.

y * z

##      [,1] [,2]
## [1,]   10   10
## [2,]   20   20

True matrix multiplication is:

x %*% y

##      [,1] [,2]
## [1,]   40   40
## [2,]   60   60

Vectorized operations make code a lot simpler.

Using apply functions.

There are a family of ‘apply’ functions which can be used to efficiently evaluate a function over a data frame or matrix.

apply
by
lapply
tapply
sapply

apply

apply applies a function to each row or column of a matrix.

m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
m

##       [,1] [,2]
##  [1,]    1   11
##  [2,]    2   12
##  [3,]    3   13
##  [4,]    4   14
##  [5,]    5   15
##  [6,]    6   16
##  [7,]    7   17
##  [8,]    8   18
##  [9,]    9   19
## [10,]   10   20

# 1 is the row index
# 2 is the column index
apply(m, 1, sum)

##  [1] 12 14 16 18 20 22 24 26 28 30

apply(m, 2, sum)

## [1]  55 155

apply(m, 1, mean)

##  [1]  6  7  8  9 10 11 12 13 14 15

apply(m, 2, mean)

## [1]  5.5 15.5

by

by applies a function to subsets of a data frame.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

by(iris[, 1:2], iris[,"Species"], summary)

## iris[, "Species"]: setosa
##   Sepal.Length    Sepal.Width   
##  Min.   :4.300   Min.   :2.300  
##  1st Qu.:4.800   1st Qu.:3.200  
##  Median :5.000   Median :3.400  
##  Mean   :5.006   Mean   :3.428  
##  3rd Qu.:5.200   3rd Qu.:3.675  
##  Max.   :5.800   Max.   :4.400  
## -------------------------------------------------------- 
## iris[, "Species"]: versicolor
##   Sepal.Length    Sepal.Width   
##  Min.   :4.900   Min.   :2.000  
##  1st Qu.:5.600   1st Qu.:2.525  
##  Median :5.900   Median :2.800  
##  Mean   :5.936   Mean   :2.770  
##  3rd Qu.:6.300   3rd Qu.:3.000  
##  Max.   :7.000   Max.   :3.400  
## -------------------------------------------------------- 
## iris[, "Species"]: virginica
##   Sepal.Length    Sepal.Width   
##  Min.   :4.900   Min.   :2.200  
##  1st Qu.:6.225   1st Qu.:2.800  
##  Median :6.500   Median :3.000  
##  Mean   :6.588   Mean   :2.974  
##  3rd Qu.:6.900   3rd Qu.:3.175  
##  Max.   :7.900   Max.   :3.800

by(iris[, 1:2], iris[,"Species"], sum)

## iris[, "Species"]: setosa
## [1] 421.7
## -------------------------------------------------------- 
## iris[, "Species"]: versicolor
## [1] 435.3
## -------------------------------------------------------- 
## iris[, "Species"]: virginica
## [1] 478.1

tapply

tapply applies a function to subsets of a vector.

df <- data.frame(names = sample(c("A","B","C"), 10, rep = T), length = rnorm(10))
df

##    names      length
## 1      B -0.15564185
## 2      C -2.01990840
## 3      A -1.05965372
## 4      A  0.09074220
## 5      C  0.07270563
## 6      A  1.33217365
## 7      B  0.56650964
## 8      B  0.79729735
## 9      B  0.59426159
## 10     B -0.15828292

tapply(df$length, df$names, mean)

##          A          B          C 
##  0.1210874  0.3288288 -0.9736014

Now with a more familiar dataset.

tapply(iris$Petal.Length, iris$Species, mean)

##     setosa versicolor  virginica 
##      1.462      4.260      5.552

lapply (and llply)

What it does: Returns a list of same length as the input. Each element of the output is a result of applying a function to the corresponding element.

my_list <- list(a = 1:10, b = 2:20)
my_list

## $a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $b
##  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

lapply(my_list, mean)

## $a
## [1] 5.5
## 
## $b
## [1] 11

sapply

sapply is a more user friendly version of lapply and will return a list of matrix where appropriate.

Let’s work with the same list we just created.

my_list

## $a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $b
##  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

x <- sapply(my_list, mean)
x

##    a    b 
##  5.5 11.0

class(x)

## [1] "numeric"

replicate

An extremely useful function to generate datasets for simulation purposes.

replicate(10, rnorm(10))

##              [,1]        [,2]       [,3]       [,4]        [,5]
##  [1,]  0.26421373 -0.10159728 -1.9375789  0.6193420 -0.37341041
##  [2,] -0.30863414  0.50976733  0.2055021 -0.4598711 -0.89605471
##  [3,] -0.06765376  0.52928225  0.1580237  1.3963495 -0.81323233
##  [4,]  0.17311238  0.22412245  0.8983873  1.9073195 -0.37334082
##  [5,]  1.46754668 -0.80187641 -0.7102795  1.9133833  1.15881163
##  [6,] -1.41534907  1.28553724 -0.2017006 -1.5268909  1.14028436
##  [7,] -0.82721664 -2.18201060  0.2037915 -0.2475745 -0.26601050
##  [8,]  0.73795417 -1.56953094 -0.4912095  1.1445191  0.17755596
##  [9,]  0.91559526  1.25281136 -1.2630158  0.2805345  0.09110074
## [10,] -0.63911177 -0.03310822  1.5041693  0.2057096 -1.71796162
##              [,6]        [,7]         [,8]       [,9]      [,10]
##  [1,] -0.30087259  1.00063954 -0.504888500 -0.2648303  1.3302251
##  [2,]  0.65033846  0.63341144 -2.136281521  0.8952265 -2.1804741
##  [3,] -0.80649729  0.13914735 -0.731222701 -0.7382418  1.3266580
##  [4,] -0.25434165  0.08408901  0.005663705  0.1393083 -0.7212685
##  [5,]  1.85881674  1.11605037  1.336734467  1.8564083 -1.2622886
##  [6,]  0.46422573 -0.57456118 -0.249801668 -2.1058102 -0.6862764
##  [7,]  0.06724379  1.05281759 -0.088825598 -0.5552824  0.5025837
##  [8,] -0.93850496  0.52840918 -0.371982637 -0.8857639 -0.9486380
##  [9,]  0.56307305 -1.34094760 -0.946183435 -1.4913859  0.5943181
## [10,]  1.38833429 -0.42663614  0.956575116 -0.7419571  0.8020775

replicate(10, rnorm(10), simplify = TRUE)

##              [,1]       [,2]        [,3]         [,4]        [,5]
##  [1,]  1.20960402 -0.4158126  0.02510334  1.158939833  0.71317936
##  [2,] -0.27429215 -1.8490742  0.72829752 -1.251474297  0.63496378
##  [3,]  0.51196608  0.2789386 -1.21910489 -0.305100265 -1.31396758
##  [4,]  1.63004468 -1.0069109  1.30153988 -1.292111005 -0.89604490
##  [5,]  2.25532261  0.4742752 -0.28319620 -0.441805702  1.34853333
##  [6,] -0.50389760 -0.6409965 -0.89472811 -1.207065136 -0.09317015
##  [7,]  0.05006981 -0.1776505  1.47746350 -0.728781404 -0.83565480
##  [8,]  1.27836776  2.7134612 -1.32250365  1.868500614 -0.82169765
##  [9,] -0.47332446 -0.1848393  0.90136592  0.009677263 -0.48432410
## [10,] -0.72155071 -1.3830132  0.30951871 -1.542600789 -0.48148185
##             [,6]         [,7]        [,8]        [,9]       [,10]
##  [1,]  0.8843542  0.022220671 -0.79730862  0.51982447  0.25013255
##  [2,] -1.2180504 -1.909112828  0.21654130 -0.39123723 -0.63367137
##  [3,]  0.8889140  0.007988984  1.74494642  0.71555944 -0.02367925
##  [4,]  1.3713941  0.764123382 -1.44089197  0.88094830  0.89915134
##  [5,] -1.7088335 -0.816713311  0.30085003 -0.67295293 -1.92298810
##  [6,] -0.3128146 -0.457666717 -1.12377527  1.87636709  0.49752297
##  [7,]  0.5325853 -0.287594995 -1.70361001 -0.98239953 -2.57322846
##  [8,] -1.2103010  0.080100320 -1.23877166 -0.08359462  1.22175106
##  [9,]  0.0428119  0.762190614 -0.08887393  1.32409417 -0.46036287
## [10,]  0.7869221 -1.642560278 -0.30738330 -0.31920379 -0.59422701

The final arguments turns the result into a vector or matrix if possible.

mapply

It’s more or less a multivariate version of sapply. It applies a function to all corresponding elements of each argument.

example:

list_1 <- list(a = c(1:10), b = c(11:20))
list_1

## $a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $b
##  [1] 11 12 13 14 15 16 17 18 19 20

list_2 <- list(c = c(21:30), d = c(31:40))
list_2

## $c
##  [1] 21 22 23 24 25 26 27 28 29 30
## 
## $d
##  [1] 31 32 33 34 35 36 37 38 39 40

mapply(sum, list_1$a, list_1$b, list_2$c, list_2$d)

##  [1]  64  68  72  76  80  84  88  92  96 100

Additional Resources

References

Chapman, Christopher N. and McDonnell Feit, Elea, [http://www.springer.com/us/book/9783319144351] (R for Marketing Research and Analytics), Springer.
Ihaka, R., and R. Gentleman. 1996. [https://www.stat.auckland.ac.nz/~ihaka/downloads/R-paper.pdf] (R: A language for data analysis and graphics), Journal of Computational and Graphical Statistics 5(3):399–314.
Norm Matloff, The Art of R Programming: A Tour of Statistical Software Design, 1st Edition (provided)
Phil Spector, An Introduction to R, Department of Statistics, UC Berkeley (provided)
Garrett Grolemund and Hadley Wickham, R for Data Science, O’Reilly.

Introduction to R