tapply & *apply()

Inside tapply() function, we need three items:

tapply( X, INDEX, FUN = , …)

where:

X: a vector to apply a function to
INDEX: A list of one or more factors
FUN: the function to apply

I’ll depict how to use it using an example:

# We'll first have a data frame

df <- data.frame(
  class = c ("A","A","A","B","B","B"),
  result = c ("1","0","1","1","0","1"),
  income = c (14, 15, 13, 13, 20, 6)
)

# Take a look at this data frame

df

##   class result income
## 1     A      1     14
## 2     A      0     15
## 3     A      1     13
## 4     B      1     13
## 5     B      0     20
## 6     B      1      6

In this data frame, I made up of 6 people:

They are either in class A, or class B
Whether they’re succeed in a competition, 1 means: succeed, while 0 means: failed.
The third column states their income in ,000$.

Apply Function to One Variable, grouped by Another Variable

For example, we want to know the average income for each class.

# Find out the average income for each class:

tapply(df$income,df$class, mean)

##  A  B 
## 14 13

We can see that:

The average income of people in class A is $14,000.
The average income of people in class B is $13,000.

We can also use: na.rm argument inside tapply() to indicate that: we wish to calculate the mean while ignoring NA values in the data frame:

# I update my data.frame with one NA item

df_new <- data.frame(
  class = c ("A","A","A","B","B","B"),
  result = c ("1","0","1","1","0","1"),
  income = c (14, 15, 13, 13, NA, 6)
)

# Take a look!
df_new

##   class result income
## 1     A      1     14
## 2     A      0     15
## 3     A      1     13
## 4     B      1     13
## 5     B      0     NA
## 6     B      1      6

Then I calculate the average income for each class.

# It doesn't work if I don't include the *na.rm = TRUE*
tapply(df_new$income, df_new$class, mean)

##  A  B 
## 14 NA

tapply(df_new$income, df_new$class, mean,
       na.rm = TRUE)

##    A    B 
## 14.0  9.5

Furthermore, I with to know the average income for each result

tapply(df_new$income, df_new$result, mean, 
       na.rm = TRUE)

##    0    1 
## 15.0 11.5

Apply Function to One Variable, Grouped by Multiple Variables

For example, we want to find the:

average
income

for each

class
and result

# find out the average income, grouped by class and result

tapply (df$income, list(df$class, df$result), mean)

##    0    1
## A 15 13.5
## B 20  9.5

We can interpret the result as:

for cell (1,1): For people in class A and those who failed, their average income is 15.

Actually, you can think of conditional probability:

\[ E (income | succeed = 0, class = A) = 15 \]

Note: In this example we grouped by two variables, so it is necessary for us to use a list() function.

apply() function.

This function enables us to apply a function to the rows or columns of a matrix or data frame.

Basic syntax:

apply(X, margin, function)

X: the data frame or matrix you wish to use
margin: across row: margin = 1; across the column: margin = 2
function: the function you wish to apply to the rows / columns of X.

Let’s show it by an example!

# Let's create a matrix

m <- matrix(1:9, nrow = 3, byrow = TRUE)

# Take a look at it:
m

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

For example, I want to compute the

mean
of each row

# calculate the mean of each row
apply (m, 1, mean)

## [1] 2 5 8

We can check this by using:

rowMeans(m)

## [1] 2 5 8

Furthermore, I wish to find out the:

sum
of each column

# find the sum of each column of m
apply(m, 2, sum)

## [1] 12 15 18

We can check this by using:

colSums(m)

## [1] 12 15 18

lapply() function

It helps us in applying functions on list objects and returns a list of object of the same length. It takes a vector or data frame as input and gives output in the form of a list object. It applies a certain operation to all the elements of the list it doesn’t need a MARGIN command.

Basic syntax:

lapply (X, function)

X: the intput (matrix / data frame) you want
function: the function you wish to apply to

# I'll still using the matrix m created before
# Now, I defind a function of my own:

stretch <- function(x){
  x <- 2*x+3
}

t(lapply(m, stretch))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 5    11   17   7    13   19   9    15   21

Usually, the result from lapply() function is a column vector, to save space, I just transpose the results.

As you can see, my function called stretch and it multiplies each cell of the matrix by 2 and add 3 on them.

Another example:

# Here, I have a list of names
names <- c ("abe", "bush", "charlie", "daisy")

lapply(names, toupper)

## [[1]]
## [1] "ABE"
## 
## [[2]]
## [1] "BUSH"
## 
## [[3]]
## [1] "CHARLIE"
## 
## [[4]]
## [1] "DAISY"

sapply()function.

This function helps us in applying functions on a list, vector, or data frame and returns an array or matrix object of the same length.

This sapply() function applies a certain operation to all the elements of the object so it doesn’t need a MARGIN.

Basic syntax: sapply(X, function)

X: the input vector or a data frame
function: the function you wish to apply

We will examine how to use it using examples:

# remember the 3*3 matrix m, and the *stretch* function?

sapply(m, stretch)

## [1]  5 11 17  7 13 19  9 15 21

Differences between sapply() and lapply()?

lapply() always returns a list.
sapply() returns a simplified version of the result (vector or matrix) if possible; otherwise, returns a list.

For example:

df_diff <- data.frame(
  a = c (0:5),
  b = c (5:10),
  c = c (20:25)
)

df_diff

##   a  b  c
## 1 0  5 20
## 2 1  6 21
## 3 2  7 22
## 4 3  8 23
## 5 4  9 24
## 6 5 10 25

# using lapply()
lapply (df_diff, mean)

## $a
## [1] 2.5
## 
## $b
## [1] 7.5
## 
## $c
## [1] 22.5

# using sapply()
sapply(df_diff, mean)

##    a    b    c 
##  2.5  7.5 22.5