Inside tapply() function, we need three items:
tapply( X, INDEX, FUN = , …)
where:
X: a vector to apply a function to
INDEX: A list of one or more factors
FUN: the function to apply
I’ll depict how to use it using an example:
# We'll first have a data frame
df <- data.frame(
class = c ("A","A","A","B","B","B"),
result = c ("1","0","1","1","0","1"),
income = c (14, 15, 13, 13, 20, 6)
)
# Take a look at this data frame
df
## class result income
## 1 A 1 14
## 2 A 0 15
## 3 A 1 13
## 4 B 1 13
## 5 B 0 20
## 6 B 1 6
In this data frame, I made up of 6 people:
They are either in class A, or class B
Whether they’re succeed in a competition, 1 means: succeed, while 0 means: failed.
The third column states their income in ,000$.
For example, we want to know the average income for each class.
# Find out the average income for each class:
tapply(df$income,df$class, mean)
## A B
## 14 13
We can see that:
The average income of people in class A is $14,000.
The average income of people in class B is $13,000.
We can also use: na.rm argument inside tapply() to indicate that: we wish to calculate the mean while ignoring NA values in the data frame:
# I update my data.frame with one NA item
df_new <- data.frame(
class = c ("A","A","A","B","B","B"),
result = c ("1","0","1","1","0","1"),
income = c (14, 15, 13, 13, NA, 6)
)
# Take a look!
df_new
## class result income
## 1 A 1 14
## 2 A 0 15
## 3 A 1 13
## 4 B 1 13
## 5 B 0 NA
## 6 B 1 6
Then I calculate the average income for each class.
# It doesn't work if I don't include the *na.rm = TRUE*
tapply(df_new$income, df_new$class, mean)
## A B
## 14 NA
tapply(df_new$income, df_new$class, mean,
na.rm = TRUE)
## A B
## 14.0 9.5
Furthermore, I with to know the average income for each result
tapply(df_new$income, df_new$result, mean,
na.rm = TRUE)
## 0 1
## 15.0 11.5
For example, we want to find the:
average
income
for each
class
and result
# find out the average income, grouped by class and result
tapply (df$income, list(df$class, df$result), mean)
## 0 1
## A 15 13.5
## B 20 9.5
We can interpret the result as:
Actually, you can think of conditional probability:
\[ E (income | succeed = 0, class = A) = 15 \]
Note: In this example we grouped by two variables, so it is necessary for us to use a list() function.
This function enables us to apply a function to the rows or columns of a matrix or data frame.
Basic syntax:
apply(X, margin, function)
X: the data frame or matrix you wish to use
margin: across row: margin = 1; across the column: margin = 2
function: the function you wish to apply to the rows / columns of X.
Let’s show it by an example!
# Let's create a matrix
m <- matrix(1:9, nrow = 3, byrow = TRUE)
# Take a look at it:
m
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
For example, I want to compute the
mean
of each row
# calculate the mean of each row
apply (m, 1, mean)
## [1] 2 5 8
We can check this by using:
rowMeans(m)
## [1] 2 5 8
Furthermore, I wish to find out the:
sum
of each column
# find the sum of each column of m
apply(m, 2, sum)
## [1] 12 15 18
We can check this by using:
colSums(m)
## [1] 12 15 18
It helps us in applying functions on list objects and returns a list of object of the same length. It takes a vector or data frame as input and gives output in the form of a list object. It applies a certain operation to all the elements of the list it doesn’t need a MARGIN command.
Basic syntax:
lapply (X, function)
X: the intput (matrix / data frame) you want
function: the function you wish to apply to
# I'll still using the matrix m created before
# Now, I defind a function of my own:
stretch <- function(x){
x <- 2*x+3
}
t(lapply(m, stretch))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 5 11 17 7 13 19 9 15 21
Usually, the result from lapply() function is a column vector, to save space, I just transpose the results.
As you can see, my function called stretch and it multiplies each cell of the matrix by 2 and add 3 on them.
Another example:
# Here, I have a list of names
names <- c ("abe", "bush", "charlie", "daisy")
lapply(names, toupper)
## [[1]]
## [1] "ABE"
##
## [[2]]
## [1] "BUSH"
##
## [[3]]
## [1] "CHARLIE"
##
## [[4]]
## [1] "DAISY"
This function helps us in applying functions on a list, vector, or data frame and returns an array or matrix object of the same length.
This sapply() function applies a certain operation to all the elements of the object so it doesn’t need a MARGIN.
Basic syntax: sapply(X, function)
X: the input vector or a data frame
function: the function you wish to apply
We will examine how to use it using examples:
# remember the 3*3 matrix m, and the *stretch* function?
sapply(m, stretch)
## [1] 5 11 17 7 13 19 9 15 21
lapply() always returns a list.
sapply() returns a simplified version of the result (vector or matrix) if possible; otherwise, returns a list.
For example:
df_diff <- data.frame(
a = c (0:5),
b = c (5:10),
c = c (20:25)
)
df_diff
## a b c
## 1 0 5 20
## 2 1 6 21
## 3 2 7 22
## 4 3 8 23
## 5 4 9 24
## 6 5 10 25
# using lapply()
lapply (df_diff, mean)
## $a
## [1] 2.5
##
## $b
## [1] 7.5
##
## $c
## [1] 22.5
# using sapply()
sapply(df_diff, mean)
## a b c
## 2.5 7.5 22.5