apply() Function apply(x, MARGIN, FUN, ...) Where,
x- Matrix or the data frame on which we want to apply the function.
MARGIN- A vector that defines, which part of the matrix/data frame the function should be applied on.
MARGIN = 1-Function will be applied on rows.
MARGIN = 2-Function will be applied on columns.
MARGIN = c(1, 2)-Function will be applied on both rows and columns.
FUN - Specifies the function that will be applied on the MARGIN.
… - Any further optional arguments.
Example
The mtcars data comes from the 1974 Motor Trend magazine.
The data includes fuel consumption data, and ten aspects of car design for then-current car models.
# show first few rows of mtcars
head(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Calculating the column mean, row sum and column quantiles.
# get the mean of each column
apply(mtcars, 2, mean)## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
# get the sum of each row
apply(mtcars, 1, sum)## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 328.980 329.795 259.580 426.135
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 590.310 385.540 656.920 270.980
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 299.570 350.460 349.660 510.740
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 511.500 509.850 728.560 726.644
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 725.695 213.850 195.165 206.955
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 273.775 519.650 506.085 646.280
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 631.175 208.215 272.570 273.683
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 670.690 379.590 694.710 288.890
# get column quantiles (notice the quantile percents as row names)
apply(mtcars, 2, quantile, probs = c(0.10, 0.25, 0.50, 0.75, 0.90))## mpg cyl disp hp drat wt qsec vs am gear carb
## 10% 14.340 4 80.610 66.0 3.007 1.95550 15.5340 0 0 3 1
## 25% 15.425 4 120.825 96.5 3.080 2.58125 16.8925 0 0 3 2
## 50% 19.200 6 196.300 123.0 3.695 3.32500 17.7100 0 0 4 2
## 75% 22.800 8 326.000 180.0 3.920 3.61000 18.9000 1 1 4 4
## 90% 30.090 8 396.000 243.5 4.209 4.04750 19.9900 1 1 5 4
Let’s construct a 4 x 4 matrix and calculate the sum of the values of each column.
#using apply() on a matrix
matrix_1 <- matrix(1:16, nrow = 4)
apply(matrix_1, 1, sum)## [1] 28 32 36 40
But let’s see what happens when we use a vector instead of matrix.
#create a vector first
vector_1 <- c(1:15)
vector_1## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#use the apply() function
apply(vector_1, 1, sum)## Error in apply(vector_1, 1, sum): dim(X) must have a positive length
As you can see, it didn’t work because the apply() function works best only when the data has at least two dimensions.
If the data used is in the vector format, then we need to use the other functions such as lapply(), sapply(), or vapply() instead.
lapply() Functionlapply(x, FUN, ...) Where,
X - Specifies a list on which functions should be replicated.
FUN - Is a function that needed to be looped on each element of the list.
… - Any further optional arguments.
Example
data("iris")
str(iris)## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We are going to remove the non-numerical variables.
iris$Species <- NULL
str(iris)## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
Calculating the column mean.
# Let's calculate the mean of each column
#of the iris dataset with lapply
#that returns each column as an element of a list
lapply(iris[, 1:ncol(iris)], mean)## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
# We can add extra parameters to define the configuration of the mean function
lapply(iris[, 1:ncol(iris)], mean, na.rm = TRUE)## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
If we want a result in a vector form, then we’ve to pass the unlist argument to the lapply() function.
unlist(lapply(iris,mean))## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
sapply() FunctionThe sapply() is a more generalized version of lapply().
sapply() takes a list vector or dataframe as an input and returns the output in vector or matrix form.
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) Where,
x - specifies the list on which we want to apply the function.
FUN - specifies the function to be applied.
… - arguments that can be added.
simplify - argument that specifies if we want to simplify the results or not.
USE.NAME - specifies the argument names to be used or not.
Example
To illustrate the differences we can use the previous example using a list with the beaver data and compare the sapply and lapply outputs:
# list of R's built in beaver data
beaver_data <- list(beaver1 = beaver1,
beaver2 = beaver2)Comparing mean of each list item using lapply() and sapply()
# get the mean of each list item and return as a list
lapply(beaver_data, function(x) round(apply(x, 2, mean), 2))## $beaver1
## day time temp activ
## 346.20 1312.02 36.86 0.05
##
## $beaver2
## day time temp activ
## 307.13 1446.20 37.60 0.62
# get the mean of each list item and simplify the output
sapply(beaver_data, function(x) round(apply(x, 2, mean), 2))## beaver1 beaver2
## day 346.20 307.13
## time 1312.02 1446.20
## temp 36.86 37.60
## activ 0.05 0.62
The vapply() Functionvapply() function is similar to the sapply() function, but it requires users to specify what type of data they’re passing to the arguments of the vapply() function.vapply(X, FUN, FUN.VALUE, …, USE.NAMES = TRUE) Where,
FUN.VALUE- Need to specify the type of data that we’re passing.
x - specifies the list on which we want to apply the function.
FUN - specifies the function to be applied.
… - arguments that can be added.
USE.NAME - specifies the argument names to be used or not.
Example
Creating a list.
A = matrix(1:16, nrow = 4)
B = 1:10
C = 15:20
my_list <- list(A,B,C)
my_list## [[1]]
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[3]]
## [1] 15 16 17 18 19 20
Using vapply()
If you want each item in the list to return a single numeric value, so we use the argument as FUN.VALUE = integer(1).
vapply(my_list, sum, FUN.VALUE = integer(1))## [1] 136 55 105
mapply() functionThe mapply() is a multivariate version of sapply() function in R.
The syntax for the mapply() function is as shown below:
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) Where,
FUN - Function to be applied over the objects.
… - Specifies R objects on which the function should be applied.
MoreArgs - Specifies the other arguments for the FUN.
SIMPLIFY - Specifies whether we want simplified results or not.
USE.NAMES - Specifies if we want the names for arguments or not.
Example
Suppose we want to replicate 1 at 4 times, 2 at 3 times, 3 at 2 times, and 4 at 1 time.
mapply(rep, times = 1:4, x = 4:1)## [[1]]
## [1] 4
##
## [[2]]
## [1] 3 3
##
## [[3]]
## [1] 2 2 2
##
## [[4]]
## [1] 1 1 1 1
There is another way to do this.
mapply(rep, times = 1:4, MoreArgs = list(x = 25))## [[1]]
## [1] 25
##
## [[2]]
## [1] 25 25
##
## [[3]]
## [1] 25 25 25
##
## [[4]]
## [1] 25 25 25 25
We will get the output in the form of a list. To get the output in the vector form, use the following code.
unlist(mapply(rep, times = 1:4, x = 4:1))## [1] 4 3 3 2 2 2 1 1 1 1
Suppose we’ve two vectors and want to multiply by 2 after adding each other. Creating the function first and pass the arguments to it.
x <- c(A = 10, B = 20, C = 30)
y <- c(J = 40, K = 50, L = 60)
addition <- function(u,v){
(u+v)*2
}
mapply(addition, x, y)## A B C
## 100 140 180
tapply() functionThe tapply() function can be applied on a subset of a vector where the vector is divided into different levels that are also known as factors.
The syntax for tapply() function is as shown below:
tapply(x, INDEX, FUN, ..., simplify = TRUE)Where,
x - is a vector on which the function is to be applied.
INDEX - is a vector of the factors.
FUN - is a function to be applied to each subgroup.
Simplify - is an argument which specifies if we want a simplified result or not. If we want a simplified result, we should use TRUE otherwise FALSE.
Example
The mtcars dataset is used.
# show first few rows of mtcars
head(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Mean of mpg column of cylinders.
# get the mean of the mpg column grouped by cylinders
tapply(mtcars$mpg, mtcars$cyl, mean)## 4 6 8
## 26.66364 19.74286 15.10000
Now let’s say you want to calculate the mean for each column in the mtcars dataset grouped by the cylinder categorical variable. To do this you can embed the tapply function within the apply function.
# get the mean of all columns grouped by cylinders
apply(mtcars, 2, function(x) tapply(x, mtcars$cyl, mean))## mpg cyl disp hp drat wt qsec vs
## 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909
## 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
## 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
## am gear carb
## 4 0.7272727 4.090909 1.545455
## 6 0.4285714 3.857143 3.428571
## 8 0.1428571 3.285714 3.500000
The replicate() FunctionThis function is often used with the apply() function family.
When we pass the replicate() function to a vector, it replicates its values a specified number of times.
The syntax of the function as follows:
replicate(n, expr, simplify = "array")Where,
n-An integer that shows the number of replications. expr-The expression to evaluate repeatedly. Simplify - is an argument which specifies if we want a simplified result or not. If we want a simplified result, we should use TRUE otherwise FALSE. array-Creates or tests for arrays.
Example
Creating a histogram using replicate().
hist(replicate(100, mean(rexp(10))),main="Histogram")a) Mean Price, maximum stock price of each Stock.
b) Plot the total market value and market trend.
Read in the “StockExample.csv” data, and attach it.
StockData <- readr::read_csv('StockExample.csv')
# check the data
StockData## # A tibble: 10 x 5
## ...1 Infosys Reliance TCS Microsoft
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Day1 186. 1.47 1605 95.0
## 2 Day2 184. 1.56 1580 97.5
## 3 Day3 162. 1.39 1490 88.6
## 4 Day4 159. 1.43 1520 85.6
## 5 Day5 165. 1.42 1550 92.0
## 6 Day6 163. 1.36 1525 91.7
## 7 Day7 158. NA 1495 89.9
## 8 Day8 159. 1.43 1485 93.2
## 9 Day9 150. 1.57 1470 90.1
## 10 Day10 151. 1.54 1510 92.1
Summary of the data.
summary(StockData)## ...1 Infosys Reliance TCS
## Length:10 Min. :150.2 Min. :1.360 Min. :1470
## Class :character 1st Qu.:158.2 1st Qu.:1.420 1st Qu.:1491
## Mode :character Median :160.8 Median :1.430 Median :1515
## Mean :163.7 Mean :1.463 Mean :1523
## 3rd Qu.:164.3 3rd Qu.:1.540 3rd Qu.:1544
## Max. :185.7 Max. :1.570 Max. :1605
## NA's :1
## Microsoft
## Min. :85.55
## 1st Qu.:89.94
## Median :91.87
## Mean :91.57
## 3rd Qu.:92.91
## Max. :97.49
##
Only logical or numerical arguments are fit to apply(). So converting days into null.
StockData$...1 <- NULLCalculate the mean price of each stock.
apply(X=StockData, MARGIN=2, FUN=mean)## Infosys Reliance TCS Microsoft
## 163.746 NA 1523.000 91.571
Calculate the mean price of each stock, removing any NAs.
apply(X=StockData, MARGIN=2, FUN=mean, na.rm=TRUE)## Infosys Reliance TCS Microsoft
## 163.746000 1.463333 1523.000000 91.571000
Store the mean in an object called AVG.
AVG <- apply(X=StockData, MARGIN=2, FUN=mean, na.rm=TRUE)
AVG## Infosys Reliance TCS Microsoft
## 163.746000 1.463333 1523.000000 91.571000
Finding the column mean, maximum stock price, and row sum.
# notice that we don't need to include "MARGIN", etc, as long
# as we enter info in the specified order
apply(StockData, 2, mean, na.rm=TRUE)## Infosys Reliance TCS Microsoft
## 163.746000 1.463333 1523.000000 91.571000
# do the same, but using the ColMeans command
colMeans(StockData, na.rm=TRUE)## Infosys Reliance TCS Microsoft
## 163.746000 1.463333 1523.000000 91.571000
# find the MAXIMUM stock price, for each stock
apply(X=StockData, MARGIN=2, FUN=max, na.rm=TRUE)## Infosys Reliance TCS Microsoft
## 185.74 1.57 1605.00 97.49
# find the 20th and 80th PERCENTILE, for each stock
apply(X=StockData, MARGIN=2, FUN=quantile, probs=c(0.2, .80),
na.rm=TRUE)## Infosys Reliance TCS Microsoft
## 20% 156.516 1.408 1489 89.618
## 80% 168.748 1.548 1556 93.546
# now let's calculate the SUM of each row (MARGIN=1)
apply(X=StockData, MARGIN=1, FUN=sum, na.rm=TRUE)## [1] 1887.26 1863.31 1742.17 1766.02 1808.33 1780.78 1742.77 1739.09 1711.91
## [10] 1754.70
# do the same, but with the rowSums command
rowSums(StockData, na.rm=TRUE)## [1] 1887.26 1863.31 1742.17 1766.02 1808.33 1780.78 1742.77 1739.09 1711.91
## [10] 1754.70
Plotting the market trend.
# make a nice plot of these...
plot(apply(X=StockData, MARGIN=1, FUN=sum, na.rm=TRUE), type="l"
,ylab="Total Market Value", xlab="Day", main="Market Trend")Adding points to the market trend.
plot(apply(X=StockData, MARGIN=1, FUN=sum, na.rm=TRUE), type="l"
,ylab="Total Market Value", xlab="Day", main="Market Trend")
points(apply(X=StockData, MARGIN=1, FUN=sum, na.rm=TRUE),
pch=16, col="blue")a)Mean age for smoker/non smoker
Reading and attaching the data.
# read in the "LungCapData.csv" data, and attach it
LungCapData <- readr::read_csv('Lungcapdata.csv')
# check the data
summary(LungCapData)## LungCap Age Height Smoke
## Min. : 0.507 Min. : 3.00 Min. :45.30 Length:725
## 1st Qu.: 6.150 1st Qu.: 9.00 1st Qu.:59.90 Class :character
## Median : 8.000 Median :13.00 Median :65.40 Mode :character
## Mean : 7.863 Mean :12.33 Mean :64.84
## 3rd Qu.: 9.800 3rd Qu.:15.00 3rd Qu.:70.30
## Max. :14.675 Max. :19.00 Max. :81.80
## Gender Caesarean
## Length:725 Length:725
## Class :character Class :character
## Mode :character Mode :character
##
##
##
# and attach it
attach(LungCapData)Calculating the mean age for smoker/Nonsmoker.
# calculate the mean Age for Smoker/NonSmoker
tapply(X=Age, INDEX=Smoke, FUN=mean, na.rm=T)## no yes
## 12.03549 14.77922
# you don't need to include "X", "INDEX",... as long as you
# ehter them in that order...
# we also don't need to include "na.rm=T" as no missing values
tapply(Age, Smoke, mean)## no yes
## 12.03549 14.77922
# we can save the output in a new "object"
m <- tapply(Age, Smoke, mean)
m## no yes
## 12.03549 14.77922
# also worth discussing is the use of the "SIMPLIFY" argument
# this is set to TRUE by default...if we set it to "FALSE"...
tapply(Age, Smoke, mean, simplify=FALSE)## $no
## [1] 12.03549
##
## $yes
## [1] 14.77922
# note that we could get the same using [ ],
# although using "tapply" is more efficient
mean(Age[Smoke=="no"])## [1] 12.03549
mean(Age[Smoke=="yes"])## [1] 14.77922
Summary of age and smoking habits.
# let's look at applying the "summary" function to groups
tapply(Age, Smoke, summary)## $no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 12.00 12.04 15.00 19.00
##
## $yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 13.00 15.00 14.78 17.00 19.00
Applying quantile and calculate the mean Age for Smoker/NonSmoker and male/female.
# or, applying the "quantile" function to the groups
tapply(Age, Smoke, quantile, probs=c(0.2, 0.8))## $no
## 20% 80%
## 8 16
##
## $yes
## 20% 80%
## 12 17
# we can "subset" based on multiple variables/vectors
# calculate the mean Age for Smoker/NonSmoker and male/female
tapply(X=Age, INDEX=list(Smoke, Gender), FUN=mean, na.rm=T)## female male
## no 12.12739 11.94910
## yes 14.75000 14.81818
# a less efficient way to get this done...
mean(Age[Smoke=="no" & Gender=="female"])## [1] 12.12739
mean(Age[Smoke=="no" & Gender=="male"])## [1] 11.9491
mean(Age[Smoke=="yes" & Gender=="female"])## [1] 14.75
mean(Age[Smoke=="yes" & Gender=="male"])## [1] 14.81818
# a reminder of using 2 grouping variables
tapply(Age, list(Smoke, Gender), mean, na.rm=T)## female male
## no 12.12739 11.94910
## yes 14.75000 14.81818
# an a note that the "by" function is the same as tapply,
# except it presents the results similar to a vector
by(Age, list(Smoke, Gender), mean, na.rm=T)## : no
## : female
## [1] 12.12739
## ------------------------------------------------------------
## : yes
## : female
## [1] 14.75
## ------------------------------------------------------------
## : no
## : male
## [1] 11.9491
## ------------------------------------------------------------
## : yes
## : male
## [1] 14.81818
# and we can subset the elements in the usual way
temp <- by(Age, list(Smoke, Gender), mean, na.rm=T)
temp## : no
## : female
## [1] 12.12739
## ------------------------------------------------------------
## : yes
## : female
## [1] 14.75
## ------------------------------------------------------------
## : no
## : male
## [1] 11.9491
## ------------------------------------------------------------
## : yes
## : male
## [1] 14.81818
temp[4]## [1] 14.81818
# and see the "class" of temp
class(temp)## [1] "by"
# we can also convert it to a vector if we prefer
c(temp)## [1] 12.12739 14.75000 11.94910 14.81818
temp2 <- c(temp)
temp2## [1] 12.12739 14.75000 11.94910 14.81818
# and check it's class
class(temp2)## [1] "numeric"