The apply() Function

  • The apply function helps us to apply a function on rows or columns (margins) of a matrix or a data frame. It has syntax as shown below:

apply(x, MARGIN, FUN, ...) Where,

x- Matrix or the data frame on which we want to apply the function.

MARGIN- A vector that defines, which part of the matrix/data frame the function should be applied on.

MARGIN = 1-Function will be applied on rows.

MARGIN = 2-Function will be applied on columns.

MARGIN = c(1, 2)-Function will be applied on both rows and columns.

FUN - Specifies the function that will be applied on the MARGIN.

- Any further optional arguments.

Example

  • The mtcars data comes from the 1974 Motor Trend magazine.

  • The data includes fuel consumption data, and ten aspects of car design for then-current car models.

# show first few rows of mtcars
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Calculating the column mean, row sum and column quantiles.

# get the mean of each column 
apply(mtcars, 2, mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500
# get the sum of each row 
apply(mtcars, 1, sum)
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##             328.980             329.795             259.580             426.135 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##             590.310             385.540             656.920             270.980 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##             299.570             350.460             349.660             510.740 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##             511.500             509.850             728.560             726.644 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##             725.695             213.850             195.165             206.955 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##             273.775             519.650             506.085             646.280 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##             631.175             208.215             272.570             273.683 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##             670.690             379.590             694.710             288.890
# get column quantiles (notice the quantile percents as row names)
apply(mtcars, 2, quantile, probs = c(0.10, 0.25, 0.50, 0.75, 0.90))
##        mpg cyl    disp    hp  drat      wt    qsec vs am gear carb
## 10% 14.340   4  80.610  66.0 3.007 1.95550 15.5340  0  0    3    1
## 25% 15.425   4 120.825  96.5 3.080 2.58125 16.8925  0  0    3    2
## 50% 19.200   6 196.300 123.0 3.695 3.32500 17.7100  0  0    4    2
## 75% 22.800   8 326.000 180.0 3.920 3.61000 18.9000  1  1    4    4
## 90% 30.090   8 396.000 243.5 4.209 4.04750 19.9900  1  1    5    4

Let’s construct a 4 x 4 matrix and calculate the sum of the values of each column.

#using apply() on a matrix
matrix_1 <- matrix(1:16, nrow = 4)
apply(matrix_1, 1, sum)
## [1] 28 32 36 40

But let’s see what happens when we use a vector instead of matrix.

#create a vector first
vector_1 <- c(1:15)
vector_1
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
#use the apply() function
apply(vector_1, 1, sum)
## Error in apply(vector_1, 1, sum): dim(X) must have a positive length

As you can see, it didn’t work because the apply() function works best only when the data has at least two dimensions.

If the data used is in the vector format, then we need to use the other functions such as lapply(), sapply(), or vapply() instead.

The lapply() Function

  • lapply is applied for operations on a list of objects and returns a list object of same length.

lapply(x, FUN, ...) Where,

X - Specifies a list on which functions should be replicated.

FUN - Is a function that needed to be looped on each element of the list.

- Any further optional arguments.

Example

  • In this example we will use the open repository of plants classification Iris.
data("iris")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We are going to remove the non-numerical variables.

iris$Species <- NULL
str(iris)
## 'data.frame':    150 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

Calculating the column mean.

# Let's calculate the mean of each column 
#of the iris dataset with lapply 
#that returns each column as an element of a list

lapply(iris[, 1:ncol(iris)], mean)
## $Sepal.Length
## [1] 5.843333
## 
## $Sepal.Width
## [1] 3.057333
## 
## $Petal.Length
## [1] 3.758
## 
## $Petal.Width
## [1] 1.199333
# We can add extra parameters to define the configuration of the mean function
lapply(iris[, 1:ncol(iris)], mean, na.rm = TRUE)
## $Sepal.Length
## [1] 5.843333
## 
## $Sepal.Width
## [1] 3.057333
## 
## $Petal.Length
## [1] 3.758
## 
## $Petal.Width
## [1] 1.199333

If we want a result in a vector form, then we’ve to pass the unlist argument to the lapply() function.

unlist(lapply(iris,mean))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

The sapply() Function

  • The sapply() is a more generalized version of lapply().

  • sapply() takes a list vector or dataframe as an input and returns the output in vector or matrix form.

sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) Where,

x - specifies the list on which we want to apply the function.

FUN - specifies the function to be applied.

- arguments that can be added.

simplify - argument that specifies if we want to simplify the results or not.

USE.NAME - specifies the argument names to be used or not.

Example

To illustrate the differences we can use the previous example using a list with the beaver data and compare the sapply and lapply outputs:

# list of R's built in beaver data
beaver_data <- list(beaver1 = beaver1, 
                    beaver2 = beaver2)

Comparing mean of each list item using lapply() and sapply()

# get the mean of each list item and return as a list
lapply(beaver_data, function(x) round(apply(x, 2, mean), 2))
## $beaver1
##     day    time    temp   activ 
##  346.20 1312.02   36.86    0.05 
## 
## $beaver2
##     day    time    temp   activ 
##  307.13 1446.20   37.60    0.62
# get the mean of each list item and simplify the output
sapply(beaver_data, function(x) round(apply(x, 2, mean), 2))
##       beaver1 beaver2
## day    346.20  307.13
## time  1312.02 1446.20
## temp    36.86   37.60
## activ    0.05    0.62

The vapply() Function

  • The vapply() function is similar to the sapply() function, but it requires users to specify what type of data they’re passing to the arguments of the vapply() function.

vapply(X, FUN, FUN.VALUE, …, USE.NAMES = TRUE) Where,

FUN.VALUE- Need to specify the type of data that we’re passing.

x - specifies the list on which we want to apply the function.

FUN - specifies the function to be applied.

- arguments that can be added.

USE.NAME - specifies the argument names to be used or not.

Example

Creating a list.

A = matrix(1:16, nrow = 4)
B = 1:10
C = 15:20
my_list <- list(A,B,C)
my_list
## [[1]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## 
## [[2]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[3]]
## [1] 15 16 17 18 19 20

Using vapply()

If you want each item in the list to return a single numeric value, so we use the argument as FUN.VALUE = integer(1).

vapply(my_list, sum, FUN.VALUE = integer(1))
## [1] 136  55 105

The mapply() function

  • The mapply() is a multivariate version of sapply() function in R.

  • The syntax for the mapply() function is as shown below:

mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) Where,

FUN - Function to be applied over the objects.

- Specifies R objects on which the function should be applied.

MoreArgs - Specifies the other arguments for the FUN.

SIMPLIFY - Specifies whether we want simplified results or not.

USE.NAMES - Specifies if we want the names for arguments or not.

Example

Suppose we want to replicate 1 at 4 times, 2 at 3 times, 3 at 2 times, and 4 at 1 time.

mapply(rep, times = 1:4, x = 4:1)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 3 3
## 
## [[3]]
## [1] 2 2 2
## 
## [[4]]
## [1] 1 1 1 1

There is another way to do this.

mapply(rep, times = 1:4, MoreArgs = list(x = 25))
## [[1]]
## [1] 25
## 
## [[2]]
## [1] 25 25
## 
## [[3]]
## [1] 25 25 25
## 
## [[4]]
## [1] 25 25 25 25

We will get the output in the form of a list. To get the output in the vector form, use the following code.

unlist(mapply(rep, times = 1:4, x = 4:1))
##  [1] 4 3 3 2 2 2 1 1 1 1

Suppose we’ve two vectors and want to multiply by 2 after adding each other. Creating the function first and pass the arguments to it.

x <- c(A = 10, B = 20, C = 30)
y <- c(J = 40, K = 50, L = 60)
addition <- function(u,v){
  (u+v)*2
}
mapply(addition, x, y)
##   A   B   C 
## 100 140 180

The tapply() function

  • The tapply() function can be applied on a subset of a vector where the vector is divided into different levels that are also known as factors.

  • The syntax for tapply() function is as shown below:

tapply(x, INDEX, FUN, ..., simplify = TRUE)Where,

x - is a vector on which the function is to be applied.

INDEX - is a vector of the factors.

FUN - is a function to be applied to each subgroup.

Simplify - is an argument which specifies if we want a simplified result or not. If we want a simplified result, we should use TRUE otherwise FALSE.

Example

The mtcars dataset is used.

# show first few rows of mtcars
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Mean of mpg column of cylinders.

# get the mean of the mpg column grouped by cylinders 
tapply(mtcars$mpg, mtcars$cyl, mean)
##        4        6        8 
## 26.66364 19.74286 15.10000

Now let’s say you want to calculate the mean for each column in the mtcars dataset grouped by the cylinder categorical variable. To do this you can embed the tapply function within the apply function.

# get the mean of all columns grouped by cylinders 
apply(mtcars, 2, function(x) tapply(x, mtcars$cyl, mean))
##        mpg cyl     disp        hp     drat       wt     qsec        vs
## 4 26.66364   4 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909
## 6 19.74286   6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
## 8 15.10000   8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
##          am     gear     carb
## 4 0.7272727 4.090909 1.545455
## 6 0.4285714 3.857143 3.428571
## 8 0.1428571 3.285714 3.500000

The replicate() Function

  • This function is often used with the apply() function family.

  • When we pass the replicate() function to a vector, it replicates its values a specified number of times.

  • The syntax of the function as follows:

replicate(n, expr, simplify = "array")Where,

n-An integer that shows the number of replications. expr-The expression to evaluate repeatedly. Simplify - is an argument which specifies if we want a simplified result or not. If we want a simplified result, we should use TRUE otherwise FALSE. array-Creates or tests for arrays.

Example

Creating a histogram using replicate().

hist(replicate(100, mean(rexp(10))),main="Histogram")

Excercises

1. Stock prices of 4 companies in 10 days are given as the data. Compute the following:

a) Mean Price, maximum stock price of each Stock.

b) Plot the total market value and market trend.

Solution

Read in the “StockExample.csv” data, and attach it.

StockData <- readr::read_csv('StockExample.csv')
# check the data
StockData
## # A tibble: 10 x 5
##    ...1  Infosys Reliance   TCS Microsoft
##    <chr>   <dbl>    <dbl> <dbl>     <dbl>
##  1 Day1     186.     1.47  1605      95.0
##  2 Day2     184.     1.56  1580      97.5
##  3 Day3     162.     1.39  1490      88.6
##  4 Day4     159.     1.43  1520      85.6
##  5 Day5     165.     1.42  1550      92.0
##  6 Day6     163.     1.36  1525      91.7
##  7 Day7     158.    NA     1495      89.9
##  8 Day8     159.     1.43  1485      93.2
##  9 Day9     150.     1.57  1470      90.1
## 10 Day10    151.     1.54  1510      92.1

Summary of the data.

summary(StockData)
##      ...1              Infosys         Reliance          TCS      
##  Length:10          Min.   :150.2   Min.   :1.360   Min.   :1470  
##  Class :character   1st Qu.:158.2   1st Qu.:1.420   1st Qu.:1491  
##  Mode  :character   Median :160.8   Median :1.430   Median :1515  
##                     Mean   :163.7   Mean   :1.463   Mean   :1523  
##                     3rd Qu.:164.3   3rd Qu.:1.540   3rd Qu.:1544  
##                     Max.   :185.7   Max.   :1.570   Max.   :1605  
##                                     NA's   :1                     
##    Microsoft    
##  Min.   :85.55  
##  1st Qu.:89.94  
##  Median :91.87  
##  Mean   :91.57  
##  3rd Qu.:92.91  
##  Max.   :97.49  
## 

Only logical or numerical arguments are fit to apply(). So converting days into null.

StockData$...1 <- NULL

Calculate the mean price of each stock.

apply(X=StockData, MARGIN=2, FUN=mean)
##   Infosys  Reliance       TCS Microsoft 
##   163.746        NA  1523.000    91.571

Calculate the mean price of each stock, removing any NAs.

apply(X=StockData, MARGIN=2, FUN=mean, na.rm=TRUE)
##     Infosys    Reliance         TCS   Microsoft 
##  163.746000    1.463333 1523.000000   91.571000

Store the mean in an object called AVG.

AVG <- apply(X=StockData, MARGIN=2, FUN=mean, na.rm=TRUE)
AVG
##     Infosys    Reliance         TCS   Microsoft 
##  163.746000    1.463333 1523.000000   91.571000

Finding the column mean, maximum stock price, and row sum.

# notice that we don't need to include "MARGIN", etc, as long
# as we enter info in the specified order
apply(StockData, 2, mean, na.rm=TRUE)
##     Infosys    Reliance         TCS   Microsoft 
##  163.746000    1.463333 1523.000000   91.571000
# do the same, but using the ColMeans command
colMeans(StockData, na.rm=TRUE)
##     Infosys    Reliance         TCS   Microsoft 
##  163.746000    1.463333 1523.000000   91.571000
# find the MAXIMUM stock price, for each stock
apply(X=StockData, MARGIN=2, FUN=max, na.rm=TRUE)
##   Infosys  Reliance       TCS Microsoft 
##    185.74      1.57   1605.00     97.49
# find the 20th and 80th PERCENTILE, for each stock
apply(X=StockData, MARGIN=2, FUN=quantile, probs=c(0.2, .80),
      na.rm=TRUE)
##     Infosys Reliance  TCS Microsoft
## 20% 156.516    1.408 1489    89.618
## 80% 168.748    1.548 1556    93.546
# now let's calculate the SUM of each row (MARGIN=1)
apply(X=StockData, MARGIN=1, FUN=sum, na.rm=TRUE)
##  [1] 1887.26 1863.31 1742.17 1766.02 1808.33 1780.78 1742.77 1739.09 1711.91
## [10] 1754.70
# do the same, but with the rowSums command
rowSums(StockData, na.rm=TRUE)
##  [1] 1887.26 1863.31 1742.17 1766.02 1808.33 1780.78 1742.77 1739.09 1711.91
## [10] 1754.70

Plotting the market trend.

# make a nice plot of these...
plot(apply(X=StockData, MARGIN=1, FUN=sum, na.rm=TRUE), type="l"
     ,ylab="Total Market Value", xlab="Day", main="Market Trend")

Adding points to the market trend.

plot(apply(X=StockData, MARGIN=1, FUN=sum, na.rm=TRUE), type="l"
     ,ylab="Total Market Value", xlab="Day", main="Market Trend")
points(apply(X=StockData, MARGIN=1, FUN=sum, na.rm=TRUE), 
       pch=16, col="blue")

2. Lung capacity and smoking habits of 725 individuals are given as the data. Compute the following:

a)Mean age for smoker/non smoker

Solution

Reading and attaching the data.

# read in the "LungCapData.csv" data, and attach it
LungCapData <- readr::read_csv('Lungcapdata.csv')
# check the data
summary(LungCapData)
##     LungCap            Age            Height         Smoke          
##  Min.   : 0.507   Min.   : 3.00   Min.   :45.30   Length:725        
##  1st Qu.: 6.150   1st Qu.: 9.00   1st Qu.:59.90   Class :character  
##  Median : 8.000   Median :13.00   Median :65.40   Mode  :character  
##  Mean   : 7.863   Mean   :12.33   Mean   :64.84                     
##  3rd Qu.: 9.800   3rd Qu.:15.00   3rd Qu.:70.30                     
##  Max.   :14.675   Max.   :19.00   Max.   :81.80                     
##     Gender           Caesarean        
##  Length:725         Length:725        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
# and attach it
attach(LungCapData)

Calculating the mean age for smoker/Nonsmoker.

# calculate the mean Age for Smoker/NonSmoker
tapply(X=Age, INDEX=Smoke, FUN=mean, na.rm=T)
##       no      yes 
## 12.03549 14.77922
# you don't need to include "X", "INDEX",... as long as you
# ehter them in that order...
# we also don't need to include "na.rm=T" as no missing values
tapply(Age, Smoke, mean)
##       no      yes 
## 12.03549 14.77922
# we can save the output in a new "object"
m <- tapply(Age, Smoke, mean)
m
##       no      yes 
## 12.03549 14.77922
# also worth discussing is the use of the "SIMPLIFY" argument
# this is set to TRUE by default...if we set it to "FALSE"...
tapply(Age, Smoke, mean, simplify=FALSE)
## $no
## [1] 12.03549
## 
## $yes
## [1] 14.77922
# note that we could get the same using [ ], 
# although using "tapply" is more efficient
mean(Age[Smoke=="no"])
## [1] 12.03549
mean(Age[Smoke=="yes"])
## [1] 14.77922

Summary of age and smoking habits.

# let's look at applying the "summary" function to groups
tapply(Age, Smoke, summary)
## $no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   12.00   12.04   15.00   19.00 
## 
## $yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   13.00   15.00   14.78   17.00   19.00

Applying quantile and calculate the mean Age for Smoker/NonSmoker and male/female.

# or, applying the "quantile" function to the groups
tapply(Age, Smoke, quantile, probs=c(0.2, 0.8))
## $no
## 20% 80% 
##   8  16 
## 
## $yes
## 20% 80% 
##  12  17
# we can "subset" based on multiple variables/vectors
# calculate the mean Age for Smoker/NonSmoker and male/female
tapply(X=Age, INDEX=list(Smoke, Gender), FUN=mean, na.rm=T)
##       female     male
## no  12.12739 11.94910
## yes 14.75000 14.81818
# a less efficient way to get this done...
mean(Age[Smoke=="no" & Gender=="female"])
## [1] 12.12739
mean(Age[Smoke=="no" & Gender=="male"])
## [1] 11.9491
mean(Age[Smoke=="yes" & Gender=="female"])
## [1] 14.75
mean(Age[Smoke=="yes" & Gender=="male"])
## [1] 14.81818
# a reminder of using 2 grouping variables
tapply(Age, list(Smoke, Gender), mean, na.rm=T)
##       female     male
## no  12.12739 11.94910
## yes 14.75000 14.81818
# an a note that the "by" function is the same as tapply, 
# except it presents the results similar to a vector
by(Age, list(Smoke, Gender), mean, na.rm=T)
## : no
## : female
## [1] 12.12739
## ------------------------------------------------------------ 
## : yes
## : female
## [1] 14.75
## ------------------------------------------------------------ 
## : no
## : male
## [1] 11.9491
## ------------------------------------------------------------ 
## : yes
## : male
## [1] 14.81818
# and we can subset the elements in the usual way
temp <- by(Age, list(Smoke, Gender), mean, na.rm=T)
temp
## : no
## : female
## [1] 12.12739
## ------------------------------------------------------------ 
## : yes
## : female
## [1] 14.75
## ------------------------------------------------------------ 
## : no
## : male
## [1] 11.9491
## ------------------------------------------------------------ 
## : yes
## : male
## [1] 14.81818
temp[4]
## [1] 14.81818
# and see the "class" of temp
class(temp)
## [1] "by"
# we can also convert it to a vector if we prefer
c(temp)
## [1] 12.12739 14.75000 11.94910 14.81818
temp2 <- c(temp)
temp2
## [1] 12.12739 14.75000 11.94910 14.81818
# and check it's class
class(temp2)
## [1] "numeric"