Benchmarking

There are several packages for benchmarking your machine and your code.

microbenchmark
rbenchmark
benchmarkme
and there’s always base R’s system.time()

How fast is my machine?

To maximize computational speed, first we have to know how long our code takes to run.

Let’s get an idea of local specs and compute speed.

What version of R are you using?

# get major and minor version and paste together
paste0(version$major, ".", version$minor)

[1] "4.1.0"

We should have the newest version of R (4.1.0). Having the latest version of R is one of the easiest ways to optimize code. New versions often provide speed boosts for commonly used functions.

Having a faster computer might also make you more productive. To justify how you can save time using a faster computer, you will need some numbers. We can benchmark the performance of our computer using the benchmarkme package. First, let’s see what machine we’re working with.

# benchmarkme should already be loaded in your global environment
paste("RAM:", disk.frame::df_ram_size(), "GB")  #how much RAM are you working with?

[1] "RAM: 16 GB"

paste("CPUs:", benchmarkme::get_cpu()[3])  #how many logical processors do you have?

[1] "CPUs: 8"

You’ll notice that we have more logical cores than physical cores at our disposal. This is due to hyperthreading.

You can run a set of standardized benchmarks and compare your results to users who have uploaded their performance results. The benchmark_io function allows outputs the time it takes to read in data depending on its size (units = megabytes). Let’s get performance for writing and reading a 50MB file and plot the results.

write_read_bench <- benchmark_io(size = 5)  #read and write benchmark
plot(write_read_bench)  #plot results

Press return to get next plot

upload_results(write_read_bench)  #upload results for benchmark tracking

[1] "2021-07-27-16095561"

For general benchmarking tests, use benchmark_std to see how your machine deals with handling data.

std_bench <- benchmark_std()  #benchmark matrix manipulation, calculation
plot(std_bench)  #plot results

Press return to get next plot

Press return to get next plot

upload_results(std_bench)  #upload results for benchmark tracking

[1] "2021-07-27-34263246"

Benchmarking different read and write functions

There are different ways to write and read data in R. Let’s create some data and test out read/write functions using:

saveRDS() and readRDS() functions in R
write.csv() and read.csv() functions in the utils package
write_csv() and read_csv() functions in the readr package
fwrite() and fread() functions in the data.table package
vroom_write() and vroom() functions in the vroom package

Let’s write a function that creates some random data and assign them to a data frame object.

get_random_df <- function(n_elements, ncol, min = 0, max = 1, seed = 42) {
    seed <- set.seed(seed)
    n <- n_elements  #number of elements in a vector
    x <- runif(n, min = min, max = max)  #random vector of real numbers
    x <- matrix(x, ncol = ncol)
    x <- as.data.frame(x)
    return(x)
}

n <- 1e+05
col <- 100
data <- get_random_df(n_elements = n, ncol = col)

Let’s benchmark various functions for saving data frames using rbenchmark::benchmark. We will be saving as RDS and .csv files.

benchmark(
  rds = saveRDS(data, "../data/data.rds"),
  utils = write.csv(data, "../data/data_utils.csv"),
  readr = write_csv(data, "../data/data_readr.csv"),
  data.table = fwrite(data,"../data/data_dt.csv"), 
  vroom = vroom_write(data, "../data/data_vroom.csv"),
  
  replications = 1,
  columns = c('test', 'elapsed'),
  order = 'elapsed'
)

        test elapsed
4 data.table    0.02
1        rds    0.11
5      vroom    0.12
3      readr    0.13
2      utils    0.26

benchmark(
  rds = readRDS("../data/data.rds"),
  utils = read.csv("../data/data_utils.csv"),
  readr = read_csv("../data/data_readr.csv"),
  data.table = fread("../data/data_dt.csv"), 
  vroom = vroom("../data/data_vroom.csv"),
  
  replications = 1,
  columns = c('test', 'elapsed'),
  order = 'elapsed'
)

        test elapsed
1        rds    0.00
4 data.table    0.02
2      utils    0.08
5      vroom    0.20
3      readr    1.54

Overall, if we want to use a .csv file format, data.table::fread() and data.table::fwrite() functions are the way to go. If we want to use a flexible and fast framework to store any R object (as opposed to just data frames), we should use saveRDS().

Efficient programming

Memory allocation

Since R uses RAM automatically, we have to be careful with memory allocation. It’s important to vectorize whenever possible. An efficient method is to instantiate empty vectors and then fill it with values later on. A good tip is to never grow a vector.

Below, we see different ways to creating a vector. (This example was taken from section 3.2.1 of Efficient R Programming.) More information on for loops.

# this example was taken from section 3.2.1 of Efficient R Programming (2021)

n <- 1000

method1 = function(n) {
    vec = NULL  #or vec = c()
    for (i in seq_len(n)) vec = c(vec, i)
    vec
}

method2 = function(n) seq_len(n)

benchmark(
  method1(n), method2(n),
  
  replications = 100,
  columns = c('test', 'elapsed'),
  order = 'elapsed'
)

        test elapsed
2 method2(n)    0.00
1 method1(n)    0.13

The *apply family

apply() is generic: applies a function to a rows or columns of a matrix (i.e., to dimensions of an array)
should not be used for data frames since it tries to coerce to matrix first
lapply() is a list apply which acts on a list or vector and returns a list.
sapply() is a simple lapply (function defaults to returning a vector or matrix when possible)

apply() can be used for 2D arrays (i.e., matrix) and 3D arrays

sum_row <- apply(data, 1, sum)  #applies sum to rows
sum_col <- apply(data, 2, sum)  #applies sum to rows
identical(sum_row, rowSums(data))

[1] TRUE

identical(sum_row, rowSums(data))

[1] TRUE

array_3D <- array(seq(10000), dim = c(4, 4, 2))
apply(array_3D, 1, sum)  #apply sum across 2nd and 3rd dimension

[1] 120 128 136 144

apply(array_3D, c(1, 2), sum)  # apply sum across across 3rd dimension

     [,1] [,2] [,3] [,4]
[1,]   18   26   34   42
[2,]   20   28   36   44
[3,]   22   30   38   46
[4,]   24   32   40   48

lapply() is useful for lists. Say we want to apply a function per level of a factor variable. We can split the data by factor level, store the data as a list object, run the function, and combine the results again using dplyr::bind_rows()

data$education <- 
  c("high_school","college", "masters", "phd") %>% #get a vector of edu level
  rep(., times = nrow(data)/4 ) %>% #repeat the vector for number of rows in the data divided by 4 since there's 4 levels
  sample(.) #takes a random sample of the vector, thereby randomizing it

data_ls <- split(data, data$education) #split the data into a list by edu level
#lapply(data_ls, sum) #warning if we do this because there's a character vector edu still in the data

data_ls <- lapply( data_ls, function(x) x[-length(x)] ) #since we know the position of the education variable is the max number of column, we can call length to get the number of columns and drop the last one

Let’s see how lapply works in keeping the top 25% of cases based on a composite score.

top_group <- function(x) {
    composite <- rowSums(x)
    bin <- quantile(composite)
    keep <- which(composite > bin["75%"])
    x <- x[keep, ]
    return(x)
}

data_ls_top <- lapply(data_ls, top_group)
data_top <- bind_rows(data_ls_top, .id = "education")
data_top %>%
    as_tibble

# A tibble: 252 x 101
   education      V1    V2     V3      V4     V5     V6     V7    V8     V9
   <chr>       <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
 1 college   0.940   0.835 0.801  0.281   0.687  0.0237 0.339  0.705 0.957 
 2 college   0.978   0.111 0.198  0.847   0.217  0.752  0.800  0.348 0.837 
 3 college   0.685   0.143 0.222  0.640   0.916  0.276  0.102  0.791 0.883 
 4 college   0.00395 0.361 0.0471 0.0919  0.932  0.397  0.0107 0.781 0.830 
 5 college   0.958   0.198 0.777  0.945   0.531  0.282  0.0540 0.674 0.0715
 6 college   0.677   0.644 0.140  0.00926 0.785  0.526  0.369  0.227 0.289 
 7 college   0.983   0.898 0.151  0.948   0.939  0.116  0.788  0.787 0.325 
 8 college   0.828   0.244 0.246  0.181   0.0980 0.825  0.345  0.498 0.120 
 9 college   0.719   0.678 0.497  0.626   0.972  0.764  0.565  0.720 0.298 
10 college   0.0856  0.696 0.748  0.476   0.286  0.120  0.668  0.455 0.997 
# ... with 242 more rows, and 91 more variables: V10 <dbl>, V11 <dbl>,
#   V12 <dbl>, V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>,
#   V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>,
#   V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>,
#   V30 <dbl>, V31 <dbl>, V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>,
#   V36 <dbl>, V37 <dbl>, V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>,
#   V42 <dbl>, V43 <dbl>, V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>,
#   V48 <dbl>, V49 <dbl>, V50 <dbl>, V51 <dbl>, V52 <dbl>, V53 <dbl>,
#   V54 <dbl>, V55 <dbl>, V56 <dbl>, V57 <dbl>, V58 <dbl>, V59 <dbl>,
#   V60 <dbl>, V61 <dbl>, V62 <dbl>, V63 <dbl>, V64 <dbl>, V65 <dbl>,
#   V66 <dbl>, V67 <dbl>, V68 <dbl>, V69 <dbl>, V70 <dbl>, V71 <dbl>,
#   V72 <dbl>, V73 <dbl>, V74 <dbl>, V75 <dbl>, V76 <dbl>, V77 <dbl>,
#   V78 <dbl>, V79 <dbl>, V80 <dbl>, V81 <dbl>, V82 <dbl>, V83 <dbl>,
#   V84 <dbl>, V85 <dbl>, V86 <dbl>, V87 <dbl>, V88 <dbl>, V89 <dbl>,
#   V90 <dbl>, V91 <dbl>, V92 <dbl>, V93 <dbl>, V94 <dbl>, V95 <dbl>,
#   V96 <dbl>, V97 <dbl>, V98 <dbl>, V99 <dbl>, V100 <dbl>

sapply() is ‘simplified’ apply that attempts to coerce the result to a multi-dimensional array, if appropriate. However, the function is not type consistent and can cause some problems.

df2 = data.frame(x = 1:5, y = letters[1:5])
df0 = data.frame()
sapply(df2, class)  #character vector

          x           y 
  "integer" "character"

sapply(df0, class)  #list

named list()

df2[, 1:2]  #data frame

df2[, 1]  #numeric vector

[1] 1 2 3 4 5

Learn more about the apply family here

Tidyverse and data.table

Time spent processing your data at the beginning of projects can save time in the long run. We will walk through various ways to process data efficiently, both in terms of programming and algorithmic. For example, packages in tidyverse provide ways to efficiently print results and work with data. Specifically, the dplyr package provides a sufficiently quick yet easy way to process data, and the raw speed of data.table is useful for wrangling large data sets. Also, the essential %>% (pronounced “pipe”) operator can clarify complex wrangling processes.

Tidyverse

One of the many benefits of tidyverse is that although this ecosystem of packages might be slower than base R at times, it makes programming more intuitive/functional.

`tibble()`

The tidyverse version of a data.frame object is a tbl_df from the tibble package. It is a class of data frame that makes printing results, subsetting data, and factoring more accessible.

In printing a tbl_df, we can see the class of each variable; we do not when printing a data.frame. Also, tbl_df prints tidy output to the console, whereas the full data.frame object is printed. That is, printing tbl is “tidyer” than df.

tbl <- tibble(num = 1:50, char = rep(c("x", "y"), each = 25))
tbl

# A tibble: 50 x 2
     num char 
   <int> <chr>
 1     1 x    
 2     2 x    
 3     3 x    
 4     4 x    
 5     5 x    
 6     6 x    
 7     7 x    
 8     8 x    
 9     9 x    
10    10 x    
# ... with 40 more rows

df <- data.frame(num = 1:50, char = rep(c("x", "y"), each = 25))
df

   num char
1    1    x
2    2    x
3    3    x
4    4    x
5    5    x
6    6    x
7    7    x
8    8    x
9    9    x
10  10    x
11  11    x
12  12    x
13  13    x
14  14    x
15  15    x
16  16    x
17  17    x
18  18    x
19  19    x
20  20    x
21  21    x
22  22    x
23  23    x
24  24    x
25  25    x
26  26    y
27  27    y
28  28    y
29  29    y
30  30    y
31  31    y
32  32    y
33  33    y
34  34    y
35  35    y
36  36    y
37  37    y
38  38    y
39  39    y
40  40    y
41  41    y
42  42    y
43  43    y
44  44    y
45  45    y
46  46    y
47  47    y
48  48    y
49  49    y
50  50    y

Tibbles are strict about subsetting. If you try to access a variable that does not exist, you’ll get an error.

tbl$letter

NULL

Tibbles also delineate [ and [[: [ always returns another tibble/data.frame, [[ always returns a vector.

identical(tbl$char, tbl[["char"]])  #two vectors

[1] TRUE

identical(tbl$char, tbl["char"])  #one vector, one tbl_df

[1] FALSE

`pivot_longer()`

Another efficient way to wrangle data is tidyr::pivot_longer(). The function makes “wide” tables “long”. We’re going to make education variables binary and then gather the binaries into a single column.

df <- get_random_df(2000, 2, 1, 5)  #get random data frame with 2000 elements and 2 columns with values from 1-5
make_binary <- function(x) sample(rep(c(1, 0), each = 500))  #function that gets vectors of randomly sampled zeros and ones for the number of rows in df

df <- df %>% #assigning manipulated df to object "df" (same object)
  rename(score = V1, #renaming to "score"
         performance = V2) %>% #renaming to "performance"
  mutate(   #mutating df to include new vars 
    phd = make_binary(),  
    masters = make_binary(),
    college = make_binary(), 
    high_school = make_binary()
  )
df %>% as_tibble #printing df as a tbl_df

# A tibble: 1,000 x 6
   score performance   phd masters college high_school
   <dbl>       <dbl> <dbl>   <dbl>   <dbl>       <dbl>
 1  4.66        4.39     1       0       0           0
 2  4.75        1.25     1       1       0           0
 3  2.14        4.28     0       0       0           0
 4  4.32        3.16     0       1       1           0
 5  3.57        3.00     1       0       1           0
 6  3.08        1.09     1       1       1           0
 7  3.95        3.22     1       1       1           0
 8  1.54        3.88     0       0       0           1
 9  3.63        1.94     1       1       0           0
10  3.82        4.25     1       1       0           0
# ... with 990 more rows

df %>% #grab df
  pivot_longer(
    phd:high_school, #variables to gather  
    names_to = "highest_edu", #name of gathered column 
    values_to = "edu" #name of values column 
  ) -> df_long #assign to a new tbl_df object
df_long #print the tbl_df

# A tibble: 4,000 x 4
   score performance highest_edu   edu
   <dbl>       <dbl> <chr>       <dbl>
 1  4.66        4.39 phd             1
 2  4.66        4.39 masters         0
 3  4.66        4.39 college         0
 4  4.66        4.39 high_school     0
 5  4.75        1.25 phd             1
 6  4.75        1.25 masters         1
 7  4.75        1.25 college         0
 8  4.75        1.25 high_school     0
 9  2.14        4.28 phd             0
10  2.14        4.28 masters         0
# ... with 3,990 more rows

Once we have a long table, we can easily group data for further analysis. Let’s group by the highest education attained and run a simple linear regression by the group variable. We are predicting performance based on some score.

df_long %>% #grab df_long
  group_by(highest_edu) %>% #group by highest education attained
  do(model = lm(performance ~ score, data = .)) %>% #model simple linear regressions by grouping variable -- do() applies expression to each group that can range in complexity
  summarize(broom::tidy(model)) #summarize models -- broom is a package that converts statistical objects into tibbles

# A tibble: 8 x 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  2.99        0.102    29.3    1.68e-136
2 score       -0.000440    0.0322   -0.0137 9.89e-  1
3 (Intercept)  2.99        0.102    29.3    1.68e-136
4 score       -0.000440    0.0322   -0.0137 9.89e-  1
5 (Intercept)  2.99        0.102    29.3    1.68e-136
6 score       -0.000440    0.0322   -0.0137 9.89e-  1
7 (Intercept)  2.99        0.102    29.3    1.68e-136
8 score       -0.000440    0.0322   -0.0137 9.89e-  1

`separate()`

Another useful function is separate, which splits joint variables based on a common delimiter. For example, if a variable holds multiple pieces of information, we can separate the information based on some pattern.

sep <- tibble(college_grad = "University of Florida - University of Minnesota, Twin Cities")
sep

# A tibble: 1 x 1
  college_grad                                                
  <chr>                                                       
1 University of Florida - University of Minnesota, Twin Cities

separate(sep, col = college_grad, into = c("college", "grad"), sep = "-")

# A tibble: 1 x 2
  college                  grad                                   
  <chr>                    <chr>                                  
1 "University of Florida " " University of Minnesota, Twin Cities"

separate(sep, col = college_grad, into = c("college", "grad"), sep = " - ")

# A tibble: 1 x 2
  college               grad                                
  <chr>                 <chr>                               
1 University of Florida University of Minnesota, Twin Cities

`dplyr`

The almighty dplyr is essential for working with data. dplyr is fast (C++ backend) and intuitive to type. It works well with tidy data, databases, and large data structures. In fact, we have used some dplyr functions already in this section.

dplyr was designed to be used with the %>% operator in mind. Typically, each data processing step to be represented as a new line, which we have seen before. Below are main dplyr functions and with descriptions.

filter(), slice(): subsets rows by attribute (filter) or position (slice)
arrange(): orders data by variables
select(); subsets columns
rename(): rename columns (assigning new name = old name)
distinct(): returns unique, non-duplicated rows
mutate(): creates new variables on existing data set
summarize(): summarizes variables by function and returns a new tbl_df
group_by(): groups data defined by variables where functions are performed by group
… and many more!

df %>% #grab df
  group_by(phd, masters) %>%  #group by phd, then within phd group by masters
  summarize(
    mean_score = mean(score), #get mean score for groups
    mean_performance = mean(performance) #get mean performance for groups
  ) %>% 
  arrange(phd, -masters) #arrange results in ascending  order for phd (the default) and descending order for masters within the levels of phd

# A tibble: 4 x 4
# Groups:   phd [2]
    phd masters mean_score mean_performance
  <dbl>   <dbl>      <dbl>            <dbl>
1     0       1       2.93             2.93
2     0       0       2.91             2.98
3     1       1       2.97             3.04
4     1       0       2.99             3.01

`across()`

across() is a relatively new and long-awaited function in dplyr. The function applies some other function across columns. The equivalent would be using apply() on margin 2 (i.e., columns).

Let’s coerce all numeric columns to characters using dplyr() and apply(), and compare the performance of each.

benchmark(
  dplyr = df %>% mutate(across(everything(), as.character)), 
  apply = apply(df, 2, as.character),
  
  replications = 100, 
  columns = c("test","elapsed"),
  order = "elapsed"
)

   test elapsed
1 dplyr    0.31
2 apply    0.56

data.table

data.table is typically thought of as an alternative to dplyr. It values speed over intuitive and accessible programming. You can learn more at datatable-intro, datatable-reshape and datatable-reference-semantics.

Example: Big-Five Inventory data

Simulating more BFI data

If we want to simulate data with a multivariate normal distribution, we’re going to need means, SDs, and the covariance matrix. Let’s use the BFI data set from the psych package. To learn more about the BFI data set, run ?psych::bfi.

rm(list = ls())  #clear the global environment

data(bfi)  #get data
glimpse(bfi)  #take a look at the data structure using glimpse from the tibble package

Rows: 2,800
Columns: 28
$ A1        <int> 2, 2, 5, 4, 2, 6, 2, 4, 4, 2, 4, 2, 5, 5, 4, 4, 4, 5, 4, 4, ~
$ A2        <int> 4, 4, 4, 4, 3, 6, 5, 3, 3, 5, 4, 5, 5, 5, 5, 3, 6, 5, 4, 4, ~
$ A3        <int> 3, 5, 5, 6, 3, 5, 5, 1, 6, 6, 5, 5, 5, 5, 2, 6, 6, 5, 5, 6, ~
$ A4        <int> 4, 2, 4, 5, 4, 6, 3, 5, 3, 6, 6, 5, 6, 6, 2, 6, 2, 4, 4, 5, ~
$ A5        <int> 4, 5, 4, 5, 5, 5, 5, 1, 3, 5, 5, 5, 4, 6, 1, 3, 5, 5, 3, 5, ~
$ C1        <int> 2, 5, 4, 4, 4, 6, 5, 3, 6, 6, 4, 5, 5, 4, 5, 5, 4, 5, 5, 1, ~
$ C2        <int> 3, 4, 5, 4, 4, 6, 4, 2, 6, 5, 3, 4, 4, 4, 5, 5, 4, 5, 4, 1, ~
$ C3        <int> 3, 4, 4, 3, 5, 6, 4, 4, 3, 6, 5, 5, 3, 4, 5, 5, 4, 5, 5, 1, ~
$ C4        <int> 4, 3, 2, 5, 3, 1, 2, 2, 4, 2, 3, 4, 2, 2, 2, 3, 4, 4, 4, 5, ~
$ C5        <int> 4, 4, 5, 5, 2, 3, 3, 4, 5, 1, 2, 5, 2, 1, 2, 5, 4, 3, 6, 6, ~
$ E1        <int> 3, 1, 2, 5, 2, 2, 4, 3, 5, 2, 1, 3, 3, 2, 3, 1, 1, 2, 1, 1, ~
$ E2        <int> 3, 1, 4, 3, 2, 1, 3, 6, 3, 2, 3, 3, 3, 2, 4, 1, 2, 2, 2, 1, ~
$ E3        <int> 3, 6, 4, 4, 5, 6, 4, 4, NA, 4, 2, 4, 3, 4, 3, 6, 5, 4, 4, 4,~
$ E4        <int> 4, 4, 4, 4, 4, 5, 5, 2, 4, 5, 5, 5, 2, 6, 6, 6, 5, 6, 5, 5, ~
$ E5        <int> 4, 3, 5, 4, 5, 6, 5, 1, 3, 5, 4, 4, 4, 5, 5, 4, 5, 6, 5, 6, ~
$ N1        <int> 3, 3, 4, 2, 2, 3, 1, 6, 5, 5, 3, 4, 1, 1, 2, 4, 4, 6, 5, 5, ~
$ N2        <int> 4, 3, 5, 5, 3, 5, 2, 3, 5, 5, 3, 5, 2, 1, 4, 5, 4, 5, 6, 5, ~
$ N3        <int> 2, 3, 4, 2, 4, 2, 2, 2, 2, 5, 4, 3, 2, 1, 2, 4, 4, 5, 5, 5, ~
$ N4        <int> 2, 5, 2, 4, 4, 2, 1, 6, 3, 2, 2, 2, 2, 2, 2, 5, 4, 4, 5, 1, ~
$ N5        <int> 3, 5, 3, 1, 3, 3, 1, 4, 3, 4, 3, NA, 2, 1, 3, 5, 5, 4, 2, 1,~
$ O1        <int> 3, 4, 4, 3, 3, 4, 5, 3, 6, 5, 5, 4, 4, 5, 5, 6, 5, 5, 4, 4, ~
$ O2        <int> 6, 2, 2, 3, 3, 3, 2, 2, 6, 1, 3, 6, 2, 3, 2, 6, 1, 1, 2, 1, ~
$ O3        <int> 3, 4, 5, 4, 4, 5, 5, 4, 6, 5, 5, 4, 4, 4, 5, 6, 5, 4, 2, 5, ~
$ O4        <int> 4, 3, 5, 3, 3, 6, 6, 5, 6, 5, 6, 5, 5, 4, 5, 3, 6, 5, 4, 3, ~
$ O5        <int> 3, 3, 2, 5, 3, 1, 1, 3, 1, 2, 3, 4, 2, 4, 5, 2, 3, 4, 2, 2, ~
$ gender    <int> 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, ~
$ education <int> NA, NA, NA, NA, NA, 3, NA, 2, 1, NA, 1, NA, NA, NA, 1, NA, N~
$ age       <int> 16, 18, 17, 17, 17, 21, 18, 19, 19, 17, 21, 16, 16, 16, 17, ~

bfi <- bfi[complete.cases(bfi), ]  #drop cases with at least 1 missing value 
bfi_trim <- bfi %>%
    select(-gender, -age, -education)  #or subset(bfi, select= -c(gender, age, education)) to drop gender, age, edu columns 
glimpse(bfi_trim)  #check the structure again to compare

Rows: 2,236
Columns: 25
$ A1 <int> 6, 4, 4, 4, 1, 2, 4, 1, 2, 2, 4, 2, 1, 5, 1, 1, 1, 1, 1, 5, 2, 1, 5~
$ A2 <int> 6, 3, 4, 5, 5, 6, 5, 6, 4, 5, 5, 5, 5, 3, 6, 4, 5, 5, 5, 6, 6, 6, 5~
$ A3 <int> 5, 1, 5, 2, 6, 5, 5, 6, 4, 1, 6, 6, 6, 5, 4, 4, 4, 5, 4, 4, 6, 6, 3~
$ A4 <int> 6, 5, 6, 2, 5, 6, 6, 1, 4, 3, 5, 6, 5, 4, 6, 2, 3, 6, 4, 3, 6, 6, 4~
$ A5 <int> 5, 1, 5, 1, 6, 5, 5, 6, 3, 5, 5, 6, 4, 2, 4, 3, 5, 5, 5, 5, 6, 6, 3~
$ C1 <int> 6, 3, 4, 5, 4, 3, 5, 5, 6, 5, 5, 5, 1, 2, 5, 6, 6, 4, 4, 5, 5, 5, 4~
$ C2 <int> 6, 2, 3, 5, 3, 5, 5, 2, 5, 4, 5, 5, 5, 2, 6, 5, 5, 4, 5, 3, 4, 2, 4~
$ C3 <int> 6, 4, 5, 5, 2, 6, 4, 5, 6, 5, 3, 5, 6, 4, 3, 6, 5, 4, 4, 3, 5, 1, 3~
$ C4 <int> 1, 2, 3, 2, 4, 3, 1, 1, 1, 2, 5, 2, 4, 3, 1, 3, 2, 3, 3, 4, 3, 2, 4~
$ C5 <int> 3, 4, 2, 2, 5, 6, 1, 1, 1, 5, 4, 4, 6, 4, 5, 4, 2, 4, 3, 5, 4, 1, 5~
$ E1 <int> 2, 3, 1, 3, 2, 2, 3, 1, 2, 1, 1, 1, 6, 3, 6, 3, 3, 4, 3, 6, 2, 6, 4~
$ E2 <int> 1, 6, 3, 4, 1, 2, 2, 1, 4, 2, 2, 2, 6, 4, 6, 4, 2, 3, 3, 4, 2, 5, 4~
$ E3 <int> 6, 4, 2, 3, 2, 4, 5, 6, 4, 6, 6, 4, 2, 3, 3, 3, 3, 4, 2, 4, 4, 6, 5~
$ E4 <int> 5, 2, 5, 6, 5, 6, 5, 6, 2, 5, 5, 5, 1, 2, 2, 3, 6, 4, 5, 4, 5, 6, 2~
$ E5 <int> 6, 1, 4, 5, 2, 6, 6, 6, 6, 4, 5, 5, 1, 3, 2, 5, 5, 4, 4, 3, 5, 5, 4~
$ N1 <int> 3, 6, 3, 2, 2, 4, 2, 2, 3, 1, 5, 3, 1, 5, 2, 5, 1, 2, 2, 2, 2, 2, 4~
$ N2 <int> 5, 3, 3, 4, 2, 4, 3, 3, 3, 4, 4, 2, 2, 3, 2, 6, 2, 2, 3, 2, 2, 1, 5~
$ N3 <int> 2, 2, 4, 2, 2, 4, 3, 1, 5, 2, 4, 4, 1, 4, 2, 5, 1, 3, 1, 3, 2, 4, 3~
$ N4 <int> 2, 6, 2, 2, 2, 6, 1, 2, 3, 2, 3, 1, 3, 4, 4, 5, 2, 3, 3, 4, 2, 6, 5~
$ N5 <int> 3, 4, 3, 3, 2, 6, 1, 1, 2, 5, 1, 2, 6, 3, 1, 4, 1, 3, 2, 5, 3, 5, 2~
$ O1 <int> 4, 3, 5, 5, 6, 6, 6, 6, 5, 2, 4, 5, 6, 4, 5, 5, 5, 4, 4, 3, 5, 6, 3~
$ O2 <int> 3, 2, 3, 2, 1, 1, 2, 4, 2, 4, 4, 2, 6, 5, 5, 5, 1, 3, 3, 5, 2, 5, 5~
$ O3 <int> 5, 4, 5, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 4, 4, 4, 6, 2, 5, 4, 5, 6, 4~
$ O4 <int> 6, 5, 6, 5, 5, 6, 6, 5, 6, 4, 5, 5, 6, 4, 5, 5, 6, 5, 4, 4, 5, 6, 4~
$ O5 <int> 1, 3, 3, 5, 2, 1, 2, 3, 1, 1, 1, 2, 1, 3, 3, 2, 1, 2, 3, 4, 1, 1, 2~

mu <- colMeans(bfi_trim)  #get variable means
sd <- apply(bfi_trim, 2, sd)  #get variable SDs
cor_m <- cor(bfi_trim)  #get correlation matrix
cov_m <- sd %*% t(sd) * cor_m  #get covariance matrix

Simulating data can take awhile, especially at larger sample sizes. We can simulate multivariate normal data by using MASS::mvrnorm(). Let’s also checkout another function called anMC::mvrnormArma() to compare speed.

set.seed(9201994)
benchmark(
  MASS::mvrnorm(1e6, mu, cov_m, empirical=T),
  anMC::mvrnormArma(1e6, mu, cov_m, 0),
  
  replications = 1,
  columns = c('test', 'elapsed'),
  order = 'elapsed'
) #benchmarking simulating 1mil cases of the sampled data set

                                            test elapsed
2         anMC::mvrnormArma(1e+06, mu, cov_m, 0)    1.76
1 MASS::mvrnorm(1e+06, mu, cov_m, empirical = T)   13.96

We can clearly see from the output above that mvrnormArma() is (~6x) faster. Why? Because it calls C++ on the backend to create the simulated data.

Let’s assign the new data to a data frame object to compare differences saving and loading data. Matrices are typically faster to work with, but IO psychologist typically work with tabular data, so we will be using data frames, tibbles, or data tables. Let’s start with the most common: data frames. We should also add back in our categorical variables one vector at a time (at random).

BFI data

set.seed(9201994)
bfi_df <-anMC::mvrnormArma(1e6, mu, cov_m, 0) %>% 
  t() %>% #transpose
  as.data.frame() #get data.frame
names(bfi_df) <- names(bfi_trim)

Gender

# let's keep things proportional 
gender <- round(table(bfi$gender)/nrow(bfi), digits = 5) #gender proportion
gender #view gender


      1       2 
0.32871 0.67129

n <- length(gender)
genders <- as.numeric(names(gender)) 
gender_vec <- c()

microbenchmark(times = 10, unit = "s",
               # growing a vector
               gender_vec_grow =
                 for(i in 1:n) {
                   gender_vec <- c(gender_vec, rep(genders[i], 1e6 * gender[i]))
                 }, 
               # vectorized
               gender_vec = 
                 rep(genders[1:n], 1e6 * gender[1:n]) 
)

Unit: seconds
            expr       min        lq       mean    median        uq       max
 gender_vec_grow 0.0082509 0.0240679 0.04573494 0.0381871 0.0507344 0.1348271
      gender_vec 0.0023907 0.0024279 0.00250867 0.0024953 0.0025869 0.0026732
 neval cld
    10   b
    10  a

gender_vec <- rep(genders[1:n], 1e6 * gender[1:n])
bfi_df$gender <- sample(gender_vec) #assigning proportions to gender vector on bfi_df. adding sample randomizes the order of the vector

Age

# let's keep things proportional
age <- round(table(bfi$age)/nrow(bfi), digits = 5)
age  #view age proportions


      3      11      14      15      16      17      18      19      20      21 
0.00045 0.00045 0.00045 0.00134 0.00626 0.01386 0.04785 0.07335 0.07961 0.05769 
     22      23      24      25      26      27      28      29      30      31 
0.04562 0.05725 0.04114 0.04472 0.03623 0.03712 0.03309 0.03220 0.02460 0.02773 
     32      33      34      35      36      37      38      39      40      41 
0.02549 0.01968 0.02057 0.02013 0.01834 0.01476 0.02102 0.01789 0.02057 0.01252 
     42      43      44      45      46      47      48      49      50      51 
0.01029 0.01431 0.01073 0.01073 0.00894 0.00671 0.01163 0.00447 0.01386 0.00716 
     52      53      54      55      56      57      58      59      60      61 
0.00850 0.00671 0.00626 0.00492 0.00581 0.00268 0.00268 0.00089 0.00268 0.00134 
     62      63      64      65      66      67      68      74      86 
0.00134 0.00134 0.00045 0.00045 0.00045 0.00134 0.00045 0.00045 0.00045

age_vec <- vector()

n <- length(age)
ages <- as.numeric(names(age))
age_vec <- rep(ages[1:n], 1e+06 * age[1:n])
bfi_df$age <- sample(age_vec)

Education

# let's keep things proportional
education <- round(table(bfi$education)/nrow(bfi), digits = 5)
education  #view education proportions


      1       2       3       4       5 
0.08855 0.11181 0.48211 0.15474 0.16279

education_vec <- c()

n <- length(education)
educations <- as.numeric(names(education))
education_vec <- rep(educations[1:n], 1e+06 * education[1:n])
bfi_df$education <- sample(education_vec)

Every time you manipulate the data frame, check the data before proceeding so that you can see if you’ve made any mistakes. Let’s see what the final data set looks like.

glimpse(bfi_df)  #always check your data after manipulating the data frame before proceeding

Rows: 1,000,000
Columns: 28
$ A1        <dbl> 4.98512917, 1.91873727, -0.25217646, 2.78180897, 1.17837980,~
$ A2        <dbl> 4.272613, 4.101842, 5.787534, 5.205831, 3.896131, 5.020339, ~
$ A3        <dbl> 4.865218, 4.329908, 5.557391, 4.855733, 6.117381, 5.567631, ~
$ A4        <dbl> 5.476493, 3.837986, 5.306861, 5.647894, 5.668854, 7.055909, ~
$ A5        <dbl> 5.555879, 4.596857, 2.988709, 5.405770, 5.262791, 4.769970, ~
$ C1        <dbl> 5.457633, 4.587901, 4.882483, 3.567453, 6.646876, 2.802143, ~
$ C2        <dbl> 6.742743, 2.810552, 3.876032, 4.715007, 4.642838, 4.488382, ~
$ C3        <dbl> 4.441823, 4.297566, 1.928377, 3.030105, 2.647775, 2.193794, ~
$ C4        <dbl> 3.6383434, 3.0339294, 1.6553426, 3.7270366, 2.2621397, 3.139~
$ C5        <dbl> 4.21871642, 2.64556664, 3.27568178, 5.65775552, 5.11814967, ~
$ E1        <dbl> 8.8836196, 0.8327891, 2.8785604, 5.6763096, 3.6623101, 4.834~
$ E2        <dbl> 6.8169252, 1.5661497, 4.4259251, 4.4116572, 2.8763598, 3.461~
$ E3        <dbl> 3.933498, 2.368665, 2.012880, 5.328615, 6.263056, 4.248638, ~
$ E4        <dbl> 2.675041, 5.383466, 4.049645, 4.032899, 4.457834, 4.260493, ~
$ E5        <dbl> 2.552034, 3.980497, 2.087986, 4.408435, 4.855411, 3.703792, ~
$ N1        <dbl> 2.12655182, 2.40700929, 2.78458338, 2.25950355, 3.29858002, ~
$ N2        <dbl> 4.8134651, 2.6945459, 4.1183656, 0.7878628, 3.6108563, 2.215~
$ N3        <dbl> 5.934404, 1.692800, 5.134269, 4.328980, 4.947094, 2.158376, ~
$ N4        <dbl> 6.0312045, 3.2625564, 3.5378901, 3.2924299, 3.7651757, 2.847~
$ N5        <dbl> 5.5440854, 2.0387079, 4.8161490, 1.8064135, 2.5844743, 2.517~
$ O1        <dbl> 5.692757, 4.385927, 2.181182, 5.078008, 4.478001, 3.166005, ~
$ O2        <dbl> 3.4021118, 1.8822306, 2.8189047, 2.4951893, 4.6048038, 2.032~
$ O3        <dbl> 3.503269, 4.352442, 1.817887, 5.628011, 5.104225, 5.852676, ~
$ O4        <dbl> 6.640027, 5.929866, 2.724339, 6.709802, 5.953817, 3.990266, ~
$ O5        <dbl> 4.8215096, 3.4208458, 5.2356782, 3.5194146, 2.0110209, 2.325~
$ gender    <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, ~
$ age       <dbl> 26, 22, 49, 18, 31, 23, 50, 28, 22, 19, 48, 23, 33, 20, 26, ~
$ education <dbl> 1, 3, 1, 1, 3, 4, 2, 1, 5, 3, 4, 4, 3, 5, 4, 4, 3, 3, 1, 3, ~

Earlier, we discovered that RDS and data.table functions are the fastest way to save and load data. Saving RDS objects is the most flexible way to write out and read in data without compromising performance, whereas fwrite() and fread functions were the fastest in terms of raw speed. Let’s compare these two approaches using the simulated BFI data.

benchmark(
  saveRDS(bfi_df, "../data/bfi_df.rds"),
  fwrite(bfi_df,"../data/bfi_df_dt.csv"), 
  
  replications = 1,
  columns = c('test', 'elapsed'),
  order = 'elapsed'
)

                                     test elapsed
2 fwrite(bfi_df, "../data/bfi_df_dt.csv")    1.04
1   saveRDS(bfi_df, "../data/bfi_df.rds")   10.19

benchmark(
  readRDS("../data/bfi_df.rds"),
  fread("../data/bfi_df_dt.csv"), 
  
  replications = 1,
  columns = c('test', 'elapsed'),
  order = 'elapsed'
)

                            test elapsed
2 fread("../data/bfi_df_dt.csv")    0.38
1  readRDS("../data/bfi_df.rds")    1.36

Scoring items

We can make a list of keys on how to score individual items/variables. Once we obtain item keys, we can score the items on a variety of specifications (e.g., transforming the data on minimum and maximum item values). Run ?psych::scoreItems() for more information.

keys.list <- list(agree = c("-A1", paste0("A", 2:5)), conscientious = c(paste0("C",
    1:3), "-C4", "-C5"), extraversion = c("-E1", "-E2", paste0("E", 3:5)), neuroticism = paste0("N",
    1:5), openness = c("O1", "-O2", "O3", "O4", "-O5"))
bfi_scored <- scoreItems(keys.list, bfi_df[1:25])
bfi_scored

Call: scoreItems(keys = keys.list, items = bfi_df[1:25])

(Unstandardized) Alpha:
      agree conscientious extraversion neuroticism openness
alpha  0.71          0.73         0.77        0.82     0.61

Standard errors of unstandardized Alpha:
        agree conscientious extraversion neuroticism openness
ASE   0.00075       0.00071      0.00066     0.00059  0.00087

Average item correlation:
          agree conscientious extraversion neuroticism openness
average.r  0.32          0.35          0.4        0.47     0.24

Median item correlation:
        agree conscientious  extraversion   neuroticism      openness 
         0.34          0.35          0.40          0.42          0.24 

 Guttman 6* reliability: 
         agree conscientious extraversion neuroticism openness
Lambda.6  0.71          0.73         0.77        0.82     0.62

Signal/Noise based upon av.r : 
             agree conscientious extraversion neuroticism openness
Signal/Noise   2.4           2.7          3.3         4.4      1.6

Scale intercorrelations corrected for attenuation 
 raw correlations below the diagonal, alpha on the diagonal 
 corrected correlations above the diagonal:
              agree conscientious extraversion neuroticism openness
agree          0.71          0.34         0.64      -0.231     0.18
conscientious  0.24          0.73         0.36      -0.298     0.27
extraversion   0.47          0.27         0.77      -0.281     0.32
neuroticism   -0.17         -0.23        -0.22       0.816    -0.12
openness       0.12          0.18         0.22      -0.084     0.61

 In order to see the item by scale loadings and frequency counts of the data
 print with the short option = FALSE

Fast descriptives

Let’s get summary stats and compare performance by common descriptives functions.

pacman::p_load(Hmisc, pastecs, rstatix)

benchmark(
  hmisc = Hmisc::describe(bfi_df),
  psych = psych::describe(bfi_df),
  fpsych = psych::describe(bfi_df, fast=T),
  pastecs = pastecs::stat.desc(bfi_df),
  rstatix = rstatix::get_summary_stats(bfi_df),
  
  replications = 1,
  columns = c('test', 'elapsed'),
  order = 'elapsed'
)

     test elapsed
2   psych    3.72
3  fpsych    3.89
5 rstatix   14.36
4 pastecs   23.01
1   hmisc   57.52

bfi_df_summary <- psych::describe(bfi_df, fast=T)
bfi_df_summary

          vars     n  mean    sd   min   max range   se
A1           1 1e+06  2.36  1.39 -4.18  9.22 13.40 0.00
A2           2 1e+06  4.83  1.16 -0.51 10.44 10.95 0.00
A3           3 1e+06  4.63  1.29 -1.41 10.91 12.32 0.00
A4           4 1e+06  4.75  1.45 -1.72 13.31 15.03 0.00
A5           5 1e+06  4.58  1.26 -1.55 10.99 12.54 0.00
C1           6 1e+06  4.57  1.22 -1.02 10.36 11.38 0.00
C2           7 1e+06  4.40  1.31 -1.44 10.47 11.91 0.00
C3           8 1e+06  4.32  1.29 -1.89 10.55 12.44 0.00
C4           9 1e+06  2.50  1.36 -4.04  9.20 13.24 0.00
C5          10 1e+06  3.26  1.63 -5.24 11.26 16.51 0.00
E1          11 1e+06  2.97  1.62 -4.37 10.53 14.90 0.00
E2          12 1e+06  3.12  1.61 -5.51 11.00 16.52 0.00
E3          13 1e+06  4.01  1.34 -2.53 10.72 13.25 0.00
E4          14 1e+06  4.43  1.46 -2.77 11.39 14.16 0.00
E5          15 1e+06  4.42  1.33 -1.59 10.91 12.50 0.00
N1          16 1e+06  2.91  1.56 -4.63 10.78 15.41 0.00
N2          17 1e+06  3.49  1.54 -3.81 10.83 14.64 0.00
N3          18 1e+06  3.20  1.60 -4.69 10.27 14.96 0.00
N4          19 1e+06  3.18  1.56 -4.53 10.65 15.19 0.00
N5          20 1e+06  2.95  1.62 -4.77 10.76 15.53 0.00
O1          21 1e+06  4.82  1.12 -0.61 10.47 11.08 0.00
O2          22 1e+06  2.69  1.55 -4.66 10.30 14.95 0.00
O3          23 1e+06  4.48  1.19 -1.24 10.15 11.39 0.00
O4          24 1e+06  4.95  1.17 -0.92 10.77 11.68 0.00
O5          25 1e+06  2.46  1.33 -4.81  9.14 13.95 0.00
gender      26 1e+06  1.67  0.47  1.00  2.00  1.00 0.00
age         27 1e+06 29.51 10.66  3.00 86.00 83.00 0.01
education   28 1e+06  3.19  1.11  1.00  5.00  4.00 0.00

We can see that the psych package performs the best.

Let’s get summary stats by group and compare performance by function. We will use the pipeable rstatix package that implements commonly used statistical tests with great efficiency. It’s easy to use and intuitive if you are familiar with the tidyverse framework.

First let’s drop the age column. We only want to work with age and education for now. Let’s compare performance between subsetting in base R and using dplyr::select().

microbenchmark(unit = "s", subset = subset(bfi_df, select = -age), base = bfi_df[-which(names(bfi_df) ==
    "age")], dplyr = bfi_df %>%
    select(-age))

Unit: seconds
   expr       min        lq        mean     median         uq       max neval
 subset 0.1817567 0.2470196 0.274318339 0.25980475 0.27465780 0.5020816   100
   base 0.0000157 0.0000255 0.000048281 0.00003040 0.00006945 0.0001296   100
  dplyr 0.0010276 0.0011293 0.001467883 0.00146475 0.00159710 0.0032317   100
 cld
   b
  a 
  a

Indexing using brackets is typically faster than tidyverse, but the code is not as accessible. This is a common trade off between the tidyverse approach and other approaches. For now, let’s use bracket notation to get the data set. Then, let’s compare the performance of different functions on group comparisons by gender and education.

bfi_group <- bfi_df[-which(names(bfi_df)=="age")] #or bfi_df %>% select(-age)

benchmark(
  psych_describeBy = psych::describeBy(bfi_group[1:25], list(bfi_group$gender, bfi_group$education), mat=T),
  rstatix_groupby = bfi_group %>% group_by(gender, education) %>% get_summary_stats(), 
  
  replications = 1,
  columns = c('test', 'elapsed'),
  order = 'elapsed'
)

              test elapsed
2  rstatix_groupby   10.58
1 psych_describeBy   33.72

Although overall summary statistics using the psych package is faster than using rstatix, we can see that when obtaining summary statistics after the data are grouped, the opposite is true.

bfi_summary_stats_group <- bfi_group %>%
    group_by(gender, education) %>%
    get_summary_stats()
bfi_summary_stats_group

# A tibble: 250 x 15
   gender education variable     n    min   max median    q1    q3   iqr   mad
    <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
 1      1         1 A1       29076 -3.41   7.64   2.37  1.43  3.31  1.88  1.39
 2      1         1 A2       29076 -0.478  9.32   4.84  4.05  5.62  1.58  1.17
 3      1         1 A3       29076 -0.307  9.88   4.64  3.76  5.51  1.74  1.29
 4      1         1 A4       29076 -0.807 11.4    4.77  3.78  5.74  1.96  1.46
 5      1         1 A5       29076 -1.29   9      4.59  3.74  5.44  1.70  1.26
 6      1         1 C1       29076 -0.106 10.4    4.57  3.76  5.39  1.63  1.21
 7      1         1 C2       29076 -0.98   9.58   4.39  3.49  5.27  1.78  1.32
 8      1         1 C3       29076 -1.17   9.24   4.34  3.47  5.18  1.71  1.27
 9      1         1 C4       29076 -3.08   8.07   2.51  1.59  3.42  1.83  1.36
10      1         1 C5       29076 -3.30  10.2    3.27  2.16  4.36  2.20  1.64
# ... with 240 more rows, and 4 more variables: mean <dbl>, sd <dbl>, se <dbl>,
#   ci <dbl>

Group differences

An easy way to get standardized mean differences by group is through by rstatix::cohen_d(). We can use pivot_longer() to gather a column of items and a column of item scores. We can then calculate d between groups by item to understand how the items are functioning by group.

bfi_tbl <- as_tibble(bfi_df)  #converting to tbl_df & dropping age

bfi_tbl_group <- bfi_tbl %>%  #grab tbl
  pivot_longer(               #pivot BFI items 
    A1:O5, 
    names_to = "item",        #name gathered column
    values_to = "score"       #name values columns
  ) %>%  
  select(-age, -education) %>% #drop age, edu columns
  group_by(item)              #group by item

glimpse(bfi_tbl_group)

Rows: 25,000,000
Columns: 3
Groups: item [25]
$ gender <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
$ item   <chr> "A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5", "E1~
$ score  <dbl> 4.985129, 4.272613, 4.865218, 5.476493, 5.555879, 5.457633, 6.7~

This is a large data set, with 25 million rows and 3 variables. Since the data are large, we should make a tinkering tibble to test out different functions for calculating d. Let’s take the first 1000 rows of our data using dplyr::slice(). Since the data are grouped, using dplyr::slice(1:1000) would take the first 1000 rows per item (i.e., the grouping variable). We need to ungroup, slice, and regroup.

bfi_tbl_group_slice <- bfi_tbl_group %>%
    ungroup(item) %>%
    slice(1:1000) %>%
    group_by(item)

bfi_tbl_group_slice

# A tibble: 1,000 x 3
# Groups:   item [25]
   gender item  score
    <dbl> <chr> <dbl>
 1      2 A1     4.99
 2      2 A2     4.27
 3      2 A3     4.87
 4      2 A4     5.48
 5      2 A5     5.56
 6      2 C1     5.46
 7      2 C2     6.74
 8      2 C3     4.44
 9      2 C4     3.64
10      2 C5     4.22
# ... with 990 more rows

The rstatix way is easy to read and follows the grouping command, where in the psych way we have to add a group to the formula. Also, the output for each is quite different. We need to code further manipulations using psych.

bfi_tbl_group_slice %>%
    cohens_d(score ~ gender)

# A tibble: 25 x 8
   .y.   group1 group2 effsize item     n1    n2 magnitude 
 * <chr> <chr>  <chr>    <dbl> <chr> <int> <int> <ord>     
 1 score 1      2       0.124  A1       13    27 negligible
 2 score 1      2       0.207  A2       13    27 small     
 3 score 1      2       0.0638 A3       13    27 negligible
 4 score 1      2       0.245  A4       13    27 small     
 5 score 1      2       0.0739 A5       13    27 negligible
 6 score 1      2      -0.142  C1       13    27 negligible
 7 score 1      2      -0.0283 C2       13    27 negligible
 8 score 1      2      -0.0729 C3       13    27 negligible
 9 score 1      2       0.240  C4       13    27 small     
10 score 1      2      -0.348  C5       13    27 small     
# ... with 15 more rows

bfi_tbl_group_slice %>%
    cohen.d(score ~ gender + item, data = .) %>%
    summary

Extract effect sizes by groups from cohen.d.by
                 score    Md
genderforitemA1 -0.123 0.123
genderforitemA2 -0.230 0.230
genderforitemA3 -0.064 0.064
genderforitemA4 -0.240 0.240
genderforitemA5 -0.079 0.079
genderforitemC1  0.149 0.149
genderforitemC2  0.029 0.029
genderforitemC3  0.079 0.079
genderforitemC4 -0.255 0.255
genderforitemC5  0.356 0.356
genderforitemE1  0.564 0.564
genderforitemE2  0.195 0.195
genderforitemE3 -0.490 0.490
genderforitemE4 -0.167 0.167
genderforitemE5 -0.121 0.121
genderforitemN1  0.385 0.385
genderforitemN2  0.288 0.288
genderforitemN3  0.453 0.453
genderforitemN4  1.131 1.131
genderforitemN5 -0.278 0.278
genderforitemO1 -0.280 0.280
genderforitemO2 -0.088 0.088
genderforitemO3 -0.049 0.049
genderforitemO4  0.756 0.756
genderforitemO5  0.172 0.172

Although psych might ~2x faster in terms of raw speed, rstatix gives us practically useful output that we would have to valuable time coding using psych. However, if you only want the effect size without having to program further data manipulations to get a more comprehensive summary, use psych. Make sure to double check the direction groups are compared as well.

benchmark( 
  rstatix = bfi_tbl_group_slice %>% 
    cohens_d(score ~ gender),
  
  psych =   bfi_tbl_group_slice %>% 
    cohen.d(score ~ gender + item, data = .) %>% 
    summary ,
  
  replications = 1, 
  columns = c("test", "elapsed"),
  order = "elapsed"
)

Extract effect sizes by groups from cohen.d.by
                 score    Md
genderforitemA1 -0.123 0.123
genderforitemA2 -0.230 0.230
genderforitemA3 -0.064 0.064
genderforitemA4 -0.240 0.240
genderforitemA5 -0.079 0.079
genderforitemC1  0.149 0.149
genderforitemC2  0.029 0.029
genderforitemC3  0.079 0.079
genderforitemC4 -0.255 0.255
genderforitemC5  0.356 0.356
genderforitemE1  0.564 0.564
genderforitemE2  0.195 0.195
genderforitemE3 -0.490 0.490
genderforitemE4 -0.167 0.167
genderforitemE5 -0.121 0.121
genderforitemN1  0.385 0.385
genderforitemN2  0.288 0.288
genderforitemN3  0.453 0.453
genderforitemN4  1.131 1.131
genderforitemN5 -0.278 0.278
genderforitemO1 -0.280 0.280
genderforitemO2 -0.088 0.088
genderforitemO3 -0.049 0.049
genderforitemO4  0.756 0.756
genderforitemO5  0.172 0.172
Extract effect sizes by groups from cohen.d.by
                 score    Md
genderforitemA1 -0.123 0.123
genderforitemA2 -0.230 0.230
genderforitemA3 -0.064 0.064
genderforitemA4 -0.240 0.240
genderforitemA5 -0.079 0.079
genderforitemC1  0.149 0.149
genderforitemC2  0.029 0.029
genderforitemC3  0.079 0.079
genderforitemC4 -0.255 0.255
genderforitemC5  0.356 0.356
genderforitemE1  0.564 0.564
genderforitemE2  0.195 0.195
genderforitemE3 -0.490 0.490
genderforitemE4 -0.167 0.167
genderforitemE5 -0.121 0.121
genderforitemN1  0.385 0.385
genderforitemN2  0.288 0.288
genderforitemN3  0.453 0.453
genderforitemN4  1.131 1.131
genderforitemN5 -0.278 0.278
genderforitemO1 -0.280 0.280
genderforitemO2 -0.088 0.088
genderforitemO3 -0.049 0.049
genderforitemO4  0.756 0.756
genderforitemO5  0.172 0.172

     test elapsed
2   psych    0.22
1 rstatix    0.40

Yet, given 25 million rows, we would need to calculate 25 effect sizes at a sample of 1 million each. To calculate just 1000 cases took ~.5 seconds using rstatix. Scaling up to 25 million would take ~208 minutes. This is a prime example of how inefficiencies can compound when scaling up.

Parallelization

A better way to approach the problem is to parallelize across all cores. Parallelization is the act of designing a computer program or system to process data in parallel. (You can think of it as hyperthreading across physical cores instead of within cores). Normally, computer programs compute data serially computing one thing at a time. If we parallelize, we can speed up compute by breaking the problem down into smaller pieces to be independently solved simultaneously.

For this example, we will split the data set into a list, where every element of that list is an item. Then, we can loop through the list using foreach and the %dopar% operator to run a for loop in parallel. For data sets with smaller cases, we don’t see the performance gains. For larger data sets, we do.

bfi_tbl_long <-  bfi_tbl_group %>% slice(1:1000) #25k

bfi_ls <- split(
  bfi_tbl_long[, !names(bfi_tbl_long) == "item" ],  #split into list but remove the item variable
  bfi_tbl_long$item #split on item variable
) 

system.time({
  rstatix::cohens_d(bfi_tbl_long, score ~ gender)
})

   user  system elapsed 
   0.53    0.00    0.54

registerDoParallel(detectCores() - 1) #register the cluster to run in parallel
system.time({
  d_gender <- 
    foreach(i=1:25) %dopar% {
      rstatix::cohens_d(bfi_ls[[i]], score ~ gender) 
    }
})

   user  system elapsed 
   0.06    0.02    3.21

registerDoSEQ() #register cores to compute sequentially

bfi_tbl_long <-  bfi_tbl_group %>% slice(1:1e5)  #2.5 mil (2500k)

bfi_ls <- split(
  bfi_tbl_long[, !names(bfi_tbl_long) == "item" ],  #split into list but remove the item variable
  bfi_tbl_long$item #split on item variable
) 

system.time({
  rstatix::cohens_d(bfi_tbl_long, score ~ gender)
})

   user  system elapsed 
  10.55    0.02   10.57

registerDoParallel(detectCores() - 1) #register the cluster to run in parallel
system.time({
  d_gender <- 
    foreach(i=1:25) %dopar% {
      rstatix::cohens_d(bfi_ls[[i]], score ~ gender) 
    }
})

   user  system elapsed 
   3.09    2.36   10.42

registerDoSEQ() #register cores to compute sequentially

To view the results, we can bind the rows across the list if they hold the same column values. There are two ways we can do this: dplyr::bind_rows() or data.table::rbindList(). Both yield the same output, but rbindList() is faster. However, it coerces to a data.table object. If we want a tibble, we will have to manually transform, which is still faster than bind_rows().

benchmark(
  dplyr::bind_rows(d_gender),
  data.table::rbindlist(d_gender) %>% as_tibble, 
  
  replications = 10, 
  columns = c("test","elapsed"),
  order = "elapsed"
)

                                           test elapsed
2 data.table::rbindlist(d_gender) %>% as_tibble    0.00
1                    dplyr::bind_rows(d_gender)    0.33

rbindlist(d_gender) %>%
    as_tibble()

# A tibble: 25 x 7
   .y.   group1 group2   effsize    n1    n2 magnitude 
   <chr> <chr>  <chr>      <dbl> <int> <int> <ord>     
 1 score 1      2      -0.000244 32692 67308 negligible
 2 score 1      2       0.00620  32692 67308 negligible
 3 score 1      2      -0.00613  32692 67308 negligible
 4 score 1      2       0.00461  32692 67308 negligible
 5 score 1      2       0.000188 32692 67308 negligible
 6 score 1      2      -0.0135   32692 67308 negligible
 7 score 1      2      -0.00459  32692 67308 negligible
 8 score 1      2       0.00647  32692 67308 negligible
 9 score 1      2       0.00323  32692 67308 negligible
10 score 1      2      -0.00134  32692 67308 negligible
# ... with 15 more rows

Factor Anlalysis

Sometimes factor analysis can take awhile. With parallelization, we can run several factor analyses in parallel if we have the proper input data.

rm(list = ls())  #clear the global environment
glimpse(bfi)

Rows: 2,800
Columns: 28
$ A1        <int> 2, 2, 5, 4, 2, 6, 2, 4, 4, 2, 4, 2, 5, 5, 4, 4, 4, 5, 4, 4, ~
$ A2        <int> 4, 4, 4, 4, 3, 6, 5, 3, 3, 5, 4, 5, 5, 5, 5, 3, 6, 5, 4, 4, ~
$ A3        <int> 3, 5, 5, 6, 3, 5, 5, 1, 6, 6, 5, 5, 5, 5, 2, 6, 6, 5, 5, 6, ~
$ A4        <int> 4, 2, 4, 5, 4, 6, 3, 5, 3, 6, 6, 5, 6, 6, 2, 6, 2, 4, 4, 5, ~
$ A5        <int> 4, 5, 4, 5, 5, 5, 5, 1, 3, 5, 5, 5, 4, 6, 1, 3, 5, 5, 3, 5, ~
$ C1        <int> 2, 5, 4, 4, 4, 6, 5, 3, 6, 6, 4, 5, 5, 4, 5, 5, 4, 5, 5, 1, ~
$ C2        <int> 3, 4, 5, 4, 4, 6, 4, 2, 6, 5, 3, 4, 4, 4, 5, 5, 4, 5, 4, 1, ~
$ C3        <int> 3, 4, 4, 3, 5, 6, 4, 4, 3, 6, 5, 5, 3, 4, 5, 5, 4, 5, 5, 1, ~
$ C4        <int> 4, 3, 2, 5, 3, 1, 2, 2, 4, 2, 3, 4, 2, 2, 2, 3, 4, 4, 4, 5, ~
$ C5        <int> 4, 4, 5, 5, 2, 3, 3, 4, 5, 1, 2, 5, 2, 1, 2, 5, 4, 3, 6, 6, ~
$ E1        <int> 3, 1, 2, 5, 2, 2, 4, 3, 5, 2, 1, 3, 3, 2, 3, 1, 1, 2, 1, 1, ~
$ E2        <int> 3, 1, 4, 3, 2, 1, 3, 6, 3, 2, 3, 3, 3, 2, 4, 1, 2, 2, 2, 1, ~
$ E3        <int> 3, 6, 4, 4, 5, 6, 4, 4, NA, 4, 2, 4, 3, 4, 3, 6, 5, 4, 4, 4,~
$ E4        <int> 4, 4, 4, 4, 4, 5, 5, 2, 4, 5, 5, 5, 2, 6, 6, 6, 5, 6, 5, 5, ~
$ E5        <int> 4, 3, 5, 4, 5, 6, 5, 1, 3, 5, 4, 4, 4, 5, 5, 4, 5, 6, 5, 6, ~
$ N1        <int> 3, 3, 4, 2, 2, 3, 1, 6, 5, 5, 3, 4, 1, 1, 2, 4, 4, 6, 5, 5, ~
$ N2        <int> 4, 3, 5, 5, 3, 5, 2, 3, 5, 5, 3, 5, 2, 1, 4, 5, 4, 5, 6, 5, ~
$ N3        <int> 2, 3, 4, 2, 4, 2, 2, 2, 2, 5, 4, 3, 2, 1, 2, 4, 4, 5, 5, 5, ~
$ N4        <int> 2, 5, 2, 4, 4, 2, 1, 6, 3, 2, 2, 2, 2, 2, 2, 5, 4, 4, 5, 1, ~
$ N5        <int> 3, 5, 3, 1, 3, 3, 1, 4, 3, 4, 3, NA, 2, 1, 3, 5, 5, 4, 2, 1,~
$ O1        <int> 3, 4, 4, 3, 3, 4, 5, 3, 6, 5, 5, 4, 4, 5, 5, 6, 5, 5, 4, 4, ~
$ O2        <int> 6, 2, 2, 3, 3, 3, 2, 2, 6, 1, 3, 6, 2, 3, 2, 6, 1, 1, 2, 1, ~
$ O3        <int> 3, 4, 5, 4, 4, 5, 5, 4, 6, 5, 5, 4, 4, 4, 5, 6, 5, 4, 2, 5, ~
$ O4        <int> 4, 3, 5, 3, 3, 6, 6, 5, 6, 5, 6, 5, 5, 4, 5, 3, 6, 5, 4, 3, ~
$ O5        <int> 3, 3, 2, 5, 3, 1, 1, 3, 1, 2, 3, 4, 2, 4, 5, 2, 3, 4, 2, 2, ~
$ gender    <int> 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, ~
$ education <int> NA, NA, NA, NA, NA, 3, NA, 2, 1, NA, 1, NA, NA, NA, 1, NA, N~
$ age       <int> 16, 18, 17, 17, 17, 21, 18, 19, 19, 17, 21, 16, 16, 16, 17, ~

bfi_ls <- list(agree = bfi %>%
    select(starts_with("A"), -age), conscientious = bfi %>%
    select(starts_with("C")), extraversion = bfi %>%
    select(starts_with("E"), -education), neuroticism = bfi %>%
    select(starts_with("N")), openness = bfi %>%
    select(starts_with("O")))

for (i in seq_along(bfi_ls)) {
    colnames(bfi_ls[[i]]) <- paste0("item", 1:5)
}

pacman::p_load(lavaan)

registerDoParallel(detectCores() - 1)
cfa <- foreach(i = 1:5, .multicombine = T, .errorhandling = "pass") %dopar% {
    lavaan::cfa(model = "trait =~ item1 + item2 + item3 + item4 + item5", data = bfi_ls[[i]])
}

models <- foreach(i = 1:length(cfa), .multicombine = T, .errorhandling = "pass") %dopar%
    {
        pacman::p_load(lavaan)

        nobs <- inspect(cfa[[i]], what = "nobs")
        fm <- fitMeasures(cfa[[i]], c("chisq", "df", "pvalue", "cfi", "rmsea", "srmr"))
        maxLoading <- max(lavInspect(cfa[[i]], what = "est")$lambda)
        minLoading <- min(lavInspect(cfa[[i]], what = "est")$lambda)

        as.list(c(names(bfi_ls[i]), nobs, fm, maxLoading, minLoading))
    }

registerDoSEQ()

models_output <- data.table::rbindlist(models)

colnames(models_output) <- c("trait", "N", "chisq", "df", "pvalue", "cfi", "rmsea",
    "srmr", "max_loading", "min_loading")

models_output %>%
    mutate(across(N:min_loading, as.numeric)) %>%
    mutate(across(where(is.numeric), ~round(., digits = 3)))

           trait    N   chisq df pvalue   cfi rmsea  srmr max_loading
1:         agree 2709  86.696  5      0 0.968 0.078 0.032       1.000
2: conscientious 2707 164.893  5      0 0.937 0.109 0.042       1.184
3:  extraversion 2713  86.601  5      0 0.973 0.078 0.030       1.182
4:   neuroticism 2694 360.932  5      0 0.925 0.163 0.056       1.000
5:      openness 2726  80.583  5      0 0.945 0.074 0.034       1.286
   min_loading
1:      -1.880
2:      -1.398
3:      -1.020
4:       0.632
5:      -1.139

Programming Efficiency in R

Sebastian Marin

2021-07-27