There are several packages for benchmarking your machine and your code.
system.time()To maximize computational speed, first we have to know how long our code takes to run.
Let’s get an idea of local specs and compute speed.
What version of R are you using?
# get major and minor version and paste together
paste0(version$major, ".", version$minor)
[1] "4.1.0"
We should have the newest version of R (4.1.0). Having the latest version of R is one of the easiest ways to optimize code. New versions often provide speed boosts for commonly used functions.
Having a faster computer might also make you more productive. To justify how you can save time using a faster computer, you will need some numbers. We can benchmark the performance of our computer using the benchmarkme package. First, let’s see what machine we’re working with.
# benchmarkme should already be loaded in your global environment
paste("RAM:", disk.frame::df_ram_size(), "GB") #how much RAM are you working with?
[1] "RAM: 16 GB"
paste("CPUs:", benchmarkme::get_cpu()[3]) #how many logical processors do you have?
[1] "CPUs: 8"
You’ll notice that we have more logical cores than physical cores at our disposal. This is due to hyperthreading.
You can run a set of standardized benchmarks and compare your results to users who have uploaded their performance results. The benchmark_io function allows outputs the time it takes to read in data depending on its size (units = megabytes). Let’s get performance for writing and reading a 50MB file and plot the results.
write_read_bench <- benchmark_io(size = 5) #read and write benchmark
plot(write_read_bench) #plot results
Press return to get next plot
upload_results(write_read_bench) #upload results for benchmark tracking
[1] "2021-07-27-16095561"
For general benchmarking tests, use benchmark_std to see how your machine deals with handling data.
std_bench <- benchmark_std() #benchmark matrix manipulation, calculation
plot(std_bench) #plot results
Press return to get next plot
Press return to get next plot
upload_results(std_bench) #upload results for benchmark tracking
[1] "2021-07-27-34263246"
There are different ways to write and read data in R. Let’s create some data and test out read/write functions using:
saveRDS() and readRDS() functions in Rwrite.csv() and read.csv() functions in the utils packagewrite_csv() and read_csv() functions in the readr packagefwrite() and fread() functions in the data.table packagevroom_write() and vroom() functions in the vroom packageLet’s write a function that creates some random data and assign them to a data frame object.
get_random_df <- function(n_elements, ncol, min = 0, max = 1, seed = 42) {
seed <- set.seed(seed)
n <- n_elements #number of elements in a vector
x <- runif(n, min = min, max = max) #random vector of real numbers
x <- matrix(x, ncol = ncol)
x <- as.data.frame(x)
return(x)
}
n <- 1e+05
col <- 100
data <- get_random_df(n_elements = n, ncol = col)
Let’s benchmark various functions for saving data frames using rbenchmark::benchmark. We will be saving as RDS and .csv files.
benchmark(
rds = saveRDS(data, "../data/data.rds"),
utils = write.csv(data, "../data/data_utils.csv"),
readr = write_csv(data, "../data/data_readr.csv"),
data.table = fwrite(data,"../data/data_dt.csv"),
vroom = vroom_write(data, "../data/data_vroom.csv"),
replications = 1,
columns = c('test', 'elapsed'),
order = 'elapsed'
)
test elapsed
4 data.table 0.02
1 rds 0.11
5 vroom 0.12
3 readr 0.13
2 utils 0.26
benchmark(
rds = readRDS("../data/data.rds"),
utils = read.csv("../data/data_utils.csv"),
readr = read_csv("../data/data_readr.csv"),
data.table = fread("../data/data_dt.csv"),
vroom = vroom("../data/data_vroom.csv"),
replications = 1,
columns = c('test', 'elapsed'),
order = 'elapsed'
)
test elapsed
1 rds 0.00
4 data.table 0.02
2 utils 0.08
5 vroom 0.20
3 readr 1.54
Overall, if we want to use a .csv file format, data.table::fread() and data.table::fwrite() functions are the way to go. If we want to use a flexible and fast framework to store any R object (as opposed to just data frames), we should use saveRDS().
Since R uses RAM automatically, we have to be careful with memory allocation. It’s important to vectorize whenever possible. An efficient method is to instantiate empty vectors and then fill it with values later on. A good tip is to never grow a vector.
Below, we see different ways to creating a vector. (This example was taken from section 3.2.1 of Efficient R Programming.) More information on for loops.
# this example was taken from section 3.2.1 of Efficient R Programming (2021)
n <- 1000
method1 = function(n) {
vec = NULL #or vec = c()
for (i in seq_len(n)) vec = c(vec, i)
vec
}
method2 = function(n) seq_len(n)
benchmark(
method1(n), method2(n),
replications = 100,
columns = c('test', 'elapsed'),
order = 'elapsed'
)
test elapsed
2 method2(n) 0.00
1 method1(n) 0.13
apply() is generic: applies a function to a rows or columns of a matrix (i.e., to dimensions of an array)lapply() is a list apply which acts on a list or vector and returns a list.sapply() is a simple lapply (function defaults to returning a vector or matrix when possible)apply() can be used for 2D arrays (i.e., matrix) and 3D arrays
sum_row <- apply(data, 1, sum) #applies sum to rows
sum_col <- apply(data, 2, sum) #applies sum to rows
identical(sum_row, rowSums(data))
[1] TRUE
identical(sum_row, rowSums(data))
[1] TRUE
array_3D <- array(seq(10000), dim = c(4, 4, 2))
apply(array_3D, 1, sum) #apply sum across 2nd and 3rd dimension
[1] 120 128 136 144
apply(array_3D, c(1, 2), sum) # apply sum across across 3rd dimension
[,1] [,2] [,3] [,4]
[1,] 18 26 34 42
[2,] 20 28 36 44
[3,] 22 30 38 46
[4,] 24 32 40 48
lapply() is useful for lists. Say we want to apply a function per level of a factor variable. We can split the data by factor level, store the data as a list object, run the function, and combine the results again using dplyr::bind_rows()
data$education <-
c("high_school","college", "masters", "phd") %>% #get a vector of edu level
rep(., times = nrow(data)/4 ) %>% #repeat the vector for number of rows in the data divided by 4 since there's 4 levels
sample(.) #takes a random sample of the vector, thereby randomizing it
data_ls <- split(data, data$education) #split the data into a list by edu level
#lapply(data_ls, sum) #warning if we do this because there's a character vector edu still in the data
data_ls <- lapply( data_ls, function(x) x[-length(x)] ) #since we know the position of the education variable is the max number of column, we can call length to get the number of columns and drop the last one
Let’s see how lapply works in keeping the top 25% of cases based on a composite score.
top_group <- function(x) {
composite <- rowSums(x)
bin <- quantile(composite)
keep <- which(composite > bin["75%"])
x <- x[keep, ]
return(x)
}
data_ls_top <- lapply(data_ls, top_group)
data_top <- bind_rows(data_ls_top, .id = "education")
data_top %>%
as_tibble
# A tibble: 252 x 101
education V1 V2 V3 V4 V5 V6 V7 V8 V9
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 college 0.940 0.835 0.801 0.281 0.687 0.0237 0.339 0.705 0.957
2 college 0.978 0.111 0.198 0.847 0.217 0.752 0.800 0.348 0.837
3 college 0.685 0.143 0.222 0.640 0.916 0.276 0.102 0.791 0.883
4 college 0.00395 0.361 0.0471 0.0919 0.932 0.397 0.0107 0.781 0.830
5 college 0.958 0.198 0.777 0.945 0.531 0.282 0.0540 0.674 0.0715
6 college 0.677 0.644 0.140 0.00926 0.785 0.526 0.369 0.227 0.289
7 college 0.983 0.898 0.151 0.948 0.939 0.116 0.788 0.787 0.325
8 college 0.828 0.244 0.246 0.181 0.0980 0.825 0.345 0.498 0.120
9 college 0.719 0.678 0.497 0.626 0.972 0.764 0.565 0.720 0.298
10 college 0.0856 0.696 0.748 0.476 0.286 0.120 0.668 0.455 0.997
# ... with 242 more rows, and 91 more variables: V10 <dbl>, V11 <dbl>,
# V12 <dbl>, V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>,
# V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>,
# V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>,
# V30 <dbl>, V31 <dbl>, V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>,
# V36 <dbl>, V37 <dbl>, V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>,
# V42 <dbl>, V43 <dbl>, V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>,
# V48 <dbl>, V49 <dbl>, V50 <dbl>, V51 <dbl>, V52 <dbl>, V53 <dbl>,
# V54 <dbl>, V55 <dbl>, V56 <dbl>, V57 <dbl>, V58 <dbl>, V59 <dbl>,
# V60 <dbl>, V61 <dbl>, V62 <dbl>, V63 <dbl>, V64 <dbl>, V65 <dbl>,
# V66 <dbl>, V67 <dbl>, V68 <dbl>, V69 <dbl>, V70 <dbl>, V71 <dbl>,
# V72 <dbl>, V73 <dbl>, V74 <dbl>, V75 <dbl>, V76 <dbl>, V77 <dbl>,
# V78 <dbl>, V79 <dbl>, V80 <dbl>, V81 <dbl>, V82 <dbl>, V83 <dbl>,
# V84 <dbl>, V85 <dbl>, V86 <dbl>, V87 <dbl>, V88 <dbl>, V89 <dbl>,
# V90 <dbl>, V91 <dbl>, V92 <dbl>, V93 <dbl>, V94 <dbl>, V95 <dbl>,
# V96 <dbl>, V97 <dbl>, V98 <dbl>, V99 <dbl>, V100 <dbl>
sapply() is ‘simplified’ apply that attempts to coerce the result to a multi-dimensional array, if appropriate. However, the function is not type consistent and can cause some problems.
df2 = data.frame(x = 1:5, y = letters[1:5])
df0 = data.frame()
sapply(df2, class) #character vector
x y
"integer" "character"
sapply(df0, class) #list
named list()
df2[, 1:2] #data frame
x y
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
df2[, 1] #numeric vector
[1] 1 2 3 4 5
Learn more about the apply family here
Time spent processing your data at the beginning of projects can save time in the long run. We will walk through various ways to process data efficiently, both in terms of programming and algorithmic. For example, packages in tidyverse provide ways to efficiently print results and work with data. Specifically, the dplyr package provides a sufficiently quick yet easy way to process data, and the raw speed of data.table is useful for wrangling large data sets. Also, the essential %>% (pronounced “pipe”) operator can clarify complex wrangling processes.
One of the many benefits of tidyverse is that although this ecosystem of packages might be slower than base R at times, it makes programming more intuitive/functional.
tibble()The tidyverse version of a data.frame object is a tbl_df from the tibble package. It is a class of data frame that makes printing results, subsetting data, and factoring more accessible.
In printing a tbl_df, we can see the class of each variable; we do not when printing a data.frame. Also, tbl_df prints tidy output to the console, whereas the full data.frame object is printed. That is, printing tbl is “tidyer” than df.
tbl <- tibble(num = 1:50, char = rep(c("x", "y"), each = 25))
tbl
# A tibble: 50 x 2
num char
<int> <chr>
1 1 x
2 2 x
3 3 x
4 4 x
5 5 x
6 6 x
7 7 x
8 8 x
9 9 x
10 10 x
# ... with 40 more rows
df <- data.frame(num = 1:50, char = rep(c("x", "y"), each = 25))
df
num char
1 1 x
2 2 x
3 3 x
4 4 x
5 5 x
6 6 x
7 7 x
8 8 x
9 9 x
10 10 x
11 11 x
12 12 x
13 13 x
14 14 x
15 15 x
16 16 x
17 17 x
18 18 x
19 19 x
20 20 x
21 21 x
22 22 x
23 23 x
24 24 x
25 25 x
26 26 y
27 27 y
28 28 y
29 29 y
30 30 y
31 31 y
32 32 y
33 33 y
34 34 y
35 35 y
36 36 y
37 37 y
38 38 y
39 39 y
40 40 y
41 41 y
42 42 y
43 43 y
44 44 y
45 45 y
46 46 y
47 47 y
48 48 y
49 49 y
50 50 y
Tibbles are strict about subsetting. If you try to access a variable that does not exist, you’ll get an error.
tbl$letter
NULL
Tibbles also delineate [ and [[: [ always returns another tibble/data.frame, [[ always returns a vector.
identical(tbl$char, tbl[["char"]]) #two vectors
[1] TRUE
identical(tbl$char, tbl["char"]) #one vector, one tbl_df
[1] FALSE
pivot_longer()Another efficient way to wrangle data is tidyr::pivot_longer(). The function makes “wide” tables “long”. We’re going to make education variables binary and then gather the binaries into a single column.
df <- get_random_df(2000, 2, 1, 5) #get random data frame with 2000 elements and 2 columns with values from 1-5
make_binary <- function(x) sample(rep(c(1, 0), each = 500)) #function that gets vectors of randomly sampled zeros and ones for the number of rows in df
df <- df %>% #assigning manipulated df to object "df" (same object)
rename(score = V1, #renaming to "score"
performance = V2) %>% #renaming to "performance"
mutate( #mutating df to include new vars
phd = make_binary(),
masters = make_binary(),
college = make_binary(),
high_school = make_binary()
)
df %>% as_tibble #printing df as a tbl_df
# A tibble: 1,000 x 6
score performance phd masters college high_school
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4.66 4.39 1 0 0 0
2 4.75 1.25 1 1 0 0
3 2.14 4.28 0 0 0 0
4 4.32 3.16 0 1 1 0
5 3.57 3.00 1 0 1 0
6 3.08 1.09 1 1 1 0
7 3.95 3.22 1 1 1 0
8 1.54 3.88 0 0 0 1
9 3.63 1.94 1 1 0 0
10 3.82 4.25 1 1 0 0
# ... with 990 more rows
df %>% #grab df
pivot_longer(
phd:high_school, #variables to gather
names_to = "highest_edu", #name of gathered column
values_to = "edu" #name of values column
) -> df_long #assign to a new tbl_df object
df_long #print the tbl_df
# A tibble: 4,000 x 4
score performance highest_edu edu
<dbl> <dbl> <chr> <dbl>
1 4.66 4.39 phd 1
2 4.66 4.39 masters 0
3 4.66 4.39 college 0
4 4.66 4.39 high_school 0
5 4.75 1.25 phd 1
6 4.75 1.25 masters 1
7 4.75 1.25 college 0
8 4.75 1.25 high_school 0
9 2.14 4.28 phd 0
10 2.14 4.28 masters 0
# ... with 3,990 more rows
Once we have a long table, we can easily group data for further analysis. Let’s group by the highest education attained and run a simple linear regression by the group variable. We are predicting performance based on some score.
df_long %>% #grab df_long
group_by(highest_edu) %>% #group by highest education attained
do(model = lm(performance ~ score, data = .)) %>% #model simple linear regressions by grouping variable -- do() applies expression to each group that can range in complexity
summarize(broom::tidy(model)) #summarize models -- broom is a package that converts statistical objects into tibbles
# A tibble: 8 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.99 0.102 29.3 1.68e-136
2 score -0.000440 0.0322 -0.0137 9.89e- 1
3 (Intercept) 2.99 0.102 29.3 1.68e-136
4 score -0.000440 0.0322 -0.0137 9.89e- 1
5 (Intercept) 2.99 0.102 29.3 1.68e-136
6 score -0.000440 0.0322 -0.0137 9.89e- 1
7 (Intercept) 2.99 0.102 29.3 1.68e-136
8 score -0.000440 0.0322 -0.0137 9.89e- 1
separate()Another useful function is separate, which splits joint variables based on a common delimiter. For example, if a variable holds multiple pieces of information, we can separate the information based on some pattern.
sep <- tibble(college_grad = "University of Florida - University of Minnesota, Twin Cities")
sep
# A tibble: 1 x 1
college_grad
<chr>
1 University of Florida - University of Minnesota, Twin Cities
separate(sep, col = college_grad, into = c("college", "grad"), sep = "-")
# A tibble: 1 x 2
college grad
<chr> <chr>
1 "University of Florida " " University of Minnesota, Twin Cities"
separate(sep, col = college_grad, into = c("college", "grad"), sep = " - ")
# A tibble: 1 x 2
college grad
<chr> <chr>
1 University of Florida University of Minnesota, Twin Cities
dplyrThe almighty dplyr is essential for working with data. dplyr is fast (C++ backend) and intuitive to type. It works well with tidy data, databases, and large data structures. In fact, we have used some dplyr functions already in this section.
dplyr was designed to be used with the %>% operator in mind. Typically, each data processing step to be represented as a new line, which we have seen before. Below are main dplyr functions and with descriptions.
filter(), slice(): subsets rows by attribute (filter) or position (slice)arrange(): orders data by variablesselect(); subsets columnsrename(): rename columns (assigning new name = old name)distinct(): returns unique, non-duplicated rowsmutate(): creates new variables on existing data setsummarize(): summarizes variables by function and returns a new tbl_dfgroup_by(): groups data defined by variables where functions are performed by groupdf %>% #grab df
group_by(phd, masters) %>% #group by phd, then within phd group by masters
summarize(
mean_score = mean(score), #get mean score for groups
mean_performance = mean(performance) #get mean performance for groups
) %>%
arrange(phd, -masters) #arrange results in ascending order for phd (the default) and descending order for masters within the levels of phd
# A tibble: 4 x 4
# Groups: phd [2]
phd masters mean_score mean_performance
<dbl> <dbl> <dbl> <dbl>
1 0 1 2.93 2.93
2 0 0 2.91 2.98
3 1 1 2.97 3.04
4 1 0 2.99 3.01
across()across() is a relatively new and long-awaited function in dplyr. The function applies some other function across columns. The equivalent would be using apply() on margin 2 (i.e., columns).
Let’s coerce all numeric columns to characters using dplyr() and apply(), and compare the performance of each.
benchmark(
dplyr = df %>% mutate(across(everything(), as.character)),
apply = apply(df, 2, as.character),
replications = 100,
columns = c("test","elapsed"),
order = "elapsed"
)
test elapsed
1 dplyr 0.31
2 apply 0.56
data.table is typically thought of as an alternative to dplyr. It values speed over intuitive and accessible programming. You can learn more at datatable-intro, datatable-reshape and datatable-reference-semantics.
If we want to simulate data with a multivariate normal distribution, we’re going to need means, SDs, and the covariance matrix. Let’s use the BFI data set from the psych package. To learn more about the BFI data set, run ?psych::bfi.
rm(list = ls()) #clear the global environment
data(bfi) #get data
glimpse(bfi) #take a look at the data structure using glimpse from the tibble package
Rows: 2,800
Columns: 28
$ A1 <int> 2, 2, 5, 4, 2, 6, 2, 4, 4, 2, 4, 2, 5, 5, 4, 4, 4, 5, 4, 4, ~
$ A2 <int> 4, 4, 4, 4, 3, 6, 5, 3, 3, 5, 4, 5, 5, 5, 5, 3, 6, 5, 4, 4, ~
$ A3 <int> 3, 5, 5, 6, 3, 5, 5, 1, 6, 6, 5, 5, 5, 5, 2, 6, 6, 5, 5, 6, ~
$ A4 <int> 4, 2, 4, 5, 4, 6, 3, 5, 3, 6, 6, 5, 6, 6, 2, 6, 2, 4, 4, 5, ~
$ A5 <int> 4, 5, 4, 5, 5, 5, 5, 1, 3, 5, 5, 5, 4, 6, 1, 3, 5, 5, 3, 5, ~
$ C1 <int> 2, 5, 4, 4, 4, 6, 5, 3, 6, 6, 4, 5, 5, 4, 5, 5, 4, 5, 5, 1, ~
$ C2 <int> 3, 4, 5, 4, 4, 6, 4, 2, 6, 5, 3, 4, 4, 4, 5, 5, 4, 5, 4, 1, ~
$ C3 <int> 3, 4, 4, 3, 5, 6, 4, 4, 3, 6, 5, 5, 3, 4, 5, 5, 4, 5, 5, 1, ~
$ C4 <int> 4, 3, 2, 5, 3, 1, 2, 2, 4, 2, 3, 4, 2, 2, 2, 3, 4, 4, 4, 5, ~
$ C5 <int> 4, 4, 5, 5, 2, 3, 3, 4, 5, 1, 2, 5, 2, 1, 2, 5, 4, 3, 6, 6, ~
$ E1 <int> 3, 1, 2, 5, 2, 2, 4, 3, 5, 2, 1, 3, 3, 2, 3, 1, 1, 2, 1, 1, ~
$ E2 <int> 3, 1, 4, 3, 2, 1, 3, 6, 3, 2, 3, 3, 3, 2, 4, 1, 2, 2, 2, 1, ~
$ E3 <int> 3, 6, 4, 4, 5, 6, 4, 4, NA, 4, 2, 4, 3, 4, 3, 6, 5, 4, 4, 4,~
$ E4 <int> 4, 4, 4, 4, 4, 5, 5, 2, 4, 5, 5, 5, 2, 6, 6, 6, 5, 6, 5, 5, ~
$ E5 <int> 4, 3, 5, 4, 5, 6, 5, 1, 3, 5, 4, 4, 4, 5, 5, 4, 5, 6, 5, 6, ~
$ N1 <int> 3, 3, 4, 2, 2, 3, 1, 6, 5, 5, 3, 4, 1, 1, 2, 4, 4, 6, 5, 5, ~
$ N2 <int> 4, 3, 5, 5, 3, 5, 2, 3, 5, 5, 3, 5, 2, 1, 4, 5, 4, 5, 6, 5, ~
$ N3 <int> 2, 3, 4, 2, 4, 2, 2, 2, 2, 5, 4, 3, 2, 1, 2, 4, 4, 5, 5, 5, ~
$ N4 <int> 2, 5, 2, 4, 4, 2, 1, 6, 3, 2, 2, 2, 2, 2, 2, 5, 4, 4, 5, 1, ~
$ N5 <int> 3, 5, 3, 1, 3, 3, 1, 4, 3, 4, 3, NA, 2, 1, 3, 5, 5, 4, 2, 1,~
$ O1 <int> 3, 4, 4, 3, 3, 4, 5, 3, 6, 5, 5, 4, 4, 5, 5, 6, 5, 5, 4, 4, ~
$ O2 <int> 6, 2, 2, 3, 3, 3, 2, 2, 6, 1, 3, 6, 2, 3, 2, 6, 1, 1, 2, 1, ~
$ O3 <int> 3, 4, 5, 4, 4, 5, 5, 4, 6, 5, 5, 4, 4, 4, 5, 6, 5, 4, 2, 5, ~
$ O4 <int> 4, 3, 5, 3, 3, 6, 6, 5, 6, 5, 6, 5, 5, 4, 5, 3, 6, 5, 4, 3, ~
$ O5 <int> 3, 3, 2, 5, 3, 1, 1, 3, 1, 2, 3, 4, 2, 4, 5, 2, 3, 4, 2, 2, ~
$ gender <int> 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, ~
$ education <int> NA, NA, NA, NA, NA, 3, NA, 2, 1, NA, 1, NA, NA, NA, 1, NA, N~
$ age <int> 16, 18, 17, 17, 17, 21, 18, 19, 19, 17, 21, 16, 16, 16, 17, ~
bfi <- bfi[complete.cases(bfi), ] #drop cases with at least 1 missing value
bfi_trim <- bfi %>%
select(-gender, -age, -education) #or subset(bfi, select= -c(gender, age, education)) to drop gender, age, edu columns
glimpse(bfi_trim) #check the structure again to compare
Rows: 2,236
Columns: 25
$ A1 <int> 6, 4, 4, 4, 1, 2, 4, 1, 2, 2, 4, 2, 1, 5, 1, 1, 1, 1, 1, 5, 2, 1, 5~
$ A2 <int> 6, 3, 4, 5, 5, 6, 5, 6, 4, 5, 5, 5, 5, 3, 6, 4, 5, 5, 5, 6, 6, 6, 5~
$ A3 <int> 5, 1, 5, 2, 6, 5, 5, 6, 4, 1, 6, 6, 6, 5, 4, 4, 4, 5, 4, 4, 6, 6, 3~
$ A4 <int> 6, 5, 6, 2, 5, 6, 6, 1, 4, 3, 5, 6, 5, 4, 6, 2, 3, 6, 4, 3, 6, 6, 4~
$ A5 <int> 5, 1, 5, 1, 6, 5, 5, 6, 3, 5, 5, 6, 4, 2, 4, 3, 5, 5, 5, 5, 6, 6, 3~
$ C1 <int> 6, 3, 4, 5, 4, 3, 5, 5, 6, 5, 5, 5, 1, 2, 5, 6, 6, 4, 4, 5, 5, 5, 4~
$ C2 <int> 6, 2, 3, 5, 3, 5, 5, 2, 5, 4, 5, 5, 5, 2, 6, 5, 5, 4, 5, 3, 4, 2, 4~
$ C3 <int> 6, 4, 5, 5, 2, 6, 4, 5, 6, 5, 3, 5, 6, 4, 3, 6, 5, 4, 4, 3, 5, 1, 3~
$ C4 <int> 1, 2, 3, 2, 4, 3, 1, 1, 1, 2, 5, 2, 4, 3, 1, 3, 2, 3, 3, 4, 3, 2, 4~
$ C5 <int> 3, 4, 2, 2, 5, 6, 1, 1, 1, 5, 4, 4, 6, 4, 5, 4, 2, 4, 3, 5, 4, 1, 5~
$ E1 <int> 2, 3, 1, 3, 2, 2, 3, 1, 2, 1, 1, 1, 6, 3, 6, 3, 3, 4, 3, 6, 2, 6, 4~
$ E2 <int> 1, 6, 3, 4, 1, 2, 2, 1, 4, 2, 2, 2, 6, 4, 6, 4, 2, 3, 3, 4, 2, 5, 4~
$ E3 <int> 6, 4, 2, 3, 2, 4, 5, 6, 4, 6, 6, 4, 2, 3, 3, 3, 3, 4, 2, 4, 4, 6, 5~
$ E4 <int> 5, 2, 5, 6, 5, 6, 5, 6, 2, 5, 5, 5, 1, 2, 2, 3, 6, 4, 5, 4, 5, 6, 2~
$ E5 <int> 6, 1, 4, 5, 2, 6, 6, 6, 6, 4, 5, 5, 1, 3, 2, 5, 5, 4, 4, 3, 5, 5, 4~
$ N1 <int> 3, 6, 3, 2, 2, 4, 2, 2, 3, 1, 5, 3, 1, 5, 2, 5, 1, 2, 2, 2, 2, 2, 4~
$ N2 <int> 5, 3, 3, 4, 2, 4, 3, 3, 3, 4, 4, 2, 2, 3, 2, 6, 2, 2, 3, 2, 2, 1, 5~
$ N3 <int> 2, 2, 4, 2, 2, 4, 3, 1, 5, 2, 4, 4, 1, 4, 2, 5, 1, 3, 1, 3, 2, 4, 3~
$ N4 <int> 2, 6, 2, 2, 2, 6, 1, 2, 3, 2, 3, 1, 3, 4, 4, 5, 2, 3, 3, 4, 2, 6, 5~
$ N5 <int> 3, 4, 3, 3, 2, 6, 1, 1, 2, 5, 1, 2, 6, 3, 1, 4, 1, 3, 2, 5, 3, 5, 2~
$ O1 <int> 4, 3, 5, 5, 6, 6, 6, 6, 5, 2, 4, 5, 6, 4, 5, 5, 5, 4, 4, 3, 5, 6, 3~
$ O2 <int> 3, 2, 3, 2, 1, 1, 2, 4, 2, 4, 4, 2, 6, 5, 5, 5, 1, 3, 3, 5, 2, 5, 5~
$ O3 <int> 5, 4, 5, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 4, 4, 4, 6, 2, 5, 4, 5, 6, 4~
$ O4 <int> 6, 5, 6, 5, 5, 6, 6, 5, 6, 4, 5, 5, 6, 4, 5, 5, 6, 5, 4, 4, 5, 6, 4~
$ O5 <int> 1, 3, 3, 5, 2, 1, 2, 3, 1, 1, 1, 2, 1, 3, 3, 2, 1, 2, 3, 4, 1, 1, 2~
mu <- colMeans(bfi_trim) #get variable means
sd <- apply(bfi_trim, 2, sd) #get variable SDs
cor_m <- cor(bfi_trim) #get correlation matrix
cov_m <- sd %*% t(sd) * cor_m #get covariance matrix
Simulating data can take awhile, especially at larger sample sizes. We can simulate multivariate normal data by using MASS::mvrnorm(). Let’s also checkout another function called anMC::mvrnormArma() to compare speed.
set.seed(9201994)
benchmark(
MASS::mvrnorm(1e6, mu, cov_m, empirical=T),
anMC::mvrnormArma(1e6, mu, cov_m, 0),
replications = 1,
columns = c('test', 'elapsed'),
order = 'elapsed'
) #benchmarking simulating 1mil cases of the sampled data set
test elapsed
2 anMC::mvrnormArma(1e+06, mu, cov_m, 0) 1.76
1 MASS::mvrnorm(1e+06, mu, cov_m, empirical = T) 13.96
We can clearly see from the output above that mvrnormArma() is (~6x) faster. Why? Because it calls C++ on the backend to create the simulated data.
Let’s assign the new data to a data frame object to compare differences saving and loading data. Matrices are typically faster to work with, but IO psychologist typically work with tabular data, so we will be using data frames, tibbles, or data tables. Let’s start with the most common: data frames. We should also add back in our categorical variables one vector at a time (at random).
set.seed(9201994)
bfi_df <-anMC::mvrnormArma(1e6, mu, cov_m, 0) %>%
t() %>% #transpose
as.data.frame() #get data.frame
names(bfi_df) <- names(bfi_trim)
# let's keep things proportional
gender <- round(table(bfi$gender)/nrow(bfi), digits = 5) #gender proportion
gender #view gender
1 2
0.32871 0.67129
n <- length(gender)
genders <- as.numeric(names(gender))
gender_vec <- c()
microbenchmark(times = 10, unit = "s",
# growing a vector
gender_vec_grow =
for(i in 1:n) {
gender_vec <- c(gender_vec, rep(genders[i], 1e6 * gender[i]))
},
# vectorized
gender_vec =
rep(genders[1:n], 1e6 * gender[1:n])
)
Unit: seconds
expr min lq mean median uq max
gender_vec_grow 0.0082509 0.0240679 0.04573494 0.0381871 0.0507344 0.1348271
gender_vec 0.0023907 0.0024279 0.00250867 0.0024953 0.0025869 0.0026732
neval cld
10 b
10 a
gender_vec <- rep(genders[1:n], 1e6 * gender[1:n])
bfi_df$gender <- sample(gender_vec) #assigning proportions to gender vector on bfi_df. adding sample randomizes the order of the vector
# let's keep things proportional
age <- round(table(bfi$age)/nrow(bfi), digits = 5)
age #view age proportions
3 11 14 15 16 17 18 19 20 21
0.00045 0.00045 0.00045 0.00134 0.00626 0.01386 0.04785 0.07335 0.07961 0.05769
22 23 24 25 26 27 28 29 30 31
0.04562 0.05725 0.04114 0.04472 0.03623 0.03712 0.03309 0.03220 0.02460 0.02773
32 33 34 35 36 37 38 39 40 41
0.02549 0.01968 0.02057 0.02013 0.01834 0.01476 0.02102 0.01789 0.02057 0.01252
42 43 44 45 46 47 48 49 50 51
0.01029 0.01431 0.01073 0.01073 0.00894 0.00671 0.01163 0.00447 0.01386 0.00716
52 53 54 55 56 57 58 59 60 61
0.00850 0.00671 0.00626 0.00492 0.00581 0.00268 0.00268 0.00089 0.00268 0.00134
62 63 64 65 66 67 68 74 86
0.00134 0.00134 0.00045 0.00045 0.00045 0.00134 0.00045 0.00045 0.00045
age_vec <- vector()
n <- length(age)
ages <- as.numeric(names(age))
age_vec <- rep(ages[1:n], 1e+06 * age[1:n])
bfi_df$age <- sample(age_vec)
# let's keep things proportional
education <- round(table(bfi$education)/nrow(bfi), digits = 5)
education #view education proportions
1 2 3 4 5
0.08855 0.11181 0.48211 0.15474 0.16279
education_vec <- c()
n <- length(education)
educations <- as.numeric(names(education))
education_vec <- rep(educations[1:n], 1e+06 * education[1:n])
bfi_df$education <- sample(education_vec)
Every time you manipulate the data frame, check the data before proceeding so that you can see if you’ve made any mistakes. Let’s see what the final data set looks like.
glimpse(bfi_df) #always check your data after manipulating the data frame before proceeding
Rows: 1,000,000
Columns: 28
$ A1 <dbl> 4.98512917, 1.91873727, -0.25217646, 2.78180897, 1.17837980,~
$ A2 <dbl> 4.272613, 4.101842, 5.787534, 5.205831, 3.896131, 5.020339, ~
$ A3 <dbl> 4.865218, 4.329908, 5.557391, 4.855733, 6.117381, 5.567631, ~
$ A4 <dbl> 5.476493, 3.837986, 5.306861, 5.647894, 5.668854, 7.055909, ~
$ A5 <dbl> 5.555879, 4.596857, 2.988709, 5.405770, 5.262791, 4.769970, ~
$ C1 <dbl> 5.457633, 4.587901, 4.882483, 3.567453, 6.646876, 2.802143, ~
$ C2 <dbl> 6.742743, 2.810552, 3.876032, 4.715007, 4.642838, 4.488382, ~
$ C3 <dbl> 4.441823, 4.297566, 1.928377, 3.030105, 2.647775, 2.193794, ~
$ C4 <dbl> 3.6383434, 3.0339294, 1.6553426, 3.7270366, 2.2621397, 3.139~
$ C5 <dbl> 4.21871642, 2.64556664, 3.27568178, 5.65775552, 5.11814967, ~
$ E1 <dbl> 8.8836196, 0.8327891, 2.8785604, 5.6763096, 3.6623101, 4.834~
$ E2 <dbl> 6.8169252, 1.5661497, 4.4259251, 4.4116572, 2.8763598, 3.461~
$ E3 <dbl> 3.933498, 2.368665, 2.012880, 5.328615, 6.263056, 4.248638, ~
$ E4 <dbl> 2.675041, 5.383466, 4.049645, 4.032899, 4.457834, 4.260493, ~
$ E5 <dbl> 2.552034, 3.980497, 2.087986, 4.408435, 4.855411, 3.703792, ~
$ N1 <dbl> 2.12655182, 2.40700929, 2.78458338, 2.25950355, 3.29858002, ~
$ N2 <dbl> 4.8134651, 2.6945459, 4.1183656, 0.7878628, 3.6108563, 2.215~
$ N3 <dbl> 5.934404, 1.692800, 5.134269, 4.328980, 4.947094, 2.158376, ~
$ N4 <dbl> 6.0312045, 3.2625564, 3.5378901, 3.2924299, 3.7651757, 2.847~
$ N5 <dbl> 5.5440854, 2.0387079, 4.8161490, 1.8064135, 2.5844743, 2.517~
$ O1 <dbl> 5.692757, 4.385927, 2.181182, 5.078008, 4.478001, 3.166005, ~
$ O2 <dbl> 3.4021118, 1.8822306, 2.8189047, 2.4951893, 4.6048038, 2.032~
$ O3 <dbl> 3.503269, 4.352442, 1.817887, 5.628011, 5.104225, 5.852676, ~
$ O4 <dbl> 6.640027, 5.929866, 2.724339, 6.709802, 5.953817, 3.990266, ~
$ O5 <dbl> 4.8215096, 3.4208458, 5.2356782, 3.5194146, 2.0110209, 2.325~
$ gender <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, ~
$ age <dbl> 26, 22, 49, 18, 31, 23, 50, 28, 22, 19, 48, 23, 33, 20, 26, ~
$ education <dbl> 1, 3, 1, 1, 3, 4, 2, 1, 5, 3, 4, 4, 3, 5, 4, 4, 3, 3, 1, 3, ~
Earlier, we discovered that RDS and data.table functions are the fastest way to save and load data. Saving RDS objects is the most flexible way to write out and read in data without compromising performance, whereas fwrite() and fread functions were the fastest in terms of raw speed. Let’s compare these two approaches using the simulated BFI data.
benchmark(
saveRDS(bfi_df, "../data/bfi_df.rds"),
fwrite(bfi_df,"../data/bfi_df_dt.csv"),
replications = 1,
columns = c('test', 'elapsed'),
order = 'elapsed'
)
test elapsed
2 fwrite(bfi_df, "../data/bfi_df_dt.csv") 1.04
1 saveRDS(bfi_df, "../data/bfi_df.rds") 10.19
benchmark(
readRDS("../data/bfi_df.rds"),
fread("../data/bfi_df_dt.csv"),
replications = 1,
columns = c('test', 'elapsed'),
order = 'elapsed'
)
test elapsed
2 fread("../data/bfi_df_dt.csv") 0.38
1 readRDS("../data/bfi_df.rds") 1.36
We can make a list of keys on how to score individual items/variables. Once we obtain item keys, we can score the items on a variety of specifications (e.g., transforming the data on minimum and maximum item values). Run ?psych::scoreItems() for more information.
keys.list <- list(agree = c("-A1", paste0("A", 2:5)), conscientious = c(paste0("C",
1:3), "-C4", "-C5"), extraversion = c("-E1", "-E2", paste0("E", 3:5)), neuroticism = paste0("N",
1:5), openness = c("O1", "-O2", "O3", "O4", "-O5"))
bfi_scored <- scoreItems(keys.list, bfi_df[1:25])
bfi_scored
Call: scoreItems(keys = keys.list, items = bfi_df[1:25])
(Unstandardized) Alpha:
agree conscientious extraversion neuroticism openness
alpha 0.71 0.73 0.77 0.82 0.61
Standard errors of unstandardized Alpha:
agree conscientious extraversion neuroticism openness
ASE 0.00075 0.00071 0.00066 0.00059 0.00087
Average item correlation:
agree conscientious extraversion neuroticism openness
average.r 0.32 0.35 0.4 0.47 0.24
Median item correlation:
agree conscientious extraversion neuroticism openness
0.34 0.35 0.40 0.42 0.24
Guttman 6* reliability:
agree conscientious extraversion neuroticism openness
Lambda.6 0.71 0.73 0.77 0.82 0.62
Signal/Noise based upon av.r :
agree conscientious extraversion neuroticism openness
Signal/Noise 2.4 2.7 3.3 4.4 1.6
Scale intercorrelations corrected for attenuation
raw correlations below the diagonal, alpha on the diagonal
corrected correlations above the diagonal:
agree conscientious extraversion neuroticism openness
agree 0.71 0.34 0.64 -0.231 0.18
conscientious 0.24 0.73 0.36 -0.298 0.27
extraversion 0.47 0.27 0.77 -0.281 0.32
neuroticism -0.17 -0.23 -0.22 0.816 -0.12
openness 0.12 0.18 0.22 -0.084 0.61
In order to see the item by scale loadings and frequency counts of the data
print with the short option = FALSE
Let’s get summary stats and compare performance by common descriptives functions.
pacman::p_load(Hmisc, pastecs, rstatix)
benchmark(
hmisc = Hmisc::describe(bfi_df),
psych = psych::describe(bfi_df),
fpsych = psych::describe(bfi_df, fast=T),
pastecs = pastecs::stat.desc(bfi_df),
rstatix = rstatix::get_summary_stats(bfi_df),
replications = 1,
columns = c('test', 'elapsed'),
order = 'elapsed'
)
test elapsed
2 psych 3.72
3 fpsych 3.89
5 rstatix 14.36
4 pastecs 23.01
1 hmisc 57.52
bfi_df_summary <- psych::describe(bfi_df, fast=T)
bfi_df_summary
vars n mean sd min max range se
A1 1 1e+06 2.36 1.39 -4.18 9.22 13.40 0.00
A2 2 1e+06 4.83 1.16 -0.51 10.44 10.95 0.00
A3 3 1e+06 4.63 1.29 -1.41 10.91 12.32 0.00
A4 4 1e+06 4.75 1.45 -1.72 13.31 15.03 0.00
A5 5 1e+06 4.58 1.26 -1.55 10.99 12.54 0.00
C1 6 1e+06 4.57 1.22 -1.02 10.36 11.38 0.00
C2 7 1e+06 4.40 1.31 -1.44 10.47 11.91 0.00
C3 8 1e+06 4.32 1.29 -1.89 10.55 12.44 0.00
C4 9 1e+06 2.50 1.36 -4.04 9.20 13.24 0.00
C5 10 1e+06 3.26 1.63 -5.24 11.26 16.51 0.00
E1 11 1e+06 2.97 1.62 -4.37 10.53 14.90 0.00
E2 12 1e+06 3.12 1.61 -5.51 11.00 16.52 0.00
E3 13 1e+06 4.01 1.34 -2.53 10.72 13.25 0.00
E4 14 1e+06 4.43 1.46 -2.77 11.39 14.16 0.00
E5 15 1e+06 4.42 1.33 -1.59 10.91 12.50 0.00
N1 16 1e+06 2.91 1.56 -4.63 10.78 15.41 0.00
N2 17 1e+06 3.49 1.54 -3.81 10.83 14.64 0.00
N3 18 1e+06 3.20 1.60 -4.69 10.27 14.96 0.00
N4 19 1e+06 3.18 1.56 -4.53 10.65 15.19 0.00
N5 20 1e+06 2.95 1.62 -4.77 10.76 15.53 0.00
O1 21 1e+06 4.82 1.12 -0.61 10.47 11.08 0.00
O2 22 1e+06 2.69 1.55 -4.66 10.30 14.95 0.00
O3 23 1e+06 4.48 1.19 -1.24 10.15 11.39 0.00
O4 24 1e+06 4.95 1.17 -0.92 10.77 11.68 0.00
O5 25 1e+06 2.46 1.33 -4.81 9.14 13.95 0.00
gender 26 1e+06 1.67 0.47 1.00 2.00 1.00 0.00
age 27 1e+06 29.51 10.66 3.00 86.00 83.00 0.01
education 28 1e+06 3.19 1.11 1.00 5.00 4.00 0.00
We can see that the psych package performs the best.
Let’s get summary stats by group and compare performance by function. We will use the pipeable rstatix package that implements commonly used statistical tests with great efficiency. It’s easy to use and intuitive if you are familiar with the tidyverse framework.
First let’s drop the age column. We only want to work with age and education for now. Let’s compare performance between subsetting in base R and using dplyr::select().
microbenchmark(unit = "s", subset = subset(bfi_df, select = -age), base = bfi_df[-which(names(bfi_df) ==
"age")], dplyr = bfi_df %>%
select(-age))
Unit: seconds
expr min lq mean median uq max neval
subset 0.1817567 0.2470196 0.274318339 0.25980475 0.27465780 0.5020816 100
base 0.0000157 0.0000255 0.000048281 0.00003040 0.00006945 0.0001296 100
dplyr 0.0010276 0.0011293 0.001467883 0.00146475 0.00159710 0.0032317 100
cld
b
a
a
Indexing using brackets is typically faster than tidyverse, but the code is not as accessible. This is a common trade off between the tidyverse approach and other approaches. For now, let’s use bracket notation to get the data set. Then, let’s compare the performance of different functions on group comparisons by gender and education.
bfi_group <- bfi_df[-which(names(bfi_df)=="age")] #or bfi_df %>% select(-age)
benchmark(
psych_describeBy = psych::describeBy(bfi_group[1:25], list(bfi_group$gender, bfi_group$education), mat=T),
rstatix_groupby = bfi_group %>% group_by(gender, education) %>% get_summary_stats(),
replications = 1,
columns = c('test', 'elapsed'),
order = 'elapsed'
)
test elapsed
2 rstatix_groupby 10.58
1 psych_describeBy 33.72
Although overall summary statistics using the psych package is faster than using rstatix, we can see that when obtaining summary statistics after the data are grouped, the opposite is true.
bfi_summary_stats_group <- bfi_group %>%
group_by(gender, education) %>%
get_summary_stats()
bfi_summary_stats_group
# A tibble: 250 x 15
gender education variable n min max median q1 q3 iqr mad
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 A1 29076 -3.41 7.64 2.37 1.43 3.31 1.88 1.39
2 1 1 A2 29076 -0.478 9.32 4.84 4.05 5.62 1.58 1.17
3 1 1 A3 29076 -0.307 9.88 4.64 3.76 5.51 1.74 1.29
4 1 1 A4 29076 -0.807 11.4 4.77 3.78 5.74 1.96 1.46
5 1 1 A5 29076 -1.29 9 4.59 3.74 5.44 1.70 1.26
6 1 1 C1 29076 -0.106 10.4 4.57 3.76 5.39 1.63 1.21
7 1 1 C2 29076 -0.98 9.58 4.39 3.49 5.27 1.78 1.32
8 1 1 C3 29076 -1.17 9.24 4.34 3.47 5.18 1.71 1.27
9 1 1 C4 29076 -3.08 8.07 2.51 1.59 3.42 1.83 1.36
10 1 1 C5 29076 -3.30 10.2 3.27 2.16 4.36 2.20 1.64
# ... with 240 more rows, and 4 more variables: mean <dbl>, sd <dbl>, se <dbl>,
# ci <dbl>
An easy way to get standardized mean differences by group is through by rstatix::cohen_d(). We can use pivot_longer() to gather a column of items and a column of item scores. We can then calculate d between groups by item to understand how the items are functioning by group.
bfi_tbl <- as_tibble(bfi_df) #converting to tbl_df & dropping age
bfi_tbl_group <- bfi_tbl %>% #grab tbl
pivot_longer( #pivot BFI items
A1:O5,
names_to = "item", #name gathered column
values_to = "score" #name values columns
) %>%
select(-age, -education) %>% #drop age, edu columns
group_by(item) #group by item
glimpse(bfi_tbl_group)
Rows: 25,000,000
Columns: 3
Groups: item [25]
$ gender <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
$ item <chr> "A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5", "E1~
$ score <dbl> 4.985129, 4.272613, 4.865218, 5.476493, 5.555879, 5.457633, 6.7~
This is a large data set, with 25 million rows and 3 variables. Since the data are large, we should make a tinkering tibble to test out different functions for calculating d. Let’s take the first 1000 rows of our data using dplyr::slice(). Since the data are grouped, using dplyr::slice(1:1000) would take the first 1000 rows per item (i.e., the grouping variable). We need to ungroup, slice, and regroup.
bfi_tbl_group_slice <- bfi_tbl_group %>%
ungroup(item) %>%
slice(1:1000) %>%
group_by(item)
bfi_tbl_group_slice
# A tibble: 1,000 x 3
# Groups: item [25]
gender item score
<dbl> <chr> <dbl>
1 2 A1 4.99
2 2 A2 4.27
3 2 A3 4.87
4 2 A4 5.48
5 2 A5 5.56
6 2 C1 5.46
7 2 C2 6.74
8 2 C3 4.44
9 2 C4 3.64
10 2 C5 4.22
# ... with 990 more rows
The rstatix way is easy to read and follows the grouping command, where in the psych way we have to add a group to the formula. Also, the output for each is quite different. We need to code further manipulations using psych.
bfi_tbl_group_slice %>%
cohens_d(score ~ gender)
# A tibble: 25 x 8
.y. group1 group2 effsize item n1 n2 magnitude
* <chr> <chr> <chr> <dbl> <chr> <int> <int> <ord>
1 score 1 2 0.124 A1 13 27 negligible
2 score 1 2 0.207 A2 13 27 small
3 score 1 2 0.0638 A3 13 27 negligible
4 score 1 2 0.245 A4 13 27 small
5 score 1 2 0.0739 A5 13 27 negligible
6 score 1 2 -0.142 C1 13 27 negligible
7 score 1 2 -0.0283 C2 13 27 negligible
8 score 1 2 -0.0729 C3 13 27 negligible
9 score 1 2 0.240 C4 13 27 small
10 score 1 2 -0.348 C5 13 27 small
# ... with 15 more rows
bfi_tbl_group_slice %>%
cohen.d(score ~ gender + item, data = .) %>%
summary
Extract effect sizes by groups from cohen.d.by
score Md
genderforitemA1 -0.123 0.123
genderforitemA2 -0.230 0.230
genderforitemA3 -0.064 0.064
genderforitemA4 -0.240 0.240
genderforitemA5 -0.079 0.079
genderforitemC1 0.149 0.149
genderforitemC2 0.029 0.029
genderforitemC3 0.079 0.079
genderforitemC4 -0.255 0.255
genderforitemC5 0.356 0.356
genderforitemE1 0.564 0.564
genderforitemE2 0.195 0.195
genderforitemE3 -0.490 0.490
genderforitemE4 -0.167 0.167
genderforitemE5 -0.121 0.121
genderforitemN1 0.385 0.385
genderforitemN2 0.288 0.288
genderforitemN3 0.453 0.453
genderforitemN4 1.131 1.131
genderforitemN5 -0.278 0.278
genderforitemO1 -0.280 0.280
genderforitemO2 -0.088 0.088
genderforitemO3 -0.049 0.049
genderforitemO4 0.756 0.756
genderforitemO5 0.172 0.172
Although psych might ~2x faster in terms of raw speed, rstatix gives us practically useful output that we would have to valuable time coding using psych. However, if you only want the effect size without having to program further data manipulations to get a more comprehensive summary, use psych. Make sure to double check the direction groups are compared as well.
benchmark(
rstatix = bfi_tbl_group_slice %>%
cohens_d(score ~ gender),
psych = bfi_tbl_group_slice %>%
cohen.d(score ~ gender + item, data = .) %>%
summary ,
replications = 1,
columns = c("test", "elapsed"),
order = "elapsed"
)
Extract effect sizes by groups from cohen.d.by
score Md
genderforitemA1 -0.123 0.123
genderforitemA2 -0.230 0.230
genderforitemA3 -0.064 0.064
genderforitemA4 -0.240 0.240
genderforitemA5 -0.079 0.079
genderforitemC1 0.149 0.149
genderforitemC2 0.029 0.029
genderforitemC3 0.079 0.079
genderforitemC4 -0.255 0.255
genderforitemC5 0.356 0.356
genderforitemE1 0.564 0.564
genderforitemE2 0.195 0.195
genderforitemE3 -0.490 0.490
genderforitemE4 -0.167 0.167
genderforitemE5 -0.121 0.121
genderforitemN1 0.385 0.385
genderforitemN2 0.288 0.288
genderforitemN3 0.453 0.453
genderforitemN4 1.131 1.131
genderforitemN5 -0.278 0.278
genderforitemO1 -0.280 0.280
genderforitemO2 -0.088 0.088
genderforitemO3 -0.049 0.049
genderforitemO4 0.756 0.756
genderforitemO5 0.172 0.172
Extract effect sizes by groups from cohen.d.by
score Md
genderforitemA1 -0.123 0.123
genderforitemA2 -0.230 0.230
genderforitemA3 -0.064 0.064
genderforitemA4 -0.240 0.240
genderforitemA5 -0.079 0.079
genderforitemC1 0.149 0.149
genderforitemC2 0.029 0.029
genderforitemC3 0.079 0.079
genderforitemC4 -0.255 0.255
genderforitemC5 0.356 0.356
genderforitemE1 0.564 0.564
genderforitemE2 0.195 0.195
genderforitemE3 -0.490 0.490
genderforitemE4 -0.167 0.167
genderforitemE5 -0.121 0.121
genderforitemN1 0.385 0.385
genderforitemN2 0.288 0.288
genderforitemN3 0.453 0.453
genderforitemN4 1.131 1.131
genderforitemN5 -0.278 0.278
genderforitemO1 -0.280 0.280
genderforitemO2 -0.088 0.088
genderforitemO3 -0.049 0.049
genderforitemO4 0.756 0.756
genderforitemO5 0.172 0.172
test elapsed
2 psych 0.22
1 rstatix 0.40
Yet, given 25 million rows, we would need to calculate 25 effect sizes at a sample of 1 million each. To calculate just 1000 cases took ~.5 seconds using rstatix. Scaling up to 25 million would take ~208 minutes. This is a prime example of how inefficiencies can compound when scaling up.
A better way to approach the problem is to parallelize across all cores. Parallelization is the act of designing a computer program or system to process data in parallel. (You can think of it as hyperthreading across physical cores instead of within cores). Normally, computer programs compute data serially computing one thing at a time. If we parallelize, we can speed up compute by breaking the problem down into smaller pieces to be independently solved simultaneously.
For this example, we will split the data set into a list, where every element of that list is an item. Then, we can loop through the list using foreach and the %dopar% operator to run a for loop in parallel. For data sets with smaller cases, we don’t see the performance gains. For larger data sets, we do.
bfi_tbl_long <- bfi_tbl_group %>% slice(1:1000) #25k
bfi_ls <- split(
bfi_tbl_long[, !names(bfi_tbl_long) == "item" ], #split into list but remove the item variable
bfi_tbl_long$item #split on item variable
)
system.time({
rstatix::cohens_d(bfi_tbl_long, score ~ gender)
})
user system elapsed
0.53 0.00 0.54
registerDoParallel(detectCores() - 1) #register the cluster to run in parallel
system.time({
d_gender <-
foreach(i=1:25) %dopar% {
rstatix::cohens_d(bfi_ls[[i]], score ~ gender)
}
})
user system elapsed
0.06 0.02 3.21
registerDoSEQ() #register cores to compute sequentially
bfi_tbl_long <- bfi_tbl_group %>% slice(1:1e5) #2.5 mil (2500k)
bfi_ls <- split(
bfi_tbl_long[, !names(bfi_tbl_long) == "item" ], #split into list but remove the item variable
bfi_tbl_long$item #split on item variable
)
system.time({
rstatix::cohens_d(bfi_tbl_long, score ~ gender)
})
user system elapsed
10.55 0.02 10.57
registerDoParallel(detectCores() - 1) #register the cluster to run in parallel
system.time({
d_gender <-
foreach(i=1:25) %dopar% {
rstatix::cohens_d(bfi_ls[[i]], score ~ gender)
}
})
user system elapsed
3.09 2.36 10.42
registerDoSEQ() #register cores to compute sequentially
To view the results, we can bind the rows across the list if they hold the same column values. There are two ways we can do this: dplyr::bind_rows() or data.table::rbindList(). Both yield the same output, but rbindList() is faster. However, it coerces to a data.table object. If we want a tibble, we will have to manually transform, which is still faster than bind_rows().
benchmark(
dplyr::bind_rows(d_gender),
data.table::rbindlist(d_gender) %>% as_tibble,
replications = 10,
columns = c("test","elapsed"),
order = "elapsed"
)
test elapsed
2 data.table::rbindlist(d_gender) %>% as_tibble 0.00
1 dplyr::bind_rows(d_gender) 0.33
rbindlist(d_gender) %>%
as_tibble()
# A tibble: 25 x 7
.y. group1 group2 effsize n1 n2 magnitude
<chr> <chr> <chr> <dbl> <int> <int> <ord>
1 score 1 2 -0.000244 32692 67308 negligible
2 score 1 2 0.00620 32692 67308 negligible
3 score 1 2 -0.00613 32692 67308 negligible
4 score 1 2 0.00461 32692 67308 negligible
5 score 1 2 0.000188 32692 67308 negligible
6 score 1 2 -0.0135 32692 67308 negligible
7 score 1 2 -0.00459 32692 67308 negligible
8 score 1 2 0.00647 32692 67308 negligible
9 score 1 2 0.00323 32692 67308 negligible
10 score 1 2 -0.00134 32692 67308 negligible
# ... with 15 more rows
Sometimes factor analysis can take awhile. With parallelization, we can run several factor analyses in parallel if we have the proper input data.
rm(list = ls()) #clear the global environment
glimpse(bfi)
Rows: 2,800
Columns: 28
$ A1 <int> 2, 2, 5, 4, 2, 6, 2, 4, 4, 2, 4, 2, 5, 5, 4, 4, 4, 5, 4, 4, ~
$ A2 <int> 4, 4, 4, 4, 3, 6, 5, 3, 3, 5, 4, 5, 5, 5, 5, 3, 6, 5, 4, 4, ~
$ A3 <int> 3, 5, 5, 6, 3, 5, 5, 1, 6, 6, 5, 5, 5, 5, 2, 6, 6, 5, 5, 6, ~
$ A4 <int> 4, 2, 4, 5, 4, 6, 3, 5, 3, 6, 6, 5, 6, 6, 2, 6, 2, 4, 4, 5, ~
$ A5 <int> 4, 5, 4, 5, 5, 5, 5, 1, 3, 5, 5, 5, 4, 6, 1, 3, 5, 5, 3, 5, ~
$ C1 <int> 2, 5, 4, 4, 4, 6, 5, 3, 6, 6, 4, 5, 5, 4, 5, 5, 4, 5, 5, 1, ~
$ C2 <int> 3, 4, 5, 4, 4, 6, 4, 2, 6, 5, 3, 4, 4, 4, 5, 5, 4, 5, 4, 1, ~
$ C3 <int> 3, 4, 4, 3, 5, 6, 4, 4, 3, 6, 5, 5, 3, 4, 5, 5, 4, 5, 5, 1, ~
$ C4 <int> 4, 3, 2, 5, 3, 1, 2, 2, 4, 2, 3, 4, 2, 2, 2, 3, 4, 4, 4, 5, ~
$ C5 <int> 4, 4, 5, 5, 2, 3, 3, 4, 5, 1, 2, 5, 2, 1, 2, 5, 4, 3, 6, 6, ~
$ E1 <int> 3, 1, 2, 5, 2, 2, 4, 3, 5, 2, 1, 3, 3, 2, 3, 1, 1, 2, 1, 1, ~
$ E2 <int> 3, 1, 4, 3, 2, 1, 3, 6, 3, 2, 3, 3, 3, 2, 4, 1, 2, 2, 2, 1, ~
$ E3 <int> 3, 6, 4, 4, 5, 6, 4, 4, NA, 4, 2, 4, 3, 4, 3, 6, 5, 4, 4, 4,~
$ E4 <int> 4, 4, 4, 4, 4, 5, 5, 2, 4, 5, 5, 5, 2, 6, 6, 6, 5, 6, 5, 5, ~
$ E5 <int> 4, 3, 5, 4, 5, 6, 5, 1, 3, 5, 4, 4, 4, 5, 5, 4, 5, 6, 5, 6, ~
$ N1 <int> 3, 3, 4, 2, 2, 3, 1, 6, 5, 5, 3, 4, 1, 1, 2, 4, 4, 6, 5, 5, ~
$ N2 <int> 4, 3, 5, 5, 3, 5, 2, 3, 5, 5, 3, 5, 2, 1, 4, 5, 4, 5, 6, 5, ~
$ N3 <int> 2, 3, 4, 2, 4, 2, 2, 2, 2, 5, 4, 3, 2, 1, 2, 4, 4, 5, 5, 5, ~
$ N4 <int> 2, 5, 2, 4, 4, 2, 1, 6, 3, 2, 2, 2, 2, 2, 2, 5, 4, 4, 5, 1, ~
$ N5 <int> 3, 5, 3, 1, 3, 3, 1, 4, 3, 4, 3, NA, 2, 1, 3, 5, 5, 4, 2, 1,~
$ O1 <int> 3, 4, 4, 3, 3, 4, 5, 3, 6, 5, 5, 4, 4, 5, 5, 6, 5, 5, 4, 4, ~
$ O2 <int> 6, 2, 2, 3, 3, 3, 2, 2, 6, 1, 3, 6, 2, 3, 2, 6, 1, 1, 2, 1, ~
$ O3 <int> 3, 4, 5, 4, 4, 5, 5, 4, 6, 5, 5, 4, 4, 4, 5, 6, 5, 4, 2, 5, ~
$ O4 <int> 4, 3, 5, 3, 3, 6, 6, 5, 6, 5, 6, 5, 5, 4, 5, 3, 6, 5, 4, 3, ~
$ O5 <int> 3, 3, 2, 5, 3, 1, 1, 3, 1, 2, 3, 4, 2, 4, 5, 2, 3, 4, 2, 2, ~
$ gender <int> 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, ~
$ education <int> NA, NA, NA, NA, NA, 3, NA, 2, 1, NA, 1, NA, NA, NA, 1, NA, N~
$ age <int> 16, 18, 17, 17, 17, 21, 18, 19, 19, 17, 21, 16, 16, 16, 17, ~
bfi_ls <- list(agree = bfi %>%
select(starts_with("A"), -age), conscientious = bfi %>%
select(starts_with("C")), extraversion = bfi %>%
select(starts_with("E"), -education), neuroticism = bfi %>%
select(starts_with("N")), openness = bfi %>%
select(starts_with("O")))
for (i in seq_along(bfi_ls)) {
colnames(bfi_ls[[i]]) <- paste0("item", 1:5)
}
pacman::p_load(lavaan)
registerDoParallel(detectCores() - 1)
cfa <- foreach(i = 1:5, .multicombine = T, .errorhandling = "pass") %dopar% {
lavaan::cfa(model = "trait =~ item1 + item2 + item3 + item4 + item5", data = bfi_ls[[i]])
}
models <- foreach(i = 1:length(cfa), .multicombine = T, .errorhandling = "pass") %dopar%
{
pacman::p_load(lavaan)
nobs <- inspect(cfa[[i]], what = "nobs")
fm <- fitMeasures(cfa[[i]], c("chisq", "df", "pvalue", "cfi", "rmsea", "srmr"))
maxLoading <- max(lavInspect(cfa[[i]], what = "est")$lambda)
minLoading <- min(lavInspect(cfa[[i]], what = "est")$lambda)
as.list(c(names(bfi_ls[i]), nobs, fm, maxLoading, minLoading))
}
registerDoSEQ()
models_output <- data.table::rbindlist(models)
colnames(models_output) <- c("trait", "N", "chisq", "df", "pvalue", "cfi", "rmsea",
"srmr", "max_loading", "min_loading")
models_output %>%
mutate(across(N:min_loading, as.numeric)) %>%
mutate(across(where(is.numeric), ~round(., digits = 3)))
trait N chisq df pvalue cfi rmsea srmr max_loading
1: agree 2709 86.696 5 0 0.968 0.078 0.032 1.000
2: conscientious 2707 164.893 5 0 0.937 0.109 0.042 1.184
3: extraversion 2713 86.601 5 0 0.973 0.078 0.030 1.182
4: neuroticism 2694 360.932 5 0 0.925 0.163 0.056 1.000
5: openness 2726 80.583 5 0 0.945 0.074 0.034 1.286
min_loading
1: -1.880
2: -1.398
3: -1.020
4: 0.632
5: -1.139