Introduction.

Sometimes, I want to split a dataframe into smaller pieces. These are my notes to myself, some borrowed, adapted.

Create a dummy data frame.

Create the dummy dataframe and have a glimpse.

df <- as.data.frame(matrix(ceiling(runif(10000, 100, 10000))))
df$name <- sample( LETTERS[1:4], 10000, replace=TRUE)
tibble::glimpse(df)

## Rows: 10,000
## Columns: 2
## $ V1   <dbl> 6142, 537, 1091, 6778, 7614, 2501, 5434, 1190, 9687, 389, 3557, …
## $ name <chr> "D", "B", "C", "B", "B", "A", "A", "C", "D", "A", "C", "B", "B",…

Method 1 - Split using a numeric property of one of the variables.

This method takes one the the numeric variables, extracts a substring, and then uses the modulus %% operator to divide the dataframe into n lists that can later be recovered to dataframes.

Split into n dataframes within a list. Then recover the lists as dataframes. They are not necessarily the same length so if that matters then this method isn’t for you.

Increase the number of splits by increasing the %% n modulus number.

list1 <- split(df, (as.numeric(substr(df$V1,1,1))) %% 2)
df1_1 <- list1[[1]] ; nrow(df1_1)

## [1] 4413

df1_2 <- list1[[2]] ; nrow(df1_2)

## [1] 5587

Method 2 - Set the upper range of a replicate function.

number_of_splits <- 3

list2 <- split(df, sample(rep(1:number_of_splits, number_of_splits)))

## Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...): data
## length is not a multiple of split variable

df2_1 <- list2[[1]] ; nrow(df2_1)

## [1] 3334

df2_2 <- list2[[2]] ; nrow(df2_2)

## [1] 3333

df2_3 <- list2[[3]] ; nrow(df2_3)

## [1] 3333

Method 3 - Split on an Existing Column.

Here we use the name variable to split the data frame, then recover into data frames.

list3 <- split(df, df$name)
df3_1 <- list3[[1]] ; nrow(df3_1)

## [1] 2528

df3_2 <- list3[[2]] ; nrow(df3_2)

## [1] 2490

df3_3 <- list3[[3]] ; nrow(df3_3)

## [1] 2475

df3_4 <- list3[[4]] ; nrow(df3_4)

## [1] 2507

References, Sources.

https://stackoverflow.com/questions/3302356/how-to-split-a-data-frame

30_splitting_R_dataframes

Superboreen

2020-04-22