Introduction.

Sometimes, I want to split a dataframe into smaller pieces. These are my notes to myself, some borrowed, adapted.

Create a dummy data frame.

Create the dummy dataframe and have a glimpse.

df <- as.data.frame(matrix(ceiling(runif(10000, 100, 10000))))
df$name <- sample( LETTERS[1:4], 10000, replace=TRUE)
tibble::glimpse(df)
## Rows: 10,000
## Columns: 2
## $ V1   <dbl> 6142, 537, 1091, 6778, 7614, 2501, 5434, 1190, 9687, 389, 3557, …
## $ name <chr> "D", "B", "C", "B", "B", "A", "A", "C", "D", "A", "C", "B", "B",…

Method 1 - Split using a numeric property of one of the variables.

This method takes one the the numeric variables, extracts a substring, and then uses the modulus %% operator to divide the dataframe into n lists that can later be recovered to dataframes.

Split into n dataframes within a list. Then recover the lists as dataframes. They are not necessarily the same length so if that matters then this method isn’t for you.

Increase the number of splits by increasing the %% n modulus number.

list1 <- split(df, (as.numeric(substr(df$V1,1,1))) %% 2)
df1_1 <- list1[[1]] ; nrow(df1_1)
## [1] 4413
df1_2 <- list1[[2]] ; nrow(df1_2)
## [1] 5587

Method 2 - Set the upper range of a replicate function.

number_of_splits <- 3

list2 <- split(df, sample(rep(1:number_of_splits, number_of_splits)))
## Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...): data
## length is not a multiple of split variable
df2_1 <- list2[[1]] ; nrow(df2_1)
## [1] 3334
df2_2 <- list2[[2]] ; nrow(df2_2)
## [1] 3333
df2_3 <- list2[[3]] ; nrow(df2_3)
## [1] 3333

Method 3 - Split on an Existing Column.

Here we use the name variable to split the data frame, then recover into data frames.

list3 <- split(df, df$name)
df3_1 <- list3[[1]] ; nrow(df3_1)
## [1] 2528
df3_2 <- list3[[2]] ; nrow(df3_2)
## [1] 2490
df3_3 <- list3[[3]] ; nrow(df3_3)
## [1] 2475
df3_4 <- list3[[4]] ; nrow(df3_4)
## [1] 2507

References, Sources.

  1. https://stackoverflow.com/questions/3302356/how-to-split-a-data-frame

Ends.