Sometimes, I want to split a dataframe into smaller pieces. These are my notes to myself, some borrowed, adapted.
Create the dummy dataframe and have a glimpse.
df <- as.data.frame(matrix(ceiling(runif(10000, 100, 10000))))
df$name <- sample( LETTERS[1:4], 10000, replace=TRUE)
tibble::glimpse(df)
## Rows: 10,000
## Columns: 2
## $ V1 <dbl> 6142, 537, 1091, 6778, 7614, 2501, 5434, 1190, 9687, 389, 3557, …
## $ name <chr> "D", "B", "C", "B", "B", "A", "A", "C", "D", "A", "C", "B", "B",…
This method takes one the the numeric variables, extracts a substring, and then uses the modulus %% operator to divide the dataframe into n lists that can later be recovered to dataframes.
Split into n dataframes within a list. Then recover the lists as dataframes. They are not necessarily the same length so if that matters then this method isn’t for you.
Increase the number of splits by increasing the %% n modulus number.
list1 <- split(df, (as.numeric(substr(df$V1,1,1))) %% 2)
df1_1 <- list1[[1]] ; nrow(df1_1)
## [1] 4413
df1_2 <- list1[[2]] ; nrow(df1_2)
## [1] 5587
number_of_splits <- 3
list2 <- split(df, sample(rep(1:number_of_splits, number_of_splits)))
## Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...): data
## length is not a multiple of split variable
df2_1 <- list2[[1]] ; nrow(df2_1)
## [1] 3334
df2_2 <- list2[[2]] ; nrow(df2_2)
## [1] 3333
df2_3 <- list2[[3]] ; nrow(df2_3)
## [1] 3333
Here we use the name variable to split the data frame, then recover into data frames.
list3 <- split(df, df$name)
df3_1 <- list3[[1]] ; nrow(df3_1)
## [1] 2528
df3_2 <- list3[[2]] ; nrow(df3_2)
## [1] 2490
df3_3 <- list3[[3]] ; nrow(df3_3)
## [1] 2475
df3_4 <- list3[[4]] ; nrow(df3_4)
## [1] 2507