Reading Data from Multiple Files

Author

Jamal Rogers

Published

May 16, 2023

We load the tidyverse package to continue using the readr package.

library(tidyverse)

Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March. With read_csv() you can read these data in at once and stack them on top of each other in a single data frame.

    sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")

    read_csv(sales_files, id = "file")

Once again, the code above will work if you have the CSV files in a data folder in your project. You can download these files from https://pos.it/r4ds-01-sales, https://pos.it/r4ds-02-sales, and https://pos.it/r4ds-03-sales or you can read them directly with:

sales_files <- c(
  "https://pos.it/r4ds-01-sales",
  "https://pos.it/r4ds-02-sales",
  "https://pos.it/r4ds-03-sales"
)
read_csv(sales_files, id = "file")

# A tibble: 19 × 6
   file                         month     year brand  item     n
   <chr>                        <chr>    <dbl> <dbl> <dbl> <dbl>
 1 https://pos.it/r4ds-01-sales January   2019     1  1234     3
 2 https://pos.it/r4ds-01-sales January   2019     1  8721     9
 3 https://pos.it/r4ds-01-sales January   2019     1  1822     2
 4 https://pos.it/r4ds-01-sales January   2019     2  3333     1
 5 https://pos.it/r4ds-01-sales January   2019     2  2156     9
 6 https://pos.it/r4ds-01-sales January   2019     2  3987     6
 7 https://pos.it/r4ds-01-sales January   2019     2  3827     6
 8 https://pos.it/r4ds-02-sales February  2019     1  1234     8
 9 https://pos.it/r4ds-02-sales February  2019     1  8721     2
10 https://pos.it/r4ds-02-sales February  2019     1  1822     3
11 https://pos.it/r4ds-02-sales February  2019     2  3333     1
12 https://pos.it/r4ds-02-sales February  2019     2  2156     3
13 https://pos.it/r4ds-02-sales February  2019     2  3987     6
14 https://pos.it/r4ds-03-sales March     2019     1  1234     3
15 https://pos.it/r4ds-03-sales March     2019     1  3627     1
16 https://pos.it/r4ds-03-sales March     2019     1  8820     3
17 https://pos.it/r4ds-03-sales March     2019     2  7253     1
18 https://pos.it/r4ds-03-sales March     2019     2  8766     3
19 https://pos.it/r4ds-03-sales March     2019     2  8288     6

The id argument adds a new column called file to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.

If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base list.files() function to find the files for you by matching a pattern in the file names.

    sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)