Read All Files from a Directory

Introduction

It is often useful to be able to read all files in a directory into R. This simple loop allows you to read all of the excel files and stack them together to create one final dataset. All files must be arranged in the same column format and be of the same file type. We will cover several different versions of the loop I wrote.

While I am sure similar loops similar to this have been writen in the past, this is an easily customizable template that can be used with a variety of functions. We will cover several versions of the loop below.

#read in required packages
require(readxl)
require(tidyverse)

#set the working directory from which the files will be read from
setwd("C:/Users/Luke/Documents/NBA")

#create a list of the files from your target directory
file_list <- list.files(path="C:/Users/Luke/Documents/NBA")

#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
dataset <- data.frame()

#had to specify columns to get rid of the total column
for (i in 1:length(file_list)){
  temp_data <- read_excel(file_list[i], range = cell_cols("A:H")) #each file will be read in, specify which columns you need read in to avoid any errors
  temp_data$Class <- sapply(strsplit(gsub(".xlsx", "", file_list[i]), "_"), function(x){x[2]}) #clean the data as needed, in this case I am creating a new column that indicates which file each row of data came from
  dataset <- rbind(dataset, temp_data) #for each iteration, bind the new data to the building dataset
}

If you have a particularly large number of files, a loop can take some time. In that case, it may be advantageous to use the fread() function from the data.table package. This function is highly optimized and is much quicker than other functions for reading in large data files.

The function boasts several other advantages in the final step of the loop you can stack the data by variable name rather than the typical column index. So if the columns in the raw data files appear in a different order in separate files, they will be stacked according to the column name rather than index. For example if your first file has a the column name “Reach” in the second column, but your second file has it in the third column, the function will bind the two files by looking at the column name “Reach.” This is accopmplished by seting the use.names arguement to “TRUE” See the rbindlist function documentation for further reference.

Another benefit of using fread is that you can monitor the progress of the loop. If an error message is thrown, you will know exactly what file to look at because fread confirms once a completion has been officially read in to R properly.

#read in required packages
require(data.table)

setwd("C:/Users/Luke/Documents/NBA_Leaders")

#create a list of the files from your target directory
file_list <- list.files(path="C:/Users/Luke/Documents/NBA_Leaders")

#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
dataset <- data.frame()

#had to specify columns to get rid of the total column
for (i in 1:length(file_list)){
  temp_data <- fread(file_list[i], stringsAsFactors = F) #read in files using the fread function from the data.table package
  dataset <- rbindlist(list(dataset, temp_data), use.names = T) #for each iteration, bind the new data to the building dataset
}

Read All Files from a Directory

Luke Munyan

February 21, 2018

Introduction