Importing multiple files

Scenario: You have many files on your harddive and you want to process them all in R in one way or another.

In the simplest case, we just import all files into a list of data frames. Each entry in the list is one data frame that represents one file.

# First, get a handle on the files that you want to import
# here, selecting all files with ".txt" suffix from the current working directory.
my.files <- list.files(pattern = ".txt")

my.files

## [1] "loPcday200185.asd.txt" "loPcday200186.asd.txt" "loPcday200187.asd.txt"
## [4] "loPcday200188.asd.txt" "loPcday200189.asd.txt" "loPcday200190.asd.txt"
## [7] "loPcday200191.asd.txt" "loPcday200192.asd.txt" "loPcday200193.asd.txt"


# now importing them all at once as entries in a list
# for my files, I have to provide some options to read.csv
my.data <- lapply(my.files, 
                  read.csv, 
                  header=TRUE, sep="\t", skip = 40)

## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul

# no worries regarding the warning regarding a NUL character - that is ok in the files from our spectrometer
# my.data is a list of data frames
class(my.data)

## [1] "list"

summary(my.data)

##       Length Class      Mode
##  [1,] 2      data.frame list
##  [2,] 2      data.frame list
##  [3,] 2      data.frame list
##  [4,] 2      data.frame list
##  [5,] 2      data.frame list
##  [6,] 2      data.frame list
##  [7,] 2      data.frame list
##  [8,] 2      data.frame list
##  [9,] 2      data.frame list

# this shows that we have a list with 9 entries
# each entry in the list represents one imported file stored as a data frame.
# Another way of looking at the structure of an object
str(my.data)

## List of 9
##  $ :'data.frame':    2151 obs. of  2 variables:
##   ..$ Wavelength       : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
##   ..$ loPcday200185.asd: num [1:2151] 0.0148 0.0211 0.0212 0.0205 0.0181 ...
##  $ :'data.frame':    2151 obs. of  2 variables:
##   ..$ Wavelength       : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
##   ..$ loPcday200186.asd: num [1:2151] 0.0142 0.0194 0.0219 0.0195 0.017 ...
##  $ :'data.frame':    2151 obs. of  2 variables:
##   ..$ Wavelength       : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
##   ..$ loPcday200187.asd: num [1:2151] 0.0149 0.0209 0.0198 0.0158 0.0137 ...
##  $ :'data.frame':    2151 obs. of  2 variables:
##   ..$ Wavelength       : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
##   ..$ loPcday200188.asd: num [1:2151] 1.05 1.04 1.04 1.04 1.04 ...
##  $ :'data.frame':    2151 obs. of  2 variables:
##   ..$ Wavelength       : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
##   ..$ loPcday200189.asd: num [1:2151] 1.04 1.04 1.05 1.05 1.05 ...
##  $ :'data.frame':    2151 obs. of  2 variables:
##   ..$ Wavelength       : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
##   ..$ loPcday200190.asd: num [1:2151] 0.0137 0.0158 0.015 0.0168 0.0172 ...
##  $ :'data.frame':    2151 obs. of  2 variables:
##   ..$ Wavelength       : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
##   ..$ loPcday200191.asd: num [1:2151] 0.0146 0.016 0.013 0.0143 0.0167 ...
##  $ :'data.frame':    2151 obs. of  2 variables:
##   ..$ Wavelength       : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
##   ..$ loPcday200192.asd: num [1:2151] 0.0103 0.014 0.0148 0.0142 0.0134 ...
##  $ :'data.frame':    2151 obs. of  2 variables:
##   ..$ Wavelength       : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
##   ..$ loPcday200193.asd: num [1:2151] 0.0144 0.0184 0.0195 0.0181 0.0153 ...

# This tells us again, that "my.data" is a list with 9 entries
# To use one "file" out of the list, use the square brackets
third_file <- data.frame(my.data[3])

# looking at the top of the third file
head(third_file)

##   Wavelength loPcday200187.asd
## 1        350           0.01490
## 2        351           0.02090
## 3        352           0.01980
## 4        353           0.01580
## 5        354           0.01372
## 6        355           0.01598


# get all the data frames from the list into a single data frame
my.data <- do.call("rbind", my.data)

## Error: names do not match previous names


# this gives an error about incompatible names (in my case, the second column name in the csv changes)
# To get around that, the names in the imported files need changes
# second try:
my.import <- function(data) {
              my.df <- read.csv(data, header=TRUE, sep="\t", skip = 40)
              names(my.df)[2] <- "Extinction" # changing the name of the offending column
              # add an identifier to the data frame that tells us the file name of the data came from
              my.df$Name <- as.factor(data)
              return(my.df)
}

# apply the my.import funtion to each element of the list
my.data <- lapply(my.files, my.import)

## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul


# to get from the list of data frames to a single data frame
my.data <- do.call("rbind", my.data)

# see what we got
class(my.data)

## [1] "data.frame"

summary(my.data)

##    Wavelength     Extinction                        Name     
##  Min.   : 350   Min.   :0.0103   loPcday200185.asd.txt:2151  
##  1st Qu.: 887   1st Qu.:0.0845   loPcday200186.asd.txt:2151  
##  Median :1425   Median :0.2284   loPcday200187.asd.txt:2151  
##  Mean   :1425   Mean   :0.3677   loPcday200188.asd.txt:2151  
##  3rd Qu.:1963   3rd Qu.:0.4148   loPcday200189.asd.txt:2151  
##  Max.   :2500   Max.   :1.0835   loPcday200190.asd.txt:2151  
##                                  (Other)              :6453

head(my.data)

##   Wavelength Extinction                  Name
## 1        350    0.01483 loPcday200185.asd.txt
## 2        351    0.02110 loPcday200185.asd.txt
## 3        352    0.02117 loPcday200185.asd.txt
## 4        353    0.02049 loPcday200185.asd.txt
## 5        354    0.01808 loPcday200185.asd.txt
## 6        355    0.01368 loPcday200185.asd.txt

Import multiple files into individual data frames using a “for” loop.

for (i in 1:(length(my.files))){
  # import the file
  cur.file <- read.csv(file = my.files[i], 
                       skip = 40, sep = "\t")
  # use the column name of the second column as name for the object. This is just an example for my files, you have to find your own way to identify and name your data objects
  my.name <- names(cur.file)[2]

  # assign the name to the object
  assign(paste(my.name), cur.file)
}

## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul

Import multiple files into one data frame via a for loop

for (i in 1:(length(my.files)))
{
  if (i == 1) {
    # during the first run, I'm creating a dummy data frame with out any values. Later, I append the content of each file to this data frame
    my.data <- data.frame(NULL)
  }

  cur.file <- read.csv(file = my.files[i], 
                       skip = 40, sep = "\t")

  # I'm using the the column header of the second column as a new variable. This indicator is unique for each file
  cur.file$Name <- names(cur.file)[2]

  # before I can append the imported file to my exisitng dummy data frame, I have to make sure the column names match up.
  names(cur.file)[2] <- "Refl"

  # using rbind to append the data
  my.data <- rbind(my.data, cur.file)
}

## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul

summary(my.data)

##    Wavelength        Refl            Name          
##  Min.   : 350   Min.   :0.0103   Length:19359      
##  1st Qu.: 887   1st Qu.:0.0845   Class :character  
##  Median :1425   Median :0.2284   Mode  :character  
##  Mean   :1425   Mean   :0.3677                     
##  3rd Qu.:1963   3rd Qu.:0.4148                     
##  Max.   :2500   Max.   :1.0835

head(my.data)

##   Wavelength    Refl              Name
## 1        350 0.01483 loPcday200185.asd
## 2        351 0.02110 loPcday200185.asd
## 3        352 0.02117 loPcday200185.asd
## 4        353 0.02049 loPcday200185.asd
## 5        354 0.01808 loPcday200185.asd
## 6        355 0.01368 loPcday200185.asd

Compared to a “for” loop, the “apply”-version is faster, more concise, and simpler to understand… But it's your choice..