Scenario: You have many files on your harddive and you want to process them all in R in one way or another.
In the simplest case, we just import all files into a list of data frames. Each entry in the list is one data frame that represents one file.
# First, get a handle on the files that you want to import
# here, selecting all files with ".txt" suffix from the current working directory.
my.files <- list.files(pattern = ".txt")
my.files
## [1] "loPcday200185.asd.txt" "loPcday200186.asd.txt" "loPcday200187.asd.txt"
## [4] "loPcday200188.asd.txt" "loPcday200189.asd.txt" "loPcday200190.asd.txt"
## [7] "loPcday200191.asd.txt" "loPcday200192.asd.txt" "loPcday200193.asd.txt"
# now importing them all at once as entries in a list
# for my files, I have to provide some options to read.csv
my.data <- lapply(my.files,
read.csv,
header=TRUE, sep="\t", skip = 40)
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
# no worries regarding the warning regarding a NUL character - that is ok in the files from our spectrometer
# my.data is a list of data frames
class(my.data)
## [1] "list"
summary(my.data)
## Length Class Mode
## [1,] 2 data.frame list
## [2,] 2 data.frame list
## [3,] 2 data.frame list
## [4,] 2 data.frame list
## [5,] 2 data.frame list
## [6,] 2 data.frame list
## [7,] 2 data.frame list
## [8,] 2 data.frame list
## [9,] 2 data.frame list
# this shows that we have a list with 9 entries
# each entry in the list represents one imported file stored as a data frame.
# Another way of looking at the structure of an object
str(my.data)
## List of 9
## $ :'data.frame': 2151 obs. of 2 variables:
## ..$ Wavelength : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
## ..$ loPcday200185.asd: num [1:2151] 0.0148 0.0211 0.0212 0.0205 0.0181 ...
## $ :'data.frame': 2151 obs. of 2 variables:
## ..$ Wavelength : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
## ..$ loPcday200186.asd: num [1:2151] 0.0142 0.0194 0.0219 0.0195 0.017 ...
## $ :'data.frame': 2151 obs. of 2 variables:
## ..$ Wavelength : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
## ..$ loPcday200187.asd: num [1:2151] 0.0149 0.0209 0.0198 0.0158 0.0137 ...
## $ :'data.frame': 2151 obs. of 2 variables:
## ..$ Wavelength : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
## ..$ loPcday200188.asd: num [1:2151] 1.05 1.04 1.04 1.04 1.04 ...
## $ :'data.frame': 2151 obs. of 2 variables:
## ..$ Wavelength : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
## ..$ loPcday200189.asd: num [1:2151] 1.04 1.04 1.05 1.05 1.05 ...
## $ :'data.frame': 2151 obs. of 2 variables:
## ..$ Wavelength : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
## ..$ loPcday200190.asd: num [1:2151] 0.0137 0.0158 0.015 0.0168 0.0172 ...
## $ :'data.frame': 2151 obs. of 2 variables:
## ..$ Wavelength : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
## ..$ loPcday200191.asd: num [1:2151] 0.0146 0.016 0.013 0.0143 0.0167 ...
## $ :'data.frame': 2151 obs. of 2 variables:
## ..$ Wavelength : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
## ..$ loPcday200192.asd: num [1:2151] 0.0103 0.014 0.0148 0.0142 0.0134 ...
## $ :'data.frame': 2151 obs. of 2 variables:
## ..$ Wavelength : int [1:2151] 350 351 352 353 354 355 356 357 358 359 ...
## ..$ loPcday200193.asd: num [1:2151] 0.0144 0.0184 0.0195 0.0181 0.0153 ...
# This tells us again, that "my.data" is a list with 9 entries
# To use one "file" out of the list, use the square brackets
third_file <- data.frame(my.data[3])
# looking at the top of the third file
head(third_file)
## Wavelength loPcday200187.asd
## 1 350 0.01490
## 2 351 0.02090
## 3 352 0.01980
## 4 353 0.01580
## 5 354 0.01372
## 6 355 0.01598
# get all the data frames from the list into a single data frame
my.data <- do.call("rbind", my.data)
## Error: names do not match previous names
# this gives an error about incompatible names (in my case, the second column name in the csv changes)
# To get around that, the names in the imported files need changes
# second try:
my.import <- function(data) {
my.df <- read.csv(data, header=TRUE, sep="\t", skip = 40)
names(my.df)[2] <- "Extinction" # changing the name of the offending column
# add an identifier to the data frame that tells us the file name of the data came from
my.df$Name <- as.factor(data)
return(my.df)
}
# apply the my.import funtion to each element of the list
my.data <- lapply(my.files, my.import)
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
# to get from the list of data frames to a single data frame
my.data <- do.call("rbind", my.data)
# see what we got
class(my.data)
## [1] "data.frame"
summary(my.data)
## Wavelength Extinction Name
## Min. : 350 Min. :0.0103 loPcday200185.asd.txt:2151
## 1st Qu.: 887 1st Qu.:0.0845 loPcday200186.asd.txt:2151
## Median :1425 Median :0.2284 loPcday200187.asd.txt:2151
## Mean :1425 Mean :0.3677 loPcday200188.asd.txt:2151
## 3rd Qu.:1963 3rd Qu.:0.4148 loPcday200189.asd.txt:2151
## Max. :2500 Max. :1.0835 loPcday200190.asd.txt:2151
## (Other) :6453
head(my.data)
## Wavelength Extinction Name
## 1 350 0.01483 loPcday200185.asd.txt
## 2 351 0.02110 loPcday200185.asd.txt
## 3 352 0.02117 loPcday200185.asd.txt
## 4 353 0.02049 loPcday200185.asd.txt
## 5 354 0.01808 loPcday200185.asd.txt
## 6 355 0.01368 loPcday200185.asd.txt
for (i in 1:(length(my.files))){
# import the file
cur.file <- read.csv(file = my.files[i],
skip = 40, sep = "\t")
# use the column name of the second column as name for the object. This is just an example for my files, you have to find your own way to identify and name your data objects
my.name <- names(cur.file)[2]
# assign the name to the object
assign(paste(my.name), cur.file)
}
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
for (i in 1:(length(my.files)))
{
if (i == 1) {
# during the first run, I'm creating a dummy data frame with out any values. Later, I append the content of each file to this data frame
my.data <- data.frame(NULL)
}
cur.file <- read.csv(file = my.files[i],
skip = 40, sep = "\t")
# I'm using the the column header of the second column as a new variable. This indicator is unique for each file
cur.file$Name <- names(cur.file)[2]
# before I can append the imported file to my exisitng dummy data frame, I have to make sure the column names match up.
names(cur.file)[2] <- "Refl"
# using rbind to append the data
my.data <- rbind(my.data, cur.file)
}
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
## Warning: line 4 appears to contain an embedded nul
summary(my.data)
## Wavelength Refl Name
## Min. : 350 Min. :0.0103 Length:19359
## 1st Qu.: 887 1st Qu.:0.0845 Class :character
## Median :1425 Median :0.2284 Mode :character
## Mean :1425 Mean :0.3677
## 3rd Qu.:1963 3rd Qu.:0.4148
## Max. :2500 Max. :1.0835
head(my.data)
## Wavelength Refl Name
## 1 350 0.01483 loPcday200185.asd
## 2 351 0.02110 loPcday200185.asd
## 3 352 0.02117 loPcday200185.asd
## 4 353 0.02049 loPcday200185.asd
## 5 354 0.01808 loPcday200185.asd
## 6 355 0.01368 loPcday200185.asd
Compared to a “for” loop, the “apply”-version is faster, more concise, and simpler to understand… But it's your choice..