The “readxl” package is one of a handful that helps get data from Excel workbooks (.xls, .xlsx) into R. We will use it for our example, but all of the following should have equivalent functions for (1) returning sheet names and (2) reading data from workseets:
We will load the package and use one of the example .xlsxs - present in the package directory, subdirectory “\extdata” - to demonstrate use of the read_excel() and excel_sheets() functions.
library(readxl) # install.packages("readxl") or install.packages("tidyverse")
library(plyr)
library(tibble)
Note that there is a readxl_example() function that eases loading of the example files - i.e doesn’t require specification of a file path or additional arguments - but for this exercise we will use read_excel and assume that our working directory is not one of library subdirectories. The “datasets.xlsx” has four (4) tabs, with the iris, mtcars, chickwts and quakes datasets.
# You would need to use readxl_example() or specify the path relevant to your
# terminal.
xl_data <- "C:/Users/toddfpeterson/Documents/R/win-library/3.4/readxl/extdata/datasets.xlsx"
# Before reading data, we will return the names of the sheets for later use:
excel_sheets(path = xl_data)
## [1] "iris" "mtcars" "chickwts" "quakes"
# We will now read in just the quakes data. First, specifying by sheet name, then by number
df_quakes_name <- read_excel(path = xl_data, sheet = "quakes")
df_quakes_number <- read_excel(path = xl_data, sheet = 4)
identical(df_quakes_name, df_quakes_number)
## [1] TRUE
We may want to import all sheets from a workbook. We will do this via lapply(), iterating over the names (or range) of our sheets; passing read_excel() as our function. The resulting object should be a list of four (4) data frames; one (1) per tab.
tab_names <- excel_sheets(path = xl_data)
list_all <- lapply(tab_names, function(x) read_excel(path = xl_data, sheet = x))
str(list_all)
## List of 4
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 5 variables:
## ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## ..$ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 32 obs. of 11 variables:
## ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## ..$ disp: num [1:32] 160 160 108 258 360 ...
## ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
## ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 71 obs. of 2 variables:
## ..$ weight: num [1:71] 179 160 136 227 217 168 108 124 143 140 ...
## ..$ feed : chr [1:71] "horsebean" "horsebean" "horsebean" "horsebean" ...
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 5 variables:
## ..$ lat : num [1:1000] -20.4 -20.6 -26 -18 -20.4 ...
## ..$ long : num [1:1000] 182 181 184 182 182 ...
## ..$ depth : num [1:1000] 562 650 42 626 649 195 82 194 211 622 ...
## ..$ mag : num [1:1000] 4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
## ..$ stations: num [1:1000] 41 15 43 19 11 12 43 15 35 19 ...
read_xl() accepts additional arguments for specifying column types (“col_types”), column names (“col_names”), ranges of cells to read (as opposed to an entire sheet), number of lines to skip when reading in and others.
Lastly, we may have workbooks with sheets identically formatted, but with novel data; e.g. tabs per month, year, location. The following code reads in the worksheets as above, but the list is then collapsed into a single data frame. Note that there are multiple ways to accomplish this task. The example workbook has three (3) identical sheets: “Sheet1”, “…2” and “…3”. Each sheet has the same three (3) columns - “column1”, “…2” and “…3” - each with twenty (20) elements:
sheets <- excel_sheets("readxl_example.xlsx")
list_all_example <- lapply(sheets, function(x) read_excel("readxl_example.xlsx", sheet = x))
str(list_all_example)
## List of 3
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 20 obs. of 3 variables:
## ..$ column1: chr [1:20] "A" "B" "C" "D" ...
## ..$ column2: num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ column3: chr [1:20] "A" "B" "A" "B" ...
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 20 obs. of 3 variables:
## ..$ column1: chr [1:20] "A" "B" "C" "D" ...
## ..$ column2: num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ column3: chr [1:20] "A" "B" "A" "B" ...
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 20 obs. of 3 variables:
## ..$ column1: chr [1:20] "A" "B" "C" "D" ...
## ..$ column2: num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ column3: chr [1:20] "A" "B" "A" "B" ...
# Then, to create single data frame:
df <- rbind.fill(list_all_example)
str(df)
## 'data.frame': 60 obs. of 3 variables:
## $ column1: chr "A" "B" "C" "D" ...
## $ column2: num 1 2 3 4 5 6 7 8 9 10 ...
## $ column3: chr "A" "B" "A" "B" ...
# Or,
df_doCall <- do.call("rbind", list_all_example)
str(df_doCall)
## Classes 'tbl_df', 'tbl' and 'data.frame': 60 obs. of 3 variables:
## $ column1: chr "A" "B" "C" "D" ...
## $ column2: num 1 2 3 4 5 6 7 8 9 10 ...
## $ column3: chr "A" "B" "A" "B" ...
# Or,
df_plyr <- plyr::ldply(list_all_example, data.frame)
str(df_plyr)
## 'data.frame': 60 obs. of 3 variables:
## $ column1: chr "A" "B" "C" "D" ...
## $ column2: num 1 2 3 4 5 6 7 8 9 10 ...
## $ column3: chr "A" "B" "A" "B" ...
In the above, “df_doCall” is of classes “tbl_df” (tibble data frame), “tbl” (tibble) and “data.frame”. “tibbles” are a “modern reimagining of the data frame…” The additional classes are a product of using the read_excel() function. The rbind.fill() and ldply() approaches coerce the resulting object to (only) data frame. However, we could coerce them back (or coerce “df_doCall” to a data frame):
df <- as.tibble(df)
df_plyr <- as.tibble(df)
identical(df, df_doCall)
## [1] TRUE
identical(df, df_plyr)
## [1] TRUE