Background

In this project, I demonstrate how to extract data from multiple excel files nested in different folders and sub-folders. Ordinarily, this data would require a person to open each folder and sub-folder. Next, open each Ms Excel file inside the folders or sub-folders, and then copy-paste the content of each of the Excel files to a master Excel file.

Copy-pasting is viable when dealing with a few data files. However, even for a very efficient Ms Excel user, cutting and pasting data from 200 Ms Excel files to create one data set is a tall order. Fortunately, the R programming language makes such tasks easy and fast.

I start by loading the required packages. Note that this step requires an internet connection.

## Load required packages ----
if(!require(pacman)){
        
        install.packages("pacman")
}

## Loading required package: pacman

pacman::p_load(tidyverse, readxl, data.table, doParallel,
               
               tufte, tint, stevetemplates, rticles)

The code chunk below hastens code execution.

## Allow for parallel computing ----
doParallel::registerDoParallel()

Raw Data

The data is as follows: One master folder named Data contains 16 sub-folders for the years 2006 to 2021. Each of the 16 sub-folders has 12 sub-folders, one for each month of a given year. Each of the 12 sub-folders contains one Ms Excel workbook with one sheet. Thus, we have 192 Ms Excel files. Note that all the data in Excel are in the same format. A sample excel file, in this case that of August 2006, has data in the following format.

head(readxl::read_xlsx("Data/2006/Aug/Aug.xlsx", col_names = FALSE), 10)

## # A tibble: 10 × 15
##    ...1  ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11 ...12 ...13
##    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 ####… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  2 #     Query made  at    04/1… 3.13… UTC   <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  3 #     Coun… like  Zamb* <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  4 #     Time  inte… from  38725 0.22… to    08/3… 0.22… UTC   <NA>  <NA>  <NA> 
##  5 ####… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  6 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  7 ####… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  8 #     SYNO… from  6740… Kawa… (Zam… |     09-4… |     029-… |     1323  m    
##  9 ####… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 10 2006… AAXX  25064 67403 31980 904   10212 20096 48548 70500 <NA>  <NA>  <NA> 
## # … with 2 more variables: ...14 <chr>, ...15 <chr>

Expected Output

The first task is to create a single data table from the 192 excel files. The second task is to clean the data in column F. Specifically; I extract each row in column F (or column 6) that contains a 5 number digit. Put another way, I drop all rows in column F that do not have a number with exactly five digits.

Next, I break down the 5 number digit in column F into three components or variables.

Digit 1: Cloud cover.
Digits 2 and 3: Wind direction.
Digit 4 and 5: Wind speed in knots.

For example, a five digit number like 13786 will be as follows.

tribble(~ cloud, ~ wind_direction, ~ wind_speed,
        
        1, 37, 86) %>% 
        
        knitr::kable()

cloud	wind_direction	wind_speed
1	37	86

Next, I convert wind speed from knots to metres per second by multiplying the variable by 0.514. Finally, I extract the wind speed in metres per second and summarise (average) the variable by month and year.

You can download the code used in this analysis from my GitHub account, www.github.com/Karuitha.

Extracting the Data

This section details the steps I follow in reading and extracting the data.

Step 1: List all the Ms Excel Files in the Folders.

The code chunk below will list the names of all excel (.xlsx) files inside the Data folder, including sub-folders inside the Data folder.

## List files recursively, that is, including files in sub-folders
my_files <- list.files(path = "Data", 
                       
                       pattern = "*.xlsx", 
                       
                       recursive = TRUE, 
                       
                      full.names = TRUE)

path = "Data": Indicates that our data is in the Data folder.
pattern = "*.xlsx": This notation instructs R to extract all files that end in .xlsx.
recursive = TRUE: This instruction tells R to check for excel files in the Data folder and the sub-folders inside the Data folder. If set to FALSE, R will only check for excel files inside the Data folder and miss the files inside the sub-folders.
full.names = TRUE: This instruction tells R to extract the full file path of each excel file. The entire file path allows R to read files contained in the sub-folders.

Below is a listing of all the excel files detected.

## Print all files
my_files

##   [1] "Data/2006/April/April.xlsx"       "Data/2006/Aug/Aug.xlsx"          
##   [3] "Data/2006/Dec/Dec.xlsx"           "Data/2006/Feb/Feb.xlsx"          
##   [5] "Data/2006/Jan/Jan.xlsx"           "Data/2006/July/July.xlsx"        
##   [7] "Data/2006/Jun/June.xlsx"          "Data/2006/March/March.xlsx"      
##   [9] "Data/2006/May/May.xlsx"           "Data/2006/Nov/Nov.xlsx"          
##  [11] "Data/2006/Oct/Oct.xlsx"           "Data/2006/Sept/Sept.xlsx"        
##  [13] "Data/2007/Apr/Apr.xlsx"           "Data/2007/Aug/Aug.xlsx"          
##  [15] "Data/2007/Dec/Dec.xlsx"           "Data/2007/Feb/Feb.xlsx"          
##  [17] "Data/2007/Jan/Jan.xlsx"           "Data/2007/July/July.xlsx"        
##  [19] "Data/2007/June/June.xlsx"         "Data/2007/March/March.xlsx"      
##  [21] "Data/2007/May/May.xlsx"           "Data/2007/Nov/Nov.xlsx"          
##  [23] "Data/2007/Oct/Oct.xlsx"           "Data/2007/Sept/September.xlsx"   
##  [25] "Data/2008/Apr/Apr.xlsx"           "Data/2008/Aug/Aug.xlsx"          
##  [27] "Data/2008/Dec/Dec.xlsx"           "Data/2008/Feb/Feb.xlsx"          
##  [29] "Data/2008/Jan/Jan.xlsx"           "Data/2008/July/July.xlsx"        
##  [31] "Data/2008/June/June.xlsx"         "Data/2008/March/March.xlsx"      
##  [33] "Data/2008/May/May.xlsx"           "Data/2008/Nov/Nov.xlsx"          
##  [35] "Data/2008/Oct/Oct.xlsx"           "Data/2008/Sept/Sept.xlsx"        
##  [37] "Data/2009/Apr/Apr.xlsx"           "Data/2009/Aug/Aug.xlsx"          
##  [39] "Data/2009/Dec/Dec.xlsx"           "Data/2009/Feb/Feb.xlsx"          
##  [41] "Data/2009/Jan/Jan.xlsx"           "Data/2009/July/July.xlsx"        
##  [43] "Data/2009/Jun/June.xlsx"          "Data/2009/Mar/Mar.xlsx"          
##  [45] "Data/2009/May/May.xlsx"           "Data/2009/Nov/Nov.xlsx"          
##  [47] "Data/2009/Oct/Oct.xlsx"           "Data/2009/Sept/Sept.xlsx"        
##  [49] "Data/2010/Apr/Apr.xlsx"           "Data/2010/Aug/Aug.xlsx"          
##  [51] "Data/2010/Dec/Dec.xlsx"           "Data/2010/Feb/Feb.xlsx"          
##  [53] "Data/2010/Jan/Jan.xlsx"           "Data/2010/July/July.xlsx"        
##  [55] "Data/2010/June/June.xlsx"         "Data/2010/Mar/Mar.xlsx"          
##  [57] "Data/2010/May/May.xlsx"           "Data/2010/Nov/Nov.xlsx"          
##  [59] "Data/2010/Oct/Oct.xlsx"           "Data/2010/Sept/Sept.xlsx"        
##  [61] "Data/2011/Apr/Apr.xlsx"           "Data/2011/Aug/Aug.xlsx"          
##  [63] "Data/2011/Dec/Dec.xlsx"           "Data/2011/Feb/Feb/Feb.xlsx"      
##  [65] "Data/2011/Jan/Jan.xlsx"           "Data/2011/Jul/Jul.xlsx"          
##  [67] "Data/2011/June/June.xlsx"         "Data/2011/March/March.xlsx"      
##  [69] "Data/2011/May/May.xlsx"           "Data/2011/Nov/Nov.xlsx"          
##  [71] "Data/2011/Oct/Oct.xlsx"           "Data/2011/Sept/Sept.xlsx"        
##  [73] "Data/2012/Apr/Apr.xlsx"           "Data/2012/Aug/Aug.xlsx"          
##  [75] "Data/2012/Dec/Dec.xlsx"           "Data/2012/Feb/Feb.xlsx"          
##  [77] "Data/2012/Jan/Jan.xlsx"           "Data/2012/July/July.xlsx"        
##  [79] "Data/2012/June/June.xlsx"         "Data/2012/March/Mar.xlsx"        
##  [81] "Data/2012/May/May.xlsx"           "Data/2012/Nov/Nov.xlsx"          
##  [83] "Data/2012/Oct/Oct.xlsx"           "Data/2012/Sep/Sept.xlsx"         
##  [85] "Data/2013/Apr/apr.xlsx"           "Data/2013/aug/aug.xlsx"          
##  [87] "Data/2013/dec/dec.xlsx"           "Data/2013/Feb/Feb.xlsx"          
##  [89] "Data/2013/Jan/jan.xlsx"           "Data/2013/jul/jul.xlsx"          
##  [91] "Data/2013/Jun/jun.xlsx"           "Data/2013/March/march.xlsx"      
##  [93] "Data/2013/may/may.xlsx"           "Data/2013/nov/nov.xlsx"          
##  [95] "Data/2013/oct/oct.xlsx"           "Data/2013/sep/sep.xlsx"          
##  [97] "Data/2014/Apr/Apr.xlsx"           "Data/2014/Aug/Aug.xlsx"          
##  [99] "Data/2014/Dec/Dec.xlsx"           "Data/2014/Feb/Feb.xlsx"          
## [101] "Data/2014/jan/jan.xlsx"           "Data/2014/July/July.xlsx"        
## [103] "Data/2014/June/June.xlsx"         "Data/2014/March/March.xlsx"      
## [105] "Data/2014/May/May.xlsx"           "Data/2014/Nov/Nov.xlsx"          
## [107] "Data/2014/Oct/Oct.xlsx"           "Data/2014/Sept/Sept.xlsx"        
## [109] "Data/2015/Apr/Apr.xlsx"           "Data/2015/Aug/Aug.xlsx"          
## [111] "Data/2015/Dec/Dec.xlsx"           "Data/2015/Feb/Feb.xlsx"          
## [113] "Data/2015/Jan/Jan.xlsx"           "Data/2015/July/July.xlsx"        
## [115] "Data/2015/June/June.xlsx"         "Data/2015/Mar/March.xlsx"        
## [117] "Data/2015/May/May.xlsx"           "Data/2015/Nov/Nov.xlsx"          
## [119] "Data/2015/Oct/Oct.xlsx"           "Data/2015/Sept/Sept.xlsx"        
## [121] "Data/2016/Apr/Apr.xlsx"           "Data/2016/Aug/Aug.xlsx"          
## [123] "Data/2016/Dec/Dec.xlsx"           "Data/2016/Feb/Feb.xlsx"          
## [125] "Data/2016/Jan/Jan.xlsx"           "Data/2016/July/July.xlsx"        
## [127] "Data/2016/June/June.xlsx"         "Data/2016/Mar/Mar.xlsx"          
## [129] "Data/2016/May/May.xlsx"           "Data/2016/Nov/Nov.xlsx"          
## [131] "Data/2016/Oct/Oct.xlsx"           "Data/2016/Sep/Sep.xlsx"          
## [133] "Data/2017/Apr/Apr.xlsx"           "Data/2017/Aug/Aug.xlsx"          
## [135] "Data/2017/Dec/Dec.xlsx"           "Data/2017/Feb/feb.xlsx"          
## [137] "Data/2017/Jan/Jan.xlsx"           "Data/2017/July/July.xlsx"        
## [139] "Data/2017/June/June.xlsx"         "Data/2017/Marc/mar.xlsx"         
## [141] "Data/2017/May/May.xlsx"           "Data/2017/Nov/Nov.xlsx"          
## [143] "Data/2017/Oct/Oct.xlsx"           "Data/2017/Sept/Sept.xlsx"        
## [145] "Data/2018/Apr/Apr.xlsx"           "Data/2018/Aug/Aug.xlsx"          
## [147] "Data/2018/Dec/dec.xlsx"           "Data/2018/Feb/Feb.xlsx"          
## [149] "Data/2018/Jan/Jan.xlsx"           "Data/2018/July/July.xlsx"        
## [151] "Data/2018/Jun/June.xlsx"          "Data/2018/Mar/Mar.xlsx"          
## [153] "Data/2018/May/May.xlsx"           "Data/2018/Nov/Nov.xlsx"          
## [155] "Data/2018/Oct/Oct.xlsx"           "Data/2018/sep/sep.xlsx"          
## [157] "Data/2019/April/April.xlsx"       "Data/2019/Aug/Aug.xlsx"          
## [159] "Data/2019/Dec/Dec.xlsx"           "Data/2019/Feb/Feb.xlsx"          
## [161] "Data/2019/Jan/Jan.xlsx"           "Data/2019/July/July.xlsx"        
## [163] "Data/2019/Jun/June.xlsx"          "Data/2019/March/March.xlsx"      
## [165] "Data/2019/May/May.xlsx"           "Data/2019/Nov/Nov.xlsx"          
## [167] "Data/2019/Oct/Oct.xlsx"           "Data/2019/Sep/Sept.xlsx"         
## [169] "Data/2020/Apr/April.xlsx"         "Data/2020/Aug/Aug.xlsx"          
## [171] "Data/2020/Dec/Dec.xlsx"           "Data/2020/Feb/Feb.xlsx"          
## [173] "Data/2020/Jan/Jan.xlsx"           "Data/2020/July/July.xlsx"        
## [175] "Data/2020/June/June.xlsx"         "Data/2020/March/March.xlsx"      
## [177] "Data/2020/May/May.xlsx"           "Data/2020/Nov/Nov.xlsx"          
## [179] "Data/2020/Oct/Oct.xlsx"           "Data/2020/Sept/Sept.xlsx"        
## [181] "Data/2021/April/April.xlsx"       "Data/2021/Aug/Aug.xlsx"          
## [183] "Data/2021/Dec/Dec.xlsx"           "Data/2021/February/Feb.xlsx"     
## [185] "Data/2021/January/Jan.xlsx"       "Data/2021/July/July.xlsx"        
## [187] "Data/2021/June/June.xlsx"         "Data/2021/March/March.xlsx"      
## [189] "Data/2021/May/May.xlsx"           "Data/2021/November/November.xlsx"
## [191] "Data/2021/Oct/Oct.xlsx"           "Data/2021/Sept/Sept.xlsx"

Step 2: Create a Function to Read and Clean Excel Files

As noted earlier, the data in the Ms Excel files have the same format. My approach to creating a function to read and clean data is to take one Excel file and write a piece of code that reads the data and does data cleaning to my satisfaction. I then use this code to construct a function to read and clean data from any other file in the dataset.

The function below reads in the data and does the cleaning. Notable is the extract function that takes a grouped regular expression (regex) and extracts the data specified. The function will extract the first, second and third, and fourth and fifth digits and allocate them to their respective variable names. It is also critical to convert all data into characters to avoid problems when combining the files into one dataset. You can convert all columns to characters using the code;

mutate(across(.cols = everything(), as.character))

## Write function to read in the data ----
## The function takes in file name (x) and country as inputs

get_data <- function(x, country){
        
        readxl::read_xlsx(x, col_names = FALSE) %>% 
                
                ## Convert all columns to character to allow binding rows
                mutate(across(.cols = everything(), as.character)) %>% 
                
                ## Get rows for date from folder and country
                mutate(date = x, country = country) %>% 
                
                ## Select relevant columns, including F
                select(date, country, `...6`) %>% 
                
                ## Rename the column F with the values to a proper name
                rename(values = `...6`) %>% 
                
                ## Extract only columns with 5 letters
                mutate(figs = str_extract(values, "^[0-9]{5}$")) %>% 
                
                ## Drop columns with NAs in column F
                drop_na(figs) %>% 
                
                select(-figs) %>% 
                
                ## Convert values to numeric
                mutate(values = parse_number(values)) %>% 
                
                ## Extract month and year from file names
                mutate(year = str_extract(date, '[0-9]{4}'),
                       
                       month = str_extract(date, '[a-zA-Z]*\\.xlsx$'),
                       
                       month = str_remove(month, '\\.xlsx$')
                       
                       ) %>% 
                
                ## Remove the redundant column containing file name
                ## Given we have extracted year and month from it.
                select(-date) %>% 
                
                relocate(month, year) %>% 
                
                ## Extract clouds(digit 1), wind direction(digit 2 & 3)
                ## And wind speed (digit 4 and 5)
                ## retain the original variable
                extract(values, into = c("clouds", "wind_direction", 
                                         
                                         "speed"), 
                        
                        regex = "(\\d{1})(\\d{2})(\\d{2})",
                        
                        remove = FALSE) %>% 
                
                ## Convert the extracted variables to numerics
                mutate(
                       clouds = parse_number(clouds),
                       
                       speed = parse_number(speed),
                       
                       wind_direction = parse_number(wind_direction),
                       
                       ## Convert knots to m/s
                       wind_speed = speed * 0.514) %>% 
                
                ## Arrange data well
                relocate(wind_direction, .after = clouds)
                                              
                                              
}

Step 3: Loop the Function Over the Excel Files

Next, I loop the function over each Ms Excel file. We would have to write a for loop in other programming languages like Python. However, in R, we loop mainly using the apply or map family of functions, although for loops can also work in R.

## Loop the function above over all files in the dataset
final_joe_data <- map_dfr(my_files, get_data, country = "Zambia")

Long Form of the Data

The table below shows the first six rows of the data in the long format.

head(final_joe_data)

## # A tibble: 6 × 8
##   month year  country values clouds wind_direction speed wind_speed
##   <chr> <chr> <chr>    <dbl>  <dbl>          <dbl> <dbl>      <dbl>
## 1 April 2006  Zambia   60000      6              0     0       0   
## 2 April 2006  Zambia   40000      4              0     0       0   
## 3 April 2006  Zambia   11213      1             12    13       6.68
## 4 April 2006  Zambia   53302      5             33     2       1.03
## 5 April 2006  Zambia   50905      5              9     5       2.57
## 6 April 2006  Zambia   30905      3              9     5       2.57

Long Form of the Wind Speed data

The last two digits of the five digit number in column F represent the wind speed in knots. We then convert the wind speed to meters per second. In this section, I summarise the wind speed in Zambia by month of each year.

## I am writing the excel file to avoid running the
## files reading process as it takes a lot of time
fwrite(final_joe_data, "final_joe_data.csv")

## I repeat the same exercise to get long data ----
final_joe_long_data <- fread("final_joe_data.csv") %>% 
        
        ## Select relevant variables
        select(month, year, wind_speed) %>% 
        
        ## Group data by month and year for making summaries
        group_by(month, year) %>% 
        
        ## Take average of wind speed
        summarise(avg_wind_speed = mean(wind_speed, na.rm = TRUE)) %>% 
        
        ## Convert months to a standard format
        mutate(month = str_to_sentence(month),
               
               month = case_when(
                       
                       str_detect(month, '^[Aa]p.*') ~ "April",
                       
                       str_detect(month, '^[Jj]ul.*') ~ "July",
                       
                       str_detect(month, '^[Jj]un.*') ~ "June",
                       
                       str_detect(month, '^[Mm]ar.*') ~ "Mar",
                       
                       str_detect(month, '^[Nn]ov.*') ~ "Nov",
                       
                       str_detect(month, '^[Ss]ep.*') ~ "Sep",
                       
                       TRUE ~ month
               )
               
        )

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

head(final_joe_long_data)

## # A tibble: 6 × 3
## # Groups:   month [1]
##   month  year avg_wind_speed
##   <chr> <int>          <dbl>
## 1 April  2013           9.07
## 2 April  2007           5.53
## 3 April  2008           5.09
## 4 April  2009           5.97
## 5 April  2010           5.48
## 6 April  2011           4.75

Wide Form of the Wind Speed data

For storage efficiency, we usually store data in a wide format. Here, the columns represent months, while the rows are the years.

## Finally I average the wind speed data by year, month 
## Them make it wide 
## I read in the data
final_joe_wide_data <- fread("final_joe_data.csv") %>% 
        
        ## Select relevant variables
        select(month, year, wind_speed) %>% 
        
        ## Group data by month and year for making summaries
        group_by(month, year) %>% 
        
        ## Take average of wind speed
        summarise(avg_wind_speed = mean(wind_speed, na.rm = TRUE)) %>% 
        
        ## Convert months to a standard format
        mutate(month = str_to_sentence(month),
               
               month = case_when(
                       
                       str_detect(month, '^[Aa]p.*') ~ "April",
                       
                       str_detect(month, '^[Jj]ul.*') ~ "July",
                       
                       str_detect(month, '^[Jj]un.*') ~ "June",
                       
                       str_detect(month, '^[Mm]ar.*') ~ "Mar",
                       
                       str_detect(month, '^[Nn]ov.*') ~ "Nov",
                       
                       str_detect(month, '^[Ss]ep.*') ~ "Sep",
                       
                       TRUE ~ month
               )
               
               ) %>% 
        
        ## make the data wide
        pivot_wider(names_from = month, 
                    
                    values_from = avg_wind_speed) %>% 
        
        ## Arrange data by year
        arrange(year)

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

head(final_joe_wide_data)

## # A tibble: 6 × 13
##    year April   Aug   Dec   Feb   Jan  July  June   Mar   May   Nov   Oct   Sep
##   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2006  5.20  4.42  6.80  8.36  7.77  4.72  4.10  6.68  4.70  6.78  7.54  4.86
## 2  2007  5.53  4.96  5.62  4.29  6.94  3.26  4.10  6.66  4.74  6.41  5.53  3.67
## 3  2008  5.09  4.08  8.64  5.76  5.23  4.10  3.90  5.88  4.94  7.27  4.19  5.33
## 4  2009  5.97  6.23  7.63  6.56  7.42  5.38  5.18  6.07  5.57  6.77  6.52  4.90
## 5  2010  5.48  6.62  5.63  5.57  5.38  4.35  4.59  5.23  4.91  7.65  6.31  6.88
## 6  2011  4.75  5.25  5.29  5.03  7.07  3.97  4.26  6.60  3.68  6.12  5.17  4.10

Exploratory Data Analysis

I summarise the cloud, wind_direction and wind_speed in Zambia.

final_joe_data %>% 
        
        select(where(is.numeric), -year, -values, -speed) %>% 
        
        GGally::ggpairs()

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

What Could cause Code to Break?

##################################################
## IMPORTANT
## Challenges that make code to break
# Inconsistent formats xls, xlsx(see year 2017). Convert to xlsx
# Some broken excel files in the data folder. Delete them
# Inconsistent file naming. eg march.xlsx, march2021.xlsx (see year 2021 relative to other years)
# Data stacked up in one column.

Conclusion

In this analysis, I have demonstrated how to read multiple Ms Excel files from a folder and techniques for cleaning such data.

Replication files are available on my Github account (https://github.com/Karuitha/data_projects/tree/master/Send). **Corresponding author**jkaruitha@karu.ac.ke.

Extracting Data from 200 Nested Excel Files Using the R Programming Language

John Karuitha