In this project, I demonstrate how to extract data from multiple excel files nested in different folders and sub-folders. Ordinarily, this data would require a person to open each folder and sub-folder. Next, open each Ms Excel file inside the folders or sub-folders, and then copy-paste the content of each of the Excel files to a master Excel file.
Copy-pasting is viable when dealing with a few data files. However,
even for a very efficient Ms Excel user, cutting and pasting data from
200 Ms Excel files to create one data set is a tall order. Fortunately,
the R
programming language makes such tasks easy and
fast.
I start by loading the required packages. Note that this step requires an internet connection.
## Load required packages ----
if(!require(pacman)){
install.packages("pacman")
}
## Loading required package: pacman
::p_load(tidyverse, readxl, data.table, doParallel,
pacman
tufte, tint, stevetemplates, rticles)
The code chunk below hastens code execution.
## Allow for parallel computing ----
::registerDoParallel() doParallel
The data is as follows: One master folder named Data
contains 16 sub-folders for the years 2006 to 2021. Each of the 16
sub-folders has 12 sub-folders, one for each month of a given year. Each
of the 12 sub-folders contains one Ms Excel workbook with one sheet.
Thus, we have 192 Ms Excel files. Note that all the data in Excel are in
the same format. A sample excel file, in this case that of August 2006,
has data in the following format.
head(readxl::read_xlsx("Data/2006/Aug/Aug.xlsx", col_names = FALSE), 10)
## # A tibble: 10 × 15
## ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11 ...12 ...13
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ####… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 # Query made at 04/1… 3.13… UTC <NA> <NA> <NA> <NA> <NA> <NA>
## 3 # Coun… like Zamb* <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 # Time inte… from 38725 0.22… to 08/3… 0.22… UTC <NA> <NA> <NA>
## 5 ####… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 7 ####… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 8 # SYNO… from 6740… Kawa… (Zam… | 09-4… | 029-… | 1323 m
## 9 ####… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 10 2006… AAXX 25064 67403 31980 904 10212 20096 48548 70500 <NA> <NA> <NA>
## # … with 2 more variables: ...14 <chr>, ...15 <chr>
The first task is to create a single data table from the 192 excel files. The second task is to clean the data in column F. Specifically; I extract each row in column F (or column 6) that contains a 5 number digit. Put another way, I drop all rows in column F that do not have a number with exactly five digits.
Next, I break down the 5 number digit in column F into three components or variables.
Cloud
cover.Wind direction
.Wind speed
in knots.For example, a five digit number like 13786 will be as follows.
tribble(~ cloud, ~ wind_direction, ~ wind_speed,
1, 37, 86) %>%
::kable() knitr
cloud | wind_direction | wind_speed |
---|---|---|
1 | 37 | 86 |
Next, I convert wind speed from knots to metres per second by multiplying the variable by 0.514. Finally, I extract the wind speed in metres per second and summarise (average) the variable by month and year.
You can download the code used in this analysis from my GitHub account, www.github.com/Karuitha.
This section details the steps I follow in reading and extracting the data.
The code chunk below will list the names of all excel (.xlsx) files
inside the Data
folder, including sub-folders inside the
Data
folder.
## List files recursively, that is, including files in sub-folders
<- list.files(path = "Data",
my_files
pattern = "*.xlsx",
recursive = TRUE,
full.names = TRUE)
path = "Data"
: Indicates that our data is in the
Data
folder.pattern = "*.xlsx"
: This notation instructs R to
extract all files that end in .xlsx.recursive = TRUE
: This instruction tells R to check for
excel files in the Data
folder and the sub-folders inside
the Data
folder. If set to FALSE, R will only check for
excel files inside the Data
folder and miss the files
inside the sub-folders.full.names = TRUE
: This instruction tells R to extract
the full file path of each excel file. The entire file path allows R to
read files contained in the sub-folders.Below is a listing of all the excel files detected.
## Print all files
my_files
## [1] "Data/2006/April/April.xlsx" "Data/2006/Aug/Aug.xlsx"
## [3] "Data/2006/Dec/Dec.xlsx" "Data/2006/Feb/Feb.xlsx"
## [5] "Data/2006/Jan/Jan.xlsx" "Data/2006/July/July.xlsx"
## [7] "Data/2006/Jun/June.xlsx" "Data/2006/March/March.xlsx"
## [9] "Data/2006/May/May.xlsx" "Data/2006/Nov/Nov.xlsx"
## [11] "Data/2006/Oct/Oct.xlsx" "Data/2006/Sept/Sept.xlsx"
## [13] "Data/2007/Apr/Apr.xlsx" "Data/2007/Aug/Aug.xlsx"
## [15] "Data/2007/Dec/Dec.xlsx" "Data/2007/Feb/Feb.xlsx"
## [17] "Data/2007/Jan/Jan.xlsx" "Data/2007/July/July.xlsx"
## [19] "Data/2007/June/June.xlsx" "Data/2007/March/March.xlsx"
## [21] "Data/2007/May/May.xlsx" "Data/2007/Nov/Nov.xlsx"
## [23] "Data/2007/Oct/Oct.xlsx" "Data/2007/Sept/September.xlsx"
## [25] "Data/2008/Apr/Apr.xlsx" "Data/2008/Aug/Aug.xlsx"
## [27] "Data/2008/Dec/Dec.xlsx" "Data/2008/Feb/Feb.xlsx"
## [29] "Data/2008/Jan/Jan.xlsx" "Data/2008/July/July.xlsx"
## [31] "Data/2008/June/June.xlsx" "Data/2008/March/March.xlsx"
## [33] "Data/2008/May/May.xlsx" "Data/2008/Nov/Nov.xlsx"
## [35] "Data/2008/Oct/Oct.xlsx" "Data/2008/Sept/Sept.xlsx"
## [37] "Data/2009/Apr/Apr.xlsx" "Data/2009/Aug/Aug.xlsx"
## [39] "Data/2009/Dec/Dec.xlsx" "Data/2009/Feb/Feb.xlsx"
## [41] "Data/2009/Jan/Jan.xlsx" "Data/2009/July/July.xlsx"
## [43] "Data/2009/Jun/June.xlsx" "Data/2009/Mar/Mar.xlsx"
## [45] "Data/2009/May/May.xlsx" "Data/2009/Nov/Nov.xlsx"
## [47] "Data/2009/Oct/Oct.xlsx" "Data/2009/Sept/Sept.xlsx"
## [49] "Data/2010/Apr/Apr.xlsx" "Data/2010/Aug/Aug.xlsx"
## [51] "Data/2010/Dec/Dec.xlsx" "Data/2010/Feb/Feb.xlsx"
## [53] "Data/2010/Jan/Jan.xlsx" "Data/2010/July/July.xlsx"
## [55] "Data/2010/June/June.xlsx" "Data/2010/Mar/Mar.xlsx"
## [57] "Data/2010/May/May.xlsx" "Data/2010/Nov/Nov.xlsx"
## [59] "Data/2010/Oct/Oct.xlsx" "Data/2010/Sept/Sept.xlsx"
## [61] "Data/2011/Apr/Apr.xlsx" "Data/2011/Aug/Aug.xlsx"
## [63] "Data/2011/Dec/Dec.xlsx" "Data/2011/Feb/Feb/Feb.xlsx"
## [65] "Data/2011/Jan/Jan.xlsx" "Data/2011/Jul/Jul.xlsx"
## [67] "Data/2011/June/June.xlsx" "Data/2011/March/March.xlsx"
## [69] "Data/2011/May/May.xlsx" "Data/2011/Nov/Nov.xlsx"
## [71] "Data/2011/Oct/Oct.xlsx" "Data/2011/Sept/Sept.xlsx"
## [73] "Data/2012/Apr/Apr.xlsx" "Data/2012/Aug/Aug.xlsx"
## [75] "Data/2012/Dec/Dec.xlsx" "Data/2012/Feb/Feb.xlsx"
## [77] "Data/2012/Jan/Jan.xlsx" "Data/2012/July/July.xlsx"
## [79] "Data/2012/June/June.xlsx" "Data/2012/March/Mar.xlsx"
## [81] "Data/2012/May/May.xlsx" "Data/2012/Nov/Nov.xlsx"
## [83] "Data/2012/Oct/Oct.xlsx" "Data/2012/Sep/Sept.xlsx"
## [85] "Data/2013/Apr/apr.xlsx" "Data/2013/aug/aug.xlsx"
## [87] "Data/2013/dec/dec.xlsx" "Data/2013/Feb/Feb.xlsx"
## [89] "Data/2013/Jan/jan.xlsx" "Data/2013/jul/jul.xlsx"
## [91] "Data/2013/Jun/jun.xlsx" "Data/2013/March/march.xlsx"
## [93] "Data/2013/may/may.xlsx" "Data/2013/nov/nov.xlsx"
## [95] "Data/2013/oct/oct.xlsx" "Data/2013/sep/sep.xlsx"
## [97] "Data/2014/Apr/Apr.xlsx" "Data/2014/Aug/Aug.xlsx"
## [99] "Data/2014/Dec/Dec.xlsx" "Data/2014/Feb/Feb.xlsx"
## [101] "Data/2014/jan/jan.xlsx" "Data/2014/July/July.xlsx"
## [103] "Data/2014/June/June.xlsx" "Data/2014/March/March.xlsx"
## [105] "Data/2014/May/May.xlsx" "Data/2014/Nov/Nov.xlsx"
## [107] "Data/2014/Oct/Oct.xlsx" "Data/2014/Sept/Sept.xlsx"
## [109] "Data/2015/Apr/Apr.xlsx" "Data/2015/Aug/Aug.xlsx"
## [111] "Data/2015/Dec/Dec.xlsx" "Data/2015/Feb/Feb.xlsx"
## [113] "Data/2015/Jan/Jan.xlsx" "Data/2015/July/July.xlsx"
## [115] "Data/2015/June/June.xlsx" "Data/2015/Mar/March.xlsx"
## [117] "Data/2015/May/May.xlsx" "Data/2015/Nov/Nov.xlsx"
## [119] "Data/2015/Oct/Oct.xlsx" "Data/2015/Sept/Sept.xlsx"
## [121] "Data/2016/Apr/Apr.xlsx" "Data/2016/Aug/Aug.xlsx"
## [123] "Data/2016/Dec/Dec.xlsx" "Data/2016/Feb/Feb.xlsx"
## [125] "Data/2016/Jan/Jan.xlsx" "Data/2016/July/July.xlsx"
## [127] "Data/2016/June/June.xlsx" "Data/2016/Mar/Mar.xlsx"
## [129] "Data/2016/May/May.xlsx" "Data/2016/Nov/Nov.xlsx"
## [131] "Data/2016/Oct/Oct.xlsx" "Data/2016/Sep/Sep.xlsx"
## [133] "Data/2017/Apr/Apr.xlsx" "Data/2017/Aug/Aug.xlsx"
## [135] "Data/2017/Dec/Dec.xlsx" "Data/2017/Feb/feb.xlsx"
## [137] "Data/2017/Jan/Jan.xlsx" "Data/2017/July/July.xlsx"
## [139] "Data/2017/June/June.xlsx" "Data/2017/Marc/mar.xlsx"
## [141] "Data/2017/May/May.xlsx" "Data/2017/Nov/Nov.xlsx"
## [143] "Data/2017/Oct/Oct.xlsx" "Data/2017/Sept/Sept.xlsx"
## [145] "Data/2018/Apr/Apr.xlsx" "Data/2018/Aug/Aug.xlsx"
## [147] "Data/2018/Dec/dec.xlsx" "Data/2018/Feb/Feb.xlsx"
## [149] "Data/2018/Jan/Jan.xlsx" "Data/2018/July/July.xlsx"
## [151] "Data/2018/Jun/June.xlsx" "Data/2018/Mar/Mar.xlsx"
## [153] "Data/2018/May/May.xlsx" "Data/2018/Nov/Nov.xlsx"
## [155] "Data/2018/Oct/Oct.xlsx" "Data/2018/sep/sep.xlsx"
## [157] "Data/2019/April/April.xlsx" "Data/2019/Aug/Aug.xlsx"
## [159] "Data/2019/Dec/Dec.xlsx" "Data/2019/Feb/Feb.xlsx"
## [161] "Data/2019/Jan/Jan.xlsx" "Data/2019/July/July.xlsx"
## [163] "Data/2019/Jun/June.xlsx" "Data/2019/March/March.xlsx"
## [165] "Data/2019/May/May.xlsx" "Data/2019/Nov/Nov.xlsx"
## [167] "Data/2019/Oct/Oct.xlsx" "Data/2019/Sep/Sept.xlsx"
## [169] "Data/2020/Apr/April.xlsx" "Data/2020/Aug/Aug.xlsx"
## [171] "Data/2020/Dec/Dec.xlsx" "Data/2020/Feb/Feb.xlsx"
## [173] "Data/2020/Jan/Jan.xlsx" "Data/2020/July/July.xlsx"
## [175] "Data/2020/June/June.xlsx" "Data/2020/March/March.xlsx"
## [177] "Data/2020/May/May.xlsx" "Data/2020/Nov/Nov.xlsx"
## [179] "Data/2020/Oct/Oct.xlsx" "Data/2020/Sept/Sept.xlsx"
## [181] "Data/2021/April/April.xlsx" "Data/2021/Aug/Aug.xlsx"
## [183] "Data/2021/Dec/Dec.xlsx" "Data/2021/February/Feb.xlsx"
## [185] "Data/2021/January/Jan.xlsx" "Data/2021/July/July.xlsx"
## [187] "Data/2021/June/June.xlsx" "Data/2021/March/March.xlsx"
## [189] "Data/2021/May/May.xlsx" "Data/2021/November/November.xlsx"
## [191] "Data/2021/Oct/Oct.xlsx" "Data/2021/Sept/Sept.xlsx"
As noted earlier, the data in the Ms Excel files have the same format. My approach to creating a function to read and clean data is to take one Excel file and write a piece of code that reads the data and does data cleaning to my satisfaction. I then use this code to construct a function to read and clean data from any other file in the dataset.
The function below reads in the data and does the cleaning. Notable
is the extract
function that takes a grouped regular
expression (regex
) and extracts the data specified. The
function will extract the first, second and third, and fourth and fifth
digits and allocate them to their respective variable names. It is also
critical to convert all data into characters to avoid problems when
combining the files into one dataset. You can convert all columns to
characters using the code;
mutate(across(.cols = everything(), as.character))
## Write function to read in the data ----
## The function takes in file name (x) and country as inputs
<- function(x, country){
get_data
::read_xlsx(x, col_names = FALSE) %>%
readxl
## Convert all columns to character to allow binding rows
mutate(across(.cols = everything(), as.character)) %>%
## Get rows for date from folder and country
mutate(date = x, country = country) %>%
## Select relevant columns, including F
select(date, country, `...6`) %>%
## Rename the column F with the values to a proper name
rename(values = `...6`) %>%
## Extract only columns with 5 letters
mutate(figs = str_extract(values, "^[0-9]{5}$")) %>%
## Drop columns with NAs in column F
drop_na(figs) %>%
select(-figs) %>%
## Convert values to numeric
mutate(values = parse_number(values)) %>%
## Extract month and year from file names
mutate(year = str_extract(date, '[0-9]{4}'),
month = str_extract(date, '[a-zA-Z]*\\.xlsx$'),
month = str_remove(month, '\\.xlsx$')
%>%
)
## Remove the redundant column containing file name
## Given we have extracted year and month from it.
select(-date) %>%
relocate(month, year) %>%
## Extract clouds(digit 1), wind direction(digit 2 & 3)
## And wind speed (digit 4 and 5)
## retain the original variable
extract(values, into = c("clouds", "wind_direction",
"speed"),
regex = "(\\d{1})(\\d{2})(\\d{2})",
remove = FALSE) %>%
## Convert the extracted variables to numerics
mutate(
clouds = parse_number(clouds),
speed = parse_number(speed),
wind_direction = parse_number(wind_direction),
## Convert knots to m/s
wind_speed = speed * 0.514) %>%
## Arrange data well
relocate(wind_direction, .after = clouds)
}
Next, I loop the function over each Ms Excel file. We would have to
write a for
loop in other programming languages like
Python. However, in R, we loop mainly using the apply
or
map
family of functions, although for
loops
can also work in R.
## Loop the function above over all files in the dataset
<- map_dfr(my_files, get_data, country = "Zambia") final_joe_data
The table below shows the first six rows of the data in the long format.
head(final_joe_data)
## # A tibble: 6 × 8
## month year country values clouds wind_direction speed wind_speed
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 April 2006 Zambia 60000 6 0 0 0
## 2 April 2006 Zambia 40000 4 0 0 0
## 3 April 2006 Zambia 11213 1 12 13 6.68
## 4 April 2006 Zambia 53302 5 33 2 1.03
## 5 April 2006 Zambia 50905 5 9 5 2.57
## 6 April 2006 Zambia 30905 3 9 5 2.57
The last two digits of the five digit number in column F represent the wind speed in knots. We then convert the wind speed to meters per second. In this section, I summarise the wind speed in Zambia by month of each year.
## I am writing the excel file to avoid running the
## files reading process as it takes a lot of time
fwrite(final_joe_data, "final_joe_data.csv")
## I repeat the same exercise to get long data ----
<- fread("final_joe_data.csv") %>%
final_joe_long_data
## Select relevant variables
select(month, year, wind_speed) %>%
## Group data by month and year for making summaries
group_by(month, year) %>%
## Take average of wind speed
summarise(avg_wind_speed = mean(wind_speed, na.rm = TRUE)) %>%
## Convert months to a standard format
mutate(month = str_to_sentence(month),
month = case_when(
str_detect(month, '^[Aa]p.*') ~ "April",
str_detect(month, '^[Jj]ul.*') ~ "July",
str_detect(month, '^[Jj]un.*') ~ "June",
str_detect(month, '^[Mm]ar.*') ~ "Mar",
str_detect(month, '^[Nn]ov.*') ~ "Nov",
str_detect(month, '^[Ss]ep.*') ~ "Sep",
TRUE ~ month
)
)
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
head(final_joe_long_data)
## # A tibble: 6 × 3
## # Groups: month [1]
## month year avg_wind_speed
## <chr> <int> <dbl>
## 1 April 2013 9.07
## 2 April 2007 5.53
## 3 April 2008 5.09
## 4 April 2009 5.97
## 5 April 2010 5.48
## 6 April 2011 4.75
For storage efficiency, we usually store data in a wide format. Here, the columns represent months, while the rows are the years.
## Finally I average the wind speed data by year, month
## Them make it wide
## I read in the data
<- fread("final_joe_data.csv") %>%
final_joe_wide_data
## Select relevant variables
select(month, year, wind_speed) %>%
## Group data by month and year for making summaries
group_by(month, year) %>%
## Take average of wind speed
summarise(avg_wind_speed = mean(wind_speed, na.rm = TRUE)) %>%
## Convert months to a standard format
mutate(month = str_to_sentence(month),
month = case_when(
str_detect(month, '^[Aa]p.*') ~ "April",
str_detect(month, '^[Jj]ul.*') ~ "July",
str_detect(month, '^[Jj]un.*') ~ "June",
str_detect(month, '^[Mm]ar.*') ~ "Mar",
str_detect(month, '^[Nn]ov.*') ~ "Nov",
str_detect(month, '^[Ss]ep.*') ~ "Sep",
TRUE ~ month
)
%>%
)
## make the data wide
pivot_wider(names_from = month,
values_from = avg_wind_speed) %>%
## Arrange data by year
arrange(year)
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
head(final_joe_wide_data)
## # A tibble: 6 × 13
## year April Aug Dec Feb Jan July June Mar May Nov Oct Sep
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2006 5.20 4.42 6.80 8.36 7.77 4.72 4.10 6.68 4.70 6.78 7.54 4.86
## 2 2007 5.53 4.96 5.62 4.29 6.94 3.26 4.10 6.66 4.74 6.41 5.53 3.67
## 3 2008 5.09 4.08 8.64 5.76 5.23 4.10 3.90 5.88 4.94 7.27 4.19 5.33
## 4 2009 5.97 6.23 7.63 6.56 7.42 5.38 5.18 6.07 5.57 6.77 6.52 4.90
## 5 2010 5.48 6.62 5.63 5.57 5.38 4.35 4.59 5.23 4.91 7.65 6.31 6.88
## 6 2011 4.75 5.25 5.29 5.03 7.07 3.97 4.26 6.60 3.68 6.12 5.17 4.10
I summarise the cloud, wind_direction and wind_speed in Zambia.
%>%
final_joe_data
select(where(is.numeric), -year, -values, -speed) %>%
::ggpairs() GGally
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##################################################
## IMPORTANT
## Challenges that make code to break
# Inconsistent formats xls, xlsx(see year 2017). Convert to xlsx
# Some broken excel files in the data folder. Delete them
# Inconsistent file naming. eg march.xlsx, march2021.xlsx (see year 2021 relative to other years)
# Data stacked up in one column.
In this analysis, I have demonstrated how to read multiple Ms Excel files from a folder and techniques for cleaning such data.
Replication files are available on my Github account (https://github.com/Karuitha/data_projects/tree/master/Send). **Corresponding author**jkaruitha@karu.ac.ke.