We have recently ran into a problem many people I’m sure have run into before. In order to get the data we need to run models, we have to find ways to combine hundreds of daily status check reports into a single location. The question came down to how do we do it effectively for the initial dataframe? Once we have the initial data gathered into one location an export to another format is easy and maintaining it is also easy. I was tasked with developing a reproducible format for all of our various daily status reports.
There was a few requirements I had before starting this project:
#list.files(path = "C:/temp", pattern = "*.xlsx",full.names = TRUE)
Data Folder
My team ensured all the Excel files were in the same location. As the report had not changed in 2 years, the columns were all the same name. The only issue with the report, the data i need to extract is on a separate tab. The system prints the report with a info tab and a data tab. I don’t need the Info tab. However, as it is the first tab, normal load Excel procedures will pull the info from this tab into the dataframe requiring additional clean up.
In the read_excel package, we can specify two very import features; Sheet number and skip.
Excel Report
As we can see, the report not only has a tab we need to skip but also the header column is on row 4. To get around this, we can use ‘read_excel(fname, sheet = 2, skip = 3)’ where fname is the file name. ‘Sheet’ notes which sheet to run the function on and ‘Skip’ denotes which of the first columns need to be bypassed. I set up a function to iterate read_excel over any file name I put into it.
read_filename <- function(fname) {
read_excel(fname, sheet = 2, skip = 3)
}
With this function, I can now use list.files (base package) to create a table and iterate the function over each file combining them into the same dataframe.
excel_tbl <-
list.files(path = "C:/temp",
pattern = "*.xlsx",
full.names = TRUE)%>%
map_df(~read_filename(.))
## New names:
## * Status -> Status...1
## * Status -> Status...5
colnames(excel_tbl)
## [1] "Status...1" "Date / time"
## [3] "Total time (ms)" "Status...5"
## [5] "Description" "Checkpoint"
## [7] "Resolved IP address" "Resolve time (ms)"
## [9] "Connection time (ms)" "Download time (ms)"
## [11] "Download size (bytes)" "Error level"
## [13] "Status code"
Since this is now a dataframe, all the normal operations can take place. Name changes, Number extraction, etc..
names(excel_tbl)[names(excel_tbl) == "Status...1"] <- "Status"
names(excel_tbl)[names(excel_tbl) == "Status...5"] <- "Error Code"
excel_tbl$step_number <- str_match(excel_tbl$Description,"[[:digit:]]{1,2}")
The final dataframe is not usable for any activity we require. It can be exported to .csv for Python or it can be exported as .RDS for safe keeping and future work in R.
head(excel_tbl,20)
## # A tibble: 20 x 14
## Status `Date / time` `Total time (ms~ `Error Code` Description
## <chr> <chr> <dbl> <dbl> <chr>
## 1 No er~ 27/09/2019 2~ 138 0 OK
## 2 No er~ 27/09/2019 2~ 44690 0 OK
## 3 No er~ 27/09/2019 2~ 39761 0 OK
## 4 No er~ 27/09/2019 2~ 226 0 OK
## 5 No er~ 27/09/2019 2~ 31357 0 OK
## 6 No er~ 27/09/2019 2~ 57978 0 OK
## 7 No er~ 27/09/2019 2~ 36247 0 OK
## 8 No er~ 27/09/2019 2~ 35859 0 OK
## 9 No er~ 27/09/2019 2~ 47449 0 OK
## 10 No er~ 27/09/2019 2~ 30390 0 OK
## 11 Uncon~ 27/09/2019 2~ 11349 7004 5: Element~
## 12 No er~ 27/09/2019 2~ 33026 0 OK
## 13 No er~ 27/09/2019 2~ 38169 0 OK
## 14 No er~ 27/09/2019 2~ 38815 0 OK
## 15 No er~ 27/09/2019 2~ 34350 0 OK
## 16 No er~ 27/09/2019 2~ 103 0 OK
## 17 No er~ 27/09/2019 2~ 87 0 OK
## 18 Uncon~ 27/09/2019 2~ 45401 7004 11: 'JE_Te~
## 19 No er~ 27/09/2019 2~ 43236 0 OK
## 20 No er~ 27/09/2019 2~ 52983 0 OK
## # ... with 9 more variables: Checkpoint <chr>, `Resolved IP
## # address` <chr>, `Resolve time (ms)` <dbl>, `Connection time
## # (ms)` <dbl>, `Download time (ms)` <dbl>, `Download size
## # (bytes)` <dbl>, `Error level` <chr>, `Status code` <dbl>,
## # step_number[,1] <chr>
I hope this helps other people in the same situation.