This document demonstrates a few ways to generate excel files files using R code. The main purpose is to show how it is possible to manipulate a large data set to automatically generate XLSX files based on selected criteria and how to recombine XLSX files in a folder into a single workbook.
The code provided subdivides the data set by the education field and creates an an XLSX file for each value. The code can easily be modified to subdivide the data set by another field such as occupation.
The code also demonstrates how to: - Combine multiple XLSX files into individual worksheets in a single workbook - Combine multiple XLSX files onto a single worksheet in a workbook - Split a data source into separate worksheets in a workbook - Recombine excel tabs into a single worksheet
The first code chunk imports data from the US Census Data data set, obtained from
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
The code also generates summary stats and charts on the data set based on the education field of the data set.
shell('cls')
library(tidyverse)
library(plotly)
library(readxl)
library(openxlsx)
library(formattable)
library(knitr)
#Set filepath. Create before running script.
path <- "C:/education/"
options(pillar.sigfig = 5)
# Adult Census Income:
# https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
# https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
# 1. Read data from internet source
# Download US census data
census_uci <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep = ',', fill = F)
# Create a dataframe of column names
colnames(census_uci) <- c('age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')
# 1.1 Inspect the Data
paste("Data set dimensions:")
## [1] "Data set dimensions:"
paste(ncol(census_uci), "fields") #no of fields
## [1] "15 fields"
paste(nrow(census_uci),"records") #no of records
## [1] "32561 records"
knitr::kable(head(census_uci, n=10), caption = "Sample of the Census Data")
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
| 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
| 49 | Private | 160187 | 9th | 5 | Married-spouse-absent | Other-service | Not-in-family | Black | Female | 0 | 0 | 16 | Jamaica | <=50K |
| 52 | Self-emp-not-inc | 209642 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | >50K |
| 31 | Private | 45781 | Masters | 14 | Never-married | Prof-specialty | Not-in-family | White | Female | 14084 | 0 | 50 | United-States | >50K |
| 42 | Private | 159449 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 5178 | 0 | 40 | United-States | >50K |
# Summarise the data
paste("Summary data for the fields of the data set:")
## [1] "Summary data for the fields of the data set:"
summary(census_uci)
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education_num marital_status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship
## Prof-specialty :4140 Husband :13193
## Craft-repair :4099 Not-in-family : 8305
## Exec-managerial:4066 Other-relative: 981
## Adm-clerical :3770 Own-child : 5068
## Sales :3650 Unmarried : 3446
## Other-service :3295 Wife : 1568
## (Other) :9541
## race sex capital_gain
## Amer-Indian-Eskimo: 311 Female:10771 Min. : 0
## Asian-Pac-Islander: 1039 Male :21790 1st Qu.: 0
## Black : 3124 Median : 0
## Other : 271 Mean : 1078
## White :27816 3rd Qu.: 0
## Max. :99999
##
## capital_loss hours_per_week native_country income
## Min. : 0.0 Min. : 1.00 United-States:29170 <=50K:24720
## 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 643 >50K : 7841
## Median : 0.0 Median :40.00 ? : 583
## Mean : 87.3 Mean :40.44 Philippines : 198
## 3rd Qu.: 0.0 3rd Qu.:45.00 Germany : 137
## Max. :4356.0 Max. :99.00 Canada : 121
## (Other) : 1709
# Record count by education level
count_ed<-census_uci %>%
group_by(education) %>%
summarise(n())
# Barplot of record count by Education
p1<-census_uci %>%
ggplot(aes(x=education))+geom_bar(color="grey",fill="blue")+ggtitle("Count by Education Level")+ theme(axis.text.x = element_text(angle = 90))
ggplotly(p1)
# Summary of education count
# paste("Summary of education count")
knitr::kable(count_ed, caption = "Summary of education count" )
| education | n() |
|---|---|
| 10th | 933 |
| 11th | 1175 |
| 12th | 433 |
| 1st-4th | 168 |
| 5th-6th | 333 |
| 7th-8th | 646 |
| 9th | 514 |
| Assoc-acdm | 1067 |
| Assoc-voc | 1382 |
| Bachelors | 5355 |
| Doctorate | 413 |
| HS-grad | 10501 |
| Masters | 1723 |
| Preschool | 51 |
| Prof-school | 576 |
| Some-college | 7291 |
The following code chunk subdivides the initial data set by the education field and creates a single tab XLSX file for each value.
Be sure to set the correct folder path in the previous code chunk, which must be run in advance of this one. The XLSX files will load to C:/education/, unless adjusted.
# 2. Split data and export to separate Excel files
sp1 <- split(census_uci,census_uci$education)
# Get unique education values
names <- census_uci$education %>% unique()
# Loop through the the different education levels
for (i in names){
# write.xlsx(sp1[i],paste(i,".xlsx"))
write.xlsx(sp1[i],file.path(path,paste(i,".xlsx")))
}
Upon running this chunk, 16 files should appear in the target folder each named for a value of the education field
This code chunk loops through all the files in a folder and exports the data into single worksheet in a new Excel file. If executed successfully, the record set in the destination file will be the same as the initial data set.
In order to work properly, all files in the source folder must have the same data structure. (Be sure to remove unnecessary files from the source folder). This code chunk should work on any folder with a set of similarly structured files. It is not limited to the files created previously. Simply adjust the path variable in the first code chunk.
# 3. Combine Excel files in a folder into a single workbook tab
# Store file path and destination file
merge_file_name <- paste(path,"single_tab_merged_file.xlsx")
# Generate a list of all the files in path folder
filenames_list <- list.files(path= path, full.names=TRUE)
# Read the files in the folder
paste("A List of the merged files")
## [1] "A List of the merged files"
All <- lapply(filenames_list,function(filename){
print(paste("Merging",filename,sep = " "))
read.xlsx(filename)
})
## [1] "Merging C:/education/ 10th .xlsx"
## [1] "Merging C:/education/ 11th .xlsx"
## [1] "Merging C:/education/ 12th .xlsx"
## [1] "Merging C:/education/ 1st-4th .xlsx"
## [1] "Merging C:/education/ 5th-6th .xlsx"
## [1] "Merging C:/education/ 7th-8th .xlsx"
## [1] "Merging C:/education/ 9th .xlsx"
## [1] "Merging C:/education/ Assoc-acdm .xlsx"
## [1] "Merging C:/education/ Assoc-voc .xlsx"
## [1] "Merging C:/education/ Bachelors .xlsx"
## [1] "Merging C:/education/ Doctorate .xlsx"
## [1] "Merging C:/education/ education_tabs.xlsx"
## [1] "Merging C:/education/ HS-grad .xlsx"
## [1] "Merging C:/education/ Masters .xlsx"
## [1] "Merging C:/education/ Preschool .xlsx"
## [1] "Merging C:/education/ Prof-school .xlsx"
## [1] "Merging C:/education/ single_tab_merged_file.xlsx"
## [1] "Merging C:/education/ Some-college .xlsx"
## [1] "Merging C:/education/Occupations.xlsx"
## [1] "Merging C:/education/Single_sheet_recombined_tabs.xlsx"
# Create a dataframe combining data from all files
df <- do.call(rbind.data.frame, All)
write.xlsx(df,merge_file_name)
If executed successfully, single_tab_merged_file.XLSX should appear in your source folder. You may open it to verify the number of fields (15) and records (32,561).
The following code chunk works the same as the previous one except that it creates separate destination worksheet tabs from each file. You may use this code chunk as an alternative to the previous one.
# 4. Combine Excel files in a folder into separate tabs
# Only for files with a similar data structure.
# Remove unwanted files from folder
# Store file path and destination file
tabbed_file_name <- paste(path,"education_tabs.xlsx")
# Store file names in a dataframe
filenames_list <- list.files(path= path, full.names=TRUE)
# Read the files in folder
paste("A List of the transferred files")
## [1] "A List of the transferred files"
All <- lapply(filenames_list,function(filename){
print(paste("Merging",filename,sep = " "))
read.xlsx(filename)
})
## [1] "Merging C:/education/ 10th .xlsx"
## [1] "Merging C:/education/ 11th .xlsx"
## [1] "Merging C:/education/ 12th .xlsx"
## [1] "Merging C:/education/ 1st-4th .xlsx"
## [1] "Merging C:/education/ 5th-6th .xlsx"
## [1] "Merging C:/education/ 7th-8th .xlsx"
## [1] "Merging C:/education/ 9th .xlsx"
## [1] "Merging C:/education/ Assoc-acdm .xlsx"
## [1] "Merging C:/education/ Assoc-voc .xlsx"
## [1] "Merging C:/education/ Bachelors .xlsx"
## [1] "Merging C:/education/ Doctorate .xlsx"
## [1] "Merging C:/education/ education_tabs.xlsx"
## [1] "Merging C:/education/ HS-grad .xlsx"
## [1] "Merging C:/education/ Masters .xlsx"
## [1] "Merging C:/education/ Preschool .xlsx"
## [1] "Merging C:/education/ Prof-school .xlsx"
## [1] "Merging C:/education/ single_tab_merged_file.xlsx"
## [1] "Merging C:/education/ Some-college .xlsx"
## [1] "Merging C:/education/Occupations.xlsx"
## [1] "Merging C:/education/Single_sheet_recombined_tabs.xlsx"
# Row bind the data and split across Education field
df <- do.call(rbind.data.frame, All)
df<-split(df,df$education)
# Create destination file
write.xlsx(df,tabbed_file_name)
The file education_tabs.XLSX should appear in the source folder having 17 separate tabs including a tab with the entire record set created by the previous code.
This code demonstrates how to take data directly from a data set and subdivide it into separate tabs by occupation. This can be executed immediately after executing the first code chunk.
# 5. Split original data source into separate Excel tabs
# Separate data frame by marital status
sp2 <- split(census_uci,census_uci$occupation)
#Export to Excel tabs. Missing values generate warnings in Excel
tab_file <- paste("Occupations.xlsx")
# Write data to destination file
xl <- write.xlsx(sp2,file.path(path,tab_file))
# List tab names and show worksheet order
xl
## A Workbook object.
##
## Worksheets:
## Sheet 1: " ?"
##
##
## Sheet 2: " Adm-clerical"
##
##
## Sheet 3: " Armed-Forces"
##
##
## Sheet 4: " Craft-repair"
##
##
## Sheet 5: " Exec-managerial"
##
##
## Sheet 6: " Farming-fishing"
##
##
## Sheet 7: " Handlers-cleaners"
##
##
## Sheet 8: " Machine-op-inspct"
##
##
## Sheet 9: " Other-service"
##
##
## Sheet 10: " Priv-house-serv"
##
##
## Sheet 11: " Prof-specialty"
##
##
## Sheet 12: " Protective-serv"
##
##
## Sheet 13: " Sales"
##
##
## Sheet 14: " Tech-support"
##
##
## Sheet 15: " Transport-moving"
##
##
##
## Worksheet write order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
# worksheetOrder(xl)
The file Occupations.XLSX should appear with tabs for each of the 15 unique values of the occupation field.
This last code chunk simply recombines the occupation tabs created in the previous code chunk into a single worksheet tab. Once again the data should have 15 fields with 32,561 records.
# 6. Recombine Excel tabs into a single sheet
# Create name of destination file
sheet_file <- paste("Single_sheet_recombined_tabs.xlsx")
#Concatenate file path and the source file
xl2<- paste(path,tab_file,sep= '')
#write to data frame with with bound rows
df<-xl2 %>%
excel_sheets() %>%
set_names() %>%
map(read_excel,path=xl2) %>%
bind_rows()
#Create destination File
write.xlsx(df,file.path(path,sheet_file))
Was this helpful? I welcome any feedback.
N Ray