Excel File Manipulation Using R

0. Introduction

This document demonstrates a few ways to generate excel files files using R code. The main purpose is to show how it is possible to manipulate a large data set to automatically generate XLSX files based on selected criteria and how to recombine XLSX files in a folder into a single workbook.

The code provided subdivides the data set by the education field and creates an an XLSX file for each value. The code can easily be modified to subdivide the data set by another field such as occupation.

The code also demonstrates how to: - Combine multiple XLSX files into individual worksheets in a single workbook - Combine multiple XLSX files onto a single worksheet in a workbook - Split a data source into separate worksheets in a workbook - Recombine excel tabs into a single worksheet

1. Importing and Inspecting a Data Set

The first code chunk imports data from the US Census Data data set, obtained from

https://archive.ics.uci.edu/ml/machine-learning-databases/adult/

https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

The code also generates summary stats and charts on the data set based on the education field of the data set.

shell('cls')
library(tidyverse)
library(plotly)
library(readxl)
library(openxlsx)
library(formattable)
library(knitr)


#Set filepath. Create before running script.
path <- "C:/education/"

options(pillar.sigfig = 5)

# Adult Census Income:
# https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
# https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data


# 1. Read data from internet source

# Download US census data
census_uci <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                         sep = ',', fill = F)
# Create a dataframe of column names
colnames(census_uci) <- c('age', 'workclass', 'fnlwgt', 'education', 
                          'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 
                          'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')


# 1.1 Inspect the Data
paste("Data set dimensions:")

## [1] "Data set dimensions:"

paste(ncol(census_uci), "fields") #no of fields

## [1] "15 fields"

paste(nrow(census_uci),"records") #no of records

## [1] "32561 records"

knitr::kable(head(census_uci, n=10), caption = "Sample of the Census Data")

Sample of the Census Data
age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K
37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	40	United-States	<=50K
49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	16	Jamaica	<=50K
52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	45	United-States	>50K
31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	50	United-States	>50K
42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	40	United-States	>50K

# Summarise the data
paste("Summary data for the fields of the data set:")

## [1] "Summary data for the fields of the data set:"

summary(census_uci)

##       age                    workclass         fnlwgt       
##  Min.   :17.00    Private         :22696   Min.   :  12285  
##  1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00    Local-gov       : 2093   Median : 178356  
##  Mean   :38.58    ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00    State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)          :  981                    
##          education     education_num                  marital_status 
##   HS-grad     :10501   Min.   : 1.00    Divorced             : 4443  
##   Some-college: 7291   1st Qu.: 9.00    Married-AF-spouse    :   23  
##   Bachelors   : 5355   Median :10.00    Married-civ-spouse   :14976  
##   Masters     : 1723   Mean   :10.08    Married-spouse-absent:  418  
##   Assoc-voc   : 1382   3rd Qu.:12.00    Never-married        :10683  
##   11th        : 1175   Max.   :16.00    Separated            : 1025  
##  (Other)      : 5134                    Widowed              :  993  
##             occupation            relationship  
##   Prof-specialty :4140    Husband       :13193  
##   Craft-repair   :4099    Not-in-family : 8305  
##   Exec-managerial:4066    Other-relative:  981  
##   Adm-clerical   :3770    Own-child     : 5068  
##   Sales          :3650    Unmarried     : 3446  
##   Other-service  :3295    Wife          : 1568  
##  (Other)         :9541                          
##                   race            sex         capital_gain  
##   Amer-Indian-Eskimo:  311    Female:10771   Min.   :    0  
##   Asian-Pac-Islander: 1039    Male  :21790   1st Qu.:    0  
##   Black             : 3124                   Median :    0  
##   Other             :  271                   Mean   : 1078  
##   White             :27816                   3rd Qu.:    0  
##                                              Max.   :99999  
##                                                             
##   capital_loss    hours_per_week         native_country     income     
##  Min.   :   0.0   Min.   : 1.00    United-States:29170    <=50K:24720  
##  1st Qu.:   0.0   1st Qu.:40.00    Mexico       :  643    >50K : 7841  
##  Median :   0.0   Median :40.00    ?            :  583                 
##  Mean   :  87.3   Mean   :40.44    Philippines  :  198                 
##  3rd Qu.:   0.0   3rd Qu.:45.00    Germany      :  137                 
##  Max.   :4356.0   Max.   :99.00    Canada       :  121                 
##                                   (Other)       : 1709

# Record count by education level
count_ed<-census_uci %>% 
  group_by(education) %>% 
             summarise(n())

# Barplot of record count by Education
p1<-census_uci %>% 
  ggplot(aes(x=education))+geom_bar(color="grey",fill="blue")+ggtitle("Count by Education Level")+ theme(axis.text.x = element_text(angle = 90))
ggplotly(p1)

# Summary of education count
# paste("Summary of education count")
knitr::kable(count_ed, caption = "Summary of education count"  )

Summary of education count
education	n()
10th	933
11th	1175
12th	433
1st-4th	168
5th-6th	333
7th-8th	646
9th	514
Assoc-acdm	1067
Assoc-voc	1382
Bachelors	5355
Doctorate	413
HS-grad	10501
Masters	1723
Preschool	51
Prof-school	576
Some-college	7291

2. Export Data Into Separate XLSX Files

The following code chunk subdivides the initial data set by the education field and creates a single tab XLSX file for each value.

Be sure to set the correct folder path in the previous code chunk, which must be run in advance of this one. The XLSX files will load to C:/education/, unless adjusted.

# 2. Split data and export to separate Excel files

sp1 <- split(census_uci,census_uci$education)

# Get unique education values
names <- census_uci$education %>% unique()

# Loop through the the different education levels
for (i in names){
  # write.xlsx(sp1[i],paste(i,".xlsx"))
  write.xlsx(sp1[i],file.path(path,paste(i,".xlsx")))
  }

Upon running this chunk, 16 files should appear in the target folder each named for a value of the education field

3. Combine All Files in a folder to a Single Worksheet

This code chunk loops through all the files in a folder and exports the data into single worksheet in a new Excel file. If executed successfully, the record set in the destination file will be the same as the initial data set.

In order to work properly, all files in the source folder must have the same data structure. (Be sure to remove unnecessary files from the source folder). This code chunk should work on any folder with a set of similarly structured files. It is not limited to the files created previously. Simply adjust the path variable in the first code chunk.

# 3. Combine Excel files in a folder into a single workbook tab

# Store file path and destination file
merge_file_name <- paste(path,"single_tab_merged_file.xlsx")

# Generate a list of all the files in path folder
filenames_list <- list.files(path= path, full.names=TRUE)

# Read the files in the folder
paste("A List of the merged files")

## [1] "A List of the merged files"

All <- lapply(filenames_list,function(filename){
  print(paste("Merging",filename,sep = " "))
  read.xlsx(filename)
})

## [1] "Merging C:/education/ 10th .xlsx"
## [1] "Merging C:/education/ 11th .xlsx"
## [1] "Merging C:/education/ 12th .xlsx"
## [1] "Merging C:/education/ 1st-4th .xlsx"
## [1] "Merging C:/education/ 5th-6th .xlsx"
## [1] "Merging C:/education/ 7th-8th .xlsx"
## [1] "Merging C:/education/ 9th .xlsx"
## [1] "Merging C:/education/ Assoc-acdm .xlsx"
## [1] "Merging C:/education/ Assoc-voc .xlsx"
## [1] "Merging C:/education/ Bachelors .xlsx"
## [1] "Merging C:/education/ Doctorate .xlsx"
## [1] "Merging C:/education/ education_tabs.xlsx"
## [1] "Merging C:/education/ HS-grad .xlsx"
## [1] "Merging C:/education/ Masters .xlsx"
## [1] "Merging C:/education/ Preschool .xlsx"
## [1] "Merging C:/education/ Prof-school .xlsx"
## [1] "Merging C:/education/ single_tab_merged_file.xlsx"
## [1] "Merging C:/education/ Some-college .xlsx"
## [1] "Merging C:/education/Occupations.xlsx"
## [1] "Merging C:/education/Single_sheet_recombined_tabs.xlsx"

# Create a dataframe combining data from all files
df <- do.call(rbind.data.frame, All)
write.xlsx(df,merge_file_name)

If executed successfully, single_tab_merged_file.XLSX should appear in your source folder. You may open it to verify the number of fields (15) and records (32,561).

4. Combine All Files in a folder to Separate Worksheet Tabs

The following code chunk works the same as the previous one except that it creates separate destination worksheet tabs from each file. You may use this code chunk as an alternative to the previous one.

# 4. Combine Excel files in a folder into separate tabs

# Only for files with a similar data structure. 
# Remove unwanted files from folder

# Store file path and destination file
tabbed_file_name <- paste(path,"education_tabs.xlsx")

# Store file names in a dataframe
filenames_list <- list.files(path= path, full.names=TRUE)

# Read the files in folder

paste("A List of the transferred files")

## [1] "A List of the transferred files"

All <- lapply(filenames_list,function(filename){
  print(paste("Merging",filename,sep = " "))
  read.xlsx(filename)
})

## [1] "Merging C:/education/ 10th .xlsx"
## [1] "Merging C:/education/ 11th .xlsx"
## [1] "Merging C:/education/ 12th .xlsx"
## [1] "Merging C:/education/ 1st-4th .xlsx"
## [1] "Merging C:/education/ 5th-6th .xlsx"
## [1] "Merging C:/education/ 7th-8th .xlsx"
## [1] "Merging C:/education/ 9th .xlsx"
## [1] "Merging C:/education/ Assoc-acdm .xlsx"
## [1] "Merging C:/education/ Assoc-voc .xlsx"
## [1] "Merging C:/education/ Bachelors .xlsx"
## [1] "Merging C:/education/ Doctorate .xlsx"
## [1] "Merging C:/education/ education_tabs.xlsx"
## [1] "Merging C:/education/ HS-grad .xlsx"
## [1] "Merging C:/education/ Masters .xlsx"
## [1] "Merging C:/education/ Preschool .xlsx"
## [1] "Merging C:/education/ Prof-school .xlsx"
## [1] "Merging C:/education/ single_tab_merged_file.xlsx"
## [1] "Merging C:/education/ Some-college .xlsx"
## [1] "Merging C:/education/Occupations.xlsx"
## [1] "Merging C:/education/Single_sheet_recombined_tabs.xlsx"

# Row bind the data and split across Education field
df <- do.call(rbind.data.frame, All)
df<-split(df,df$education)

# Create destination file
write.xlsx(df,tabbed_file_name)

The file education_tabs.XLSX should appear in the source folder having 17 separate tabs including a tab with the entire record set created by the previous code.

5. Split Initail Data Source into Separate Worksheet Tabs

This code demonstrates how to take data directly from a data set and subdivide it into separate tabs by occupation. This can be executed immediately after executing the first code chunk.

# 5. Split original data source into separate Excel tabs

# Separate data frame by marital status
sp2 <- split(census_uci,census_uci$occupation)

#Export to Excel tabs. Missing values generate warnings in Excel
tab_file <- paste("Occupations.xlsx")

# Write data to destination file
xl <- write.xlsx(sp2,file.path(path,tab_file))

# List tab names and show worksheet order
xl

## A Workbook object.
##  
## Worksheets:
##  Sheet 1: " ?"
##  
## 
##  Sheet 2: " Adm-clerical"
##  
## 
##  Sheet 3: " Armed-Forces"
##  
## 
##  Sheet 4: " Craft-repair"
##  
## 
##  Sheet 5: " Exec-managerial"
##  
## 
##  Sheet 6: " Farming-fishing"
##  
## 
##  Sheet 7: " Handlers-cleaners"
##  
## 
##  Sheet 8: " Machine-op-inspct"
##  
## 
##  Sheet 9: " Other-service"
##  
## 
##  Sheet 10: " Priv-house-serv"
##  
## 
##  Sheet 11: " Prof-specialty"
##  
## 
##  Sheet 12: " Protective-serv"
##  
## 
##  Sheet 13: " Sales"
##  
## 
##  Sheet 14: " Tech-support"
##  
## 
##  Sheet 15: " Transport-moving"
##  
## 
##  
##  Worksheet write order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

# worksheetOrder(xl)

The file Occupations.XLSX should appear with tabs for each of the 15 unique values of the occupation field.

6. Recombine Multiple Worksheets into a Single Worksheet

This last code chunk simply recombines the occupation tabs created in the previous code chunk into a single worksheet tab. Once again the data should have 15 fields with 32,561 records.

# 6. Recombine Excel tabs into a single sheet

# Create name of destination file
sheet_file <- paste("Single_sheet_recombined_tabs.xlsx")


#Concatenate file path and the source file
xl2<- paste(path,tab_file,sep= '')

#write to data frame with with bound rows
df<-xl2 %>% 
  excel_sheets() %>% 
   set_names() %>% 
  map(read_excel,path=xl2) %>% 
  bind_rows()

#Create destination File
write.xlsx(df,file.path(path,sheet_file))

Was this helpful? I welcome any feedback.

N Ray