Background
R and Python are widely used tools by data analysts. A recent survey reported that each is used by approximately 50% of those questions and usage of both is increasing. It is generally recognised both are complimetary and there are many cases where we can benefit from using both languages side-by-side in an analysis in order to captialise upon applications where one outperforms the other. A series of blogs reported on how to integrate Python and R.
Problem
We have recently been working on an analysis wherein we are required to tidy data which is provided in the form of an Excel spreadsheet. We have experience in using the dplyr
pacakge in R which is excellent for simplicity of writing code (although other packages outperform it for very large data sets). Hence, there is a benefit to using R.
Several packages exist in R for reading Excel files. However, to date, we have not been able to find one which can read cell formats of ‘xlsx’ files. We are required to identify ‘struck out’ cells which we are not able to do in R. Previously, we performed a manual step to record the strike outs using an Excel macro. However, this impairs the reproducibility of the analysis since this step is not explicitly coded in the literate program.
Python does provide a more advanced Excel reader, called openpyxl
, which is capable of reading any cell formatting, including strikethroughs. The objective of this example is to show how we pass data between R and Python using feather-format
Read, Interpret and Write Data with Python
The libraries used are:
openpyxl
python library to read and interepret the Excel spreadsheet.
pandas
python library to convert the data to a dataframe.
feather
python library to the write the dataframe in a fast read/write format which is interchanageable between R and Python.
An aside: we handle carriage returns in headers by removing them since carriage returns in variable names is not a good idea.
You need access to the example.xlsx
file to run the code below.
# import libraries ------------------------------------------------------------
from openpyxl import load_workbook
from openpyxl.styles import Font
import pandas as pd
import feather as ft
# load workbook ---------------------------------------------------------------
wb = load_workbook('./example.xlsx', data_only = True, read_only = True)
ws = wb.get_sheet_by_name('Sheet1')
# get header ------------------------------------------------------------------
header_row = 4
hdr = [elem.value for elem in ws[header_row]]
hdr = [' '.join(elem.splitlines()) if elem else elem for elem in hdr]
# read values and strike-out status -------------------------------------------
val = [(cell.value if cell.value else "" for cell in row) for \
row in ws.iter_rows(min_row = header_row + 1)]
stk = [(bool(cell.font.strike) if cell.font else False for cell in row) for \
row in ws.iter_rows(min_row = header_row + 1)]
# write feather files ---------------------------------------------------------
for data in (("val", val), ("stk", stk)):
df = pd.DataFrame(data[1], columns = hdr)
print(data[0], "data\n", df, "\n")
ft.write_dataframe(df, "exp_%s.feather" % data[0])
val data
A B C D -some muppet put a carriage return NaN E
0 1 2 3 A 3
1 4 5 6 1 3
2 8 9 10 1.4 4
stk data
A B C D -some muppet put a carriage return NaN E
0 False False True False True False
1 False True True False True False
2 True True True False True False
Read Feather with R
Now, we are able to read the data into R. Note that we could have used a CSV file to record the data. However, for large data sets, this would have been much slower.
library(feather)
val <- read_feather("exp_val.feather")
Warning: Coercing int64 to double
Warning: Coercing int64 to double
Warning: Coercing int64 to double
Warning: Coercing int64 to double
stk <- read_feather("exp_stk.feather")
print(val); print(stk)
LS0tCnRpdGxlOiAiRXhhbXBsZSBvZiBFeGNoYW5naW5nIGRhdGEgYmV0d2VlbiBQeXRob24gYW5kIFIgd2l0aCBGZWF0aGVyIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyBCYWNrZ3JvdW5kCgpSIGFuZCBQeXRob24gYXJlIHdpZGVseSB1c2VkIHRvb2xzIGJ5IGRhdGEgYW5hbHlzdHMuICBBIFtyZWNlbnQgc3VydmV5XVtrZG51Z2dldHMtMjAxNl0gcmVwb3J0ZWQgdGhhdCBlYWNoIGlzIHVzZWQgYnkgYXBwcm94aW1hdGVseSA1MCUgb2YgdGhvc2UgcXVlc3Rpb25zIGFuZCB1c2FnZSBvZiBib3RoIGlzIGluY3JlYXNpbmcuICBJdCBpcyBnZW5lcmFsbHkgcmVjb2duaXNlZCBib3RoIGFyZSBjb21wbGltZXRhcnkgYW5kIHRoZXJlIGFyZSBtYW55IGNhc2VzIHdoZXJlIHdlIGNhbiBiZW5lZml0IGZyb20gdXNpbmcgYm90aCBsYW5ndWFnZXMgc2lkZS1ieS1zaWRlIGluIGFuIGFuYWx5c2lzIGluIG9yZGVyIHRvIGNhcHRpYWxpc2UgdXBvbiBhcHBsaWNhdGlvbnMgd2hlcmUgb25lIG91dHBlcmZvcm1zIHRoZSBvdGhlci4gIEEgW3NlcmllcyBvZiBibG9nc11bTWFuZ28tMjAxNV0gcmVwb3J0ZWQgb24gaG93IHRvIGludGVncmF0ZSBQeXRob24gYW5kIFIuCgojIyBQcm9ibGVtCgpXZSBoYXZlIHJlY2VudGx5IGJlZW4gd29ya2luZyBvbiBhbiBhbmFseXNpcyB3aGVyZWluIHdlIGFyZSByZXF1aXJlZCB0byB0aWR5IGRhdGEgd2hpY2ggaXMgcHJvdmlkZWQgaW4gdGhlIGZvcm0gb2YgYW4gRXhjZWwgc3ByZWFkc2hlZXQuICBXZSBoYXZlIGV4cGVyaWVuY2UgaW4gdXNpbmcgdGhlIGBkcGx5cmAgcGFjYWtnZSBpbiBSIHdoaWNoIGlzIGV4Y2VsbGVudCBmb3Igc2ltcGxpY2l0eSBvZiB3cml0aW5nIGNvZGUgKGFsdGhvdWdoIG90aGVyIHBhY2thZ2VzIG91dHBlcmZvcm0gaXQgZm9yIHZlcnkgbGFyZ2UgZGF0YSBzZXRzKS4gIEhlbmNlLCB0aGVyZSBpcyBhIGJlbmVmaXQgdG8gdXNpbmcgUi4KClNldmVyYWwgcGFja2FnZXMgZXhpc3QgaW4gUiBmb3IgcmVhZGluZyBFeGNlbCBmaWxlcy4gIEhvd2V2ZXIsIHRvIGRhdGUsIHdlIGhhdmUgbm90IGJlZW4gYWJsZSB0byBmaW5kIG9uZSB3aGljaCBjYW4gcmVhZCBjZWxsIGZvcm1hdHMgb2YgJ3hsc3gnIGZpbGVzLiAgV2UgYXJlIHJlcXVpcmVkIHRvIGlkZW50aWZ5ICdzdHJ1Y2sgb3V0JyBjZWxscyB3aGljaCB3ZSBhcmUgbm90IGFibGUgdG8gZG8gaW4gUi4gIFByZXZpb3VzbHksIHdlIHBlcmZvcm1lZCBhIG1hbnVhbCBzdGVwIHRvIHJlY29yZCB0aGUgc3RyaWtlIG91dHMgdXNpbmcgYW4gRXhjZWwgbWFjcm8uICBIb3dldmVyLCB0aGlzIGltcGFpcnMgdGhlIHJlcHJvZHVjaWJpbGl0eSBvZiB0aGUgYW5hbHlzaXMgc2luY2UgdGhpcyBzdGVwIGlzIG5vdCBleHBsaWNpdGx5IGNvZGVkIGluIHRoZSBsaXRlcmF0ZSBwcm9ncmFtLgoKUHl0aG9uIGRvZXMgcHJvdmlkZSBhIG1vcmUgYWR2YW5jZWQgRXhjZWwgcmVhZGVyLCBjYWxsZWQgYG9wZW5weXhsYCwgd2hpY2ggaXMgY2FwYWJsZSBvZiByZWFkaW5nIGFueSBjZWxsIGZvcm1hdHRpbmcsIGluY2x1ZGluZyBzdHJpa2V0aHJvdWdocy4gIFRoZSBvYmplY3RpdmUgb2YgdGhpcyBleGFtcGxlIGlzIHRvIHNob3cgaG93IHdlIHBhc3MgZGF0YSBiZXR3ZWVuIFIgYW5kIFB5dGhvbiB1c2luZyBbZmVhdGhlci1mb3JtYXRdW0ZlYXRoZXItMjAxNV0gCgpba2RudWdnZXRzLTIwMTZdOiBodHRwOi8vd3d3LmtkbnVnZ2V0cy5jb20vMjAxNi8wNi9yLXB5dGhvbi10b3AtYW5hbHl0aWNzLWRhdGEtbWluaW5nLWRhdGEtc2NpZW5jZS1zb2Z0d2FyZS5odG1sCltNYW5nby0yMDE1XTogaHR0cDovL3d3dy5tYW5nby1zb2x1dGlvbnMuY29tL3dwLzIwMTUvMTAvaW50ZWdyYXRpbmctcHl0aG9uLWFuZC1yLWludG8tYS1kYXRhLWFuYWx5c2lzLXBpcGVsaW5lLXBhcnQtMS8KW0ZlYXRoZXItMjAxNV06IGh0dHBzOi8vYmxvZy5yc3R1ZGlvLm9yZy8yMDE2LzAzLzI5L2ZlYXRoZXIvCgojIyBSZWFkLCBJbnRlcnByZXQgYW5kIFdyaXRlIERhdGEgd2l0aCBQeXRob24KClRoZSBsaWJyYXJpZXMgdXNlZCBhcmU6CgoqIGBvcGVucHl4bGAgcHl0aG9uIGxpYnJhcnkgdG8gcmVhZCBhbmQgaW50ZXJlcHJldCB0aGUgRXhjZWwgc3ByZWFkc2hlZXQuCiogYHBhbmRhc2AgcHl0aG9uIGxpYnJhcnkgdG8gY29udmVydCB0aGUgZGF0YSB0byBhIGRhdGFmcmFtZS4KKiBgZmVhdGhlcmAgcHl0aG9uIGxpYnJhcnkgdG8gdGhlIHdyaXRlIHRoZSBkYXRhZnJhbWUgaW4gYSBmYXN0IHJlYWQvd3JpdGUgZm9ybWF0IHdoaWNoIGlzIGludGVyY2hhbmFnZWFibGUgYmV0d2VlbiBSIGFuZCBQeXRob24uCgpBbiBhc2lkZTogd2UgaGFuZGxlIGNhcnJpYWdlIHJldHVybnMgaW4gaGVhZGVycyBieSByZW1vdmluZyB0aGVtIHNpbmNlIGNhcnJpYWdlIHJldHVybnMgaW4gdmFyaWFibGUgbmFtZXMgaXMgbm90IGEgZ29vZCBpZGVhLgoKWW91IG5lZWQgYWNjZXNzIHRvIHRoZSBgZXhhbXBsZS54bHN4YCBmaWxlIHRvIHJ1biB0aGUgY29kZSBiZWxvdy4KCmBgYCB7cHl0aG9uLCBlbmdpbmUgPSAncHl0aG9uMyd9CiMgaW1wb3J0IGxpYnJhcmllcyAtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KZnJvbSBvcGVucHl4bCBpbXBvcnQgbG9hZF93b3JrYm9vawpmcm9tIG9wZW5weXhsLnN0eWxlcyBpbXBvcnQgRm9udAppbXBvcnQgcGFuZGFzIGFzIHBkIAppbXBvcnQgZmVhdGhlciBhcyBmdAoKIyBsb2FkIHdvcmtib29rIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQoKd2IgPSBsb2FkX3dvcmtib29rKCcuL2V4YW1wbGUueGxzeCcsIGRhdGFfb25seSA9IFRydWUsIHJlYWRfb25seSA9IFRydWUpCndzID0gd2IuZ2V0X3NoZWV0X2J5X25hbWUoJ1NoZWV0MScpCgojIGdldCBoZWFkZXIgLyByZW1vdmUgY2FycmlhZ2UgcmV0dXJucyAtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCmhlYWRlcl9yb3cgPSA0CmhkciA9IFtlbGVtLnZhbHVlIGZvciBlbGVtIGluIHdzW2hlYWRlcl9yb3ddXQpoZHIgPSBbJyAnLmpvaW4oZWxlbS5zcGxpdGxpbmVzKCkpIGlmIGVsZW0gZWxzZSBlbGVtIGZvciBlbGVtIGluIGhkcl0KCiMgcmVhZCB2YWx1ZXMgYW5kIHN0cmlrZS1vdXQgc3RhdHVzIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KdmFsID0gWyhjZWxsLnZhbHVlIGlmIGNlbGwudmFsdWUgZWxzZSAiIiBmb3IgY2VsbCBpbiByb3cpIGZvciBcCiAgICAgICByb3cgaW4gd3MuaXRlcl9yb3dzKG1pbl9yb3cgPSBoZWFkZXJfcm93ICsgMSldCnN0ayA9IFsoYm9vbChjZWxsLmZvbnQuc3RyaWtlKSBpZiBjZWxsLmZvbnQgZWxzZSBGYWxzZSBmb3IgY2VsbCBpbiByb3cpIGZvciBcCiAgICAgICAgcm93IGluIHdzLml0ZXJfcm93cyhtaW5fcm93ID0gaGVhZGVyX3JvdyArIDEpXQoKIyB3cml0ZSBmZWF0aGVyIGZpbGVzIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQpmb3IgZGF0YSBpbiAoKCJ2YWwiLCB2YWwpLCAoInN0ayIsIHN0aykpOgogICAgZGYgPSBwZC5EYXRhRnJhbWUoZGF0YVsxXSwgY29sdW1ucyA9IGhkcikKICAgIHByaW50KGRhdGFbMF0sICJkYXRhXG4iLCBkZiwgIlxuIikKICAgIGZ0LndyaXRlX2RhdGFmcmFtZShkZiwgImV4cF8lcy5mZWF0aGVyIiAlIGRhdGFbMF0pCmBgYAojIFJlYWQgRmVhdGhlciB3aXRoIFIKCk5vdywgd2UgYXJlIGFibGUgdG8gcmVhZCB0aGUgZGF0YSBpbnRvIFIuICBOb3RlIHRoYXQgd2UgY291bGQgaGF2ZSB1c2VkIGEgQ1NWIGZpbGUgdG8gcmVjb3JkIHRoZSBkYXRhLiAgSG93ZXZlciwgZm9yIGxhcmdlIGRhdGEgc2V0cywgdGhpcyB3b3VsZCBoYXZlIGJlZW4gbXVjaCBzbG93ZXIuICAKCmBgYHtyfQpsaWJyYXJ5KGZlYXRoZXIpCnZhbCA8LSByZWFkX2ZlYXRoZXIoImV4cF92YWwuZmVhdGhlciIpCnN0ayA8LSByZWFkX2ZlYXRoZXIoImV4cF9zdGsuZmVhdGhlciIpCnByaW50KHZhbCk7IHByaW50KHN0aykKYGBg