Background

R and Python are widely used tools by data analysts. A recent survey reported that each is used by approximately 50% of those questions and usage of both is increasing. It is generally recognised both are complimetary and there are many cases where we can benefit from using both languages side-by-side in an analysis in order to captialise upon applications where one outperforms the other. A series of blogs reported on how to integrate Python and R.

Problem

We have recently been working on an analysis wherein we are required to tidy data which is provided in the form of an Excel spreadsheet. We have experience in using the dplyr pacakge in R which is excellent for simplicity of writing code (although other packages outperform it for very large data sets). Hence, there is a benefit to using R.

Several packages exist in R for reading Excel files. However, to date, we have not been able to find one which can read cell formats of ‘xlsx’ files. We are required to identify ‘struck out’ cells which we are not able to do in R. Previously, we performed a manual step to record the strike outs using an Excel macro. However, this impairs the reproducibility of the analysis since this step is not explicitly coded in the literate program.

Python does provide a more advanced Excel reader, called openpyxl, which is capable of reading any cell formatting, including strikethroughs. The objective of this example is to show how we pass data between R and Python using feather-format

Read, Interpret and Write Data with Python

The libraries used are:

An aside: we handle carriage returns in headers by removing them since carriage returns in variable names is not a good idea.

You need access to the example.xlsx file to run the code below.

# import libraries ------------------------------------------------------------
from openpyxl import load_workbook
from openpyxl.styles import Font
import pandas as pd 
import feather as ft

# load workbook ---------------------------------------------------------------

wb = load_workbook('./example.xlsx', data_only = True, read_only = True)
ws = wb.get_sheet_by_name('Sheet1')

# get header ------------------------------------------------------------------
header_row = 4
hdr = [elem.value for elem in ws[header_row]]
hdr = [' '.join(elem.splitlines()) if elem else elem for elem in hdr]

# read values and strike-out status -------------------------------------------
val = [(cell.value if cell.value else "" for cell in row) for \
       row in ws.iter_rows(min_row = header_row + 1)]
stk = [(bool(cell.font.strike) if cell.font else False for cell in row) for \
        row in ws.iter_rows(min_row = header_row + 1)]

# write feather files ---------------------------------------------------------
for data in (("val", val), ("stk", stk)):
    df = pd.DataFrame(data[1], columns = hdr)
    print(data[0], "data\n", df, "\n")
    ft.write_dataframe(df, "exp_%s.feather" % data[0])
val data
    A  B   C D -some muppet put a carriage return  NaN E
0  1  2   3                                    A    3  
1  4  5   6                                    1    3  
2  8  9  10                                  1.4    4   

stk data
        A      B     C D -some muppet put a carriage return   NaN      E
0  False  False  True                                False  True  False
1  False   True  True                                False  True  False
2   True   True  True                                False  True  False 

Read Feather with R

Now, we are able to read the data into R. Note that we could have used a CSV file to record the data. However, for large data sets, this would have been much slower.

library(feather)
val <- read_feather("exp_val.feather")
Warning: Coercing int64 to double
Warning: Coercing int64 to double
Warning: Coercing int64 to double
Warning: Coercing int64 to double
stk <- read_feather("exp_stk.feather")
print(val); print(stk)
LS0tCnRpdGxlOiAiRXhhbXBsZSBvZiBFeGNoYW5naW5nIGRhdGEgYmV0d2VlbiBQeXRob24gYW5kIFIgd2l0aCBGZWF0aGVyIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyBCYWNrZ3JvdW5kCgpSIGFuZCBQeXRob24gYXJlIHdpZGVseSB1c2VkIHRvb2xzIGJ5IGRhdGEgYW5hbHlzdHMuICBBIFtyZWNlbnQgc3VydmV5XVtrZG51Z2dldHMtMjAxNl0gcmVwb3J0ZWQgdGhhdCBlYWNoIGlzIHVzZWQgYnkgYXBwcm94aW1hdGVseSA1MCUgb2YgdGhvc2UgcXVlc3Rpb25zIGFuZCB1c2FnZSBvZiBib3RoIGlzIGluY3JlYXNpbmcuICBJdCBpcyBnZW5lcmFsbHkgcmVjb2duaXNlZCBib3RoIGFyZSBjb21wbGltZXRhcnkgYW5kIHRoZXJlIGFyZSBtYW55IGNhc2VzIHdoZXJlIHdlIGNhbiBiZW5lZml0IGZyb20gdXNpbmcgYm90aCBsYW5ndWFnZXMgc2lkZS1ieS1zaWRlIGluIGFuIGFuYWx5c2lzIGluIG9yZGVyIHRvIGNhcHRpYWxpc2UgdXBvbiBhcHBsaWNhdGlvbnMgd2hlcmUgb25lIG91dHBlcmZvcm1zIHRoZSBvdGhlci4gIEEgW3NlcmllcyBvZiBibG9nc11bTWFuZ28tMjAxNV0gcmVwb3J0ZWQgb24gaG93IHRvIGludGVncmF0ZSBQeXRob24gYW5kIFIuCgojIyBQcm9ibGVtCgpXZSBoYXZlIHJlY2VudGx5IGJlZW4gd29ya2luZyBvbiBhbiBhbmFseXNpcyB3aGVyZWluIHdlIGFyZSByZXF1aXJlZCB0byB0aWR5IGRhdGEgd2hpY2ggaXMgcHJvdmlkZWQgaW4gdGhlIGZvcm0gb2YgYW4gRXhjZWwgc3ByZWFkc2hlZXQuICBXZSBoYXZlIGV4cGVyaWVuY2UgaW4gdXNpbmcgdGhlIGBkcGx5cmAgcGFjYWtnZSBpbiBSIHdoaWNoIGlzIGV4Y2VsbGVudCBmb3Igc2ltcGxpY2l0eSBvZiB3cml0aW5nIGNvZGUgKGFsdGhvdWdoIG90aGVyIHBhY2thZ2VzIG91dHBlcmZvcm0gaXQgZm9yIHZlcnkgbGFyZ2UgZGF0YSBzZXRzKS4gIEhlbmNlLCB0aGVyZSBpcyBhIGJlbmVmaXQgdG8gdXNpbmcgUi4KClNldmVyYWwgcGFja2FnZXMgZXhpc3QgaW4gUiBmb3IgcmVhZGluZyBFeGNlbCBmaWxlcy4gIEhvd2V2ZXIsIHRvIGRhdGUsIHdlIGhhdmUgbm90IGJlZW4gYWJsZSB0byBmaW5kIG9uZSB3aGljaCBjYW4gcmVhZCBjZWxsIGZvcm1hdHMgb2YgJ3hsc3gnIGZpbGVzLiAgV2UgYXJlIHJlcXVpcmVkIHRvIGlkZW50aWZ5ICdzdHJ1Y2sgb3V0JyBjZWxscyB3aGljaCB3ZSBhcmUgbm90IGFibGUgdG8gZG8gaW4gUi4gIFByZXZpb3VzbHksIHdlIHBlcmZvcm1lZCBhIG1hbnVhbCBzdGVwIHRvIHJlY29yZCB0aGUgc3RyaWtlIG91dHMgdXNpbmcgYW4gRXhjZWwgbWFjcm8uICBIb3dldmVyLCB0aGlzIGltcGFpcnMgdGhlIHJlcHJvZHVjaWJpbGl0eSBvZiB0aGUgYW5hbHlzaXMgc2luY2UgdGhpcyBzdGVwIGlzIG5vdCBleHBsaWNpdGx5IGNvZGVkIGluIHRoZSBsaXRlcmF0ZSBwcm9ncmFtLgoKUHl0aG9uIGRvZXMgcHJvdmlkZSBhIG1vcmUgYWR2YW5jZWQgRXhjZWwgcmVhZGVyLCBjYWxsZWQgYG9wZW5weXhsYCwgd2hpY2ggaXMgY2FwYWJsZSBvZiByZWFkaW5nIGFueSBjZWxsIGZvcm1hdHRpbmcsIGluY2x1ZGluZyBzdHJpa2V0aHJvdWdocy4gIFRoZSBvYmplY3RpdmUgb2YgdGhpcyBleGFtcGxlIGlzIHRvIHNob3cgaG93IHdlIHBhc3MgZGF0YSBiZXR3ZWVuIFIgYW5kIFB5dGhvbiB1c2luZyBbZmVhdGhlci1mb3JtYXRdW0ZlYXRoZXItMjAxNV0gCgpba2RudWdnZXRzLTIwMTZdOiBodHRwOi8vd3d3LmtkbnVnZ2V0cy5jb20vMjAxNi8wNi9yLXB5dGhvbi10b3AtYW5hbHl0aWNzLWRhdGEtbWluaW5nLWRhdGEtc2NpZW5jZS1zb2Z0d2FyZS5odG1sCltNYW5nby0yMDE1XTogaHR0cDovL3d3dy5tYW5nby1zb2x1dGlvbnMuY29tL3dwLzIwMTUvMTAvaW50ZWdyYXRpbmctcHl0aG9uLWFuZC1yLWludG8tYS1kYXRhLWFuYWx5c2lzLXBpcGVsaW5lLXBhcnQtMS8KW0ZlYXRoZXItMjAxNV06IGh0dHBzOi8vYmxvZy5yc3R1ZGlvLm9yZy8yMDE2LzAzLzI5L2ZlYXRoZXIvCgojIyBSZWFkLCBJbnRlcnByZXQgYW5kIFdyaXRlIERhdGEgd2l0aCBQeXRob24KClRoZSBsaWJyYXJpZXMgdXNlZCBhcmU6CgoqIGBvcGVucHl4bGAgcHl0aG9uIGxpYnJhcnkgdG8gcmVhZCBhbmQgaW50ZXJlcHJldCB0aGUgRXhjZWwgc3ByZWFkc2hlZXQuCiogYHBhbmRhc2AgcHl0aG9uIGxpYnJhcnkgdG8gY29udmVydCB0aGUgZGF0YSB0byBhIGRhdGFmcmFtZS4KKiBgZmVhdGhlcmAgcHl0aG9uIGxpYnJhcnkgdG8gdGhlIHdyaXRlIHRoZSBkYXRhZnJhbWUgaW4gYSBmYXN0IHJlYWQvd3JpdGUgZm9ybWF0IHdoaWNoIGlzIGludGVyY2hhbmFnZWFibGUgYmV0d2VlbiBSIGFuZCBQeXRob24uCgpBbiBhc2lkZTogd2UgaGFuZGxlIGNhcnJpYWdlIHJldHVybnMgaW4gaGVhZGVycyBieSByZW1vdmluZyB0aGVtIHNpbmNlIGNhcnJpYWdlIHJldHVybnMgaW4gdmFyaWFibGUgbmFtZXMgaXMgbm90IGEgZ29vZCBpZGVhLgoKWW91IG5lZWQgYWNjZXNzIHRvIHRoZSBgZXhhbXBsZS54bHN4YCBmaWxlIHRvIHJ1biB0aGUgY29kZSBiZWxvdy4KCmBgYCB7cHl0aG9uLCBlbmdpbmUgPSAncHl0aG9uMyd9CiMgaW1wb3J0IGxpYnJhcmllcyAtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KZnJvbSBvcGVucHl4bCBpbXBvcnQgbG9hZF93b3JrYm9vawpmcm9tIG9wZW5weXhsLnN0eWxlcyBpbXBvcnQgRm9udAppbXBvcnQgcGFuZGFzIGFzIHBkIAppbXBvcnQgZmVhdGhlciBhcyBmdAoKIyBsb2FkIHdvcmtib29rIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQoKd2IgPSBsb2FkX3dvcmtib29rKCcuL2V4YW1wbGUueGxzeCcsIGRhdGFfb25seSA9IFRydWUsIHJlYWRfb25seSA9IFRydWUpCndzID0gd2IuZ2V0X3NoZWV0X2J5X25hbWUoJ1NoZWV0MScpCgojIGdldCBoZWFkZXIgLyByZW1vdmUgY2FycmlhZ2UgcmV0dXJucyAtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCmhlYWRlcl9yb3cgPSA0CmhkciA9IFtlbGVtLnZhbHVlIGZvciBlbGVtIGluIHdzW2hlYWRlcl9yb3ddXQpoZHIgPSBbJyAnLmpvaW4oZWxlbS5zcGxpdGxpbmVzKCkpIGlmIGVsZW0gZWxzZSBlbGVtIGZvciBlbGVtIGluIGhkcl0KCiMgcmVhZCB2YWx1ZXMgYW5kIHN0cmlrZS1vdXQgc3RhdHVzIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KdmFsID0gWyhjZWxsLnZhbHVlIGlmIGNlbGwudmFsdWUgZWxzZSAiIiBmb3IgY2VsbCBpbiByb3cpIGZvciBcCiAgICAgICByb3cgaW4gd3MuaXRlcl9yb3dzKG1pbl9yb3cgPSBoZWFkZXJfcm93ICsgMSldCnN0ayA9IFsoYm9vbChjZWxsLmZvbnQuc3RyaWtlKSBpZiBjZWxsLmZvbnQgZWxzZSBGYWxzZSBmb3IgY2VsbCBpbiByb3cpIGZvciBcCiAgICAgICAgcm93IGluIHdzLml0ZXJfcm93cyhtaW5fcm93ID0gaGVhZGVyX3JvdyArIDEpXQoKIyB3cml0ZSBmZWF0aGVyIGZpbGVzIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQpmb3IgZGF0YSBpbiAoKCJ2YWwiLCB2YWwpLCAoInN0ayIsIHN0aykpOgogICAgZGYgPSBwZC5EYXRhRnJhbWUoZGF0YVsxXSwgY29sdW1ucyA9IGhkcikKICAgIHByaW50KGRhdGFbMF0sICJkYXRhXG4iLCBkZiwgIlxuIikKICAgIGZ0LndyaXRlX2RhdGFmcmFtZShkZiwgImV4cF8lcy5mZWF0aGVyIiAlIGRhdGFbMF0pCmBgYAojIFJlYWQgRmVhdGhlciB3aXRoIFIKCk5vdywgd2UgYXJlIGFibGUgdG8gcmVhZCB0aGUgZGF0YSBpbnRvIFIuICBOb3RlIHRoYXQgd2UgY291bGQgaGF2ZSB1c2VkIGEgQ1NWIGZpbGUgdG8gcmVjb3JkIHRoZSBkYXRhLiAgSG93ZXZlciwgZm9yIGxhcmdlIGRhdGEgc2V0cywgdGhpcyB3b3VsZCBoYXZlIGJlZW4gbXVjaCBzbG93ZXIuICAKCmBgYHtyfQpsaWJyYXJ5KGZlYXRoZXIpCnZhbCA8LSByZWFkX2ZlYXRoZXIoImV4cF92YWwuZmVhdGhlciIpCnN0ayA8LSByZWFkX2ZlYXRoZXIoImV4cF9zdGsuZmVhdGhlciIpCnByaW50KHZhbCk7IHByaW50KHN0aykKYGBg