Extract content of ‘Word.docx’

A quick try-out. Now, .docx files are just zipped directories, so rename that document to ‘data’.zip, unzip it and navigate to ‘data’/word/document.xml . ‘data’ is just the name of the file.

Looking at document.xml, you will see a standard xml script.

We can easily make out a table structure with rows and columns. In the simplest cases (which is all I???ll cover in this post) where the rows and columns are uniform it???s pretty easy to grab the data:

setwd("~/Documents/Workspace/R-project/Word")


library(xml2) #read xml text
library(textreadr) # read docx text
## 
## Attaching package: 'textreadr'
## The following object is masked from 'package:xml2':
## 
##     read_html
library(knitr)

## FIRST, look ar xml2 package
# read in the XML file
doc <- read_xml("document.xml")

# there is an egregious use of namespaces in these files
ns <- xml_ns(doc)

# extract all the table cells (this is assuming one table in the document)
cells <- xml_find_all(doc, ".//w:tbl/w:tr/w:tc", ns=ns)

# convert the cells to a matrix then to a data.frame)
dat <- data.frame(matrix(xml_text(cells), ncol=4, byrow=TRUE), 
                  stringsAsFactors=FALSE)

# if there are column headers, make them the column name and remove that line
colnames(dat) <- dat[1,]
dat <- dat[-1,]
rownames(dat) <- NULL
A quick grab
Item Definition/clarification
1. Date of transplantation dd/mm/yyyy
2. Country where transplantation took place
3. City and hospital where transplantation took place
4. Recipient age at transplantation Years
5. Recipient gender FORMCHECKBOX Male FORMCHECKBOX Female
6. Recipient blood group FORMCHECKBOX A FORMCHECKBOX B FORMCHECKBOX AB FORMCHECKBOX 0
7. Status of the recipient on your centres waiting list when he/she travelled for transplantation abroad FORMCHECKBOX Active on the waiting list FORMCHECKBOX Not active on the waiting list, but treated at your Centre FORMCHECKBOX Not active on the waiting list, and not treated at your Centre FORMCHECKBOX Other (please specify)_______________
8. Referral of the recipient by your Centre or a Centre in your country for transplantation abroad Specify if the patient was referred by your Centre or a Centre in your country for transplantation in another country. Referral should be understood as the establishment of a direct contact between the Centre of origin and the Centre where the transplantation procedure would take place in order to ensure transfer of medical records and continuity of care. Referral should NOT be understood as a simple recommendation to travel for transplantation abroad without any further engagement or contact between the Centre of origin and the Centre where the transplantation procedure would take place. FORMCHECKBOX Yes FORMCHECKBOX No
9. Reason for referring the recipient for transplantation abroad Complete only if the answer to question #8 is Yes. FORMCHECKBOX Lack of transplant programme in home country and established official bilateral agreement with the country where the transplantation procedure would take place FORMCHECKBOX Lack of transplant programme in home country in the absence of an established official bilateral agreement with the country where the transplantation procedure would take place FORMCHECKBOX Double citizenship of recipient FORMCHECKBOX Other reason (please specify) _______________
10. Country(ies) of legal citizenship/residency of the recipient Citizenship: _______________Residency (if different): _______________
11. Donor type According to the World Health Organization classification, a living donor has one of the following relationships with the recipient: RelatedA1. Genetically related: 1st degree genetically relative: parent, sibling, offspring2nd degree genetically related: e.g. grandparent, grandchild, aunt, uncle, niece, nephew.Other than 1st or 2nd degree genetically related: e.g cousinA2. Emotionally related: spouse (if not genetically related; in-law; adopted; friend.Unrelated: not genetically or emotionally related. FORMCHECKBOX Deceased FORMCHECKBOX Living - genetically related 1st degree FORMCHECKBOX Living - genetically related 2nd degree FORMCHECKBOX Living - genetically related other than 1st or 2nd degree FORMCHECKBOX Living - emotionally related FORMCHECKBOX Living - unrelated FORMCHECKBOX Not available
12. Donor age Years. Please specify if not available.
13. Donor gender FORMCHECKBOX Male FORMCHECKBOX Female FORMCHECKBOX Not available
14. Country(ies) of legal citizenship/residency of the donor Citizenship: _______________ Residency (if different): _______________ FORMCHECKBOX Not available
15. Donor blood group FORMCHECKBOX A FORMCHECKBOX B FORMCHECKBOX AB FORMCHECKBOX 0 FORMCHECKBOX Not available
16. Information on the Transplant Team Specify if your Centre has information available on the Transplant Team (contact details) who performed the transplant procedure abroad FORMCHECKBOX Yes FORMCHECKBOX No
17. Quality of medical report on transplant procedure provided to the patient at hospital discharge after transplantation A complete report should contain at a minimum information on: hospital where transplantation took place (with contact details of the transplant team), date of transplantationdonor characteristicsrecipients post-operative complications recipients treatmentIf any of this information is missing, please describe the report as incomplete. FORMCHECKBOX Complete report FORMCHECKBOX Incomplete report FORMCHECKBOX No report available
18. Date of last available follow-up The evolution of the transplant recipient is assessed at 1 year 1 month after transplantation. dd/mm/yyyy
19. Functioning graft (censored for death) Please specify if the patient had a functioning graft at 1 year 1 month after transplantation. If the patient died with a functioning graft, please respond Yes.Skip to question # 22 if the answer is Yes. FORMCHECKBOX Yes FORMCHECKBOX No
20. Date of graft loss Complete only if answer to question #19 is No. dd/mm/yyyy

SECOND, look ar textreadr package

doc2 <- system.file("Questionnaire data collection for hospitals.docx")
read_document(doc2)
## character(0)

doc2

hmmm doesn’t seem to recognize any data…