Introduction

IBM offers its Watson capabilities as a service via Bluemix. One of the services that’s offered is the document conversion facility. The service accepts a Word, HTML or pdf file as an input and returns it as structured data.

The following example takes a Word file and transforms it into a data frame with titles and text content. To start with this utility it is necessary to have an account with the Bluemix system en authentication keys for the selected service. This is free (as in ‘trial’), rather straightforward and there are a lot of online instructions available.

Step 1

Post a file to the data conversion service and extract the data. The result is a list.

library(httr)

rawData <- POST('https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15',
         authenticate("XXXYourUserNameXXX", "XXXYourPasswordXXX"),
         body=list(file=upload_file("a_word_file.docx"),
                   type="application/msword", # Other file types are possible
                   config = "{\"conversion_target\":\"answer_units\"}"))

docData <- content(rawData)

Step 2

Extract the title and content vectors and combine them into a data frame.

n <- length(docData$answer_units)
titels <- sapply(1:n, function(x) docData$answer_units[[x]]$title)
inhoud <- sapply(1:n, function(x) docData$answer_units[[x]]$content[[1]]$text)
docDF <- data.frame(titels, inhoud)

Step 3

Write to a csv file

csvName <- paste0(docData$metadata[[2]]$content,".csv")
write.csv(docDF, csvName)