Translate strings using the googleLanguageR package

Intro

The googleLanguageR provides useful functions to translate strings within R through the Google Cloud Translation. The quality of the translation is admittedly very good although it cannot yet replace that of professional translation services. However, this package succeeds since it eliminates the time to translate, for example, survey data or any other type of it that happen to contain character variables in small segments in an unwished language. A disadvantage will always be that it comes at a price though quite affordable for tasks similar to the ones described in this article. For taking full advantage of this package, one should subscribe with the Google Cloud Services to obtain the API authorisation.

For more information on the process read here.

Make sure to load the following packages in order to complete this task and have your .json authorisation file stored in your working directory for R to be able to call it. We will use it later in the gl_auth() function.

packages <- c("googleLanguageR", "cld2","tidyverse", "datasets")

Load and prepare the dataset `sentences`

Here we will make use of the sentences dataset included in the datasets package. When inspecting the dataset we realise that it contains one string variable which is already in a tidy form.

str(sentences)

##  chr [1:720] "The birch canoe slid on the smooth planks." ...

head(sentences)

## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."       
## [6] "The juice of lemons makes fine punch."

In a real-life scenario, we would need to tide up and bring the variable in our wished form. As part of the current process I use the function first_letter_upper to capitalize the first letter of each sentence but prior to that we need to remove any leading, trailing or repeated whitespace in it with str_squish. This process may partly cover what would be needed if the data was in an untidy form.

first_letter_upper=function(x) ifelse(is.na(x)==TRUE, NA, paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2))))

sentences_new <- head(sentences) %>% 
  # I added the head() function above to not have to translate every single element in this long dataset. It should be removed in actual translations.
  sapply(., str_squish) %>%
  sapply(., first_letter_upper) %>%
  na.omit %>%
  data.frame(stringsAsFactors = FALSE)

(If your dataset has multiple columns you may use lapply() instead of sapply())

Detect the current language of `sentences`

The gl_translate function of the googleLanguageR package only requires the target= language to be defined. However, since the Google Authorisation API comes with credit which is reduced every time a word is translated we can apply few tricks to reduce our cost:

Avoid translating sentences that are already in your target language by first detecting the existing language with detect_language of the package cld2 instead of the googleLanguageR.
Select only the rows that are in a language which you wish to translate and not all of your elements.

In practice, the following code chunk returns an object of identical dimensions with the detected language annotations:

detected_language <- sentences_new %>% 
  sapply(., map_chr, detect_language) %>% 
  data.frame(check.names = FALSE)

Using the indexes of the language-tagged elements we can exclude a language that we do not intend to translate or we can select only one to do so with the use of logical operators. As a first step we can add a (Translated) annotation to every element of sentences_new in order to always be able to recognise which ones were translated by us:

(here we translate every language-tagged element thus we expect all of them to be annotated)

#for (i in 1:ncol(sentences_new)){
  for (k in 1:nrow(sentences_new)){
    sentences_new[k,][detected_language[k,] =="en"] <- 
      paste("(Translated)", sentences_new[k,][detected_language[k,] =="en"], sep = " ")
    }
#}

(If the object sentences_new had multiple columns we would need to uncomment the outer for-loop)

Translate `sentences_new` to Italian

All sentences are in english therefore we choose to translate them to Italian:

gl_auth("google_api_auth.json")

for (i in 1:ncol(sentences_new)){
  # we replace all the elements in sentences_new that are =="en":
  sentences_new[,i][detected_language[,i] =="en" & !is.na(detected_language[,i])] <- 
    # with their translation (gl_translate) to target="it" then we transform it to a dataframe and makes sure all missing values are in NA form:
    data.frame(gl_translate(sentences_new[,i][detected_language[,i] =="en" & !is.na(detected_language[,i])], target = "it"))[,1]
  
  sentences_new[,i][sentences_new[,i] %in% c("NA", "N/a", "N/A", "Na", "na", "n/a", "not applicable")] <- NA
}

## 2020-09-28 16:30:21 -- Translating text: 314 characters -

head(sentences_new)

##                                                                                                       .
## The birch canoe slid on the smooth planks.     (Tradotto) La canoa di betulla scivolò sulle assi lisce.
## Glue the sheet to the dark blue background.        (Tradotto) Incolla il foglio sullo sfondo blu scuro.
## It's easy to tell the depth of a well.            (Tradotto) È facile capire la profondità di un pozzo.
## These days a chicken leg is a rare dish.    (Tradotto) Oggigiorno una coscia di pollo è un piatto raro.
## Rice is often served in round bowls.        (Tradotto) Il riso viene spesso servito in ciotole rotonde.
## The juice of lemons makes fine punch.                   (Tradotto) Il succo dei limoni fa un bel pugno.

Intro

Load and prepare the dataset sentences

Detect the current language of sentences

Translate sentences_new to Italian

Load and prepare the dataset `sentences`

Detect the current language of `sentences`

Translate `sentences_new` to Italian