3 ways of formatting information about dictionaries (books)

XML

#xmlParse('https://github.com/ebhtra/msds-607/blob/main/wk7_formats/dictionaries.xml')  # can't use https
(xml2df <- xmlToDataFrame('dictionaries.xml'))
##                                         title
## 1 diccionario Salamanca de la lengua española
## 2                American Heritage Dictionary
## 3                          Diccionari Escolar
##                                                               editors year
## 1 PilarPeña PérezMaríadel Rosario Calderón SotoMercedesEsteban García 2002
## 2                                 MarkBoyerPamelaDeVinneDoloresHarris 1991
## 3       MontserratMartín EnrileFerranBallester MateosLaiaCabal Guarro 2007
##                            langs edition
## 1                spanish-spanish       6
## 2                english-english       2
## 3 catalan-spanishspanish-catalan       2

Nothing like a whole bunch of long Spanish names (That’s only 3 editors per row) to highlight how this method works easily but needs some work afterwards to separate the children that are siblings.

html

Url <- 'dictionaries.html'  # Again, https URL not working for this
readHTMLTable(Url, encoding = "UTF-8")[[1]][,]
##                                         title
## 1 diccionario Salamanca de la lengua española
## 2                American Heritage Dictionary
## 3                          Diccionari Escolar
##                                                                      editors
## 1 Pilar Peña Pérez, María del Rosario Calderón Soto, Mercedes Esteban García
## 2                                 Mark Boyer, Pamela DeVinne, Dolores Harris
## 3       Montserrat Martín Enrile, Ferran Ballester Mateos, Laia Cabal Guarro
##                          languages year edition
## 1                  spanish-spanish 2002       6
## 2                  english-english 1991       2
## 3 catalan-spanish, spanish-catalan 2007       2

This one looks a lot nicer, but that’s because I combined the editors and languages elements into one here, since html seemed to force me to, when I was constructing the table.

json

jd <- fromJSON("https://raw.githubusercontent.com/ebhtra/msds-607/main/wk7_formats/dictionaries.json")
jd <- data.frame(jd)
jd
##                            Dictionaries.title
## 1 diccionario Salamanca de la lengua española
## 2                American Heritage Dictionary
## 3                          Diccionari Escolar
##                                                         Dictionaries.editors
## 1 Pilar Peña Pérez, María del Rosario Calderón Soto, Mercedes Esteban García
## 2                                 Mark Boyer, Pamela DeVinne, Dolores Harris
## 3       Montserrat Martín Enrile, Ferran Ballester Mateos, Laia Cabal Guarro
##             Dictionaries.languages Dictionaries.year Dictionaries.edition
## 1                  spanish-spanish              2002                    6
## 2                  english-english              1991                    2
## 3 catalan-spanish, spanish-catalan              2007                    2

Remove column prefixes (‘Dictionary’ was the outer dict in the JSON code)

names(jd) <- sapply(names(jd), function(n){substring(n, 14)})
kbl(jd)  # View doesn't knit so use kable to show lists
title editors languages year edition
diccionario Salamanca de la lengua española Pilar Peña Pérez , María del Rosario Calderón Soto, Mercedes Esteban García spanish-spanish 2002 6
American Heritage Dictionary Mark Boyer , Pamela DeVinne, Dolores Harris english-english 1991 2
Diccionari Escolar Montserrat Martín Enrile, Ferran Ballester Mateos , Laia Cabal Guarro catalan-spanish, spanish-catalan 2007 2
That JSON version is a little less clean, with the lists not unpacking automatically into the frame. The XML library was nice, but it made it difficult to connect to an “https” URL. Html had the same problem, at least for the methods I used. It was cleaner, having been born in a tabular state, by problem definition. It seemed sort of random which of the various methods for each format needed to have “UTF-8” encoding specified, for the Spanish accents, and some methods didn’t even allow that.