XML-JSN-files.utf8

library(ggplot2)
library(tidyverse)
library(dplyr)
library(tidyr)
library(gridExtra)
library(grid)
library(DT)
library(data.table)
library(httr)
library(tsibble)
library(tibble)
library(xml2)
library(RColorBrewer)


# gridExtra for grid.arrange()
# grid for grid tables, package "grid"
# DT for datatables()
# httr to bring data from internet.
# as.tibble from deprecated {tibble}

Dee Chiluiza
30 April, 2021
Working with XML data sets
Practice file

• Note 1: This is a draft file in progress.
• Note 2: Click on icons “Code” to observe codes.
• Note 3: I am creating this practice file as part of my learning process using LinkedIn Learning course: Master R for Data Science.
• Note 4: This is not a final version, new changes will be included.

XML: Extensible Markup Language.
This language was developed in the 1990s in order to fix a problem people had with HTML language: it did not allow the definition of new text elements, in this sense, XML is extensible (Hemmendinger, 2000).
Information in progress…

First data set: From LinkedIn course and Missouri data portal.
In the R chunk below, there is the link to the data set and the first object created using the code read_xml(), then the data set converted to a list using as_list().

# R Chunk #1 
URL = "https://data.mo.gov/api/views/vpge-tj3s/rows.xml" 

URLdata = 
  URL %>%
  read_xml() %>%
  as_list()

The data is very long and based on the image below, it looks as a list. All panels show only a very small portion of the long lists obtained by the different methods. (A) The raw data as observed by direct printing. (B) The object URLdata (R chink #1) now appears on the Environment; by clicking on the name, a new view appear in the source panel, this structure shows the different hierarchical levels. (C) Using the str() code additional information is obtained. Again, it is very long and difficult to follow.
We will break those hierarchies to construct a table (rectangular format).

In order to break the hierarchies, we will convert the data set into a tibble, then we will use the commands unnest_wider() and unnest_longer(), several times. Check the order of events below.

Observe outcome of all objects:
On the R chunk below, several sequential objects were created, now let’s observe the process the data went through in order to obtain the desired information.

URLdata1: Convert data into a tibble.

# Convert to a tibble
URLdata1 = 
  URLdata %>%
  tibble(taxes = URLdata) %>% 
  print()

## # A tibble: 1 x 2
##   .                taxes           
##   <named list>     <named list>    
## 1 <named list [1]> <named list [1]>

URLdata2: first unnest_wider() level to obtain list of counties and taxes.

# Use tibble for first unnest level: get list of counties and taxes.
# Select only column named row.  
URLdata2 =
URLdata1 %>% unnest_wider(taxes) %>% select(row) %>% 
  print()

## # A tibble: 1 x 1
##   row                 
##   <list>              
## 1 <named list [2,129]>

URLdata3: Unnest_longer() to get each county and tax pairs in a row.

# Unnest longer to get each county and tax pairs in a row
URLdata3 = 
  URLdata2 %>% unnest_longer(row) %>% 
  print()

## # A tibble: 2,129 x 2
##    row              row_id
##    <named list>     <chr> 
##  1 <named list [2]> row   
##  2 <named list [2]> row   
##  3 <named list [2]> row   
##  4 <named list [2]> row   
##  5 <named list [2]> row   
##  6 <named list [2]> row   
##  7 <named list [2]> row   
##  8 <named list [2]> row   
##  9 <named list [2]> row   
## 10 <named list [2]> row   
## # ... with 2,119 more rows

URLdata4: unnest_wider() again to separate county and tax. Also, remove the column row_id. The values on each column still appear as lists.

# Unnest wider again to county and tax as separate and remove the column row_id. 
URLdata4 =
URLdata3 %>% unnest_wider(row) %>% select(-row_id) %>% 
  print()

## # A tibble: 2,129 x 2
##    county     salestax  
##    <list>     <list>    
##  1 <list [1]> <list [1]>
##  2 <list [1]> <list [1]>
##  3 <list [1]> <list [1]>
##  4 <list [1]> <list [1]>
##  5 <list [1]> <list [1]>
##  6 <list [1]> <list [1]>
##  7 <list [1]> <list [1]>
##  8 <list [1]> <list [1]>
##  9 <list [1]> <list [1]>
## 10 <list [1]> <list [1]>
## # ... with 2,119 more rows

URLdata5: Convert from lists to character and numeric. Check the use of code str_squish() to remove repeated white space in the county column (Wickham, n.d.).

# Convert from lists to character and numeric
# Check use of str_squish() to remove repeated white space in the county column.

URLdata5 = 
  URLdata4 %>%
  unnest(county)%>%
  unnest(salestax)%>%
  mutate(salestax = as.numeric(salestax))%>%
  mutate(county = str_squish(county)) %>% 
  print()

## # A tibble: 2,129 x 2
##    county                                         salestax
##    <chr>                                             <dbl>
##  1 Adair County                                     0.056 
##  2 Andrew County                                    0.0592
##  3 Andrew County Andrew County Ambulance District   0.0642
##  4 Atchison County                                  0.0648
##  5 Audrain County                                   0.0635
##  6 Audrain County Audrain Ambulance District        0.0685
##  7 Audrain County Van Far Ambulance District        0.0685
##  8 Barry County                                     0.0572
##  9 Barton County                                    0.0572
## 10 Bates County                                     0.0522
## # ... with 2,119 more rows

Plot the data using a histogram and a density plot

# {r histogram, dev="png", fig.show="hide"}

URLdata5 %>%
  ggplot(aes(salestax))+
  geom_histogram(binwidth = 0.002,
                 color="white",
                 fill="#A11515")+
  theme_minimal()+
  labs(x="Sale taxes per county (bin width = 0.002)")

Histogram of sale tax.

# {r density, dev="png", fig.show="hide"}
URLdata5 %>%
  ggplot(aes(salestax))+
  geom_density()+
  theme_minimal()+
  labs(x="Sale taxes per county")

Density plot of sale tax. Density

# {r boxplot, dev="png", fig.show="hide"}
URLdata5 %>%
  ggplot(aes(salestax))+
  geom_boxplot(color="#2C32C3",
               fill="yellow")+
  theme_minimal()+
  labs(x="Sale taxes per county")

Box plot of sale tax.
Boxplot

Summarize data

URLdata5 %>% select(salestax) %>% summary()

##     salestax      
##  Min.   :0.04725  
##  1st Qu.:0.06225  
##  Median :0.07225  
##  Mean   :0.07257  
##  3rd Qu.:0.08350  
##  Max.   :0.10679

References:

Poulson, B. 2020. R Essential training: Wrangling data and Visualizing Data: Working with XML data. LinkedIn Learning.
https://www.linkedin.com/learning/r-essential-training-wrangling-and-visualizing-data/working-with-xml-data?contextUrn=urn%3Ali%3AlyndaLearningPath%3A5a7dfaf2498ef27b4beaf666&u=74653650
Holtz, Y. 2018. Pimp my RMD: a few tips for R Markdown. Github.
https://holtzy.github.io/Pimp-my-rmd
Hemmendinger, D. 2000. XML computer language. Britanica
https://www.britannica.com/technology/XML
Averick, Mara. 2018. String 1.3.0. Tidyverse.
https://www.tidyverse.org/blog/2018/02/stringr-1-3-0/
Wickham, H. (n.d.). Rectangling. Tidyr.
https://tidyr.tidyverse.org/articles/rectangle.html