Source file ⇒ 2017-lec22.Rmd

Today

  1. Introduction to XML
  2. Tree structure of an XML document
  3. Writing an XML document
  4. Make KLM document for geographic data (google earth)

1. Introduction to Extensible Markup Language (XML)

You may wish to download a text editor for your computer so you can view XML files. There are many good choices.

One possibility for Mac, Windows and Linux users is sublime Text

Extensible: Easy to expand (add new nodes)
Markup Language: a system of annotating a document with tags

Here is an example of an XML file on apricot producers from UN (you can view it in sublime if you like)

http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=xml&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc

Here is the first linees of the CSV file on apricots from last time:

Here an element of the XML file representing the first row of the above CSV file after the header:

XML is a standard for hierarchical representation of data. For example your data consists of a number of records and each record consists of a number of fields (country, element code, year, value).

XML was designed to carry data - with focus on what data is

HTML was designed to display data - with focus on how data looks

Some positive aspects of XML are

Some negative aspects are

XML is has become quite popular in many scientific fields, and it is standard in many web applications for the exchange and visualization of data. We’ll learn how to

  1. create it from within R, and
  2. read/process it from within R

Anatomy of an XML document

The basic unit of XML code is called an element or node. It is made up of both markup and content. Markup consists of tags, attributes, and comments.

Here <field name="Country or Area">Afghanistan</field> is an element (or node)

<field name="Country or Area"> </field> is a tag

an empty tag would be <field></field> or <field/>

name="Country or Area" is a attribute. An attribute’s value is always in quotes.

<!--A comment would looks like this --> It can be anywhere.

Content is Afghanistan.

XML is well-formed if it obeys certain syntax rules. The rules for tags are

1.Tag names are case-sensitive; start and end tags must match exactly.
2. No spaces are allowed between the < and the tag name.
3. Tag names must begin with a letter and contain only alphanumeric characters.
4. An element must have both an open and closing tag unless it is empty.
5. An empty element that does not have a closing tag must be of the form <tagname/>.

<element></element>

or

<element/>
  1. Tags must nest properly. (Inner tags must close before outer ones.)
<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root>
  1. All attributes must appear in quotes in a name = “value” format
gender='female'
  1. There are 5 pre-defined entity references in XML:
&lt;    <   less than
&gt;    >   greater than
&amp;   &   ampersand 
&apos;  '   apostrophe
&quot;  "   quotation mark
  1. All XML documents must contain a root node (doesn’t need to be called root) containing all the other nodes.
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Do not forget me this weekend!</body>
</note>

10.(Optional) The line

<?xml version="1.0"?>

is called the XML prolog. It is like the shebang in scripts anouncing that the data is XML.

Attributes versus Elements

If you start using attributes as containers for XML data, you might end up with documents that are both difficult to maintain and to manipulate. You should use elements to describe your data. Use attributes only to provide information that is not relevant to the reader.

The following three XML documents contain exactly the same information.

A date attribute is used in the first example (BAD):

<note date="2008-01-10">
  <to>Tove</to>
  <from>Jani</from>
</note>

A date element is used in the second example (BETTER):

<note>
  <date>2008-01-10</date>
  <to>Tove</to>
  <from>Jani</from>
</note>

An expanded element is used in the third example: (BEST):

<note>
  <date>
    <year>2008</year>
    <month>01</month>
    <day>10</day>
  </date>
  <to>Tove</to>
  <from>Jani</from>
</note>

Some things to consider when using attributes are:

*Attributes shouldn’t contain multiple values

Don’t do this (unreadable):

<note day="10" month="01" year="2008"
to="Tove" from="Jani" heading="Reminder"
body="Don't forget me this weekend!">
</note>

attributes cannot contain tree structures (elements can)
attributes are not easily expandable for future changes. (elements are)

2. Tree structure of an XML document

In Class exercise

Do example 1a

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-22-collection/

3. Creating an XML documents from within R

First we will see how to creat an XML document from within R. Later we will see how to read/process XML from within R.

It will be helpful to have in your mind the structure of the XML document (i.e the XML tree) before you do anything in R, especially when you’re creating a new XML document.

You will need the XML package (so type install.packages("XML") in the console) and the functions newXMLDoc, newXMLNode, and saveXML.

The syntax for newXMLNode is:

newXMLNode(name,attrs=NULL,doc=NULL, parent=NULL)

where

name is the tag name of the node
attrs is an attribute
doc is the name of thenewXMLDoc object parent is the name of the parent newXMLNode object of this newXMLNode object.

Simple example:

Here are the R commands to create this in R:

library(XML)

doc <- newXMLDoc()  
root <- newXMLNode("toplevel", doc = doc)   
child1 <- newXMLNode("level1", parent = root)   
newXMLNode("level2", "This is the content", parent = child1)
saveXML(doc, file = "~/Desktop/simple.xml")

Note: I only need to store (assign to a variable) nodes that I later need to refer to as parents. For the leaf nodes, I just use newXMLNode without assigning. The names of the nodes in R (e.g.root,child1) are not part of the resulting XML file.

In Class exercise

Do example 1b

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-22-collection/

4. Keyhole Markup Language (KML)

KML is a type of XML file specifically for geographic information / visualization (i.e. for use in Google Earth, Google Maps, etc). XML is a broader category of markup that includes KML.

KML, like HTML, has predefined tags (for example, <Document>, or <Placemark>. Here is the definitive introduction to KLM. https://developers.google.com/kml/documentation/kml_tut#for-more-information To find out about KML tags, refer to the KML reference link in the above website.

Here is a simple example of a KML file.

<?xml version="1.0"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
 <Document>
  <Placemark>
   <name>New York City</name>
   <description>New York City</description>
   <Point>
     <coordinates>-74.006393,40.714172,0</coordinates>
   </Point>
  </Placemark>
 </Document>
</kml>

Steps to view kml file in Google earth:

  1. Copy above kml file into Sublime and save it as simple.kml.
  2. Download Google Earth: https://www.google.com/earth/download/ge/agree.html
  3. Open simple.kml with Google Earth.

Some special features of KLM:

The root node is called <kml> and has a special attribute called an extensible markup language namespace (xmlns) <kml xmlns="http://www.opengis.net/kml/2.2"> to indicate that it is a KML file.

The child of the <kml> node is called <Document> and its child is called <Placemark>.

A Placemark is one of the most commonly used features in Google Earth. It marks a position on the Earth’s surface, using a yellow pushpin as the icon. The simplest Placemark includes only a element, which specifies the location of the Placemark. You can specify a name and a custom icon for the Placemark, and you can also add other geometry elements to it.

Placemark has children including <Point>, <TimeStamp>, <Description> etc. For example if you want to put a at each Placemark you need to follow the following syntax described here: https://developers.google.com/kml/documentation/kmlreference#timestamp

Making a KML file

Steps:

  1. Every KML file starts with
doc <- newXMLDoc()
root <- newXMLNode("kml",namespaceDefinitions = "http://www.opengis.net/kml/2.2", doc = doc)
  1. Diagram the rest of the tree structure.
#KML tree

                    KML
                      |
                    Document
                      |
                    Placemark
                      |  
            Name - Description  -  Point
                                    |                                  
                                Coordinates
  1. Load the XML package. library(XML)

  2. Use newXMLNode to create the Document node and its children.

docmt <- newXMLNode("Document", parent = root)
pm <- newXMLNode("Placemark", parent = docmt)
name <- newXMLNode("name", "New York City", parent = pm)
description <- newXMLNode("description", "New York City", parent = pm)
pt <- newXMLNode("Point", parent = pm)
newXMLNode("coordinates", "-74.006393,40.714172,0", parent = pt)
  1. save kml document

saveXML(doc, "/Users/Adam/Desktop/simple.kml")

In Class exercise

Do example 2a

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-22-collection/