Source file ⇒ 2017-lec22.Rmd
You may wish to download a text editor for your computer so you can view XML files. There are many good choices.
One possibility for Mac, Windows and Linux users is sublime Text
Extensible: Easy to expand (add new nodes)
Markup Language: a system of annotating a document with tags
Here is an example of an XML file on apricot producers from UN (you can view it in sublime if you like)
Here is the first linees of the CSV file on apricots from last time:
Here an element of the XML file representing the first row of the above CSV file after the header:
XML is a standard for hierarchical representation of data. For example your data consists of a number of records and each record consists of a number of fields (country, element code, year, value).
XML was designed to carry data - with focus on what data is
HTML was designed to display data - with focus on how data looks
Some positive aspects of XML are
Some negative aspects are
XML is has become quite popular in many scientific fields, and it is standard in many web applications for the exchange and visualization of data. We’ll learn how to
The basic unit of XML code is called an element or node. It is made up of both markup and content. Markup consists of tags, attributes, and comments.
Here <field name="Country or Area">Afghanistan</field>
is an element (or node)
<field name="Country or Area"> </field>
is a tag
an empty tag would be <field></field>
or <field/>
name="Country or Area"
is a attribute. An attribute’s value is always in quotes.
<!--A comment would looks like this -->
It can be anywhere.
Content is Afghanistan
.
XML is well-formed if it obeys certain syntax rules. The rules for tags are
1.Tag names are case-sensitive; start and end tags must match exactly.
2. No spaces are allowed between the < and the tag name.
3. Tag names must begin with a letter and contain only alphanumeric characters.
4. An element must have both an open and closing tag unless it is empty.
5. An empty element that does not have a closing tag must be of the form <tagname/>
.
<element></element>
or
<element/>
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
gender='female'
< < less than
> > greater than
& & ampersand
' ' apostrophe
" " quotation mark
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Do not forget me this weekend!</body>
</note>
10.(Optional) The line
<?xml version="1.0"?>
is called the XML prolog. It is like the shebang in scripts anouncing that the data is XML.
If you start using attributes as containers for XML data, you might end up with documents that are both difficult to maintain and to manipulate. You should use elements to describe your data. Use attributes only to provide information that is not relevant to the reader.
The following three XML documents contain exactly the same information.
A date attribute is used in the first example (BAD):
<note date="2008-01-10">
<to>Tove</to>
<from>Jani</from>
</note>
A date element is used in the second example (BETTER):
<note>
<date>2008-01-10</date>
<to>Tove</to>
<from>Jani</from>
</note>
An expanded
<note>
<date>
<year>2008</year>
<month>01</month>
<day>10</day>
</date>
<to>Tove</to>
<from>Jani</from>
</note>
Some things to consider when using attributes are:
*Attributes shouldn’t contain multiple values
Don’t do this (unreadable):
<note day="10" month="01" year="2008"
to="Tove" from="Jani" heading="Reminder"
body="Don't forget me this weekend!">
</note>
attributes cannot contain tree structures (elements can)
attributes are not easily expandable for future changes. (elements are)
First we will see how to creat an XML document from within R. Later we will see how to read/process XML from within R.
It will be helpful to have in your mind the structure of the XML document (i.e the XML tree) before you do anything in R, especially when you’re creating a new XML document.
You will need the XML package (so type install.packages("XML")
in the console) and the functions newXMLDoc
, newXMLNode
, and saveXML
.
The syntax for newXMLNode
is:
newXMLNode(name,attrs=NULL,doc=NULL, parent=NULL)
where
name
is the tag name of the node
attrs
is an attribute
doc
is the name of thenewXMLDoc
object parent
is the name of the parent newXMLNode
object of this newXMLNode
object.
Here are the R commands to create this in R:
library(XML)
doc <- newXMLDoc()
root <- newXMLNode("toplevel", doc = doc)
child1 <- newXMLNode("level1", parent = root)
newXMLNode("level2", "This is the content", parent = child1)
saveXML(doc, file = "~/Desktop/simple.xml")
Note: I only need to store (assign to a variable) nodes that I later need to refer to as parents. For the leaf nodes, I just use newXMLNode without assigning. The names of the nodes in R (e.g.root,child1) are not part of the resulting XML file.
KML is a type of XML file specifically for geographic information / visualization (i.e. for use in Google Earth, Google Maps, etc). XML is a broader category of markup that includes KML.
KML, like HTML, has predefined tags (for example, <Document>
, or <Placemark>
. Here is the definitive introduction to KLM. https://developers.google.com/kml/documentation/kml_tut#for-more-information To find out about KML tags, refer to the KML reference link in the above website.
Here is a simple example of a KML file.
<?xml version="1.0"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>New York City</name>
<description>New York City</description>
<Point>
<coordinates>-74.006393,40.714172,0</coordinates>
</Point>
</Placemark>
</Document>
</kml>
Some special features of KLM:
The root node is called <kml>
and has a special attribute called an extensible markup language namespace (xmlns) <kml xmlns="http://www.opengis.net/kml/2.2">
to indicate that it is a KML file.
The child of the <kml>
node is called <Document>
and its child is called <Placemark>
.
A Placemark is one of the most commonly used features in Google Earth. It marks a position on the Earth’s surface, using a yellow pushpin as the icon. The simplest Placemark includes only a
Placemark has children including <Point>
, <TimeStamp>
, <Description>
etc. For example if you want to put a
Steps:
doc <- newXMLDoc()
root <- newXMLNode("kml",namespaceDefinitions = "http://www.opengis.net/kml/2.2", doc = doc)
#KML tree
KML
|
Document
|
Placemark
|
Name - Description - Point
|
Coordinates
Load the XML
package. library(XML)
Use newXMLNode
to create the Document node and its children.
docmt <- newXMLNode("Document", parent = root)
pm <- newXMLNode("Placemark", parent = docmt)
name <- newXMLNode("name", "New York City", parent = pm)
description <- newXMLNode("description", "New York City", parent = pm)
pt <- newXMLNode("Point", parent = pm)
newXMLNode("coordinates", "-74.006393,40.714172,0", parent = pt)
saveXML(doc, "/Users/Adam/Desktop/simple.kml")