Source file ⇒ Lab_11.Rmd

In this lab, we will use the XML package to read exchange rates (against the euro) from the European Central Bank and create a time series plot showing how the rates for four different currencies-the British pound (GBP), the US dollar (USD), the Canadian dollar (CAD), and the Japanese yen (JPY)-have changed over time. Turn in this lab as a .rmd file.

Before jumping to the code portions, open a browser and visit this URL:

http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml

Examine the structure of the XML. What is the exchange rates for the British pound (GBP) for 4/8 and 4/7?

The exchange rate for the GBP on 4/8 is 0.8073 and 0.80728 on 4/7.

library(XML)

# Read the data into R
er = xmlTreeParse("http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml")

# Get the root node of the XML tree
erRoot = xmlRoot(er)

xmlSize(erRoot)
## [1] 3
names(erRoot)
##   subject    Sender      Cube 
## "subject"  "Sender"    "Cube"

The root node is <gesmes:Envelope>.

We’ll extract the exchange rates for the British pound (GBP) as an example. First, we need to understand more about the structure of this XML tree. It looks like all of the data are descendents of a “Cube” node.

Here, we examine the first child of the Cube node. If you were to draw a diagram for this tree, with the following expression, we would be looking at the following node:

Envelope > Cube > Cube

erRoot[['Cube']][1]
## $Cube
## <Cube time="2016-04-18">
##  <Cube currency="USD" rate="1.1306"/>
##  <Cube currency="JPY" rate="122.32"/>
##  <Cube currency="BGN" rate="1.9558"/>
##  <Cube currency="CZK" rate="27.022"/>
##  <Cube currency="DKK" rate="7.4417"/>
##  <Cube currency="GBP" rate="0.7978"/>
##  <Cube currency="HUF" rate="310.33"/>
##  <Cube currency="PLN" rate="4.3059"/>
##  <Cube currency="RON" rate="4.4749"/>
##  <Cube currency="SEK" rate="9.1948"/>
##  <Cube currency="CHF" rate="1.0916"/>
##  <Cube currency="NOK" rate="9.3325"/>
##  <Cube currency="HRK" rate="7.505"/>
##  <Cube currency="RUB" rate="76.7591"/>
##  <Cube currency="TRY" rate="3.2251"/>
##  <Cube currency="AUD" rate="1.4685"/>
##  <Cube currency="BRL" rate="3.9606"/>
##  <Cube currency="CAD" rate="1.4612"/>
##  <Cube currency="CNY" rate="7.3241"/>
##  <Cube currency="HKD" rate="8.7688"/>
##  <Cube currency="IDR" rate="14890.95"/>
##  <Cube currency="ILS" rate="4.2702"/>
##  <Cube currency="INR" rate="75.274"/>
##  <Cube currency="KRW" rate="1299.62"/>
##  <Cube currency="MXN" rate="19.9043"/>
##  <Cube currency="MYR" rate="4.4404"/>
##  <Cube currency="NZD" rate="1.6303"/>
##  <Cube currency="PHP" rate="52.253"/>
##  <Cube currency="SGD" rate="1.5328"/>
##  <Cube currency="THB" rate="39.605"/>
##  <Cube currency="ZAR" rate="16.5061"/>
## </Cube>
## 
## attr(,"class")
## [1] "XMLNodeList"

As we can see, this node is the parent of another node named Cube that has a “time” attribute. In turn, the Cube node with the “time” attribute node is the parent of several other Cube nodes with “currency” and “rate” attributes.

We’ll show you two ways to pull out the GBP exchange rate.

First way: Split into two steps.

1. The first is to obtain all of the nodes that have the name “Cube” and an attribute “currency” set to GBP.

We provide getNodeSet() the root node and the XPath to the nodes we’re interested in. Provide the predicate in the XPath expression to locate those Cube nodes with a currency attribute value of “GBP”.

gbp_nodes = getNodeSet(erRoot, '//x:Cube[@currency = "GBP"]', namespaces = "x")

Note we haven’t covered namespaces, so don’t worry about the “x:” part in the specification of the path above.

Now for the second step.

2. Use sapply() to get the value of the “rate” attribute on those nodes.

gbp1 = sapply(gbp_nodes, xmlGetAttr, "rate")

head(gbp1, 8)
## [1] "0.7978"  "0.79575" "0.7956"  "0.79388" "0.7984"  "0.8006"  "0.8073" 
## [8] "0.80728"
length(gbp1)
## [1] 62
class(gbp1)
## [1] "character"

ANSWER THE QUESTIONS:

Do the values match what you found earlier?

Entries 7 and 8 match the values that we mentioned earlier

How many dates are there?

There are 62 dates

What’s the correct data type for this vector?

This vector consists of character-type elements

The second approach does this all in one step with the xPathSApply() function.

Note that this function is similarly spelled, but xmlSApply() and xpathSApply() are not the same. The latter provides an XPath expression to specify the nodes to which the function is applied.

er = xmlTreeParse("http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml", useInternalNodes = TRUE)
erRoot = xmlRoot(er)
gbp2 = xpathSApply(erRoot, '//x:Cube[@currency="GBP"]', 
                  xmlGetAttr, "rate", namespaces = "x")

It is easy to check that these two methods produce the same result!

identical(gbp1, gbp2)
## [1] TRUE

Now that we’ve extracted the GBP exchange rate, let’s get the dates associated with each of those values.

Your Turn:

Fill in the XPath expression to locate all Cube nodes with a time attribute. Remeber to use x: in front of the Cube tag name.

days = xpathSApply(erRoot, '//x:Cube[@time]', 
                   xmlGetAttr, "time", namespaces = "x")
head(days)
## [1] "2016-04-18" "2016-04-15" "2016-04-14" "2016-04-13" "2016-04-12"
## [6] "2016-04-11"
dayz = as.Date(days)

# check for same number of dates
length(dayz) == length(gbp1)
## [1] TRUE

Let’s plot:

# Recall gbp1 is a character vector; let's fix that now
gbp = as.numeric(gbp1)

df <- data.frame(dayz,gbp)
# Let's plot the GBP exchange rate vs time. 
df %>% ggplot(aes(x=dayz,y=gbp)) + geom_line() + labs(title="Exchange Rates Over Time",x="Date",y="Exchange Rate (against Euro)")

Extract rates for three other currencies

getExchangeRates = function(abbrev,root, numDays = NULL ){
  # get rates
  currency_path = sprintf('//x:Cube[@currency="%s"]', 
                          abbrev)
  rates = xpathSApply(root, currency_path,
                      xmlGetAttr, "rate", namespaces = "x")
  
  # format as numeric vector:
  rates = as.numeric(rates)
  
  # Check to see if there are any missing rates. 
  # print a warning
  if (!is.null(numDays)) {
    if(length(rates) != numDays) {
      warning("Incorrect number of dates!")
      }
  }  
  return(rates)
}
currencies = mapply(getExchangeRates, 
                    abbrev = c("GBP", "USD", "CAD", "AUD"),
                    root = erRoot, 
                    numDays = length(dayz),
                    SIMPLIFY = FALSE)
# Find the range of these exchange rates
rateMin = min(sapply(currencies, min))
rateMax = max(sapply(currencies, max))
rateRange = c(rateMin, rateMax)

df_wide <- data.frame(dayz,GBP=currencies[[1]],USD=currencies[[2]],CAD=currencies[[3]],AUD=currencies[[4]])
df <- df_wide %>% gather(key=countries, value=rate, GBP,USD, CAD, AUD)
df %>% ggplot(aes(x=dayz,y=rate, col=countries)) + geom_line() + labs(title="Exchange Rates Over Time",x="Date",y="Exchange Rate (against Euro)")