lec29

Today

Example of data scraping using command line tools
Introduction to XML

1. Examples of Data Scraping using command line tools and a shell sript

Lets figure out which countries are the top 5 producers of apricots (or other fruits). We’ll use United Nations Food and Agriculture Organization (FAO) data on agricultural production.

Go to http://data.un.org/Explorer.aspx?d=FAO

click on “Crops” (you will see a bunch of agricultural products with “View data” links)

click on “apricots” as an example and you will see a “Download” button (circled in picture below) that allows you to download a CSV of the data. This is one way to download the data.

To download this file via URL (better) you will want to inspect the HTTP requests that the site handles. In Chrome go to View, Developer, Developer Tools. Click on Network (circled in picture below). Click on Download then CSV and then click on DownloadHandler (circled in the picture below).

Next click on Download and Headers (circled in picture below)

This shows us the Request URL is:

http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc

That downloads the data for Item 526 (apricots). Note that you can see the item ID for other products by hovering over “View Data” link for the relevant product.

Steps to find top 5 producers of apricots:

Download the data for apricots.

Solution:

lets make a new directory called apricots

mkdir apricots
cd apricots

Download the data from the URL. Note that you may need to put the http address inside double quotes when using wget to download it since there is the metacharacter ? in the URL.

wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"

rm UN* (remove any preexisting files starting with UN –just in case)

unzip -o temp.zip (-o means overwrite existing file)

mv UN* file.csv

Extract the data for individual countries into a separate file.

View unzipped file with less

less file.csv

egrep -v "\+" file.csv > apricotCountries.csv

Here -v is invert match.

If you use egrep then + is treated as a metacharacter and you need to escape it. In case you are curious, this is equivalent to grep -v + file.csv since grep treats + as the literal character +, not a metacharacter.

Subset the country-level data to the year 2005. Based on the “area harvested” determine the five countries using the most land to produce apricots.

First we clean up the data.

Notice that some countries and regions have commas in the country name (ex Iran, Islamic Republic of). Here is a fix.

cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv

Notice we need to remove " so that we can sort numerically, and we only care about cases with Element equal to Area Harvested.

cat apricotCountries1.csv | sed "s/\"//g" | egrep Harvested > apricotCountries_clean.csv

Note: We needed to escape the " character with a forward slash

Task for you

Copy the commands to find the top 5 countries (given below) to a script called fruit.sh using nano.

wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"
rm UN*
unzip -o temp.zip
mv UN* file.csv
egrep -v "\+" file.csv > apricotCountries.csv
cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv
cat apricotCountries1.csv | sed "s/\"//g"  | egrep Harvested  > apricotCountries_clean.csv
cat apricotCountries_clean.csv | grep 2005 | sort -t ","  -n -k 6 -r | cut -d "," -f 1,6 | sed "s/,/   /g" | head -n5

Recall here are the steps how to make a script

Steps:

type nano fruit.sh
copy and paste the above commands
define a hashbang (#!/usr/bin/env bash)
add permission to execute (chmod u+x fruit.sh)
parameterize (./fruit.sh 572) —here 572 is the code for avocadoes

Solution

You can type for example ./fruit 572 to get top 5 countries for Avocados.

#!/usr/bin/env bash
code="$1"

wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:$code&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"
rm UN*
unzip -o temp.zip
mv UN* file.csv
egrep -v "\+" file.csv > apricotCountries.csv
cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv
cat apricotCountries1.csv | sed "s/\"//g"  | egrep Harvested  > apricotCountries_clean.csv
cat apricotCountries_clean.csv | grep 2005 | sort -t ,  -n -k 6 -r | cut -d , -f 1,6 | sed "s/,/   /g" | head -n5

2. Introduction to Extensible Markup Language (XML)

You may wish to download a text editor for your computer so you can view XML files. There are many good choices.

One possibility for Mac, Windows and Linux users is sublime Text

Extensible: Easy to expand (add new nodes)
Markup Language: a system of annotating a document with tags

Here is an example of an XML file on apricot producers from UN (you can view it in sublime if you like)

http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=xml&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc

XML is a standard for semantic, hierarchical representation of data. In otherwords relationships between pieces of data reflect relationships in the real world. For example you data consists of a number of records and each record consists of a number of fields (country, element code, year, value). Compare that to our viewing of the appricot data as a flat CSV file.

XML was designed to carry data - with focus on what data is

HTML was designed to display data - with focus on how data looks

Some positive aspects of XML are

data is self-describing
format separates content from structure
data can be easily merged and exchanged
file is human-readable
file is also easily machine-generated
standards are widely adopted

Some negative aspects are

XML documents can be very verbose and hard to read
it’s so general that it’s hard to develop tools for all cases
files can be quite large due to high amount of redundancy

XML is has become quite popular in many scientific fields, and it is standard in many web applications for the exchange and visualization of data. We’ll learn how to

create it from within R, and
read/process it from within R

Anatomy of an XML document

The basic unit of XML code is called an element or node. It is made up of both markup and content. Markup consists of tags, attributes, and comments.

Here <field name="Country or Area">Afghanistan</field> is an element (or node)

<field name="Country or Area"> </field> is a tag

an empty tag would be <field></field> or <field/>

field name="Country or Area" is a attribute. An attribute’s value is always in quotes.

 It can be anywhere.

Content is Afghanistan.

XML is well-formed if it obeys certain syntax rules. The rules for tags are

1.Tag names are case-sensitive; start and end tags must match exactly.
2. No spaces are allowed between the < and the tag name.
3. Tag names must begin with a letter and contain only alphanumeric characters.
4. An element must have both an open and closing tag unless it is empty.
5. An empty element that does not have a closing tag must be of the form <tagname/>.

<element></element>

<element />

Tags must nest properly. (Inner tags must close before outer ones.)

<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root>

All attributes must appear in quotes in a name = “value” format

<person gender='female'>

There are 5 pre-defined entity references in XML:

&lt;    <   less than
&gt;    >   greater than
&amp;   &   ampersand 
&apos;  '   apostrophe
&quot;  "   quotation mark

All XML documents must contain a root node (doesn’t need to be called root) containing all the other nodes.

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Do not forget me this weekend!</body>
</note>

10.(Optional) The line

<?xml version="1.0" encoding="UTF-8"?>

is called the XML prolog. It is like the hashbang in scripts anouncing that the data is XML and that the character encoding is UTF-8.

Attributes versus Elements

The following three XML documents contain exactly the same information. Which do you prefer?

A date attribute is used in the first example:

<note date="2008-01-10">
  <to>Tove</to>
  <from>Jani</from>
</note>

A element is used in the second example:

<note>
  <date>2008-01-10</date>
  <to>Tove</to>
  <from>Jani</from>
</note>

An expanded element is used in the third example: (THIS IS MY FAVORITE):

<note>
  <date>
    <year>2008</year>
    <month>01</month>
    <day>10</day>
  </date>
  <to>Tove</to>
  <from>Jani</from>
</note>

Some things to consider when using attributes are:

attributes cannot contain multiple values (elements can)
attributes cannot contain tree structures (elements can)
*attributes are not easily expandable (for future changes)

Don’t end up like this:

<note day="10" month="01" year="2008"
to="Tove" from="Jani" heading="Reminder"
body="Don't forget me this weekend!">
</note>