Source file ⇒ lec29.Rmd
Lets figure out which countries are the top 5 producers of apricots (or other fruits). We’ll use United Nations Food and Agriculture Organization (FAO) data on agricultural production.
Go to http://data.un.org/Explorer.aspx?d=FAO
click on “Crops” (you will see a bunch of agricultural products with “View data” links)
click on “apricots” as an example and you will see a “Download” button (circled in picture below) that allows you to download a CSV of the data. This is one way to download the data.
To download this file via URL (better) you will want to inspect the HTTP requests that the site handles. In Chrome go to View, Developer, Developer Tools. Click on Network (circled in picture below). Click on Download then CSV and then click on DownloadHandler (circled in the picture below).
Next click on Download and Headers (circled in picture below)
This shows us the Request URL is:
http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc
That downloads the data for Item 526 (apricots). Note that you can see the item ID for other products by hovering over “View Data” link for the relevant product.
Solution:
lets make a new directory called apricots
mkdir apricots
cd apricots
Download the data from the URL. Note that you may need to put the http address inside double quotes when using wget
to download it since there is the metacharacter ?
in the URL.
wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"
rm UN*
(remove any preexisting files starting with UN –just in case)
unzip -o temp.zip
(-o
means overwrite existing file)
mv UN* file.csv
View unzipped file with less
less file.csv
egrep -v "\+" file.csv > apricotCountries.csv
Here -v is invert match.
If you use egrep then + is treated as a metacharacter and you need to escape it. In case you are curious, this is equivalent to grep -v + file.csv
since grep treats +
as the literal character +
, not a metacharacter.
First we clean up the data.
Notice that some countries and regions have commas in the country name (ex Iran, Islamic Republic of). Here is a fix.
cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv
Notice we need to remove " so that we can sort numerically, and we only care about cases with Element
equal to Area Harvested
.
cat apricotCountries1.csv | sed "s/\"//g" | egrep Harvested > apricotCountries_clean.csv
Note: We needed to escape the " character with a forward slash
cat apricotCountries_clean.csv | grep 2005 | sort -t "," -n -k 6 -r | cut -d "," -f 1,6 | sed "s/,/ /g" | head -n5
Note: expression after sed in quotes because space is a metacharacter
Copy the commands to find the top 5 countries (given below) to a script called fruit.sh using nano.
wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"
rm UN*
unzip -o temp.zip
mv UN* file.csv
egrep -v "\+" file.csv > apricotCountries.csv
cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv
cat apricotCountries1.csv | sed "s/\"//g" | egrep Harvested > apricotCountries_clean.csv
cat apricotCountries_clean.csv | grep 2005 | sort -t "," -n -k 6 -r | cut -d "," -f 1,6 | sed "s/,/ /g" | head -n5
Recall here are the steps how to make a script
Steps:
You can type for example ./fruit 572 to get top 5 countries for Avocados.
#!/usr/bin/env bash
code="$1"
wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:$code&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"
rm UN*
unzip -o temp.zip
mv UN* file.csv
egrep -v "\+" file.csv > apricotCountries.csv
cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv
cat apricotCountries1.csv | sed "s/\"//g" | egrep Harvested > apricotCountries_clean.csv
cat apricotCountries_clean.csv | grep 2005 | sort -t , -n -k 6 -r | cut -d , -f 1,6 | sed "s/,/ /g" | head -n5
You may wish to download a text editor for your computer so you can view XML files. There are many good choices.
One possibility for Mac, Windows and Linux users is sublime Text
Extensible: Easy to expand (add new nodes)
Markup Language: a system of annotating a document with tags
Here is an example of an XML file on apricot producers from UN (you can view it in sublime if you like)
http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=xml&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc
XML is a standard for semantic, hierarchical representation of data. In otherwords relationships between pieces of data reflect relationships in the real world. For example you data consists of a number of records and each record consists of a number of fields (country, element code, year, value). Compare that to our viewing of the appricot data as a flat CSV file.
XML was designed to carry data - with focus on what data is
HTML was designed to display data - with focus on how data looks
Some positive aspects of XML are
Some negative aspects are
XML is has become quite popular in many scientific fields, and it is standard in many web applications for the exchange and visualization of data. We’ll learn how to
The basic unit of XML code is called an element or node. It is made up of both markup and content. Markup consists of tags, attributes, and comments.
Here <field name="Country or Area">Afghanistan</field>
is an element (or node)
<field name="Country or Area"> </field>
is a tag
an empty tag would be <field></field>
or <field/>
field name="Country or Area"
is a attribute. An attribute’s value is always in quotes.
<!--A comment would looks like this -->
It can be anywhere.
Content is Afghanistan
.
XML is well-formed if it obeys certain syntax rules. The rules for tags are
1.Tag names are case-sensitive; start and end tags must match exactly.
2. No spaces are allowed between the < and the tag name.
3. Tag names must begin with a letter and contain only alphanumeric characters.
4. An element must have both an open and closing tag unless it is empty.
5. An empty element that does not have a closing tag must be of the form <tagname/>
.
<element></element>
or
<element />
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
<person gender='female'>
< < less than
> > greater than
& & ampersand
' ' apostrophe
" " quotation mark
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Do not forget me this weekend!</body>
</note>
10.(Optional) The line
<?xml version="1.0" encoding="UTF-8"?>
is called the XML prolog. It is like the hashbang in scripts anouncing that the data is XML and that the character encoding is UTF-8.
The following three XML documents contain exactly the same information. Which do you prefer?
A date attribute is used in the first example:
<note date="2008-01-10">
<to>Tove</to>
<from>Jani</from>
</note>
A
<note>
<date>2008-01-10</date>
<to>Tove</to>
<from>Jani</from>
</note>
An expanded
<note>
<date>
<year>2008</year>
<month>01</month>
<day>10</day>
</date>
<to>Tove</to>
<from>Jani</from>
</note>
Some things to consider when using attributes are:
attributes cannot contain multiple values (elements can)
attributes cannot contain tree structures (elements can)
*attributes are not easily expandable (for future changes)
Don’t end up like this:
<note day="10" month="01" year="2008"
to="Tove" from="Jani" heading="Reminder"
body="Don't forget me this weekend!">
</note>