Source file ⇒ lec28.Rmd
HTTP (hypertext transfer protocol) allows for communication between a client and host via request/response messages. There are two important parts to http: the request, the data sent to the server, and the response, the data sent back from the server.
At the heart of web communication is the request message, which are sent via Uniform Resource Locators (URLs).
The protocol is usually http
, but it can be https
for secure communications. www.domain.com
represents the Domain Name System (DNS) name of the web server which listens for http
requests on port 80
by default but one can be set explicitly, as illustrated above. The resource path is the local path to the resource on the server. A query, following ?
, is a set of characters to recover specific information from a database. The query in this example consists of a field or variable, technically called a key in this context (here, it is the words “a”), followed by an equals sign (=), followed by the value for that key (here, it is the word “b”). Each key and its corresponding value, denoted as an equation, is called a key-value pair. A query may contain several key-value pairs. This example has two. When there is more than one key-value pair, they are typically separated by ampersands (&).
Sed has many uses but we will focus on sed for substitution
syntax: sed s/regex/replacement/FLAG file
OR
cat file | sed s/regex/replacement/FLAG
FLAGS can be any of the following:
EXAMPLE:echo one two three, three two one, one one hundred > file
cat file | sed s/one/ONE/g
EXAMPLE:echo day sunday | sed s/day/night/
echo Adam is great > test
Use sed to change the contents of test to Yourname is great
In lecture 27 we saw the following commands to make a file called small_potatoes
wget -O potatoes.txt http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/potatoes.txt
cat potatoes.txt | cut -f 1-2 > small_potatoes
head small_potatoes
Suppose we would like to actually make this into a script that we can reuse.
Steps:
Make a script in nano called myhouse.sh of the following commands. Parameterize fireplaces (ex. ./myhouse Y)
wget -O houses.csv http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv
cat houses.csv | cut -d ',' -f 2-5 | egrep Y | head
Lets figure out which countries are the top 5 producers of apricots (or other fruits). We’ll use United Nations Food and Agriculture Organization (FAO) data on agricultural production.
Go to http://data.un.org/Explorer.aspx?d=FAO
click on “Crops” (you will see a bunch of agricultural products with “View data” links)
click on “apricots” as an example and you will see a “Download” button (circled in picture below) that allows you to download a CSV of the data. This is one way to download the data.
To download this file via URL (better) you will want to inspect the HTTP requests that the site handles. In Chrome go to View, Developer, Developer Tools. Click on Network (circled in picture below). Click on Download then CSV and then click on DownloadHandler (circled in the picture below).
Next click on Download and Headers (circled in picture below)
This shows us the Request URL is:
http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc
That downloads the data for Item 526 (apricots). Note that you can see the item ID for other products by hovering over “View Data” link for the relevant product.
Solution:
lets make a new directory called apricots
mkdir apricots
cd apricots
Download the data from the URL. Note that you may need to put the http address inside double quotes when using wget
to download it since there is the metacharacter ?
in the URL.
wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"
`rm UN*
unzip -o temp.zip
-o means overwrite existing file
mv UN* file.csv
View unzipped file with less
less file.csv
grep -v + file.csv > apricotCountries.csv
Here -v is invert match. grep treats + as the literal character +. If you use egrep then + is treated as a metacharacter and you need to escape it (i.e. egrep -v “+” file.csv)
We need to clean up the data first.
Notice that some countries and regions have commas in the country name (ex Iran, Islamic Republic of). Here is a fix.
cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv
Notice we need to remove " so that we can sort numerically, and we only care about “Area Harvested”.
cat apricotCountries1.csv | sed "s/\"//g" | grep Harvested > apricotCountries_clean.csv
Note: We need to escape the " character with a forward slash
cat apricotCountries_clean.csv | grep 2005 | sort -t ',' -n -k 6 -r | cut -d ',' -f 1,6 | sed "s/,/ /g" | head -n5
Note: expression after sed in quotes because space is a metacharacter
Pick your favorite friut and find its code (example apricot is 526, avodado is 572).
Copy the commands to find the top 5 countries to a script called fruit.sh
Extensible MarkUp Language (XML)