Source file ⇒ lec28.Rmd

Today

  1. Anatomy of a URL
  2. The unix command sed
  3. Making shell scripts
  4. Example of datascraping with command line tools and shell script

1. How a browser initiates a request: anatomy of a url

HTTP (hypertext transfer protocol) allows for communication between a client and host via request/response messages. There are two important parts to http: the request, the data sent to the server, and the response, the data sent back from the server.

At the heart of web communication is the request message, which are sent via Uniform Resource Locators (URLs).

The protocol is usually http, but it can be https for secure communications. www.domain.com represents the Domain Name System (DNS) name of the web server which listens for http requests on port 80 by default but one can be set explicitly, as illustrated above. The resource path is the local path to the resource on the server. A query, following ?, is a set of characters to recover specific information from a database. The query in this example consists of a field or variable, technically called a key in this context (here, it is the words “a”), followed by an equals sign (=), followed by the value for that key (here, it is the word “b”). Each key and its corresponding value, denoted as an equation, is called a key-value pair. A query may contain several key-value pairs. This example has two. When there is more than one key-value pair, they are typically separated by ampersands (&).

2. The unix command sed

Sed has many uses but we will focus on sed for substitution

syntax: sed s/regex/replacement/FLAG file OR

cat file | sed s/regex/replacement/FLAG

FLAGS can be any of the following:

  • nothing Replace only first instance of Regexp with replacement
  • g Replace all the instances of Regexp with replacement
  • n Could be any number, replae nth instance of regex with replacement
  • i match Regex in a case insensitive manner.

EXAMPLE:
echo one two three, three two one, one one hundred > file cat file | sed s/one/ONE/g

EXAMPLE:
echo day sunday | sed s/day/night/

Task for you:

echo Adam is great > test

Use sed to change the contents of test to Yourname is great

3. Converting one liners at the command line into a shell scipt

In lecture 27 we saw the following commands to make a file called small_potatoes

wget -O potatoes.txt http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/potatoes.txt
cat potatoes.txt | cut -f 1-2 > small_potatoes
head small_potatoes

Suppose we would like to actually make this into a script that we can reuse.

Steps:

  1. type nano potatoes.sh
  2. copy and paste the above commands
  3. define a hashbang (#!/usr/bin/env bash)
  4. add permission to execute (chmod u+x potatoes.sh)
  5. parameterize (./potatoes.sh 3)

Task for you

Make a script in nano called myhouse.sh of the following commands. Parameterize fireplaces (ex. ./myhouse Y)

wget -O houses.csv http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv cat houses.csv | cut -d ',' -f 2-5 | egrep Y | head

4. Examples of Data Scraping using command line tools and a shell sript

Lets figure out which countries are the top 5 producers of apricots (or other fruits). We’ll use United Nations Food and Agriculture Organization (FAO) data on agricultural production.

Go to http://data.un.org/Explorer.aspx?d=FAO

click on “Crops” (you will see a bunch of agricultural products with “View data” links)

click on “apricots” as an example and you will see a “Download” button (circled in picture below) that allows you to download a CSV of the data. This is one way to download the data.

To download this file via URL (better) you will want to inspect the HTTP requests that the site handles. In Chrome go to View, Developer, Developer Tools. Click on Network (circled in picture below). Click on Download then CSV and then click on DownloadHandler (circled in the picture below).

Next click on Download and Headers (circled in picture below)

This shows us the Request URL is:

http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc

That downloads the data for Item 526 (apricots). Note that you can see the item ID for other products by hovering over “View Data” link for the relevant product.

Steps to find top 5 producers of apricots:

  1. Download the data for apricots.

Solution:

lets make a new directory called apricots

mkdir apricots
cd apricots

Download the data from the URL. Note that you may need to put the http address inside double quotes when using wget to download it since there is the metacharacter ? in the URL.

wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"

`rm UN*

unzip -o temp.zip -o means overwrite existing file

mv UN* file.csv

  1. Extract the data for individual countries into a separate file.

View unzipped file with less

less file.csv

grep -v + file.csv > apricotCountries.csv

Here -v is invert match. grep treats + as the literal character +. If you use egrep then + is treated as a metacharacter and you need to escape it (i.e. egrep -v “+” file.csv)

  1. Then subset the country-level data to the year 2005. Based on the “area harvested” determine the five countries using the most land to produce apricots.

We need to clean up the data first.

Notice that some countries and regions have commas in the country name (ex Iran, Islamic Republic of). Here is a fix.

cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv

Notice we need to remove " so that we can sort numerically, and we only care about “Area Harvested”.

cat apricotCountries1.csv | sed "s/\"//g" | grep Harvested > apricotCountries_clean.csv

Note: We need to escape the " character with a forward slash

cat apricotCountries_clean.csv | grep 2005 | sort -t ',' -n -k 6 -r | cut -d ',' -f 1,6 | sed "s/,/ /g" | head -n5 Note: expression after sed in quotes because space is a metacharacter

Task for you

Pick your favorite friut and find its code (example apricot is 526, avodado is 572).

Task for you

Copy the commands to find the top 5 countries to a script called fruit.sh

next time

Extensible MarkUp Language (XML)