Source file ⇒ 2017-lec21.Rmd

Today

  1. escaping metacharacters in bash
  2. Review sed
  3. Anatomy of a URL
  4. Example of datascraping with command line tools and shell script

1. escaping meta characters in BASH

BASH is a type of shell that we use to communicate with the operating system Ubuntu (a version of Unix). Everything we type in the terminal is interpretted by bash. There are special characters in bash. For example Bash uses whitespace to determine where words begin and end. The first word is the command name and additional words become arguments to that command.

There are different ways to escape special characters including using a backslash \ or if there are many special characters you wish to have their literal meaning you can use single quotes, for example

echo hello     my friend

echo hello \ \ \ \ my friend


echo "hello     my friend"

echo hello"     "my friend
## hello my friend
## hello     my friend
## hello     my friend
## hello     my friend

Here are some other special characters in BASH:

special characters

2. The unix command sed

Sed has many uses but we will focus on sed for substitution

syntax: sed s/regex/replacement/FLAG file OR

cat file | sed s/regex/replacement/FLAG

FLAGS can be any of the following:

  • nothing Replace only first instance of Regexp with replacement
  • g Replace all the instances of Regexp with replacement
  • n Could be any number, replae nth instance of regex with replacement
  • i match Regex in a case insensitive manner.

EXAMPLE:

echo one two three, three two one, one one hundred | sed s/one/ONE/2  

echo how ya doing | sed s/\ //g  

echo 1A2B3C | sed s/[^A-Z]//g

echo 1A2B3C | sed s/[[:digit:]]//g
## one two three, three two ONE, one one hundred
## howyadoing
## ABC
## ABC

EXAMPLE:

echo day sunday | sed s/day/night/
## night sunday

We can put our unix command in a script called sedscript.sh with a parameter for night as follows:

Steps:

  1. type nano sedscript.sh
  2. define a shebang (#!/usr/bin/env bash) at the top of the script
  3. copy and paste the above commands with parameter ($1) in place of night.
  4. in terminal add permission to execute (chmod u+x sedscript.sh)
  5. parameterize (./sedscript.sh gloomy)

3. How a browser initiates a request: anatomy of a url

HTTP (hypertext transfer protocol) allows for communication between a client and host via request/response messages. There are two important parts to http: the request, the data sent to the server, and the response, the data sent back from the server.

At the heart of web communication is the request message, which are sent via Uniform Resource Locators (URLs).

The protocol is usually http, but it can be https for secure communications. www.domain.com represents the Domain Name System (DNS) name of the web server which listens for http requests on port 80 by default but one can be set explicitly, as illustrated above. The resource path is the local path to the resource on the server. A query, following ?, is a set of characters to recover specific information from a database. The query in this example consists of a field or variable, technically called a key in this context (here, it is the words “a”), followed by an equals sign (=), followed by the value for that key (here, it is the word “b”). Each key and its corresponding value, denoted as an equation, is called a key-value pair. A query may contain several key-value pairs. This example has two. When there is more than one key-value pair, they are typically separated by ampersands (&).

For example:

Here is one with a default port and no query:

For example here is a URL we will use later.

http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc

Identify the protocol, host, port, resource path, and queries.

4. Examples of Data Scraping using command line tools and a shell sript

Lets figure out which countries are the top 5 producers of apricots (or other fruits). We’ll use United Nations Food and Agriculture Organization (FAO) data on agricultural production.

Go to http://data.un.org/Explorer.aspx?d=FAO

click on FAO Data

click on “Crops” (you will see a bunch of agricultural products with “Preview” and View data" links)

click on “View Data” for apricots as an example and you will see a “Download” button (circled in picture below) that allows you to download a CSV of the data. This is one way to download the data.

To download this file via URL (better) you will want to inspect the HTTP requests that the site handles. In Chrome go to View, Developer, Developer Tools. Click on Network (circled in picture below). Click on Download then CSV and then click on DownloadHandler (circled in the picture below).

Next click on Download and Headers (circled in picture below)

This shows us the Request URL is:

http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc

That downloads the data for Item 526 (apricots). Note that you can see the item ID for other products by hovering over “View Data” link for the relevant product.

Steps to find top 5 producers (countries) of apricots in the year 2005:

  1. Open your SCF terminal:

https://scf.berkeley.edu:9022/wetty/ssh/username

For example my special webpage is: https://scf.berkeley.edu:9022/wetty/ssh/alucas

  1. Download the data for apricots.

lets make a new directory called apricots

mkdir apricots
cd apricots

Download the data from the URL. Note that you may need to put the http address inside double quotes when using wget to download it since there is the metacharacter ? in the URL.

wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"

unzip -o temp.zip       #-o means overwrite existing file

mv UN* file.csv 
  1. Extract the data for individual countries into a separate file.

View unzipped file with less


less file.csv  #(type q to quit)

grep -v + file.csv > apricotCountries.csv  

Here -v is invert match. grep treats + as the literal character +. If you use egrep then + is treated as a metacharacter and you need to escape it (i.e. egrep -v "\+" file.csv)

  1. Then subset the country-level data to the year 2005. Based on the “area harvested” determine the five countries using the most land to produce apricots.

We need to clean up the data first.

Notice that some countries and regions have commas in the country name (ex Iran, Islamic Republic of). Here is a fix.

cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv

Notice we need to remove " so that we can sort numerically, and we only care about “Area Harvested”.

cat apricotCountries1.csv | sed "s/\"//g" | grep Harvested > apricotCountries_clean.csv

Note: We need to escape the " character with a forward slash

cat apricotCountries_clean.csv | egrep 2005 | sort -t ',' -n -k 6 -r | cut -d ',' -f 1,6 | sed "s/,/ /g" | head -n5 Note: expression after sed in quotes because space is a metacharacter