Source file ⇒ 2017-lec21.Rmd
BASH is a type of shell that we use to communicate with the operating system Ubuntu (a version of Unix). Everything we type in the terminal is interpretted by bash. There are special characters in bash. For example Bash uses whitespace to determine where words begin and end. The first word is the command name and additional words become arguments to that command.
There are different ways to escape special characters including using a backslash \
or if there are many special characters you wish to have their literal meaning you can use single quotes, for example
echo hello my friend
echo hello \ \ \ \ my friend
echo "hello my friend"
echo hello" "my friend
## hello my friend
## hello my friend
## hello my friend
## hello my friend
Here are some other special characters in BASH:
Sed has many uses but we will focus on sed for substitution
syntax: sed s/regex/replacement/FLAG file
OR
cat file | sed s/regex/replacement/FLAG
FLAGS can be any of the following:
EXAMPLE:
echo one two three, three two one, one one hundred | sed s/one/ONE/2
echo how ya doing | sed s/\ //g
echo 1A2B3C | sed s/[^A-Z]//g
echo 1A2B3C | sed s/[[:digit:]]//g
## one two three, three two ONE, one one hundred
## howyadoing
## ABC
## ABC
EXAMPLE:
echo day sunday | sed s/day/night/
## night sunday
We can put our unix command in a script called sedscript.sh
with a parameter for night
as follows:
Steps:
HTTP (hypertext transfer protocol) allows for communication between a client and host via request/response messages. There are two important parts to http: the request, the data sent to the server, and the response, the data sent back from the server.
At the heart of web communication is the request message, which are sent via Uniform Resource Locators (URLs).
The protocol is usually http
, but it can be https
for secure communications. www.domain.com
represents the Domain Name System (DNS) name of the web server which listens for http
requests on port 80
by default but one can be set explicitly, as illustrated above. The resource path is the local path to the resource on the server. A query, following ?
, is a set of characters to recover specific information from a database. The query in this example consists of a field or variable, technically called a key in this context (here, it is the words “a”), followed by an equals sign (=), followed by the value for that key (here, it is the word “b”). Each key and its corresponding value, denoted as an equation, is called a key-value pair. A query may contain several key-value pairs. This example has two. When there is more than one key-value pair, they are typically separated by ampersands (&).
For example:
Here is one with a default port and no query:
For example here is a URL we will use later.
http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc
Identify the protocol, host, port, resource path, and queries.
Lets figure out which countries are the top 5 producers of apricots (or other fruits). We’ll use United Nations Food and Agriculture Organization (FAO) data on agricultural production.
Go to http://data.un.org/Explorer.aspx?d=FAO
click on FAO Data
click on “Crops” (you will see a bunch of agricultural products with “Preview” and View data" links)
click on “View Data” for apricots as an example and you will see a “Download” button (circled in picture below) that allows you to download a CSV of the data. This is one way to download the data.
To download this file via URL (better) you will want to inspect the HTTP requests that the site handles. In Chrome go to View, Developer, Developer Tools. Click on Network (circled in picture below). Click on Download then CSV and then click on DownloadHandler (circled in the picture below).
Next click on Download and Headers (circled in picture below)
This shows us the Request URL is:
That downloads the data for Item 526 (apricots). Note that you can see the item ID for other products by hovering over “View Data” link for the relevant product.
https://scf.berkeley.edu:9022/wetty/ssh/username
For example my special webpage is: https://scf.berkeley.edu:9022/wetty/ssh/alucas
lets make a new directory called apricots
mkdir apricots
cd apricots
Download the data from the URL. Note that you may need to put the http address inside double quotes when using wget
to download it since there is the metacharacter ?
in the URL.
wget -O temp.zip "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=itemCode:526&DataMartId=FAO&Format=csv&c=2,3,4,5,6,7&s=countryName:asc,elementCode:asc,year:desc"
unzip -o temp.zip #-o means overwrite existing file
mv UN* file.csv
View unzipped file with less
less file.csv #(type q to quit)
grep -v + file.csv > apricotCountries.csv
Here -v is invert match. grep treats + as the literal character +. If you use egrep then + is treated as a metacharacter and you need to escape it (i.e. egrep -v "\+" file.csv
)
We need to clean up the data first.
Notice that some countries and regions have commas in the country name (ex Iran, Islamic Republic of). Here is a fix.
cat apricotCountries.csv | sed "s/, / /g" > apricotCountries1.csv
Notice we need to remove " so that we can sort numerically, and we only care about “Area Harvested”.
cat apricotCountries1.csv | sed "s/\"//g" | grep Harvested > apricotCountries_clean.csv
Note: We need to escape the " character with a forward slash
cat apricotCountries_clean.csv | egrep 2005 | sort -t ',' -n -k 6 -r | cut -d ',' -f 1,6 | sed "s/,/ /g" | head -n5
Note: expression after sed in quotes because space is a metacharacter