Source file ⇒ lec26.Rmd

Today

  1. Setting up Data Science Toolbox

  2. Unix Commands

Data Cleaning isn’t always best done in R since R stores files in memory. For big data sets, it is faster to do data cleaning with command line tools. There is a bit of overhead in teaching you how to do data science at the command line however; we need to set up the Data Science Toolbox so that we all have the same command line environment.

1. Setting up Data Science Toolbox

Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes (this is a little optimistic). You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services) but I suggest you run it locally.

Please install Data Science Toolbox. Here is how to do it (steps 1-4):

data science toolbox

Here are some of my personal comments about the steps for installing it locally on your computer:

step 1: Simply download VirtualBox as instructed (around 177Mb storage). Don’t do anything further.

step 2: Simply download vagrant as instructed (around 250 Mb storage). Don’t do anything further.

step 3: Mac users: To find terminal you can type terminal in the search window in top right of screen

PC users: To find powershell type powershell in the search window in the Start menu. Long down load time possible.

step 4: Easy for mac users. For PC users follow the directions and click on putty.exe. In the Saved Sessions box type “toolbox” so so that you can save your settings for another time. Click on Save.

(no need to do step 5 at this time–setting up ipython notebook)

To end vagrant session type exit

Mac users: To restart there are 4 steps

  1. open terminal
  2. cd MyDataScienceToolbox
  3. vagrant up
  4. vagrant ssh

PC users: To restart

  1. Go to datasciencetoolbox.org and click on link for putty.exe or click on this link: putty.exe
  2. click on “toolbox” that has your settings. click on load and then open.
  3. login: vagrant / password: vagrant

Once you have everything running I suggest you type sudo apt-get update to update your vagrant.

The Environment

So you’ve just logged into a brand new environment. Before we do anything, it’s worthwhile to get a high-level understanding of this environment. The environment is roughly defined by four layers, which we briefly discuss from the top down:

Command-line tools

Command-line tools, the first layer, is the software containing commands to modify your data. This is like the DataComputing package in R that contain functions such as select(), mutate() etc. Command-line tools include cd, ls, cat and many more.

Terminal

The Terminal is an application where we type our commands in. It is like the console in RStudio. Here is a picture of the console in Data Science Toolbox:

Shell

The shell is the command interpretor. Once we have typed in our command and pressed <Enter>, the terminal sends that command to the shell. The Data Science Toolbox uses Bash as the shell, but there are many others available.

Operating System

The fourth layer is the operating system, which is GNU/Linux in our case. Linux is the name of the kernel, which is the heart of the operating system. The kernel is in direct contact with the CPU, disks, and other hardware. The kernel also executes our command-line tools. GNU, which is an acronym for “GNU’s Not Unix”“, refers to a set of basic tools. The Data Science Toolbox is based on a particular Linux distribution called Ubuntu. Unix was the first operating system and I will refer to GNU/Linux or Ubuntu as Unix.

The most popular OS today, Microsoft Windows, uses a graphical user interface (GUI) for you to interact with the OS. This is easy to learn but not very powerful.

UNIX, on the other hand, is hard at first to learn, but it allows you vastly more control over what your computer can do, kind of like driving stick shift compared to automatic (race car drivers always drive stick). It can be much faster to complete some tasks using command tools at the terminal than with graphical applications and menus. It also makes your work flow reproducible.

2. Unix Commands

Unix Commands are the command line tools that we use to communicate with the OS.

Here is a useful list of keyboard short cuts to get around in Unix: short cuts

Directories and Files

The first thing you need to know about UNIX are how to work with directories and files. Technically, everything in UNIX is a file, but it’s easier to think of directories as you would folders on Windows or Mac OS.

Directories are organized in an inverted tree structure:

To see the directory you’re currently in, type the command pwd (“present working directory”).

There are two “special” directories: The top level directory, “/”, is called the root directory.

Your home directory, “~”, contains all your files. For user mary,“~” and “/users/mary” mean the same thing.

To create a new directory, use the command mkdir. Then to move into it, use cd.

EXAMPLES:
pwd,
mkdir unixexamples,
cd unixexamples,
ls,
ls -a
ls -l

ls -a means to show all files, including the hidden files starting with a dot (“.”).

The two hidden files here are special and exist in every directory. “.” refers to the current directory, and “..” refers to the directory above it.

ls -l shows the size of the files which can be useful.

This brings us to the distinction between relative and absolute path names. (Think of a path like an address in UNIX, telling you where you are in the directory tree.)

You may have noticed that I typed cd unixexamples, rather than cd /Users/Adam/unixexamples.

The first is the relative path; the second is the absolute path.

To refer to a file, you need to either be in the directory where the file is located, or you need to refer to it using a relative or absolute path name.

EXAMPLES:
pwd,
echo "Testing 1 2 3" > test.txt,
ls,
cat test.txt,
cd ..,
cat test.txt,
cat unixexamples/test.txt

Note that file names must be unique within a particular directory, but having, say, both /Users/Adam/test.txt and /Users/Adam/unixexamples/test.txt is OK.

You can refer to multiple files at once using wildcards. The most common one is the asterisk (*). It stands in for anything (including nothing at all).

EXAMPLES:
touch test1, test2
ls t*

Commands, arguements and options

Commands, arguments, and options

We’ve already started using these; now let’s define them more precisely.

The general syntax for a UNIX command looks like this:

$ command -options argument1 argument2

(The number of arguments may vary) An argument comes at the end of the command line. It’s usually the name of a file or some text.

Options come between the name of the command and the arguments, and they tell the command to do something other than its default. They’re usually prefaced with one or two hyphens.

EXAMPLES: mv test.txt newname.txt
rmdir unixexamples,
rm -r unixexamples,

To look at the syntax of any particular UNIX command, type man (for “manual”) and then the name of the command.

The two most important parts of the man page are labeled SYNOPSIS and DESCRIPTION. These are very much like the “Usage” and “Arguments” in R’s help pages.

SYNOPSIS shows you the syntax for a particular command. Bracketed arguments are optional.

DESCRIPTION tells you what all the options do.

Press the space bar to scroll forward through the man page,b to go backward, and q to exit.

&regex in man pages displays only lines matching the regular expression. This can be very useful in case the man pages are long.

recap of commands so far and a few more:

pwd print working directory
ls list contents of current directory
ls -a list contents, including hidden files
ls -l list contents, including size of files
mkdir creat a new directory
cd dname change directory to dname
cd .. change to parent directory
cd ~ change to home directory
mv move or rename a file
touch create a new empty file
rm remove a file
rm -r remove all lower level files
wc -l count the number of lines in a file
head -5 look at the first 5 lines of a file
cat print contents of a file
tail look at end of file
cp copy a file
echo display a line of text written in the terminal
wget download a file from the web

EXAMPLES
cat ~/book/ch02/data/movies.txt,
wc -l ~/book/ch02/data/movies.txt,
head -2 ~/book/ch02/data/movies.txt,
echo hello world!,
touch test,
wget houses.csv http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv,
head houses.csv

The real power of Unix comes from stringing these together.

Redirection and Pipes

Redirection and pipes are really at the heart of the UNIX philosophy, which is to have many small tools, each one suited for a particular job.

Redirection refers to changing the output of individual commands/programs.

The “standard output” or STDOUT is usually your terminal (monitor).

The form of a command with standard input and output redirection is:

command -[options] [arguments] input file > output file (create new or overwrite output file) or
command -[options] [arguments] input file >> output file (create new or append at end of output file)

EXAMPLES:
ls -a book > text,
cat text,
sort -r text > text,
cat text, cat,
echo hello friend >> text,
cat,
cat >> text

The idea behind pipes is that rather than redirecting output to a file, we redirect it into another command. This is analygous to pipes %>% in the dplyr package.

Another way to say this is that output of one command is used as input to another command.

EXAMPLES:
cat > somenumbers.txt,
cat somenumbers.txt | uniq | sort

Here are two more useful filters:

egrep - print lines matching a pattern (regex)

EXAMPLE:
egrep e *.txt will print all lines in any file ending with .txt which contain the regex e.

cut - select portions of each line of a file

EXAMPLE:
cat houses.csv | cut -d ',' -f 2-4 | head

Next time:

  1. Using Unix commands for R Batch jobs from the Unix command line
  2. Data cleaning with command line tools

i-clicker questions