Source file ⇒ lec26.Rmd
Setting up Data Science Toolbox
Unix Commands
Data Cleaning isn’t always best done in R since R stores files in memory. For big data sets, it is faster to do data cleaning with command line tools. There is a bit of overhead in teaching you how to do data science at the command line however; we need to set up the Data Science Toolbox so that we all have the same command line environment.
Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes (this is a little optimistic). You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services) but I suggest you run it locally.
Please install Data Science Toolbox. Here is how to do it (steps 1-4):
Here are some of my personal comments about the steps for installing it locally on your computer:
step 1: Simply download VirtualBox as instructed (around 177Mb storage). Don’t do anything further.
step 2: Simply download vagrant as instructed (around 250 Mb storage). Don’t do anything further.
step 3: Mac users: To find terminal you can type terminal in the search window in top right of screen
PC users: To find powershell type powershell in the search window in the Start menu. Long down load time possible.
step 4: Easy for mac users. For PC users follow the directions and click on putty.exe. In the Saved Sessions box type “toolbox” so so that you can save your settings for another time. Click on Save.
(no need to do step 5 at this time–setting up ipython notebook)
To end vagrant session type exit
Mac users: To restart there are 4 steps
PC users: To restart
Once you have everything running I suggest you type sudo apt-get update
to update your vagrant.
So you’ve just logged into a brand new environment. Before we do anything, it’s worthwhile to get a high-level understanding of this environment. The environment is roughly defined by four layers, which we briefly discuss from the top down:
Command-line tools
Command-line tools, the first layer, is the software containing commands to modify your data. This is like the DataComputing
package in R that contain functions such as select()
, mutate()
etc. Command-line tools include cd
, ls
, cat
and many more.
Terminal
The Terminal is an application where we type our commands in. It is like the console in RStudio. Here is a picture of the console in Data Science Toolbox:
Shell
The shell is the command interpretor. Once we have typed in our command and pressed <Enter>
, the terminal sends that command to the shell. The Data Science Toolbox uses Bash as the shell, but there are many others available.
Operating System
The fourth layer is the operating system, which is GNU/Linux in our case. Linux is the name of the kernel, which is the heart of the operating system. The kernel is in direct contact with the CPU, disks, and other hardware. The kernel also executes our command-line tools. GNU, which is an acronym for “GNU’s Not Unix”“, refers to a set of basic tools. The Data Science Toolbox is based on a particular Linux distribution called Ubuntu. Unix was the first operating system and I will refer to GNU/Linux or Ubuntu as Unix.
The most popular OS today, Microsoft Windows, uses a graphical user interface (GUI) for you to interact with the OS. This is easy to learn but not very powerful.
UNIX, on the other hand, is hard at first to learn, but it allows you vastly more control over what your computer can do, kind of like driving stick shift compared to automatic (race car drivers always drive stick). It can be much faster to complete some tasks using command tools at the terminal than with graphical applications and menus. It also makes your work flow reproducible.
Unix Commands are the command line tools that we use to communicate with the OS.
Here is a useful list of keyboard short cuts to get around in Unix: short cuts
The first thing you need to know about UNIX are how to work with directories and files. Technically, everything in UNIX is a file, but it’s easier to think of directories as you would folders on Windows or Mac OS.
Directories are organized in an inverted tree structure:
To see the directory you’re currently in, type the command pwd
(“present working directory”).
There are two “special” directories: The top level directory, “/”, is called the root directory.
Your home directory, “~”, contains all your files. For user mary,“~” and “/users/mary” mean the same thing.
To create a new directory, use the command mkdir
. Then to move into it, use cd
.
EXAMPLES:pwd
,mkdir unixexamples
,cd unixexamples
,ls
,ls -a
ls -l
ls -a
means to show all files, including the hidden files starting with a dot (“.”).
The two hidden files here are special and exist in every directory. “.” refers to the current directory, and “..” refers to the directory above it.
ls -l
shows the size of the files which can be useful.
This brings us to the distinction between relative and absolute path names. (Think of a path like an address in UNIX, telling you where you are in the directory tree.)
You may have noticed that I typed cd unixexamples
, rather than cd /Users/Adam/unixexamples
.
The first is the relative path; the second is the absolute path.
To refer to a file, you need to either be in the directory where the file is located, or you need to refer to it using a relative or absolute path name.
EXAMPLES:pwd
,echo "Testing 1 2 3" > test.txt
,ls
,cat test.txt
,cd ..
,cat test.txt
,cat unixexamples/test.txt
Note that file names must be unique within a particular directory, but having, say, both /Users/Adam/test.txt
and /Users/Adam/unixexamples/test.txt
is OK.
You can refer to multiple files at once using wildcards. The most common one is the asterisk (*). It stands in for anything (including nothing at all).
EXAMPLES:touch test1, test2
ls t*
Commands, arguments, and options
We’ve already started using these; now let’s define them more precisely.
The general syntax for a UNIX command looks like this:
$ command -options argument1 argument2
(The number of arguments may vary) An argument comes at the end of the command line. It’s usually the name of a file or some text.
Options come between the name of the command and the arguments, and they tell the command to do something other than its default. They’re usually prefaced with one or two hyphens.
EXAMPLES: mv test.txt newname.txt
rmdir unixexamples
,rm -r unixexamples
,
To look at the syntax of any particular UNIX command, type man
(for “manual”) and then the name of the command.
The two most important parts of the man page are labeled SYNOPSIS and DESCRIPTION. These are very much like the “Usage” and “Arguments” in R’s help pages.
SYNOPSIS shows you the syntax for a particular command. Bracketed arguments are optional.
DESCRIPTION tells you what all the options do.
Press the space bar to scroll forward through the man page,b
to go backward, and q
to exit.
®ex in man pages displays only lines matching the regular expression. This can be very useful in case the man pages are long.
pwd
print working directoryls
list contents of current directoryls -a
list contents, including hidden filesls -l
list contents, including size of filesmkdir
creat a new directorycd dname
change directory to dnamecd ..
change to parent directorycd ~
change to home directorymv
move or rename a filetouch
create a new empty filerm
remove a filerm -r
remove all lower level fileswc -l
count the number of lines in a filehead -5
look at the first 5 lines of a filecat
print contents of a filetail
look at end of filecp
copy a fileecho
display a line of text written in the terminalwget
download a file from the web
EXAMPLEScat ~/book/ch02/data/movies.txt
,wc -l ~/book/ch02/data/movies.txt
,head -2 ~/book/ch02/data/movies.txt
,echo hello world!
,touch test
,wget houses.csv http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv
,head houses.csv
The real power of Unix comes from stringing these together.
Redirection and pipes are really at the heart of the UNIX philosophy, which is to have many small tools, each one suited for a particular job.
Redirection refers to changing the output of individual commands/programs.
The “standard output” or STDOUT is usually your terminal (monitor).
The form of a command with standard input and output redirection is:
command -[options] [arguments] input file > output file
(create new or overwrite output file) orcommand -[options] [arguments] input file >> output file
(create new or append at end of output file)
EXAMPLES:ls -a book > text
,cat text
,sort -r text > text
,cat text
, cat
,echo hello friend >> text
,cat
,cat >> text
The idea behind pipes is that rather than redirecting output to a file, we redirect it into another command. This is analygous to pipes %>%
in the dplyr
package.
Another way to say this is that output of one command is used as input to another command.
EXAMPLES:cat > somenumbers.txt
,cat somenumbers.txt | uniq | sort
Here are two more useful filters:
egrep
- print lines matching a pattern (regex)
EXAMPLE:egrep e *.txt
will print all lines in any file ending with .txt which contain the regex e.
cut
- select portions of each line of a file
EXAMPLE:cat houses.csv | cut -d ',' -f 2-4 | head