Source file ⇒ 2017-lec19.Rmd

Announcements

  1. HW #7 deadline extended until Tuesday after break
  2. Nice job on midterm. Here is the distribution:

very roughly keeping with the 40/30/20/10 curve I would say >85 is some kind of A, 75-84 some kind of B, 65-74 some kind of C. More accuratey, your score is given a weight of 15% of your total grade so the letter grade is just a guide.

Today

  1. Virtual Private Network (VPN) used for online exericises done off campus

  2. Secure Shell (SSH) access to Berkeley Statistical Computing Facilty (SCF) linux server

  3. secure copy (SCP) from your computer to Berkeley SCF server

  4. Unix commands

0 Virtual Private Network (VPN) used for online exericises done off campus

The Remote Access VPN (Virtual Private Network) service is designed to allow CalNetID authenticated users to connect to the UC Berkeley network from outside of campus, as if they were on campus, and encrypts the information sent to the network.

There will be no difference if access the online exercises if you are connecting on campus, but if you are connecting off campus you will need to download and install VPN.

Instructions

(https://berkeley.service-now.com/kb_view.do?sysparm_article=KB0010294)

Data Cleaning isn’t always best done in R since R stores files in memory. For big data sets, it is faster to do data cleaning with command line tools. There is a bit of set up before we get started.

1. Secure Shell (SSH) access to Berkeley Statistical Computing Facilty (SCF) linux server

With the ssh command you can log into a remote computer.

Grab a SCF account form and log into the SCF server at the special webpage set up for this class:

https://scf.berkeley.edu:9022/wetty/ssh/username

For example my special webpage is: https://scf.berkeley.edu:9022/wetty/ssh/alucas

To change your password type ’passwd userwhere user is your user name. For example I would typepasswd alucas` to change my password.

To get a new prompt press the Ctrl key and type dq

The Environment

So you’ve just logged into a brand new environment. Before we do anything, it’s worthwhile to get a high-level understanding of this environment. The environment is roughly defined by four layers, which we briefly discuss from the top down:

Command-line tools

Command-line tools, the first layer, is the software containing commands to modify your data. This is like the DataComputing package in R that contain functions such as select(), mutate() etc. Command-line tools include cd, ls, cat and many more.

Terminal

The Terminal is an application where we type our commands in. It is like the console in RStudio. Here is a picture of your SCF console.

Shell

The shell is the command interpretor. Once we have typed in our command and pressed <Enter>, the terminal sends that command to the shell. SCF uses Bash as the shell, but there are many others available.

Operating System

The fourth layer is the operating system, which is GNU/Linux in our case. Linux is the name of the kernel, which is the heart of the operating system. The kernel is in direct contact with the CPU, disks, and other hardware. The kernel also executes our command-line tools. GNU, which is an acronym for “GNU’s Not Unix”“, refers to a set of basic tools. The Data Science Toolbox is based on a particular Linux distribution called Ubuntu. Unix was the first operating system and I will refer to GNU/Linux or Ubuntu as Unix.

The most popular OS today, Microsoft Windows, uses a graphical user interface (GUI) for you to interact with the OS. This is easy to learn but not very powerful.

UNIX, on the other hand, is hard at first to learn, but it allows you vastly more control over what your computer can do, kind of like driving stick shift compared to automatic (race car drivers always drive stick). It can be much faster to complete some tasks using command tools at the terminal than with graphical applications and menus. It also makes your work flow reproducible.

2. Secure Copy (SCP) from your local computer to the remote Berkeley SCF server

We can’t upload files to the server directly. Instead, we need to use something called scp. Here’s how it works—you will have an opportunity to practice this in lab this week.

The instructions here are different for Mac and PC users:

MAC USERS

  1. Open a terminal window by typing terminal in the spotlight search in the top right of your screen.

  2. In the terminal, use the following command to upload your file to the ‘radagast‘ Berkeley statistics server.

scp PathTolab7/lab7.R YourAccountName@radagast.berkeley.edu:/accounts/class/s133/YourAccountName/Documents/.

Note the general form for scp is scp pathToFileOnYourComputer/file.extension username@server:/PathToCopyFileInto

PC USERS

You will need to download and install WinSCP:

Here is a link to download the latest version of WinSCP:

WinSCP

Here is a you tube video on how to use WinSCP:

you tube video

Find your file in the left side of WinSCP and drag it to the Documents directory on the right side of WinSCP.

3. Unix Commands

Unix Commands are the command line tools that we use to communicate with the OS.

Shortcuts

You can scroll through previous commands using the up and down arrow.

To go to the beginning of the line type Ctrl+a, end of line Ctrl+e

You can move clear the screen with Ctrl+l

Here is a useful list of keyboard short cuts to get around in Unix: short cuts

Directories and Files

The first thing you need to know about UNIX are how to work with directories and files. Technically, everything in UNIX is a file, but it’s easier to think of directories as you would folders on Windows or Mac OS.

Directories are organized in an inverted tree structure:

To see the directory you’re currently in, type the command pwd (“present working directory”).

There are two “special” directories: The top level directory, “/”, is called the root directory.

Your home directory, “~”, contains all your files. For user mary,“~” and “/users/mary” mean the same thing.

To create a new directory, use the command mkdir. Then to move into it, use cd.

EXAMPLES:
pwd,
mkdir unixexamples,
cd unixexamples,
ls,
ls -a
ls -l

ls -a means to show all files, including the hidden files starting with a dot (“.”).

The two hidden files here are special and exist in every directory. “.” refers to the current directory, and “..” refers to the directory above it.

ls -l shows the size of the files which can be useful.

This brings us to the distinction between relative and absolute path names. (Think of a path like an address in UNIX, telling you where you are in the directory tree.)

You may have noticed that I typed cd unixexamples, rather than cd /Users/Adam/unixexamples.

The first is the relative path; the second is the absolute path.

To refer to a file, you need to either be in the directory where the file is located, or you need to refer to it using a relative or absolute path name.

EXAMPLES:
pwd,
echo "Testing 1 2 3" > test.txt,
ls,
cat test.txt,
cd ..,
cat test.txt,
cat unixexamples/test.txt

Note that file names must be unique within a particular directory, but having, say, both /Users/Adam/test.txt and /Users/Adam/unixexamples/test.txt is OK.

You can refer to multiple files at once using wildcards. The most common one is the asterisk (*). It stands in for anything (including nothing at all).

EXAMPLES:
touch test1, test2
ls t*

The Syntax of UNIX Command Lines

UNIX command lines can be simple, one-word entities like the date command or pwd command. A UNIX command may or may not have options or arguments. An argument can be a filename or some other kind of input such as a string.

The general syntax for a UNIX command looks like this:

$ command -[option(s)] [arguments(s)]

Here are some general rules about UNiX commands:

  • Enter commands in lowercase

  • Options modify the way in which a command works. Options are often single letters prefixed with a dash (-) and set off by any number of spaces or tabs. Multiple options in one command line can be set off individually (like -a -b), or in some cases you can combine them after a single dash (like -ab). Some options are made from complete words or phrases like –delete. See for example man sort.

  • You must type spaces between commands, options and arguments.

*Options come before arguments.

*Two or more commands can be writtenon the same command line, each separated by a semicolon (;).

EXAMPLES: mv test.txt newname.txt
rmdir unixexamples,
rm -rf unixexamples,

To look at the syntax of any particular UNIX command, type man (for “manual”) and then the name of the command.

For example: man mv

The two most important parts of the man page are labeled SYNOPSIS and DESCRIPTION. These are very much like the “Usage” and “Arguments” in R’s help pages.

SYNOPSIS shows you the syntax for a particular command. Bracketed arguments are optional.

DESCRIPTION tells you what all the options do.

Press the space bar to scroll forward through the man page,b to go backward, and q to exit.

Type slash / and then type the string to search for. Then keep pressing n to get to the next item. This can be very useful in case the man pages are long.

recap of commands so far and a few more:

pwd print working directory
ls list contents of current directory
ls -a list contents, including hidden files
ls -l list contents, including size of files
mkdir creat a new directory
cd dname change directory to dname
cd .. change to parent directory
cd ~ change to home directory
mv move or rename a file
touch create a new empty file
rm remove a file
rm -r remove all lower level files
wc -l count the number of lines in a file
head -5 look at the first 5 lines of a file
cat print contents of a file
tail look at end of file
cp copy a file
echo display a line of text written in the terminal
wget download a file from the web

EXAMPLES

echo hello world!

touch test

mkdir mydocs

wget -O mydocs/text https://www.w3.org/TR/PNG/iso_8859-1.txt

head mydocs/text

wc -l mydocs/text

In Class exercise

Do example 1a,b in basic unix. Solve in SCF terminal but look at solutions in online exercises.

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-19-collection/

Redirection and Pipes

The real power of UNIX comes from stringing commands together.

Redirection and pipes are really at the heart of the UNIX philosophy, which is to have many small tools, each one suited for a particular job.

Redirection refers to changing the output of individual commands.

The “standard output” or STDOUT is usually your terminal (monitor).

The form of a command with standard input and output redirection is:

command -[options] [arguments] > output file (create new or overwrite output file) or
command -[options] [arguments] >> output file (create new or append at end of output file)

Example: echo how ya doing > slang cat slang

More examples:

mkdir book

echo "this is some text" > ~/book/words

wget https://www.w3.org/TR/PNG/iso_8859-1.txt > ~/book/words

ls -a book > text

cat text

sort -r text > text1

cat text1

echo hello friend >> text1

cat text1

cat text1 >> text1

The idea behind pipes, written |, is that rather than redirecting output to a file, we redirect it into another command. This is analygous to pipes %>% in the dplyr package.

Another way to say this is that output of one command is used as input to another command.

EXAMPLES:
cat > somenumbers (you need to enter Ctrl+c after enter numbers)

cat somenumbers | uniq | sort

Here are two useful filters:

egrep - print lines matching a pattern (regex)

The syntax for egrep is egrep -[options] regex file

EXAMPLE:
egrep e *.txt will print all lines in any file ending with .txt which contain the regex e. Here the regex is e and the file is *.tex

wget -O houses.csv http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv

cat houses.csv | head -5 
## --2017-04-10 11:13:18--  http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv
## Resolving www.mosaic-web.org... 108.168.213.78
## Connecting to www.mosaic-web.org|108.168.213.78|:80... connected.
## HTTP request sent, awaiting response... 200 OK
## Length: 31483 (31K) [text/csv]
## Saving to: 'houses.csv'
## 
##      0K .......... .......... ..........                      100%  233M=0s
## 
## 2017-04-10 11:13:18 (233 MB/s) - 'houses.csv' saved [31483/31483]
## 
## "Price","Living.Area","Baths","Bedrooms","Fireplace","Acres","Age"
## 142212,1982,1,3,"N",2,133
## 134865,1676,1.5,3,"Y",0.38,14
## 118007,1694,2,3,"Y",0.96,15
## 138297,1800,1,2,"Y",0.48,49
egrep N houses.csv | head -5
## 142212,1982,1,3,"N",2,133
## 206512,1456,2,3,"N",0.98,10
## 50709,960,1.5,2,"N",0.01,12
## 108794,1464,1,2,"N",0.11,87
## 68353,1216,1,2,"N",0.61,101

cut - select portions of each line of a file

EXAMPLE:

cat houses.csv | cut -d ',' -f 2-4 | head -5
## "Living.Area","Baths","Bedrooms"
## 1982,1,3
## 1676,1.5,3
## 1694,2,3
## 1800,1,2

the option -d ‘,’ is the delimiter

we cut out fields 2 through 4

In Class exercise

Do example 2a in redirection and pipes.

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-19-collection/