Source file ⇒ 2017-lec19.Rmd
very roughly keeping with the 40/30/20/10 curve I would say >85 is some kind of A, 75-84 some kind of B, 65-74 some kind of C. More accuratey, your score is given a weight of 15% of your total grade so the letter grade is just a guide.
Virtual Private Network (VPN) used for online exericises done off campus
Secure Shell (SSH) access to Berkeley Statistical Computing Facilty (SCF) linux server
secure copy (SCP) from your computer to Berkeley SCF server
Unix commands
The Remote Access VPN (Virtual Private Network) service is designed to allow CalNetID authenticated users to connect to the UC Berkeley network from outside of campus, as if they were on campus, and encrypts the information sent to the network.
There will be no difference if access the online exercises if you are connecting on campus, but if you are connecting off campus you will need to download and install VPN.
(https://berkeley.service-now.com/kb_view.do?sysparm_article=KB0010294)
Data Cleaning isn’t always best done in R since R stores files in memory. For big data sets, it is faster to do data cleaning with command line tools. There is a bit of set up before we get started.
With the ssh command you can log into a remote computer.
Grab a SCF account form and log into the SCF server at the special webpage set up for this class:
https://scf.berkeley.edu:9022/wetty/ssh/username
For example my special webpage is: https://scf.berkeley.edu:9022/wetty/ssh/alucas
To change your password type ’passwd userwhere user is your user name. For example I would type
passwd alucas` to change my password.
To get a new prompt press the Ctrl
key and type d
q
So you’ve just logged into a brand new environment. Before we do anything, it’s worthwhile to get a high-level understanding of this environment. The environment is roughly defined by four layers, which we briefly discuss from the top down:
Command-line tools
Command-line tools, the first layer, is the software containing commands to modify your data. This is like the DataComputing
package in R that contain functions such as select()
, mutate()
etc. Command-line tools include cd
, ls
, cat
and many more.
Terminal
The Terminal is an application where we type our commands in. It is like the console in RStudio. Here is a picture of your SCF console.
Shell
The shell is the command interpretor. Once we have typed in our command and pressed <Enter>
, the terminal sends that command to the shell. SCF uses Bash
as the shell, but there are many others available.
Operating System
The fourth layer is the operating system, which is GNU/Linux in our case. Linux is the name of the kernel, which is the heart of the operating system. The kernel is in direct contact with the CPU, disks, and other hardware. The kernel also executes our command-line tools. GNU, which is an acronym for “GNU’s Not Unix”“, refers to a set of basic tools. The Data Science Toolbox is based on a particular Linux distribution called Ubuntu. Unix was the first operating system and I will refer to GNU/Linux or Ubuntu as Unix.
The most popular OS today, Microsoft Windows, uses a graphical user interface (GUI) for you to interact with the OS. This is easy to learn but not very powerful.
UNIX, on the other hand, is hard at first to learn, but it allows you vastly more control over what your computer can do, kind of like driving stick shift compared to automatic (race car drivers always drive stick). It can be much faster to complete some tasks using command tools at the terminal than with graphical applications and menus. It also makes your work flow reproducible.
We can’t upload files to the server directly. Instead, we need to use something called scp. Here’s how it works—you will have an opportunity to practice this in lab this week.
The instructions here are different for Mac and PC users:
MAC USERS
Open a terminal window by typing terminal in the spotlight search in the top right of your screen.
In the terminal, use the following command to upload your file to the ‘radagast‘ Berkeley statistics server.
scp PathTolab7/lab7.R YourAccountName@radagast.berkeley.edu:/accounts/class/s133/YourAccountName/Documents/.
Note the general form for scp is scp pathToFileOnYourComputer/file.extension username@server:/PathToCopyFileInto
PC USERS
You will need to download and install WinSCP:
Here is a link to download the latest version of WinSCP:
Here is a you tube video on how to use WinSCP:
Find your file in the left side of WinSCP and drag it to the Documents directory on the right side of WinSCP.
Unix Commands are the command line tools that we use to communicate with the OS.
You can scroll through previous commands using the up and down arrow.
To go to the beginning of the line type Ctrl+a
, end of line Ctrl+e
You can move clear the screen with Ctrl+l
Here is a useful list of keyboard short cuts to get around in Unix: short cuts
The first thing you need to know about UNIX are how to work with directories and files. Technically, everything in UNIX is a file, but it’s easier to think of directories as you would folders on Windows or Mac OS.
Directories are organized in an inverted tree structure:
To see the directory you’re currently in, type the command pwd
(“present working directory”).
There are two “special” directories: The top level directory, “/”, is called the root directory.
Your home directory, “~”, contains all your files. For user mary,“~” and “/users/mary” mean the same thing.
To create a new directory, use the command mkdir
. Then to move into it, use cd
.
EXAMPLES:
pwd
,
mkdir unixexamples
,
cd unixexamples
,
ls
,
ls -a
ls -l
ls -a
means to show all files, including the hidden files starting with a dot (“.”).
The two hidden files here are special and exist in every directory. “.” refers to the current directory, and “..” refers to the directory above it.
ls -l
shows the size of the files which can be useful.
This brings us to the distinction between relative and absolute path names. (Think of a path like an address in UNIX, telling you where you are in the directory tree.)
You may have noticed that I typed cd unixexamples
, rather than cd /Users/Adam/unixexamples
.
The first is the relative path; the second is the absolute path.
To refer to a file, you need to either be in the directory where the file is located, or you need to refer to it using a relative or absolute path name.
EXAMPLES:
pwd
,
echo "Testing 1 2 3" > test.txt
,
ls
,
cat test.txt
,
cd ..
,
cat test.txt
,
cat unixexamples/test.txt
Note that file names must be unique within a particular directory, but having, say, both /Users/Adam/test.txt
and /Users/Adam/unixexamples/test.txt
is OK.
You can refer to multiple files at once using wildcards. The most common one is the asterisk (*). It stands in for anything (including nothing at all).
EXAMPLES:
touch test1, test2
ls t*
UNIX command lines can be simple, one-word entities like the date
command or pwd
command. A UNIX command may or may not have options or arguments. An argument can be a filename or some other kind of input such as a string.
The general syntax for a UNIX command looks like this:
$ command -[option(s)] [arguments(s)]
Here are some general rules about UNiX commands:
Enter commands in lowercase
Options modify the way in which a command works. Options are often single letters prefixed with a dash (-) and set off by any number of spaces or tabs. Multiple options in one command line can be set off individually (like -a -b), or in some cases you can combine them after a single dash (like -ab). Some options are made from complete words or phrases like –delete. See for example man sort
.
You must type spaces between commands, options and arguments.
*Options come before arguments.
*Two or more commands can be writtenon the same command line, each separated by a semicolon (;).
EXAMPLES: mv test.txt newname.txt
rmdir unixexamples
,
rm -rf unixexamples
,
To look at the syntax of any particular UNIX command, type man
(for “manual”) and then the name of the command.
For example: man mv
The two most important parts of the man page are labeled SYNOPSIS and DESCRIPTION. These are very much like the “Usage” and “Arguments” in R’s help pages.
SYNOPSIS shows you the syntax for a particular command. Bracketed arguments are optional.
DESCRIPTION tells you what all the options do.
Press the space bar to scroll forward through the man page,b
to go backward, and q
to exit.
Type slash / and then type the string to search for. Then keep pressing n to get to the next item. This can be very useful in case the man pages are long.
pwd
print working directory
ls
list contents of current directory
ls -a
list contents, including hidden files
ls -l
list contents, including size of files
mkdir
creat a new directory
cd dname
change directory to dname
cd ..
change to parent directory
cd ~
change to home directory
mv
move or rename a file
touch
create a new empty file
rm
remove a file
rm -r
remove all lower level files
wc -l
count the number of lines in a file
head -5
look at the first 5 lines of a file
cat
print contents of a file
tail
look at end of file
cp
copy a file
echo
display a line of text written in the terminal
wget
download a file from the web
EXAMPLES
echo hello world!
touch test
mkdir mydocs
wget -O mydocs/text https://www.w3.org/TR/PNG/iso_8859-1.txt
head mydocs/text
wc -l mydocs/text
Do example 1a,b in basic unix. Solve in SCF terminal but look at solutions in online exercises.
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-19-collection/
The real power of UNIX comes from stringing commands together.
Redirection and pipes are really at the heart of the UNIX philosophy, which is to have many small tools, each one suited for a particular job.
Redirection refers to changing the output of individual commands.
The “standard output” or STDOUT is usually your terminal (monitor).
The form of a command with standard input and output redirection is:
command -[options] [arguments] > output file
(create new or overwrite output file) or
command -[options] [arguments] >> output file
(create new or append at end of output file)
Example: echo how ya doing > slang
cat slang
More examples:
mkdir book
echo "this is some text" > ~/book/words
wget https://www.w3.org/TR/PNG/iso_8859-1.txt > ~/book/words
ls -a book > text
cat text
sort -r text > text1
cat text1
echo hello friend >> text1
cat text1
cat text1 >> text1
The idea behind pipes, written |
, is that rather than redirecting output to a file, we redirect it into another command. This is analygous to pipes %>%
in the dplyr
package.
Another way to say this is that output of one command is used as input to another command.
EXAMPLES:
cat > somenumbers
(you need to enter Ctrl+c after enter numbers)
cat somenumbers | uniq | sort
Here are two useful filters:
egrep
- print lines matching a pattern (regex)
The syntax for egrep
is egrep -[options] regex file
EXAMPLE:
egrep e *.txt
will print all lines in any file ending with .txt which contain the regex e. Here the regex is e and the file is *.tex
wget -O houses.csv http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv
cat houses.csv | head -5
## --2017-04-10 11:13:18-- http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv
## Resolving www.mosaic-web.org... 108.168.213.78
## Connecting to www.mosaic-web.org|108.168.213.78|:80... connected.
## HTTP request sent, awaiting response... 200 OK
## Length: 31483 (31K) [text/csv]
## Saving to: 'houses.csv'
##
## 0K .......... .......... .......... 100% 233M=0s
##
## 2017-04-10 11:13:18 (233 MB/s) - 'houses.csv' saved [31483/31483]
##
## "Price","Living.Area","Baths","Bedrooms","Fireplace","Acres","Age"
## 142212,1982,1,3,"N",2,133
## 134865,1676,1.5,3,"Y",0.38,14
## 118007,1694,2,3,"Y",0.96,15
## 138297,1800,1,2,"Y",0.48,49
egrep N houses.csv | head -5
## 142212,1982,1,3,"N",2,133
## 206512,1456,2,3,"N",0.98,10
## 50709,960,1.5,2,"N",0.01,12
## 108794,1464,1,2,"N",0.11,87
## 68353,1216,1,2,"N",0.61,101
cut
- select portions of each line of a file
EXAMPLE:
cat houses.csv | cut -d ',' -f 2-4 | head -5
## "Living.Area","Baths","Bedrooms"
## 1982,1,3
## 1676,1.5,3
## 1694,2,3
## 1800,1,2
the option -d ‘,’ is the delimiter
we cut out fields 2 through 4
Do example 2a in redirection and pipes.
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-19-collection/