Source file ⇒ lec27.Rmd
Here is a link to the installation instructions: data science toolbox
For those with a PC:
It is possible you have to enable VTx and VTd you have to in the BIOS (thanks Jason).
Here is an example how to do it for HP Compaq 8200 or similar PC:
Also do the following in Oracle VM VirtualBox Manager (Thanks Gao):
To verify, start the Virtual device from Oracle VM VirtualBox. If all has gone well, the device boots up.
For more details here is a link
Together with VirtualBox and Vagrant the DataScienceToolbox is a virtual computer with a virtual hard drive living in your computer. When you installed it you had to set aside space on your hard drive for it. The DataScienceToolbox isn’t a cloud. Your hard drive and the virtual hard drive are separate except there is a way to transfer files betweeen them as we will discuss later.
wget
download a file from the webegrep
- print lines matching a pattern (regex)cut
- extract columns of data from a field-delimited file
EXAMPLE: Here is a tab delimited data set about potatoes
wget -O potatoes.txt http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/potatoes.txt
head potatoes.txt
lets cut out the first and second columns and save to a file called small_potatoes
cat potatoes.txt | cut -f 1-2 > small_potatoes
head small_potatoes
EXAMPLE:
Recall the Saratoga Houses csv.
wget -O houses.csv http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv
head houses.csv
Suppose I want all Saratoga Houses that have a fireplace and am only interested in columns 2 through 5.
EXAMPLE:cat houses.csv | cut -d ',' -f 2-5 | egrep Y | head
What if we have a file on our desktop and we want to modify it using command line tools on MyDataScienceToolbox?
answ: the directory ~/MyDataScienceToolbox
on your hard drive and the directory /vagrant
on your virtual hard drive are the same so you can pass files back and forth.
EXAMPLE:
download the following csv file to your computer
Steps:
mv ~/Desktop/swimming_pools.csv MyDataScienceToolbox/.
/vagrant
ls /vagrant/
cp /vagrant/swimming_pools.csv .
# unix command
cat swimming_pools.csv | egrep Centre | cut -d "," -f 1-2 > center_pools.csv
Suppose you don’t want to run your R code on your computer. You can use for example get an account to the Statistical Computer Facility (SCF) here at Cal.
For your next homework you will run a batch job on your Virtual DataScienceToolbox.
BATCH jobs are useful whenever
You have a long job and you want to be able to use the computer for other things in the meantime.
You want to log out of the machine while the job is running and come back to it later.
You’re running the job on a remote machine, and again you want to log out.
You want to be courteous to other users of the machine by decreasing the priority of the job (so it doesn’t slow down the machine when someone else is using it).
To start a BATCH job, use
nice R CMD BATCH scriptfile.R outfile.Rout &
nice
gives your job lower priority. Actually on the lab computers all jobs are “niced” by default, so this isn’t strictly necessary.
&
at the end of the BATCH command indicates that you want to run this job in the background.
outfile.Rout
is the filename where the output will be sent
A few other things to keep in mind:
scriptfile.R
should require no input from the user.
Graphics should be created by surrounding the relevant code with pdf(file = “filename.pdf”)
and dev.off()
. Graphics files are saved as pdf files.
To see information about currently running processes, just type top
. There are arguments to top
that allow you to sort by CPU usage, memory, etc. See man top
for more details.
EXAMPLE:
download makemaps.R from b-courses assignment 9 into the home directory of your virtual DataScienceToolbox
view the file with cat. Notice the following things:
There are two packages needed. We will need to install them on the virtual machine.
system
is used to run UNIX commands from within R.
3 the creation of plots using pdf and dev.off functions. We need these because we are going to R in BATCH mode and we won’t be able to manually save the plots.