Source file ⇒ lec27.Rmd

Today

  1. Trouble shooting DataScience Toolbox
  2. More command line tools
  3. R BATCH jobs from the UNIX command line

0. Work together for 10-15 minutes and help get your neighbor’s DataScience toolbox working

Here is a link to the installation instructions: data science toolbox

For those with a PC:

If Vagrant up in powershell gives an error:

It is possible you have to enable VTx and VTd you have to in the BIOS (thanks Jason).

Here is an example how to do it for HP Compaq 8200 or similar PC:

  1. Start the machine.
  2. Press F10 to enter BIOS.
  3. Security-> System Security
  4. Enable Virtualization Technology (VTx) and Virtualization Technology Directed I/O (VTd).
  5. Save and restart the machine.

Also do the following in Oracle VM VirtualBox Manager (Thanks Gao):

  1. Select the Virtual device and choose Settings
  2. Navigate to System and click the Processor tab
  3. Tick the check-box, Enable PAE/NX
  4. Click OK and you are done

To verify, start the Virtual device from Oracle VM VirtualBox. If all has gone well, the device boots up.

For more details here is a link

What is the DataScience Toolbox?

Together with VirtualBox and Vagrant the DataScienceToolbox is a virtual computer with a virtual hard drive living in your computer. When you installed it you had to set aside space on your hard drive for it. The DataScienceToolbox isn’t a cloud. Your hard drive and the virtual hard drive are separate except there is a way to transfer files betweeen them as we will discuss later.

1. Command line tools useful for data cleaning

wget download a file from the web
egrep - print lines matching a pattern (regex)
cut - extract columns of data from a field-delimited file

EXAMPLE: Here is a tab delimited data set about potatoes

wget -O potatoes.txt http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/potatoes.txt head potatoes.txt

lets cut out the first and second columns and save to a file called small_potatoes

cat potatoes.txt | cut -f 1-2 > small_potatoes head small_potatoes

EXAMPLE:

Recall the Saratoga Houses csv.

wget -O houses.csv http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv head houses.csv

Suppose I want all Saratoga Houses that have a fireplace and am only interested in columns 2 through 5.

EXAMPLE:
cat houses.csv | cut -d ',' -f 2-5 | egrep Y | head

Transfering files between your computer’s hard drive and the virtual hard drive

What if we have a file on our desktop and we want to modify it using command line tools on MyDataScienceToolbox?

answ: the directory ~/MyDataScienceToolbox on your hard drive and the directory /vagrant on your virtual hard drive are the same so you can pass files back and forth.

EXAMPLE:

download the following csv file to your computer

swimming_pools.csv

Steps:

  1. on your computer using terminal or command line prompt move swimming_pools.csv to directory ~/MyDataScienceToolbox using command

mv ~/Desktop/swimming_pools.csv MyDataScienceToolbox/.

  1. in your virtual MyDataScienceToolbox window, you can verify that swimming_pools.csv is now present in the directory /vagrant

ls /vagrant/

  1. in your virtual MyDataScienceToolbox window copy swimming_pools.csv to your home directory /home/vagrant

cp /vagrant/swimming_pools.csv .

Task for you

  1. Find all the swimming pools that have Centre in the name
  2. cut out the name and address of those swimming pools
  3. save the the results of those names in a file called centre_pools.csv
  4. transfer center_pools.csv to your desktop
# unix command
cat swimming_pools.csv | egrep Centre | cut -d "," -f 1-2 > center_pools.csv

2. R BATCH jobs from the UNIX command line

Suppose you don’t want to run your R code on your computer. You can use for example get an account to the Statistical Computer Facility (SCF) here at Cal.

SCF

For your next homework you will run a batch job on your Virtual DataScienceToolbox.

BATCH jobs are useful whenever

  1. You have a long job and you want to be able to use the computer for other things in the meantime.

  2. You want to log out of the machine while the job is running and come back to it later.

  3. You’re running the job on a remote machine, and again you want to log out.

  4. You want to be courteous to other users of the machine by decreasing the priority of the job (so it doesn’t slow down the machine when someone else is using it).

To start a BATCH job, use

nice R CMD BATCH scriptfile.R outfile.Rout &

nice gives your job lower priority. Actually on the lab computers all jobs are “niced” by default, so this isn’t strictly necessary.

& at the end of the BATCH command indicates that you want to run this job in the background.

outfile.Rout is the filename where the output will be sent

A few other things to keep in mind:

  1. scriptfile.R should require no input from the user.

  2. Graphics should be created by surrounding the relevant code with pdf(file = “filename.pdf”) and dev.off(). Graphics files are saved as pdf files.

  3. To see information about currently running processes, just type top. There are arguments to top that allow you to sort by CPU usage, memory, etc. See man top for more details.

EXAMPLE:

  1. download makemaps.R from b-courses assignment 9 into the home directory of your virtual DataScienceToolbox

  2. view the file with cat. Notice the following things:

  • There are two packages needed. We will need to install them on the virtual machine.

  • system is used to run UNIX commands from within R.

  • 3 the creation of plots using pdf and dev.off functions. We need these because we are going to R in BATCH mode and we won’t be able to manually save the plots.

  1. In your homework you will run makemaps.R in BATCH mode.