Helpful Computational Infrastructure Resources

Setting up how you access software on Darwin

As a lab, and as a user, you have access to many different computational resources.

In general, as a lab, we have space on /nfs3/Sharpton_Lab/. This will be where you store most of the things you produce during your time.
As a user, you also have your own folder that is allocated to you at /home/micro/user. Your home directory is limited in space, so its best to not store things here, including software installs (we will talk about that shortly). The /home/micro is set to your home directory, so here is where we can modify files to interact with the other programs on the compute infrastructure.
You can interact with Darwin by creating a virtual machine Rstudio-cloud: https://rstudio-darwin.cqls.oregonstate.edu/auth-sign-in?appUri=%2F
You should learn to version control your software that you develop using the OSU version of GitHub: https://gitlab.cgrb.oregonstate.edu/users/sign_in

The bash profile

Every time you log into darwin, you start out in your home directory. You can return to your home directory at any time by simply typing cd. Your home directory is located at home/micro/username/, or ~/. Within your home directory, you have a file that sets your bash profile. The bash profile is a great tool to make terminal use easier and quicker to use. It can also make the terminal look more pretty than the default mode.

The bash_profile is a configuration file for the bash shell, which you access via the terminal. Now, before you make any changes to your bash profile, you should probably make a back up file first, maybe bash_profile.bak. Note that the bash profile is a hidden file, which is why the file name begins with a “.”. To see this file, you will have to type ls -a to show hidden files. The .bash_profile should be one of them.

# Look at the current .bash_profile
cat .bash_profile

# Save a copy
cp .bash_profile .bash_profile_back

# Look at the base .bash_profile

cat .bash_profile
# Jan 20 2022
# .bash_profile

Here is what is stored in a default .bash_profile created for a user on darwin.

# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin

export PATH
unset USERNAME

The bashrc file

As you can see, the .bash_profile sources another file called the .bashrc file. For most things, you should make changes within the .bashrc file, rather than to the .bash_profile. This includes things like setting alias names, prompt modification, and setting your path variable. The .bashrc file lives in the home directory as well. On the CQLS infrastructure, here is a standard .bashrc profile that they begin with for each user.

# .basrc base file
#       Source the standard .bashrc file
source /local/cluster/etc/std.bashrc
#       Add your own personal changes following this line:

This script sources the standard bash file for any user. After this, we should now have access to any programs that are stored in /local/cluster/bin/. These programs are kept updated and installed by the CQLS. The benefit is that they can install one instance of the software and multiple users can access. The downside is that the software might not be as customizable because its installed for everyone.

We get it, things get messy, you just want that program to run! Pretty soon, you have modifications to your $PATH variable in both the .bash_profile and .bashrc file with multiple alias calling different programs (maybe I am the only guilty party here!). Well, that just makes sense why the default behavior of the compute structure isn’t working for you! If you ever need to “reset” your home directory to the default settings, then you now have the default .bash_profile and .bashrc file listed here. Don’t forget to back up your current profile first before making changes to the .bash_profile and .bashrc file.

Change prompt

In general, you should try and store things into discrete groups within your .bashrc file. First, let’s change what our prompt looks like. For example, right now, my current profile looks like this on default.

-bash-4.2$

That’s kind of ugly. Why don’t we change it to reflect our user name, what machine (host) we are on, and then what our current directory we are in. To do that, we add two lines of code under the default code in the .bashrc file:

## Change bash prompt 
export PS1="Arnold@\h [\W]$ "

This change results in the prompt looking like this. We have our user name, what machine we are on, and then within the brackets is the current directory that we are in. Because we are in our home directory, its just a tilda sign.

Arnold@darwin [~]$

There are a billion modifications that you can do with your .bashrc file. And you probably don’t want to learn all the short hand for things. Here is a .bashrc generator that allows you to figure out the prompt you would like and then generates the bash code to be added to the .bashrc file https://bashrcgenerator.com/ .

Alias

The next lines in your .bashrc file should contain a list of your short cuts, called “alias”. In an effort to keep things tidy, I’ve decided to create another file called .alias, and then just have the .bashrc source this file. So I’ve added a line of code at the end of the .bashrc file to source the .alias file.

## Get shortcuts
source ~/.alias

Now, we have to make the .alias file! In general, you should use short cuts to make your life easier. For example, you could say create a keyword that means any command that you could run on bash. Good examples of this include running a command to change directories to a specified location, or running commands of a program that you use all the time. For example, if we work on a particular project all the time, in the .alias file we could add this lines:

alias mouse='cd /nfs3/Sharpton_Lab/prod/projects/mouse_behavior_metaanalysis/projects/'

Then in the command line, we can just type “mouse”, and walla! we are suddenly in the directory we want to be. This way we don’t have to remember to navigate to the project folder each time for each terminal. That is exhausting! Another thing we might add are commands for software we use all the time. For example, if you hate remembering the git commands, then why not just make alias functions for them here?

Bad examples of alias commands include accessing special programs installed within conda environments. This results in starting a software that might not be configured correctly for the current environment you are running in. To use this software, you should source activate the conda enviroment.

R

We use R all the time, so its important to know how to start R, how to manage changing to a new R version, and how to install packages. Default install will have software going to your /home/micro/ directory, but in time, you will likely reach space capacity for this folder, so you should install software somewhere else. In the Sharpton Lab, everything under prod/ is backed up, but because software can be downloaded again, there is no reason to back it up on prod, and so should be installed in a directory under your name within the /nfs3/Sharpton_Lab/tmp/src/ folder.

The CQLS installs many softwares including R, but they do not maintain R packages. So, in our .bashrc profile, we need to put the desired R version in our path, then determine a destination for any libraries we download:

## Set up R path in .bashrc
export PATH=/local/cluster/R-4.1.0/bin:${PATH} # Tell which is the default R to use
export R_LIBS=/nfs3/Sharpton_Lab/tmp/src/arnoldhk/R/library/4.1.0 # where are user libraries stored
unset R_LIBS_USER # unset this - its a relic for when everyone in the lab was using same R version

Then, Wwithin the src/ folder, you should create an R/library/4.1.0 folder where we will install all packages for R version 4.1.0. This means the next version of R that comes out, you can just create a new version to install our next set of R libraries to R/library/4.1.1, and then update our path to source the R version 4.1.1. Ed has made a great resource describing this here: https://software.cqls.oregonstate.edu/tips/posts/using-system-r-with-user-installed-packages/.

iGraph

For many R libraries that run phylogenetic analyses, they use iGraph. Unfortunately iGraph had an issue compiling on Darwin because there were some missing C+ files. This problem has been solved by the following.

# Using /local/cluster/R-4.1.0/bin
install.packages("remotes")
remotes::install_cran("iGraph")

# Remotes is also handy because you don't have to remember BcLite commands and git commands
remotes::install_bioc("")
remotes::install_git("")

R cloud

The R cloud is available here: https://rstudio-darwin.cqls.oregonstate.edu/auth-sign-in?appUri=%2F

Conda

The next software that you will likely use is conda. Within your source folder, you should also create a /conda folder within src/ to install conda files. The conda\ folder should contain a envs and pkgs folder to store conda enviroments and packages. Within your .bashrc file, you should add these lines. These should be located last in the .bashrc file.

## Conda - keep last in .bashrc
# >>> conda initialize >>> 
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/local/cluster/miniconda3_base/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/local/cluster/miniconda3_base/etc/profile.d/conda.sh" ]; then
        . "/local/cluster/miniconda3_base/etc/profile.d/conda.sh"
    else
        export PATH="/local/cluster/miniconda3_base/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

Last, you should add the path of the src file to the .condarc file that is located in the home directory:

auto_activate_base: false
channels:
  - conda-forge
  - bioconda
  - defaults
envs_dirs:
  - /nfs3/Sharpton_Lab/tmp/src/arnoldhk/conda/envs
pkgs_dirs:
  - /nfs3/Sharpton_Lab/tmp/src/arnoldhk/conda/pkgs

Within the /home/micro/.conda/ folder, there is a file called environments.txt. This should have a list of environments and a link to where they are stored. If you move where conda environments or conda packages are stored, you will have to update this file as well as the .condarc file. Here is a tutorial on how to get started with conda: https://software.cqls.oregonstate.edu/tips/posts/conda-tutorial/.

If you feel that the conda package would be used by more than just yourself, you can have conda programs installed by CQLS. For example, humann3 was installed for everyone. This is then activated by writing:

source /local/cluster/humann3/activate.sh

If you want to install your own conda environment, you can just call conda, and it will now install to the appropriate location on src/

conda create -n test_env plotly=4.4.1 notebook=6.0.1 ipywidgets=7.5.1
conda activate test_env

Mamba has been installed on the conda base and can be used to solve environments more quickly. For example, you could install the test_env above more quickly by running

mamba create -n test_env plotly=4.4.1 notebook=6.0.1 ipywidgets=7.5.1

Conda and running a specific version of R

If you want to run a specific version of R from within conda, then you can do the following.

# See what R we are calling in base
which R

# See what R_LIBS we are pointing to
echo $R_LIBS

# Activate the conda environemnt that we want to use R in
source activate my_program

# Navigate to the base conda directory environment.
cd /nfs3/Sharpton_Lab/tmp/src/arnoldhk/conda/envs/my_program
/local/cluster/conda/conda_R_setup.sh

# Now, the version of R should be what we want
which R

# And R Libs is now pointint to shomewhere else
echo $R_LIBS

Helpful Computational Infrastructure Resources

Holly Arnold

1/21/2022

Overview

Getting Started

Sharpton Lab file infrastructure