HollyArnold 2022-10-30
You’re here because you are looking for raw data from a sequencing
project produced by the Sharpton lab core. See the
Raw Data Files Section.
You’re here because you are looking for analysis outputs or
descriptions of methods for a manuscript from Arnold. See the
Data Analysis Files section.
You are here because you need to set up a bash profile, learn some
basic UNIX commands, set up conda, or learn where you can access R
studio cloud. See the Helpful Computational Reseources
section.
You’re here because you are looking for details on how data files
were transferred from one place to another. See
Project Download Record. You probably aren’t looking for
this.
This section serves to be a comprehensive list of raw data files that
I have transferred after sequencing. Below, each project is described
with a key value. Use that key to look for corresponding sub-directories
within raw_data_mapping/. Read through this gitlab page to
find where you are mentioned.
Raw data is deposited into raw sequence run IDs in sequencing project
folders that are identified by internal Sharpton Lab run identifiers.
Currently, projects are located here:
/nfs3/Sharpton_Lab/prod/prod_restructure/projects/.
Projects live in this folder
/nfs3/Sharpton_Lab/prod/prod_restructure/projects/ within a
project folder identified by three digit “TS” number unless otherwise
specified.
raw_data_mapping/bighorn_project_2022_run_1/bighorn_sheep_2022_raw_fastq_mapping_file.numbersraw_data_mapping/bighorn_project_2022_run_1/: (1)
bighorn_sheep_2022_raw_fastq_mapping_file: Map fastQ raw
data file well identifier to sheep ID and to project ID, (2)
raw_illumina_links_email_confirmation.txt: Email with
initial CQLS raw data location, (3)
TS032A-18S_barcodes.xlsx: Barcode mapping file wells 4, 1,
3 and (4) TS032A-18S_barcodes.xlsx: Barcode mapping file
wells 5, 2, 3.raw_data_mapping/bighorn_project_2022_run_1/bighorn_sheep_2022_raw_fastq_mapping_file.numbersraw_data_mapping/bighorn_project_2022_run_1/bighorn_sheep_2022_raw_fastq_mapping_file.numbersphyloseq_sheep_2020_16S_sample_integrity.rds **phyloseq_sheep_2020_16S_field_samples.rds **This section compiles some helpful resources and tips to help you interact most successfully with computing infrastructure during your time in the Sharpton Lab. It shares several links that may be generally helpful for you to learn the command line (if you need to), describes the Sharpton Lab infrastructure set up, and finally describes how you can set up your .bashrc and .bash_profile in your home directory on bash so that you best access all the features of the CQLS infrastructure. CQLS finds it easier to provide support if we follow some guidelines that are then similar between users. I have made every effort to point readers to different how to files that often are not easy or intuitive to find. Let me know if I should include something else here.
There are a variety of guides out there that can get you started with the command line in general.
How to get started with the CQLS infrastructure: https://tips.cgrb.oregonstate.edu/posts/the-cgrb-infrastructure-and-you/ .
Getting started with the command line: https://open.oregonstate.education/computationalbiology/chapter/the-command-line-and-filesystem/.
Working with some basic unix commands: https://astrobiomike.github.io/unix/
Command line best practices: https://github.com/jlevy/the-art-of-command-line
CQLS user portal for gitlab: https://gitlab.cgrb.oregonstate.edu/users/sign_in
Send more suggestions on helpful links for the next new lab memeber that joins and I will add them here.
As a lab, and as a user, you have access to many different computational resources.
Every time you log into darwin, you start out in your home directory.
You can return to your home directory at any time by simply typing
cd. Your home directory is located at
home/micro/username/, or ~/. Within your home
directory, you have a file that sets your bash profile. The bash profile
is a great tool to make terminal use easier and quicker to use. It can
also make the terminal look more pretty than the default mode.
The bash_profile is a configuration file for the bash shell, which
you access via the terminal. Now, before you make any changes to your
bash profile, you should probably make a back up file first, maybe
bash_profile.bak. Note that the bash profile is a hidden
file, which is why the file name begins with a “.”. To see this file,
you will have to type ls -a to show hidden files. The
.bash_profile should be one of them.
# Look at the current .bash_profile
cat .bash_profile
# Save a copy
cp .bash_profile .bash_profile_back
# Look at the base .bash_profile
cat .bash_profile
# Jan 20 2022
# .bash_profile
Here is what is stored in a default .bash_profile created for a user on darwin.
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/bin
export PATH
unset USERNAME
As you can see, the .bash_profile sources another file called the
.bashrc file. For most things, you should make changes within the
.bashrc file, rather than to the .bash_profile. This
includes things like setting alias names, prompt modification, and
setting your path variable. The .bashrc file lives in the
home directory as well. On the CQLS infrastructure, here is a standard
.bashrc profile that they begin with for each user.
# .basrc base file
# Source the standard .bashrc file
source /local/cluster/etc/std.bashrc
# Add your own personal changes following this line:
This script sources the standard bash file for any user. After this,
we should now have access to any programs that are stored in
/local/cluster/bin/. These programs are kept updated and
installed by the CQLS. The benefit is that they can install one instance
of the software and multiple users can access. The downside is that the
software might not be as customizable because its installed for
everyone.
We get it, things get messy, you just want that program to run! Pretty soon, you have modifications to your $PATH variable in both the .bash_profile and .bashrc file with multiple alias calling different programs (maybe I am the only guilty party here!). Well, that just makes sense why the default behavior of the compute structure isn’t working for you! If you ever need to “reset” your home directory to the default settings, then you now have the default .bash_profile and .bashrc file listed here. Don’t forget to back up your current profile first before making changes to the .bash_profile and .bashrc file.
In general, you should try and store things into discrete groups within your .bashrc file. First, let’s change what our prompt looks like. For example, right now, my current profile looks like this on default.
-bash-4.2$
That’s kind of ugly. Why don’t we change it to reflect our user name, what machine (host) we are on, and then what our current directory we are in. To do that, we add two lines of code under the default code in the .bashrc file:
## Change bash prompt
export PS1="Arnold@\h [\W]$ "
This change results in the prompt looking like this. We have our user name, what machine we are on, and then within the brackets is the current directory that we are in. Because we are in our home directory, its just a tilda sign.
Arnold@darwin [~]$
There are a billion modifications that you can do with your .bashrc file. And you probably don’t want to learn all the short hand for things. Here is a .bashrc generator that allows you to figure out the prompt you would like and then generates the bash code to be added to the .bashrc file https://bashrcgenerator.com/ .
The next lines in your .bashrc file should contain a list of your short cuts, called “alias”. In an effort to keep things tidy, I’ve decided to create another file called .alias, and then just have the .bashrc source this file. So I’ve added a line of code at the end of the .bashrc file to source the .alias file.
## Get shortcuts
source ~/.alias
Now, we have to make the .alias file! In general, you should use short cuts to make your life easier. For example, you could say create a keyword that means any command that you could run on bash. Good examples of this include running a command to change directories to a specified location, or running commands of a program that you use all the time. For example, if we work on a particular project all the time, in the .alias file we could add this lines:
alias mouse='cd /nfs3/Sharpton_Lab/prod/projects/mouse_behavior_metaanalysis/projects/'
Then in the command line, we can just type “mouse”, and walla! we are suddenly in the directory we want to be. This way we don’t have to remember to navigate to the project folder each time for each terminal. That is exhausting! Another thing we might add are commands for software we use all the time. For example, if you hate remembering the git commands, then why not just make alias functions for them here?
Bad examples of alias commands include accessing special programs installed within conda environments. This results in starting a software that might not be configured correctly for the current environment you are running in. To use this software, you should source activate the conda enviroment.
We use R all the time, so its important to know how to start R, how
to manage changing to a new R version, and how to install packages.
Default install will have software going to your /home/micro/ directory,
but in time, you will likely reach space capacity for this folder, so
you should install software somewhere else. In the Sharpton Lab,
everything under prod/ is backed up, but because software
can be downloaded again, there is no reason to back it up on prod, and
so should be installed in a directory under your name within the
/nfs3/Sharpton_Lab/tmp/src/ folder.
The CQLS installs many softwares including R, but they do not maintain R packages. So, in our .bashrc profile, we need to put the desired R version in our path, then determine a destination for any libraries we download:
## Set up R path in .bashrc
export PATH=/local/cluster/R-4.1.0/bin:${PATH} # Tell which is the default R to use
export R_LIBS=/nfs3/Sharpton_Lab/tmp/src/arnoldhk/R/library/4.1.0 # where are user libraries stored
unset R_LIBS_USER # unset this - its a relic for when everyone in the lab was using same R version
Then, Wwithin the src/ folder, you should create an
R/library/4.1.0 folder where we will install all packages
for R version 4.1.0. This means the next version of R that comes out,
you can just create a new version to install our next set of R libraries
to R/library/4.1.1, and then update our path to source the
R version 4.1.1. Ed has made a great resource describing this here: https://software.cqls.oregonstate.edu/tips/posts/using-system-r-with-user-installed-packages/.
For many R libraries that run phylogenetic analyses, they use iGraph. Unfortunately iGraph had an issue compiling on Darwin because there were some missing C+ files. This problem has been solved by the following.
# Using /local/cluster/R-4.1.0/bin
install.packages("remotes")
remotes::install_cran("iGraph")
# Remotes is also handy because you don't have to remember BcLite commands and git commands
remotes::install_bioc("")
remotes::install_git("")
The R cloud is available here: https://rstudio-darwin.cqls.oregonstate.edu
The next software that you will likely use is conda. Within your
source folder, you should also create a /conda folder within
src/ to install conda files. The conda\ folder
should contain a envs and pkgs folder to store
conda enviroments and packages. Within your .bashrc file, you should add
these lines. These should be located last in the .bashrc file.
## Conda - keep last in .bashrc
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/local/cluster/miniconda3_base/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/local/cluster/miniconda3_base/etc/profile.d/conda.sh" ]; then
. "/local/cluster/miniconda3_base/etc/profile.d/conda.sh"
else
export PATH="/local/cluster/miniconda3_base/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
Last, you should add the path of the src file to the
.condarc file that is located in the home directory:
auto_activate_base: false
channels:
- conda-forge
- bioconda
- defaults
envs_dirs:
- /nfs3/Sharpton_Lab/tmp/src/arnoldhk/conda/envs
pkgs_dirs:
- /nfs3/Sharpton_Lab/tmp/src/arnoldhk/conda/pkgs
Within the /home/micro/.conda/ folder, there is a file
called environments.txt. This should have a list of environments and a
link to where they are stored. If you move where conda environments or
conda packages are stored, you will have to update this file as well as
the .condarc file. Here is a tutorial on how to get started with conda:
https://software.cqls.oregonstate.edu/tips/posts/conda-tutorial/.
If you feel that the conda package would be used by more than just yourself, you can have conda programs installed by CQLS. For example, humann3 was installed for everyone. This is then activated by writing:
source /local/cluster/humann3/activate.sh
If you want to install your own conda environment, you can just call conda, and it will now install to the appropriate location on src/
conda create -n test_env plotly=4.4.1 notebook=6.0.1 ipywidgets=7.5.1
conda activate test_env
Mamba has been installed on the conda base and can be used to solve
environments more quickly. For example, you could install the
test_env above more quickly by running
mamba create -n test_env plotly=4.4.1 notebook=6.0.1 ipywidgets=7.5.1
If you want to run a specific version of R from within conda, then you can do the following.
# See what R we are calling in base
which R
# See what R_LIBS we are pointing to
echo $R_LIBS
# Activate the conda environemnt that we want to use R in
source activate my_program
# Navigate to the base conda directory environment.
cd /nfs3/Sharpton_Lab/tmp/src/arnoldhk/conda/envs/my_program
/local/cluster/conda/conda_R_setup.sh
# Now, the version of R should be what we want
which R
# And R Libs is now pointint to shomewhere else
echo $R_LIBS
This serves as a record of how Arnold transferred files for each of the projects listed above.
####################
# Sheep 2020 18S Data
#
####################
rsync -avz --dry-run /nfs2/hts/external/illumina/miseq/221007_M04034_0030_000000000-KDV5D/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TSOA33/FastQ/
rsync -avz --dry-run /nfs2/hts/external/illumina/miseq/221007_M70296_0001_000000000-KD3GT/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TSOB1/FastQ/
rsync -avz /nfs2/hts/external/illumina/miseq/221007_M04034_0030_000000000-KDV5D/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TSOA33/FastQ/
rsync -avz /nfs2/hts/external/illumina/miseq/221007_M70296_0001_000000000-KD3GT/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TSOB1/FastQ/
# Update the internal CQLS identifiers to Internal Sharpton Lab identifiers for the project.
mv TSOA33/ TS032A/
mv TSOB1/ TSO32B/
####################
# Scalebrain Data
# Jan 5th, 2022
####################
rsync -avz --dry-run /nfs2/hts/miseq/2022/221224_M01498_1005_000000000-KM76M/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS027/
rsync -avz /nfs2/hts/miseq/2022/221224_M01498_1005_000000000-KM76M/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS027/
#sent 4,700,564,569 bytes received 45,385 bytes 4,823,612.06 bytes/sec
#total size is 4,843,480,736 speedup is 1.03
rsync -avz /nfs2/hts/miseq/2022/221224_M01498_1005_000000000-KM76M/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS027/
#sending incremental file list
#sent 171,335 bytes received 21 bytes 31,155.64 bytes/sec
#total size is 4,843,480,736 speedup is 28,265.60
####################
# Sheep 16S data (TS032)
# Jan 5th, 2022
####################
#CWD: /nfs3/Sharpton_Lab/prod/prod_restructure/projects
mkdir TS032
rsync -avz --dry-run /nfs2/hts/miseq/2022/221209_M01498_0998_000000000-KL5LW/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS032/
rsync -avz /nfs2/hts/miseq/2022/221209_M01498_0998_000000000-KL5LW/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS032/
rsync -avz /nfs2/hts/miseq/2022/221209_M01498_0998_000000000-KL5LW/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS032/
sending incremental file list
#sent 218,464 bytes received 21 bytes 48,552.22 bytes/sec
#total size is 7,991,109,808 speedup is 36,575.10
####################
# EMC2, HBH, BEE21, Microgreens, PIB2 data
# Feb 7th, 2023
####################
rsync -avz /nfs2/hts/miseq/230203_M01498_1021_000000000-KPN3Y/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS035/
#sending incremental file list
#sent 83036 bytes received 17 bytes 166106.00 bytes/sec
#total size is 4923200099 speedup is 59277.81
####################
# Sample Integrity Project for Leigh
# May 6th, 2023
####################
rsync -avz /nfs3/Sharpton_Lab/prod/prod_restructure/projects/arnoldhk/2022_Bighorn_Sheep/dada2.out/16S/bighorn_sheep_2020_2023-05-05_output/phyloseq_sheep_2020_16S_sample_integrity.rds /nfs3/Sharpton_Lab/WEB_DOWNLOADS/arnold/
####################
# Field Samples for Sakshi and Arnold
# May 6th, 2023
####################
rsync -avz /nfs3/Sharpton_Lab/prod/prod_restructure/projects/arnoldhk/2022_Bighorn_Sheep/dada2.out/16S/bighorn_sheep_2020_2023-05-05_output/phyloseq_sheep_2020_16S_field_samples.rds /nfs3/Sharpton_Lab/WEB_DOWNLOADS/arnold/
####################
# Sheep Data 2021 and 2022 (TS039) 16S
# June 4th, 2023
####################
rsync -avz /nfs2/hts/miseq/230508_M01498_1050_000000000-KWWNR/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS039/
#sent 181,383 bytes received 21 bytes 120,936.00 bytes/sec
#total size is 9,647,518,063 speedup is 53,182.50
####################
# Sheep Data 2021 and 2022 (TS039) 18
# July 10th, 2023
####################
# Run 2
rsync -avz /nfs2/hts/miseq/230706_M01498_1064_000000000-L5NCH/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS039/MiSeq_Run_1064/
sent 190,710 bytes received 21 bytes 127,154.00 bytes/sec
total size is 3,302,511,100 speedup is 17,315.02
# Run 1
rsync -avz /nfs2/hts/miseq/230705_M01498_1063_000000000-L5N6B/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS039/MiSeq_Run_1063/
sent 188,606 bytes received 21 bytes 377,254.00 bytes/sec
total size is 2,706,575,210 speedup is 14,348.82
# Run 3
rsync -avz /nfs2/hts/miseq/230707_M01498_1065_000000000-L523W/L1/ /nfs3/Sharpton_Lab/prod/prod_restructure/projects/TS039/MiSeq_Run_1065/
sent 190,912 bytes received 21 bytes 381,866.00 bytes/sec
total size is 3,185,530,667 speedup is 16,684.02
#############
## PFAS
## October 5th, 2023
#############
rsync -avz /nfs2/hts/nextseq/230708_VH00571_285_AACJLKMM5/L1-0mm-rename/ TS040Rename/
```