Module 0: Getting Started with Galaxy and awk

Introduction:

The goal of this class is to familiarize you with modern genomic analysis techniques and their application to the study of plant genomics. Carrying out genomic analysis often involves working with large data sets that are difficult to make sense of using standard spreadsheet and word processing software. Instead, an understanding of bioinformatics and bioinformatic tools is necessary. Doing bioinformatics often requires the use of command line tools and familiarity with at least one scripting language, such as Perl or Python, and at least one statistical programming language, usually R. We will not have time to learn all of these skills this semester, but luckily many bioinformatics tools necessary for genomic analysis have been rendered in a more user friendly form as part of the Galaxy Project. Almost all common analyses, from raw sequence processing to read mapping and variant calling, can be carried out through the Galaxy platform. In addition to gaining familiarity with common command line tools you will also learn to harness the awesome power of AWK, a compact easy to use programming language, to manipulate large text files.

Learning bioinformatics is best done by rolling up your sleeves and doing bioinformatics. To this end you will be given real data sets, which are messy and large, and trained to use the latest bioinformatics tools to carry out relevant data analysis pipelines. In some cases data will be subsetted to reduce size and make assignments more tractable; however, the analysis remains the same as it would with the full sized set. It is our hope that working through the modules in this course will provide you with the necessary activation energy to tackle genomic analysis in your own work.

Throughout the modules you will analyze data from a QTL-Seq study aimed at identifying the genetic basis of flowering time in tomato. You can read the study here. This will involve running quality control on raw sequence data, calling variants, and running statistical analysis to identify SNPs associated with flowering time. However, before you get started on this you will need to set up an account on Galaxy and gain some experience using awk.

Register with Galaxy:

First things first, go to the Galaxy website and set up a free account.

After you register and activate your account click on the “start here” link and work through the Galaxy 101-1 exercise to learn how to work in Galaxy.

**Download and save your BED file with the top five regions. You will need to turn it in.

Locating awk on your system:

This is not a programming class, so class assignments will not require you to do much programming. Instead you will often be provided with awk code to run on data files. The main goal is to show you what tasks can be accomplished with awk and common awk routines. Occasionally, we may provide you with nearly complete code and ask you to fill in the missing code to complete the task. Before we get started you will need to either install or locate awk. Unfortunately, this is operating system dependent.

Linux PC

Awesome!

Just locate the terminal on your system, open it and type “awk” and you should see information about awk printed to your terminal.

Mac

Locate your terminal,open it and type “awk” and you should see information about awk printed to your terminal.

Windows PC

I’m sorry.

You don’t have access to the same terminal as your Mac or Linux counterparts. It’s okay though, you can remedy that by installing Cygwin.

Directions for installation can be found here

Great, now you have access to the same terminal with the same tools as the Mac and Linux people, including awk. You can launch the Cygwin terminal by clicking the Cygwin shortcut that was installed on your desktop. The terminal should look something like this:

Terminal Navigation

You only need to know a handful of commands for this class:

An overview of navigating in the terminal can be found here.

Make a folder called “Module_0”

If I wanted to make a folder called “Module_0” in my desktop on my Windows PC I would open the Cywgin terminal and type:

cd ../../cygdrive/c/Documents and Settings/Christopher/Desktop/

mkdir Module_0

After you create your Module_0 directory navigate into it and stay there. We will be running some programs in it soon.

cd Module_0

Working with Text Files

We will not be using Word or any other similiar word processing software in this class for working with large text files or program files. These programs do text-wrapping and have other undesirable behaviors.

If you are working on a Mac or a Linux system we recommend opening all files with gedit. Download here

If you are working on a Windows PC use Notepad++. Download here

AWK’s Anatomy

Awk programs have very simple structure.

awk ‘program’ input-file

The above line is the skeleton of an awk one-liner.

Now for everyone’s obligatory first program. Type the following code into your terminal:

echo “Don’t worry about what this line of code is doing”>test.txt

awk ‘{print “Hello World !”}’ test.txt

You should see “Hello World!” printed to your terminal. By default awk will write output to the terminal (default stdout).This can be changed by re-directed your output into a file using the “>” symbol.

Type the following code into your terminal:

echo “Don’t worry about what this line of code is doing”>test.txt

awk ‘{print “Hello World !”}’ test.txt > Hello.txt

If all went well you should see a file called “Hello.txt” in your current folder with the message “Hello World!” inside. Pretty cool, huh?

**Keep your Hello.txt file. You will need to turn it in.

Let’s try doing something a little more useful with awk. DNA sequence data often comes in FASTA format, a text format which always has the following pattern:

>Sequence1

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

>Sequence2

CTGGGTCCCTTTTGCCCCCCCCCCCCCCCCCCCAAAAAAAAAAAA

The “>” symbol designates the header line which typically contains the sequence name and other relevant information. Following the header line is the sequence corresponding to that header. A file with multiple headers and sequences is called a multi-FASTA file. Go to the module_0 folder on Blackboard and download the multi-FASTA file test.fasta and save it in your Module 0 directory.

Let’s use awk to grab the header lines and display them. Type the following code into your terminal.

awk ‘/>/ {print}’ test.fasta

You should see two header lines in your terminal. The “//” is telling awk to do pattern matching. We put “>” between the “//” to let awk know that we only want lines that contain “>” in them (the header lines). Notice that the awk rule “/>/” goes outside to the left of the curly brackets, but still within the single quotes.

Now let’s use code to determine how many sequence entries are in the file. Since there is a one-to-one correspondence between headers and number of sequences, we just have to count the number of header lines to determine how many sequences are in the file. You can do that with the following code:

awk ‘/>/ {print}’ test.fasta | wc -l

You should see “2” displayed in your terminal. You just set up a pipeline! The “|” symbol fed the output from awk into the unix word count utility called “wc.” We added the “-l” flag, which told the word count tool to count the number of lines which corresponds to the number of headers and equals the number of sequences in the file. There are many other unix utilities like word count that you have access to in the command line. We could also write our own line counting program using awk:

awk ‘!/>/ {print}’ test.fasta | awk ‘{sum +=1} END {print sum}’

In the above line the output (two header lines) was piped into another awk program. In the second program we made a variable called “sum” which we add one to for every line in the input and then at the END we print the sum.

What happens if you delete the “END” statement? Try it!

The important take away is that awk executes the action in the curly brackets for every line in the input, unless you include the END statement. With the END statement the following action is only completed once, after all the lines in the input have been processed.

Instead of counting the headers, why not count the number of sequences directly? We could with:

awk ‘!/>/ {sum +=1} END {print sum}’ test.fasta

We added “!” in the above line to tell awk to fetch us any line that doesn’t contain “>”.

Notice that the above code gave us the same answer as our previous program. This was a happy accident. Standard FASTA format only records 80 bases of sequence per line. This means if one of our sequences was 160 bases long it would have taken up two lines and the program above would have told us we had three sequences instead of the correct answer, two. Counting headers will always give us the right answer and therefore our original code is better.

Awk’s default behavior is to move through a file one line at a time and execute your code on each line unless END or BEGIN is placed before your code block. Our FASTA file only has one field (one column), but often common text files created and used by bioinformatic tools contain multiple fields. Let’s take a quick look at how awk can be used on files containing multiple fields. Please go to Blackboard and download the file “blast_output.txt” to your Module_0 directory.

BLAST is used for aligning primary biological sequence data. It is often used to identify homology in sequences, which can be used to infer biological funtion of a gene or protein.

Our BLAST output file contains twelve fields. Use the head command in the terminal to take a look at the first few lines of the file.

head blast_output.txt

Awk automatically recognizes that there are twelve fields because they are separated by white space (a tab in this case). We can use awk to filter this file based on field values. For example, the eleventh field contains the e-value, which can be used to judge the quality of a BLAST alignment. The lower this value the stronger the evidence that the alignment is meaningful.

Let’s filter our alignment file to only keep alignments that have an e-value of zero.

awk ‘$11==0 {print}’ blast_output.txt > filtered_hits.txt

In the above statement we use “$11” to tell awk to look at the eleventh column and then we use the comparison operator “==” to test if the line has a zero in its eleventh field and if it does we print the line to a new file called “filtered_hits.txt”.

**Keep your filtered_hits.txt file. You will need to turn it in.

Congrats! You finished the first module. In the next module you will learn how to process raw FASTQ data fresh off the sequencer.

Please email the following three files to ch728@cornell.edu:

*The final file produced by the Galaxy 101-1 activity (BED file with the five top regions)

*The Hello.txt file you made

*The filtered_hits.txt file