## Understand the contents of your file

When you want to do something with the contents of a file, it’s important to understand it well. Thus:

• spend some time with the file trying to understand it, and more importantly, try to understand what you need from that. For example, if you need to parse a FASTA file, you many need to extract the name of each sequence. Thus, you are interested in the lines with the followint pattern:

>SEQNAME

## Using regular expressions to extract content from a file

Regular expressions are powerful features of Python language (and other languages). With regular expressions, we

• match the lines that contain the information that we need
• extract the information from this line.

### Constructing a regular expression in python with compile

In contrast to other languages, in Python, we can pre-construct the regular expression (the pattern) that we are interested in. For example, assume that we are interested in identifying the lines that contain the sequence name in FASTA, i.e., the lines starting with an >.

import re
### seqpattern will match any line starting with an '>'
seqpattern = re.compile(r'^>')

The variable seqpattern will be able to match any line starting with a >. The special character ‘^’ means the beginning of the line. The special r in the beginning means that the following is a “raw string”, ie. backslash characters are treated literally instead of signifying special treatment of the following character.

http://docs.python.org/reference/lexical_analysis.html#literals

so ‘’ is a single newline and r’’ is two characters - a backslash and the letter ‘n’ another way to write it would be ‘\n’ because the first backslash escapes the second. Usually, we use it in regular expressions, except if we really need to parse special characters (e.g. newline). Thus:

line = ">seqName1"
if seqpattern.match(line):
print("FOUND")
else:
print("NOT FOUND")
## FOUND

The if seqpattern.match(line): means: “Is the pattern described by seqpattern found in line?”. This expression will return either True or False.

The regular expressions are not only useful to find the lines that match a criterion, but also to extract something from those lines. For example, assume that we need to extract the name.

Now, we need to change the pattern described in the regular expression. We are not only interested in the > but also in what follows that. Note the following differences:

1. I have included the \s*. This means 0 or more whitespace characters (spaces, tabs etc). This is important because some times sequences are like:
• >SEQ
• > SEQ
• > SEQ etc
1. I have removed ‘r’. This is because the \s is a special character (ONE CHARACTER, not two) meaning a whitespace. If I will include ‘r’, then it will be translated as two characters: ‘' and ’s’.

2. I have included the (.*) expression. This means two things: First, it matches any character 0 or more times. Second, it saves the match defined by whatever is inside the parentheses.

3. I have used a new variable, called myStartLine. This variable, will play two roles:
• First, if the pattern is to be found, it will receive the value True. Thus, it can be used in expressions like: if myStartLine:, which means if pattern is found.
• Second, it is a “pattern variable”, i.e. special “pattern functions” can be applied on it. For example: myStartLine.group(1) will provide whatever has been obtained by the (.*), i.e., whatever follows the > identifier (without the whitespaces). This will be a string.
## Note that I have removed 'r'
seqpattern = re.compile('^>\s*(.*)')
line = ">seqName1"
myStartLine = seqpattern.match(line)
name = ""
if myStartLine:
name = myStartLine.group(1)
print(name)
## seqName1

## Parsing GEO datasets

GEO datasets are special files that contain both information about the dataset as well as the gene expression values. They are very ‘rich’ files, meaning that they contain really a lot of information. For example, if the sample (e.g. people) is split in two classes based on whether they receive treatment or not and in two classes based on the gender (Female, male), the initial part of the file will contain information about:

• treatment, i.e., which samples have received a drug and which samples have received nothing
• gender, i.e., which samples are males and females.

For example,

^DATABASE = Geo
!Database_name = Gene Expression Omnibus (GEO)
!Database_institute = NCBI NLM NIH
!Database_email = geo@ncbi.nlm.nih.gov
!Database_ref = Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6
^DATASET = GDS5430
!dataset_title = Ethanol effect on lymphoblastoid cell lines from alcoholics
!dataset_description = Analysis of lymphoblastoid cell lines (LCLs), obtained from alcoholics, after treatment with ethanol in vitro. Results provide insight into the feasibility of using LCLs to study the molecular consequences of exposure to alcohol.
!dataset_type = Expression profiling by array
!dataset_pubmed_id = 25129674
!dataset_platform = GPL570
!dataset_platform_organism = Homo sapiens
!dataset_platform_technology_type = in situ oligonucleotide
!dataset_feature_count = 54675
!dataset_sample_organism = Homo sapiens
!dataset_sample_type = RNA
!dataset_channel_count = 1
!dataset_sample_count = 84
!dataset_value_type = count
!dataset_reference_series = GSE52553
!dataset_order = none
!dataset_update_date = Dec 04 2014
^SUBSET = GDS5430_1
!subset_dataset_id = GDS5430
!subset_description = ethanol
!subset_sample_id = GSM1269645,GSM1269647,GSM1269649,GSM1269651,GSM1269653,GSM1269655,GSM1269657,GSM1269659,GSM1269661,GSM1269663,GSM1269665,GSM1269667,GSM1269669,GSM1269671,GSM1269673,GSM1269675,GSM1269677,GSM1269679,GSM1269681,GSM1269683,GSM1269685,GSM1269687,GSM1269689,GSM1269691,GSM1269693,GSM1269695,GSM1269697,GSM1269699,GSM1269701,GSM1269703,GSM1269705,GSM1269707,GSM1269709,GSM1269711,GSM1269713,GSM1269715,GSM1269717,GSM1269719,GSM1269721,GSM1269723,GSM1269725,GSM1269727
!subset_type = agent

What is important for us, is the section:

^SUBSET = GDS5430_1
!subset_dataset_id = GDS5430
!subset_description = ethanol
!subset_sample_id = GSM1269645,GSM1269647,GSM1269649,GSM1269651,GSM1269653,GSM1269655,GSM1269657,GSM1269659,GSM1269661,GSM1269663,GSM1269665,GSM1269667,GSM1269669,GSM1269671,GSM1269673,GSM1269675,GSM1269677,GSM1269679,GSM1269681,GSM1269683,GSM1269685,GSM1269687,GSM1269689,GSM1269691,GSM1269693,GSM1269695,GSM1269697,GSM1269699,GSM1269701,GSM1269703,GSM1269705,GSM1269707,GSM1269709,GSM1269711,GSM1269713,GSM1269715,GSM1269717,GSM1269719,GSM1269721,GSM1269723,GSM1269725,GSM1269727
!subset_type = agent

This section starts with ^SUBSET. It defines the beginning of a section that will describe a part of the samples.

• !subset_dataset_id the dataset in which the subset belongs to. Usually, not interesting for us
• !subset_description the description of the subset. i.e., what ‘unites’ all the samples of this subset?
• !subset_sample_id the samples of the subset
• !subset_type the type of the subset, i.e., the factor from which the subset is obtained.

An example, will make things more clear. Assume that you are studying expression pattern in people who drink water vs people who drink alcohol. Then,

• !subset_type could be something like drink
• !subset_description could be something like alcohol or water
• !subset_sample_id would be the names of the sample that drink alcohol or they drink water.

Going back to the example above:

^SUBSET = GDS5430_1
!subset_dataset_id = GDS5430
!subset_description = ethanol
!subset_sample_id = GSM1269645,GSM1269647,GSM1269649,GSM1269651,GSM1269653,GSM1269655,GSM1269657,GSM1269659,GSM1269661,GSM1269663,GSM1269665,GSM1269667,GSM1269669,GSM1269671,GSM1269673,GSM1269675,GSM1269677,GSM1269679,GSM1269681,GSM1269683,GSM1269685,GSM1269687,GSM1269689,GSM1269691,GSM1269693,GSM1269695,GSM1269697,GSM1269699,GSM1269701,GSM1269703,GSM1269705,GSM1269707,GSM1269709,GSM1269711,GSM1269713,GSM1269715,GSM1269717,GSM1269719,GSM1269721,GSM1269723,GSM1269725,GSM1269727
!subset_type = agent
^SUBSET = GDS5430_2
!subset_dataset_id = GDS5430
!subset_description = untreated
!subset_sample_id = GSM1269644,GSM1269646,GSM1269648,GSM1269650,GSM1269652,GSM1269654,GSM1269656,GSM1269658,GSM1269660,GSM1269662,GSM1269664,GSM1269666,GSM1269668,GSM1269670,GSM1269672,GSM1269674,GSM1269676,GSM1269678,GSM1269680,GSM1269682,GSM1269684,GSM1269686,GSM1269688,GSM1269690,GSM1269692,GSM1269694,GSM1269696,GSM1269698,GSM1269700,GSM1269702,GSM1269704,GSM1269706,GSM1269708,GSM1269710,GSM1269712,GSM1269714,GSM1269716,GSM1269718,GSM1269720,GSM1269722,GSM1269724,GSM1269726
!subset_type = agent


we could say that we have the following hierarchy:

agent -> ethanol -> GSM1269645,GSM1269647,GSM1269649,...

agent -> ethanol -> GSM1269644,GSM1269646,GSM1269648,...

the factor agent is studied, that contained two levels: “ethanol” and “untreated”. Ethanol consumption characterizes the samples GSM1269645,GSM1269647,GSM1269649,… “untreated” contains the samples GSM1269644,GSM1269646,GSM1269648,…

Thus if I want to know which samples have received ethanol, I need to store the information of this line.

Now, I need to think what I would like to do with this information in order to see how I will use it. Let’s say that I want to create a PCA plot of the samples, that will paint one color the samples treated with ethanol and with another color the untreated samples. Thus, based on factor agent, I will color differently, its different levels (the ‘ethanol’ and the ‘untreated’). The dataset contains further data splits. For example, the factor ‘gender’, with levels ‘male’ and ‘female’.

I will think as follows:

1. I will parse the information of the subsets and I will create a hierarchical data structure as follows:
• Level 1: type: e.g. gender, agent
• Level 2: description: e.g. male, female, ethanol, untreated
• Level 3: samples Thus, if I (better the program) will be asked to color differently the points based on the agent, then, I will know directly that the ‘agent’ contains two levels (ethanol and untreated) and which samples belong to each level.
1. I will create a list with colors: a color will characterize the sample that are in one level* and another color the samples of the other level**.

2. Then, I’ll just do the PCA (well, I need to save first the expression table in numpy object).

import re
import argparse
import numpy as np
from sklearn.decomposition import PCA
## /usr/lib/python3/dist-packages/sklearn/externals/joblib.py:1: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
##   import imp
from sklearn.preprocessing import StandardScaler ## To standardize data
import matplotlib.pyplot as plt

## construct the regular expressions for the interesting subset descriptions
subsetIdentifier = re.compile('^\!subset_sample_id\s*=\s*(.*)')
subsetType = re.compile('^\!subset_type\s*=\s*(.*)')
subsetLevel=re.compile('^\!subset_description\s*=\s*(.*)')
subset = re.compile(r'^\^SUBSET')
datasetBegin = re.compile('^ID_REF\s*IDENTIFIER\s*(.*)')

## empty dictionary that will store all sample information
keys = {}

## the input file
inputFile="GDS5430.soft"

## i'm interested to split (and color) the dataset based on this factor
myType = 'agent'

## the final expression dataset
dataset = []

Please pay attention to the regular expressions setup, and get sure that you understand it.

Next we will read the file line by line. When we will find the lines that we are interested in, we will save the information in a dictionary of dictionaries. This may sounds complicated, but it is not so much actually. The main idea is to be able to separate the samples based on some annotation, e.g. ‘agent’. The agent has two levels: “ethanol” and “untreated”, and each level has a bunch of samples. So:

• let keys be a dictionary.
• it will contain the keys “agent”, “gender”, etc
• The keys["agent"] will also be a dictionary. The keys["gender"] will also be another dictionary.
• The keys["agent"] will contain two keys, the “ethanol” and the “untreated”.
• The keys["agent"]["ethanol"] will contain the samples that belong to the ethanol subset. The keys["agent"]["untreated"] will contain the samples that belong to the untreated subset.

This is the idea, let’s see how to implement it. The dictionary keys has been defined above.

flagtype = 0
flagsample = 0
flaglevel = 0

maxlines = 4000
currentline = 0
with open(inputFile, "r") as inputfile:

lines = []
levels = []
types = []
samples = []
values = []
datasetStartFlag = 0

for line in inputfile:

## This is just a demo, thus read only maxlines
currentline = currentline + 1
if currentline > maxlines:
break

## remove the ending \n newline character
line = line.rstrip()

## this is a small trick. A bit sub-optimal, but
## the code becomes more clean.
## I match all the regular expressions to each line.
## It is sub-optimal because only a single (or none) will match. So, trying to match each one is not the best. However, the code becomes more clean.
level_m = subsetLevel.match(line)
type_m = subsetType.match(line)
sample_m = subsetIdentifier.match(line)
dataset_b = datasetBegin.match(line)

## So, now I have tried to match all regular
## expression to the line. Perhaps one of them will
## match

## If the subset regular expression matches
## then, I'm in the line that the subset description
## begins. I empty the samples, and the types because
## I will store there the information that I will need.
## Also, I reset the flagtype, the flagsample and the
## flaglevel. Why?
## I use these three flags for CHECKS. Thus, I want to be sure that I have collected all the information that I need.
## When, I collect some piece of information, I switch the appropriate flag to 1 (from 0). Thus, then I'll be able to check that I have collected all the necessary information.

## Notice the continue statements. If I find a match line, there is no need to look further. There is no way that another match will be fulfilled for this line since I found one already.
if subset.match(line):
samples = []
types = []
flagtype = 0
flagsample = 0
flaglevel = 0
continue

## I get sure that a subset level match (level_m) occurs
## also, I want to get sure that the dataset (the table with the expressions) has not started yet (we should be at the preamble of the document) and that all flags are still 0.
## Note that: 1. I keep the levels (e.g. ethanol, or untreated) in the variable levels
## 2. I set the flaglevel to 1, because I found the line with the levels.
if datasetStartFlag==0 and level_m and flaglevel == 0 and flagtype == 0 and flagsample == 0:
levels = level_m.group(1)
flaglevel = 1
continue
## Here I want to catch the line with the sample inforamtion. I should enter here, only when I have already visited the line with the level information (thus I require that flaglevel == 1). Note, that I keep the samples. Also I directly split them using the ',' as a delimiter. Thus, the variable samples is a list that contains the names of the samples for associated with the level from the "if" above.
if datasetStartFlag==0 and sample_m and flaglevel == 1:
flagsample = 1
## I this line, I keep the samples, by using the sample_m.group(1) and immediately I split them and store them in a list (samples), by using the function split(','), which will split the string using a ',' delimiter
samples = sample_m.group(1).split(',')
## again, there is no need to perform the next "if" checks. I should go to the next repetition of the loop and read the next line
continue

## Here, I will store the type information. The idea is the same as in the previous "if statements"
if datasetStartFlag==0 and type_m and flagtype == 0 and flaglevel == 1 and flagsample == 1:
types = type_m.group(1)
flagtype = 1
## Here I should not put a continue. I already have the three flags set, I should go to the next "if statement" and build my structure to store the samples

## OK this is the big moment. I check that the three flags are set to 1. This means that I have all the necessary information, and moreover I still have not found the beginning of the dataset. Thus, I am still in the preamble. I know, however, the type of the data (e.g. agent) and the levels (e.g. ethanol or untreated)
if datasetStartFlag == 0 and flagtype == 1 and flagsample == 1 and flaglevel == 1:
## There is a small difficulty. The keys is a dictionary. If types is "agent", then by setting keys["agent"] will be defined. The problem is that the type "agent" will be present more than once in the preamble. One time for the "ethanol" and one more time for the "untreated". Thus, If I set keys["agent"] to something, this will be overwritten the next time I will find "agent" again. To avoid this, I use the following "if statement". If it's the first time that I see types (i.e. if types is not in keys), then I initiate keys[types] to an empty dictionary ( {} ). If I have seen the types already before, then I will not re-initiate the keys[types]. I have already done it before (for another level).
#print(types)
if types not in keys:
keys[types] = {}

## Here, I set the second key (i.e. the levels). Thus, the dictionary "keys" has to keys, the types and second (nested in the first), the levels. The value for these two keys are "samples".
keys[types][levels] = samples
## Here, I need to reset flags (otherwise it will not go below)
flaglevel = flagtype = flagsample = 0
continue

## In this check, since dataset_b is True, I should be on the header of the expression values (dataset_b checks for  ID_REF and IDENTIFIER)
#print(dataset_b, flaglevel, datasetStartFlag)
if datasetStartFlag == 0 and dataset_b:
#print("HERE")
## Here I store the samples found in the header
datasetSamples = dataset_b.group(1).split('\t')
## Here I set the class samples to 0. later on, some of these will become 1 or 2. With this way, I split the samples to the categories I want.
classSamples = [0 for i in datasetSamples]
## The next piece of code, will create a list (classSamples) with the labels j for each level. For example if there are four samples s1, s2, s3, s4 and the s1, s2 are "ethanol" and s4 are "untreated", whereas s3 should not be included (because it is neither untreated nor ethanol), then classSamples will be 1,1,0,2.
j = 0
for k in keys[myType]:
j = j+1
for i in range(len(datasetSamples)):
if datasetSamples[i] in keys[myType][k]:
classSamples[i] = j

## boolSamples now contains only the "label" for the samples with positive label.
boolSamples = [i for i in classSamples if i > 0]
datasetStartFlag = 1
continue

## Here, I should already be in the expression values section. THus, datasetStartFlag is already 1. The second condition is necessary because the last line of the file contains the "!endTableSection"

if datasetStartFlag == 1 and re.match(r'^\!', line) is None:
## I store the values, starting from the third column
tmpvalues = np.array(line.split('\t')[2::])

## Here, I keep only the values that correspond to  the samples that I want to keep
values = [tmpvalues[i] for i in range(len(classSamples)) if classSamples[i] > 0]                    ## store the values now to the dataset
dataset.append(values)

#print(dataset)
pca = PCA(n_components = 2)
dataset = np.transpose(np.array(dataset))
## standardize the dataset
dataset = StandardScaler().fit_transform(dataset)
## /usr/lib/python3/dist-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype <U9 was converted to float64 by StandardScaler.
##   warnings.warn(msg, DataConversionWarning)
## /usr/lib/python3/dist-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype <U9 was converted to float64 by StandardScaler.
##   warnings.warn(msg, DataConversionWarning)
principalComponents = pca.fit_transform(dataset)
color=[i for i in classSamples if i>0]

plt.scatter(principalComponents[:, 0], principalComponents[:, 1], c=color)

plt.show()