Tuesday, August 12, 2014

Problem 003: Pattern matching problem

Note: Beginning with this solution you will notice a change in programming style. Main functionality has now been moved from the main body of code to a function(s). This is done in the spirit of code reusability and to facilitate the development of a separate Python module, myBioMod. We will continue to develop this module as we go through subsequent problems and implement new functionality. However, for readability we will keep all new functionality within the current solutions. All previous functionality will reside in myBioMod. For additional details, see Detour - myBioMod

Pattern Matching Problem: Given a target pattern and a sequence of bases, find all the locations of the target pattern in the sequence.
Input: A string, “pattern” and a string, “genome”.
Output: A list of all locations where pattern occurs in genome.

Thinking back to the first problem, frequent words problem, actually gives us a pretty good approach for solving this problem. In the frequent words problem we stepped through the input sequence one character at a time and placed each set of k characters into a dictionary. In this problem we can adapt our program to instead step through the input sequence one character at a time and compare each block of k characters with pattern. If it is a match, place the current location in a list.
Starting with the pseudo code from problem 001, we can modify for this problem. Let’s look at how we could implement this solution in pseudo-code:
1) Read the input text file into a variable named pattern (first line) and genome (second).
2) Count the number of characters in pattern.
2) Step through genome from start to finish, reading number of pattern characters at a time.
3) Compare these characters to pattern.
4) If there is an exact match, place the current location in a list.
5) When finished print the list of locations.

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

def PatternMatching(Pattern, Genome):
    # List to store locations of pattern matches
    matchList = list()

    # Number of characters in Genome
    lenGenome = len(Genome)
    # Number of characters in Pattern
    lenPattern = len(Pattern)

    # Step through Genome 1 char at a time and substring out each lenPattern chars
    for i in range (0, lenGenome - lenPattern):
        testString = Genome[i:i+lenPattern]
        if testString == Pattern:
            matchList.append(`i`)
    return matchList        

f = open('problem003.txt', 'r')

# First line contains the string Pattern
Pattern = f.readline().rstrip('\n')
# Second line contains the string Genome
Genome = f.readline().rstrip('\n')

f.close()

answerList = PatternMatching(Pattern, Genome)

print ' '.join(answerList)
    
# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()

If you would like to test the code yourself, here is a sample data set: Problem003 dataset

Monday, August 11, 2014

Problem 002: Reverse complement problem

Some brief background first.  DNA spends most of the time in a double stranded form.  i.e. it does not typically exist as a single sequences of bases such as AGTATCCATC, but as two sequences that wrap around in a helix form.  However, you cannot just put any two sequences together in this manner.  In order for DNA sequences to form into a double strand, each base in a sequence must be matched with a complementary base on the opposing strand.  These complements are as follows:
A pairs with T
C pairs with G
G pairs with C
T pairs with A

ACTTATCG
TGAATAGC for example is a valid sequence and complement.

ACTTATCG
GACCTCAT is not.

Additionally, DNA has directionality.  Imagine the bases as Lego building blocks.  There is the bumpy end and the end that the bumps plug into.  With bases these are referred as the 5’ and 3’ ends.

All of this is important because if we split the sample valid sequence and complement given above we do not end up with the following two sequences: ACTTATCG and TGATAGC.  Instead we end up with ACTTATCG and CGATAAGT.  We refer to these as the sequence and its reverse complement.  A fuller representation of the sample data would be:

5’->ACTTATCG->3’
3’<-TGAATAGC<-5’

With this very brief and naïve background we can now define our next bioinformatics algorithm problem, finding the reverse complement.

Reverse Complement Problem: Given a sequence of bases, find the reverse complement of the sequence.
Input: A string, “sequence”.
Output: A string, “revComp”

A common first approach to solving this problem is to use the Python string replace method.  This will simply replace all specified patterns in the string with a specified new pattern.  However, this has to be done once for each base, which leads to a problem.  Consider the sample pattern above, sequence = “ACTTATCG”.

First you might replace all of the A’s with their complement T, such as:
sequence = sequence.replace(‘A’,’T’) which yields sequence = “TCTTTTCG”
We now run into a problem of how to differentiate between the original Ts in the sequence and the ones that were previously As.  This will often give rise to a solution such as:
sequence = sequence.replace(‘A’,’t’)
sequence = sequence.replace(‘T’,’a’)
This will produce the complement of the original sequence, except instead of being in all upper case like the original; it is in all lower case.  From there it is trivial to swap the cases back to uppercase (using the string swapcase method) and reverse the sequence.

Following is this solution implemented in Python.  Note that instead of printing the output to the screen as in problem 001, the output in this solution has been written directly to a text file instead.  This was due to consideration of the length of the output.

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

f = open('problem002.txt', 'r')
g = open('output.txt', 'w')

sequence = f.readline().rstrip('\n')
f.close()

# Replace each base with their complement, in lowercase.
sequence = sequence.replace('T','a')
sequence = sequence.replace('A','t')
sequence = sequence.replace('C','g')
sequence = sequence.replace('G','c')
# Reverse the sequence.
sequence = sequence[::-1]
# Swap all of the bases back to uppercase.
sequence = sequence.swapcase()

# Write the output to a text file.
g.write(sequence)
g.close()

print 'Finished'

# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()


It turns out that this is actually a pretty efficient solution, but is there another approach we could try?  In Python, strings are immutable.  One effect of that is that you cannot access a single character by index in the string and change it.  That prevents us from stepping through the string one character at a time and replacing the base with its complement.  Instead, in order to do this we need to convert the string into a list, step through each element in the list replacing it with its complement, and then convert the list back into a string.  This could be implemented in Python as:

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

f = open('problem002.txt', 'r')
g = open('output.txt', 'w')

sequence = f.readline().rstrip('\n')
f.close()

# Create a dictionary of the bases and their complements
complements = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
# Convert the string (sequence) to a list (bases).
bases = list(sequence)
# Reverse the list
bases.reverse()
# Replace each base with its complement
bases = [complements[base] for base in bases]
# Convert the list (bases) back into a string (sequence)
sequence = ''.join(bases)

# Write the output to a text file.
g.write(sequence)
g.close()

print 'Finished'

# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()


The Pythonic way.

Once you start using Python, it probably won't be too long until you hear someone discussing whether a particular solution is or is not Python. So what is this? As Python has evolved, the general "community" has settled on a number of solutions as being the correct way to do things in Python. This in turn has influenced Python development so that many of these idioms have built in support. A word of caution though is that a truly Pythonic solution typically cannot be directly translated to another programming language.

Now that we have developed two different algorithms for the reverse complement problem, let us look at a Pythonic solution that uses the built in string method, translate. Using this, we can replace our previous algorithms with a single line: sequence = sequence[::-1].translate(maketrans('ACGT','TGCA'))
While this looks much different than algorithm 2, it is actually very similar. Maketrans builds a dictionary using the first parameter to specify what the original characters are (the 'keys') and the second parameter to specify what they should become (the 'values'). Translate then simply iterates over the string and uses the dictionary to look up each character and replace it with the appropriate value. Here is what the full solution looks like:

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

from string import maketrans

f = open('problem002.txt', 'r')
g = open('output.txt', 'w')

sequence = f.readline().rstrip('\n')
f.close()

sequence = sequence[::-1].translate(maketrans('ACGT','TGCA'))

g.write(sequence)
g.close()

print 'Finished'

# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()
If you would like to test the code yourself, here is a sample data set: Problem002 dataset

Wrap Up
So which algorithm was best?

Comparing timing results with our initial dataset, linked above, shows no difference in speed:
Algorithm 1: 0.015 seconds
Algorithm 2: 0.015 seconds
Algorithm 3: 0.017 seconds

As is often the case, the performance of an algorithm does not become apparent with small or trivial datasets.  The initial data set contained 8,860 bases.  To stress the algorithm a bit more, the initial dataset was repeated until it was 318,960 bases in length.  Timing results now show a bit of difference:
Algorithm 1: 0.107 seconds
Algorithm 2: 0.052 seconds
Algorithm 3: 0.035 seconds

But does this difference ever become a problem?  How about with real life data?  Testing the algorithms with the vibrio cholerae genome (1,108,250 bases) yields:
Algorithm 1: 0.114 seconds
Algorithm 2: 0.068 seconds
Algorithm 3: 0.032 seconds

And with the E-coli genome (4,639,675 bases) yields:
Algorithm 1: 0.519 seconds
Algorithm 2: 0.266 seconds
Algorithm 3: 0.103 seconds

With small datasets at least it appears that Algorithm 2 is consistently ~2x as fast as Algorithm 1.  If we were using Big-O notation, both algorithm s would be categorized as O(N).  In other words, both algorithms grow linearly and in direct proportion to the size of the input data set.  Therefore, there is no compelling argument to use one algorithm vs the other so for most cases it just comes down to personal preference. 
Algorithm 3 however does appear to offer a significant improvement over the other two algorithms. Since the built in string method, translate, is implemented in compiled c code we expect it to perform better than algorithm 2, despite their functional similarity. Also note that the execution time of algorithm 3 appears to increase in a step-wise manner.

NOTE1:  The performance of all 3 algorithms was still only tested with relatively small datasets.  Performance may change drastically with LARGE datasets.  This might be something interesting to try, perhaps with: 500,000,000 bases,  1,000,000,000 bases, 4,000,000,000 bases, etc… 


NOTE2:  Even though performance appears to scale the same for both algorithms, there are still practical considerations.  For example, if your dataset is large enough that Algorithm 1 requires 36 hrs to complete, you may very well be interested in using Algorithm 2 if it reduces the time to 18 hrs.

Our Approach

Python is used to implement the bioinformatics algorithms that follow, simply because that was the language that my colleague had started with.  Python is a fine language for many reasons, not least of which is the fact that it seems to be easier for many beginners to learn, so there was no compelling reason to switch.

Python is an interpreted language, vs a compiled language like C, which many people equate with being ‘slower’.  However, with good algorithm design it is possible to solve all of these bioinformatics challenges well within a 5 minute time limit.  T he “traditional” implementation of Python (aka cPython) from www.python.org was used.  Everything was coded within the 2.7x branch, not the 3x branch.  A few add on packages are used for some solutions, but they will be pointed out when in the problems that use them.   PyPy was used occasionally for timing comparisons, but none of the solutions require it.  And finally, the solutions are not dependent on high end development hardware.  My colleague’s somewhat older laptop (~4yrs or so) was used for all of the problems and able to successfully solutions within the specified time limits.  Specs for the laptop are: Dell Latitude E4310, Windows 7 64-bit, dual 2.40GHz processor, and 4GB ram.  Not the worst system in the world but definitely not cutting edge.

This is not to say all of the algorithms are optimal and cannot be improved.  Generally, if an algorithm provided a solution within the parameters of the problem then it was left alone.

A quick soapbox
Now for a quick rant against Python.  Actually, this is not specific to Python, it applies to many other languages such as R, etc… that are traditionally implemented as interpreted languages.  And to be fair, the rant is against the ‘APPROACH’ that many people bring to programming projects in these languages, not the languages themselves.

Interpreted languages are also commonly referred to ‘scripted’ languages.  i.e. you write ‘scripts’ not ‘programs’.  Yes, this is purely semantics, BUT to many people, scripting implies an approach that requires much less rigor (or has no rigor at all).  The scripting mindset tends to be completely opposed to concepts such as planning, developing pseudo-code, formal debugging, etc…  This unfortunately does not just manifest itself in the language used (i.e. tweaking code to get it work, or work better as opposed to debugging and profiling), but in how the work is performed. 

Imagine Jeff Foxworthy as a computer science graduate, “You might be a scripter / hacker, if:”
You think comments are for wimps and whitespace or indenting is sheer folly.
You think a solution CANNOT be optimal unless it fits on a single line.
You think variable names should be as uninformative as possible, preferably a single character.
You think randomly changing parts of code is the ideal way to debug or optimize.

Obviously changing the mindset of everyone so as to reach consensus on the best approach to programming is not a battle that can be won.  Nor is this to say that you CANNOT be a good programmer if you prefer the scripting / hacking approach.  HOWEVER, this approach to programming is NOT the most conducive to a teaching environment.  Therefore the solutions that follow attempt to describe the problem being solved, explain the method of developing the final algorithm, and use generally accepted coding practices.  If there is a decision to be made between clarity and compactness, the intent is to always choose clarity.


Besides, if you really don’t like this approach you can always remove the comments and condense it down to one line afterwards J

Bioinformatics Algorithms - Intro

Working for a biotech company, I have many chances to interact with various programmers and with many biologists / scientists.  Recently, one of my scientist colleagues approached me for some help with Python programming.

Not content with just knowing the science aspect of our business, he was also interested in learning the programming and bioinformatics side of it.  A very commendable attitude and to this end he had stumbled across a code challenge site that claimed to be “… a platform for learning bioinformatics through problem solving”.  For those of you not familiar with these,  code challenge sites like; Project Euler, Rosalind, or Python Challenge; offer a series of problems with increasing complexity / difficulty that programmers can work through, often with time limits applied.  The first are typically relatively easy to get you warmed up, but often scale up quit quickly in terms of complexity or novel algorithmic approaches that you have to employ to solve them.  These sites are great for bragging rights among other programmers, or for forcing you to hunt down and study up on esoteric approaches.  However, almost without exception, these sites are not intended as TEACHING platforms.  Yes, you may learn a lot in getting to a point where you can complete a challenge, but the sites themselves are not typically where you would do that learning.  I would liken this to a college course where the professor has no book, does not lecture, and only assigns problems. 

Predictably, my colleague was able to complete the first few problems before running into a brick wall, hard!  When he came to me he was completely stumped and did not even have any ideas for how to move forward.  Actually, that is not completely fair.  Initially he did ask me if we could try his code on one of our development servers which he assumed (rightly) would be faster than his laptop.  Practically speaking though, this is not a scalable solution to a suboptimal algorithm.

This section of our blog narrates our journey through all of the bioinformatics problem sets he encountered.  Instead of just providing a solution though (yes, there is source code, don’t worry), we attempt to explain HOW we developed the algorithms.  Hopefully you will have fun with this and learn a little bit about bioinformatics algorithms along the way.


Welcome! 

Problem 001: Frequent words problem

Yes, we went all the way back to the first problem, despite the fact that my colleague was able to solve it on his own.

Problem Statement: Given a series of characters, find the word(s) of specified length, k, that occur most frequently.

Many of the issues encountered while trying to solve this problem is due to how the problem is stated and the assumptions that are in place. Since this is a bioinformatics problem, given input is naturally some sequence of the 4 different base pairs that make up DNA: ‘A’, ‘G’, ‘C’, ‘T’, such as ‘AACGTTCGGTTCCGAATATGCATG’.

Further, in biology, a sequence of some specified number of base pairs is commonly referred to as a k-mer, where k indicates how many base pairs are in the sequence.

With this in mind, the problem would be more formally stated as follows…

Frequent Word Problem: Find the most frequent k-mer(s) in a given DNA sequence.
Input: A string, “genome”, and an integer, “k”.
Output: All of the most frequently occurring k-mer(s) in genome.
Example: k = 3, genome = ‘AACGTTCGGTTCCGAATATGCATG’

A not uncommon initial approach to solving this problem is
1) Find all possible 3-mers, i.e. AAA, AAG, AAC, …, TTT
2) Count how often each 3-mer occurs in genome.

Let’s change how the problem is stated though and see if a different approach presents itself. Suppose instead of the above problem statement we had:
Frequent Word Problem: Find the most frequent word of specified length in a given text.
Input: A string, “text”, and an integer length, “l”.
Output: All of the most frequently occurring words of specified length in the given text.
Example: length = 3, text = “The quick brown fox jumps over the lazy dog.”

Given this problem definition, MOST people would not automatically think to generate EVERY possible 3 letter word. Instead a solution would probably look something like this:
1) Find each 3 letter word (or sequence of 3 letters) that occurs in text.
2) Count each word that appears and return the most frequent one(s).

Simply by changing how we think of the problem leads to a much different solution. So which approach is better?

In approach 1, there 4x4x4=64 possible 3-mers. That means the original input string, genome, has to be parsed 64 times. That may not present much a challenge for the given example problem, but what happens if k=10 and genome has 10,000 (or more characters)? In approach 2 however, we only require a single iteration through the input text, regardless of how many letters are in the word or how long the text is. Clearly this second approach is going to be more scalable as the word length and / or text length increases.

Let’s look at how we could implement this solution in pseudo-code:
1) Read the input text into a variable named genome.
2) Step through genome from start to finish, reading k characters at a time.
3) Place each k characters into a dictionary.
4) Sort the dictionary in frequency that kmers occur.
5) Return the most frequent kmer(s) from the dictionary.

Shown below is the solution implemented in Python:

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

# Open the input text file.
f = open('problem001.txt', 'r')

myDictionary = {}

# First line contains the string Text
genome = f.readline().rstrip('\n')
# Second line contains k
k = int(f.readline().rstrip('\n'))

f.close()
        
# Number of characters in the string
numChars = len(genome)

# Step through the string 1 char at a time and substring out each k characters
for i in range (0, numChars - k + 1):
    kmer = genome[i:i+k]
    if kmer in myDictionary:
        myDictionary[kmer] += 1
    else:
        myDictionary[kmer] = 1

maxValue = 0
answer = ''
# Sort the keys in the dictionary
for w in sorted(myDictionary, key=myDictionary.get, reverse=True):
    if myDictionary[w] >= maxValue:
        maxValue = myDictionary[w]
        answer += w + ' '
        print w, myDictionary[w]

print 'answer'
print answer 

# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()

If you would like to test the code yourself, here is a sample data set: Problem001 dataset

It is recommended that you try running the program as is and then comment out line 36 (add a # at the beginning) in order to compare timing results. On our system the time results were:
1) With line 36 executed: 0.044 seconds
2) With line 36 commented out: 0.011 seconds

Wrap Up
Hopefully this was a pretty straightforward and easy introduction to the topic and approach being utilized. Even though this was the first problem set and therefore relatively simple, there are two axioms to point out that should be kept in mind throughout the rest of the problems.

Axiom 1 - Whenever possible, limit parsing of the initial data set to one pass. i.e. Do NOT needlessly parse the data set multiple times!

Printing results at various stages is invaluable for debugging, BUT it can significantly add to execution time. In this example there was very limited data printed to the screen, yet it still managed to increase execution time by a measurable amount! Each print operation carries a minimum timing load, regardless of the amount of data printed. This leads to our second axiom:
Axiom 2a - Limit unnecessary print operations to debugging.
Axiom 2b - Where possible, combine multiple print operations into a single print operation.

Bioinformatics Algorithms with Python

This tutorial series works through a set of problems in bioinformatics algorithms like you might find on various code challenge sites. Unlike the typical code challenge site though, this series attempts to explain and teach the solution to each problem as opposed to just testing knowledge that you already have. Have fun learning!

Bioinformatics Algorithms - Intro
Approach
Problem 001: Frequent words problem
Problem 002: Reverse complement problem
Problem 003: Pattern matching problem

New series - Bioinformatics Algorithms

A new series is going up and is permalinked on the menus above. If you are interested in Bioinformatics Algorithms in Python then this is for you. From the menu it is Programming -> Python -> Bioinformatics Algorithms. Or you can use this link: Bioinformatics Algorithms