Monday, August 11, 2014

Problem 002: Reverse complement problem

Some brief background first.  DNA spends most of the time in a double stranded form.  i.e. it does not typically exist as a single sequences of bases such as AGTATCCATC, but as two sequences that wrap around in a helix form.  However, you cannot just put any two sequences together in this manner.  In order for DNA sequences to form into a double strand, each base in a sequence must be matched with a complementary base on the opposing strand.  These complements are as follows:
A pairs with T
C pairs with G
G pairs with C
T pairs with A

ACTTATCG
TGAATAGC for example is a valid sequence and complement.

ACTTATCG
GACCTCAT is not.

Additionally, DNA has directionality.  Imagine the bases as Lego building blocks.  There is the bumpy end and the end that the bumps plug into.  With bases these are referred as the 5’ and 3’ ends.

All of this is important because if we split the sample valid sequence and complement given above we do not end up with the following two sequences: ACTTATCG and TGATAGC.  Instead we end up with ACTTATCG and CGATAAGT.  We refer to these as the sequence and its reverse complement.  A fuller representation of the sample data would be:

5’->ACTTATCG->3’
3’<-TGAATAGC<-5’

With this very brief and naïve background we can now define our next bioinformatics algorithm problem, finding the reverse complement.

Reverse Complement Problem: Given a sequence of bases, find the reverse complement of the sequence.
Input: A string, “sequence”.
Output: A string, “revComp”

A common first approach to solving this problem is to use the Python string replace method.  This will simply replace all specified patterns in the string with a specified new pattern.  However, this has to be done once for each base, which leads to a problem.  Consider the sample pattern above, sequence = “ACTTATCG”.

First you might replace all of the A’s with their complement T, such as:
sequence = sequence.replace(‘A’,’T’) which yields sequence = “TCTTTTCG”
We now run into a problem of how to differentiate between the original Ts in the sequence and the ones that were previously As.  This will often give rise to a solution such as:
sequence = sequence.replace(‘A’,’t’)
sequence = sequence.replace(‘T’,’a’)
This will produce the complement of the original sequence, except instead of being in all upper case like the original; it is in all lower case.  From there it is trivial to swap the cases back to uppercase (using the string swapcase method) and reverse the sequence.

Following is this solution implemented in Python.  Note that instead of printing the output to the screen as in problem 001, the output in this solution has been written directly to a text file instead.  This was due to consideration of the length of the output.

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

f = open('problem002.txt', 'r')
g = open('output.txt', 'w')

sequence = f.readline().rstrip('\n')
f.close()

# Replace each base with their complement, in lowercase.
sequence = sequence.replace('T','a')
sequence = sequence.replace('A','t')
sequence = sequence.replace('C','g')
sequence = sequence.replace('G','c')
# Reverse the sequence.
sequence = sequence[::-1]
# Swap all of the bases back to uppercase.
sequence = sequence.swapcase()

# Write the output to a text file.
g.write(sequence)
g.close()

print 'Finished'

# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()


It turns out that this is actually a pretty efficient solution, but is there another approach we could try?  In Python, strings are immutable.  One effect of that is that you cannot access a single character by index in the string and change it.  That prevents us from stepping through the string one character at a time and replacing the base with its complement.  Instead, in order to do this we need to convert the string into a list, step through each element in the list replacing it with its complement, and then convert the list back into a string.  This could be implemented in Python as:

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

f = open('problem002.txt', 'r')
g = open('output.txt', 'w')

sequence = f.readline().rstrip('\n')
f.close()

# Create a dictionary of the bases and their complements
complements = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
# Convert the string (sequence) to a list (bases).
bases = list(sequence)
# Reverse the list
bases.reverse()
# Replace each base with its complement
bases = [complements[base] for base in bases]
# Convert the list (bases) back into a string (sequence)
sequence = ''.join(bases)

# Write the output to a text file.
g.write(sequence)
g.close()

print 'Finished'

# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()


The Pythonic way.

Once you start using Python, it probably won't be too long until you hear someone discussing whether a particular solution is or is not Python. So what is this? As Python has evolved, the general "community" has settled on a number of solutions as being the correct way to do things in Python. This in turn has influenced Python development so that many of these idioms have built in support. A word of caution though is that a truly Pythonic solution typically cannot be directly translated to another programming language.

Now that we have developed two different algorithms for the reverse complement problem, let us look at a Pythonic solution that uses the built in string method, translate. Using this, we can replace our previous algorithms with a single line: sequence = sequence[::-1].translate(maketrans('ACGT','TGCA'))
While this looks much different than algorithm 2, it is actually very similar. Maketrans builds a dictionary using the first parameter to specify what the original characters are (the 'keys') and the second parameter to specify what they should become (the 'values'). Translate then simply iterates over the string and uses the dictionary to look up each character and replace it with the appropriate value. Here is what the full solution looks like:

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

from string import maketrans

f = open('problem002.txt', 'r')
g = open('output.txt', 'w')

sequence = f.readline().rstrip('\n')
f.close()

sequence = sequence[::-1].translate(maketrans('ACGT','TGCA'))

g.write(sequence)
g.close()

print 'Finished'

# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()
If you would like to test the code yourself, here is a sample data set: Problem002 dataset

Wrap Up
So which algorithm was best?

Comparing timing results with our initial dataset, linked above, shows no difference in speed:
Algorithm 1: 0.015 seconds
Algorithm 2: 0.015 seconds
Algorithm 3: 0.017 seconds

As is often the case, the performance of an algorithm does not become apparent with small or trivial datasets.  The initial data set contained 8,860 bases.  To stress the algorithm a bit more, the initial dataset was repeated until it was 318,960 bases in length.  Timing results now show a bit of difference:
Algorithm 1: 0.107 seconds
Algorithm 2: 0.052 seconds
Algorithm 3: 0.035 seconds

But does this difference ever become a problem?  How about with real life data?  Testing the algorithms with the vibrio cholerae genome (1,108,250 bases) yields:
Algorithm 1: 0.114 seconds
Algorithm 2: 0.068 seconds
Algorithm 3: 0.032 seconds

And with the E-coli genome (4,639,675 bases) yields:
Algorithm 1: 0.519 seconds
Algorithm 2: 0.266 seconds
Algorithm 3: 0.103 seconds

With small datasets at least it appears that Algorithm 2 is consistently ~2x as fast as Algorithm 1.  If we were using Big-O notation, both algorithm s would be categorized as O(N).  In other words, both algorithms grow linearly and in direct proportion to the size of the input data set.  Therefore, there is no compelling argument to use one algorithm vs the other so for most cases it just comes down to personal preference. 
Algorithm 3 however does appear to offer a significant improvement over the other two algorithms. Since the built in string method, translate, is implemented in compiled c code we expect it to perform better than algorithm 2, despite their functional similarity. Also note that the execution time of algorithm 3 appears to increase in a step-wise manner.

NOTE1:  The performance of all 3 algorithms was still only tested with relatively small datasets.  Performance may change drastically with LARGE datasets.  This might be something interesting to try, perhaps with: 500,000,000 bases,  1,000,000,000 bases, 4,000,000,000 bases, etc… 


NOTE2:  Even though performance appears to scale the same for both algorithms, there are still practical considerations.  For example, if your dataset is large enough that Algorithm 1 requires 36 hrs to complete, you may very well be interested in using Algorithm 2 if it reduces the time to 18 hrs.

No comments :

Post a Comment