Tuesday, August 12, 2014

Problem 003: Pattern matching problem

Note: Beginning with this solution you will notice a change in programming style. Main functionality has now been moved from the main body of code to a function(s). This is done in the spirit of code reusability and to facilitate the development of a separate Python module, myBioMod. We will continue to develop this module as we go through subsequent problems and implement new functionality. However, for readability we will keep all new functionality within the current solutions. All previous functionality will reside in myBioMod. For additional details, see Detour - myBioMod

Pattern Matching Problem: Given a target pattern and a sequence of bases, find all the locations of the target pattern in the sequence.
Input: A string, “pattern” and a string, “genome”.
Output: A list of all locations where pattern occurs in genome.

Thinking back to the first problem, frequent words problem, actually gives us a pretty good approach for solving this problem. In the frequent words problem we stepped through the input sequence one character at a time and placed each set of k characters into a dictionary. In this problem we can adapt our program to instead step through the input sequence one character at a time and compare each block of k characters with pattern. If it is a match, place the current location in a list.
Starting with the pseudo code from problem 001, we can modify for this problem. Let’s look at how we could implement this solution in pseudo-code:
1) Read the input text file into a variable named pattern (first line) and genome (second).
2) Count the number of characters in pattern.
2) Step through genome from start to finish, reading number of pattern characters at a time.
3) Compare these characters to pattern.
4) If there is an exact match, place the current location in a list.
5) When finished print the list of locations.

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

def PatternMatching(Pattern, Genome):
    # List to store locations of pattern matches
    matchList = list()

    # Number of characters in Genome
    lenGenome = len(Genome)
    # Number of characters in Pattern
    lenPattern = len(Pattern)

    # Step through Genome 1 char at a time and substring out each lenPattern chars
    for i in range (0, lenGenome - lenPattern):
        testString = Genome[i:i+lenPattern]
        if testString == Pattern:
            matchList.append(`i`)
    return matchList        

f = open('problem003.txt', 'r')

# First line contains the string Pattern
Pattern = f.readline().rstrip('\n')
# Second line contains the string Genome
Genome = f.readline().rstrip('\n')

f.close()

answerList = PatternMatching(Pattern, Genome)

print ' '.join(answerList)
    
# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()

If you would like to test the code yourself, here is a sample data set: Problem003 dataset

No comments :

Post a Comment