Pattern Matching Problem: Given a target pattern and a sequence of bases, find all the locations of the target pattern in the sequence.
Input: A string, “pattern” and a string, “genome”.
Output: A list of all locations where pattern occurs in genome.
Thinking back to the first problem, frequent words problem, actually gives us a pretty good approach for solving this problem. In the frequent words problem we stepped through the input sequence one character at a time and placed each set of k characters into a dictionary. In this problem we can adapt our program to instead step through the input sequence one character at a time and compare each block of k characters with pattern. If it is a match, place the current location in a list.
Starting with the pseudo code from problem 001, we can modify for this problem. Let’s look at how we could implement this solution in pseudo-code:
1) Read the input text file into a variable named pattern (first line) and genome (second).
2) Count the number of characters in pattern.
2) Step through genome from start to finish, reading number of pattern characters at a time.
3) Compare these characters to pattern.
4) If there is an exact match, place the current location in a list.
5) When finished print the list of locations.
# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()
def PatternMatching(Pattern, Genome):
# List to store locations of pattern matches
matchList = list()
# Number of characters in Genome
lenGenome = len(Genome)
# Number of characters in Pattern
lenPattern = len(Pattern)
# Step through Genome 1 char at a time and substring out each lenPattern chars
for i in range (0, lenGenome - lenPattern):
testString = Genome[i:i+lenPattern]
if testString == Pattern:
matchList.append(`i`)
return matchList
f = open('problem003.txt', 'r')
# First line contains the string Pattern
Pattern = f.readline().rstrip('\n')
# Second line contains the string Genome
Genome = f.readline().rstrip('\n')
f.close()
answerList = PatternMatching(Pattern, Genome)
print ' '.join(answerList)
# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()
If you would like to test the code yourself, here is a sample data set: Problem003 dataset
No comments :
Post a Comment