Pattern Matching Problem: Given a target pattern and a sequence of bases, find all the locations of the target pattern in the sequence.
Input: A string, “pattern” and a string, “genome”.
Output: A list of all locations where pattern occurs in genome.
Thinking back to the first problem, frequent words problem, actually gives us a pretty good approach for solving this problem. In the frequent words problem we stepped through the input sequence one character at a time and placed each set of k characters into a dictionary. In this problem we can adapt our program to instead step through the input sequence one character at a time and compare each block of k characters with pattern. If it is a match, place the current location in a list.
Starting with the pseudo code from problem 001, we can modify for this problem. Let’s look at how we could implement this solution in pseudo-code:
1) Read the input text file into a variable named pattern (first line) and genome (second).
2) Count the number of characters in pattern.
2) Step through genome from start to finish, reading number of pattern characters at a time.
3) Compare these characters to pattern.
4) If there is an exact match, place the current location in a list.
5) When finished print the list of locations.
# Set up cProfile to time our code. import cProfile, pstats, StringIO pr = cProfile.Profile() pr.enable() def PatternMatching(Pattern, Genome): # List to store locations of pattern matches matchList = list() # Number of characters in Genome lenGenome = len(Genome) # Number of characters in Pattern lenPattern = len(Pattern) # Step through Genome 1 char at a time and substring out each lenPattern chars for i in range (0, lenGenome - lenPattern): testString = Genome[i:i+lenPattern] if testString == Pattern: matchList.append(`i`) return matchList f = open('problem003.txt', 'r') # First line contains the string Pattern Pattern = f.readline().rstrip('\n') # Second line contains the string Genome Genome = f.readline().rstrip('\n') f.close() answerList = PatternMatching(Pattern, Genome) print ' '.join(answerList) # Display the code time results pr.disable() s = StringIO.StringIO() sortby = 'cumulative' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() outFile = open('timing_results.txt','w') print >>outFile, s.getvalue()
If you would like to test the code yourself, here is a sample data set: Problem003 dataset
No comments :
Post a Comment