Digital Paraclete: Problem 001: Frequent words problem

Yes, we went all the way back to the first problem, despite the fact that my colleague was able to solve it on his own.

Problem Statement: Given a series of characters, find the word(s) of specified length, k, that occur most frequently.

Many of the issues encountered while trying to solve this problem is due to how the problem is stated and the assumptions that are in place. Since this is a bioinformatics problem, given input is naturally some sequence of the 4 different base pairs that make up DNA: ‘A’, ‘G’, ‘C’, ‘T’, such as ‘AACGTTCGGTTCCGAATATGCATG’.

Further, in biology, a sequence of some specified number of base pairs is commonly referred to as a k-mer, where k indicates how many base pairs are in the sequence.

With this in mind, the problem would be more formally stated as follows…

Frequent Word Problem: Find the most frequent k-mer(s) in a given DNA sequence.
Input: A string, “genome”, and an integer, “k”.
Output: All of the most frequently occurring k-mer(s) in genome.
Example: k = 3, genome = ‘AACGTTCGGTTCCGAATATGCATG’

A not uncommon initial approach to solving this problem is
1) Find all possible 3-mers, i.e. AAA, AAG, AAC, …, TTT
2) Count how often each 3-mer occurs in genome.

Let’s change how the problem is stated though and see if a different approach presents itself. Suppose instead of the above problem statement we had:
Frequent Word Problem: Find the most frequent word of specified length in a given text.
Input: A string, “text”, and an integer length, “l”.
Output: All of the most frequently occurring words of specified length in the given text.
Example: length = 3, text = “The quick brown fox jumps over the lazy dog.”

Given this problem definition, MOST people would not automatically think to generate EVERY possible 3 letter word. Instead a solution would probably look something like this:
1) Find each 3 letter word (or sequence of 3 letters) that occurs in text.
2) Count each word that appears and return the most frequent one(s).

Simply by changing how we think of the problem leads to a much different solution. So which approach is better?

In approach 1, there 4x4x4=64 possible 3-mers. That means the original input string, genome, has to be parsed 64 times. That may not present much a challenge for the given example problem, but what happens if k=10 and genome has 10,000 (or more characters)? In approach 2 however, we only require a single iteration through the input text, regardless of how many letters are in the word or how long the text is. Clearly this second approach is going to be more scalable as the word length and / or text length increases.

Let’s look at how we could implement this solution in pseudo-code:
1) Read the input text into a variable named genome.
2) Step through genome from start to finish, reading k characters at a time.
3) Place each k characters into a dictionary.
4) Sort the dictionary in frequency that kmers occur.
5) Return the most frequent kmer(s) from the dictionary.

Shown below is the solution implemented in Python:

# Set up cProfile to time our code.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()

# Open the input text file.
f = open('problem001.txt', 'r')

myDictionary = {}

# First line contains the string Text
genome = f.readline().rstrip('\n')
# Second line contains k
k = int(f.readline().rstrip('\n'))

f.close()
        
# Number of characters in the string
numChars = len(genome)

# Step through the string 1 char at a time and substring out each k characters
for i in range (0, numChars - k + 1):
    kmer = genome[i:i+k]
    if kmer in myDictionary:
        myDictionary[kmer] += 1
    else:
        myDictionary[kmer] = 1

maxValue = 0
answer = ''
# Sort the keys in the dictionary
for w in sorted(myDictionary, key=myDictionary.get, reverse=True):
    if myDictionary[w] >= maxValue:
        maxValue = myDictionary[w]
        answer += w + ' '
        print w, myDictionary[w]

print 'answer'
print answer 

# Display the code time results
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
outFile = open('timing_results.txt','w')
print >>outFile, s.getvalue()

If you would like to test the code yourself, here is a sample data set: Problem001 dataset

It is recommended that you try running the program as is and then comment out line 36 (add a # at the beginning) in order to compare timing results. On our system the time results were:
1) With line 36 executed: 0.044 seconds
2) With line 36 commented out: 0.011 seconds

Wrap Up
Hopefully this was a pretty straightforward and easy introduction to the topic and approach being utilized. Even though this was the first problem set and therefore relatively simple, there are two axioms to point out that should be kept in mind throughout the rest of the problems.

Axiom 1 - Whenever possible, limit parsing of the initial data set to one pass. i.e. Do NOT needlessly parse the data set multiple times!

Printing results at various stages is invaluable for debugging, BUT it can significantly add to execution time. In this example there was very limited data printed to the screen, yet it still managed to increase execution time by a measurable amount! Each print operation carries a minimum timing load, regardless of the amount of data printed. This leads to our second axiom:
Axiom 2a - Limit unnecessary print operations to debugging.
Axiom 2b - Where possible, combine multiple print operations into a single print operation.

Digital Paraclete

Monday, August 11, 2014

Problem 001: Frequent words problem

No comments :

Post a Comment