Obtaining overrepresented motifs in DNA sequences, part 1
Phase 2, motifs May 9th, 2008Changing gears now, leaving behind Pfam alignments. I decided to start a new series of posts based on the conversion of some small C++ programs I developed in the past. These small programs (I call them modules because they were part of a larger application) were used to count motifs, short nucleotide words up to 10-12 base pairs, and then calculate statistical overrepresentation of these words by comparing a foreground set of DNA sequences against a background set.
We will start comparing the different approaches of the C++ and the Python codes and point out advantages and disadvantages of doing it in one language or the other. First thing we need to do is to count the motifs in all sequences from our foreground and background sets. For the project I was working on, the ideal word length was 10 nucleotides. Basically our C++ approach to increase speed was to transform the character DNA sequences into numbers and then, while sliding a window with the desired word length, hash the base-four numbers into base 10 and increment a vector position, previously initialized with 0. For four nucleotides and a word size of 10 there are 1,048,576 permutations possible, from AAAAAAAAAA to TTTTTTTTTT. Initially the C++ program would do
for(j = 0; j < seqsize; j++)
{
seqfile.get(base);
if(base != '\n')
{
switch(base)
{
case 'A':
bseq.push_back(0);
break;
case 'C':
bseq.push_back(1);
break;
case 'G':
bseq.push_back(2);
break;
case 'T':
bseq.push_back(3);
break;
}
}
}
reading all sequences and pushing an figure for each nucleotide in a vector, and then sliding a window on this vector and hashing the base-four number
int hashSeq(vector<short> subseq)
{
int w, i, hashvalue = 0, power;
w = subseq.size() - 1;
for(i = 0; i < subseq.size(); i++)
{
power = 0;
hashvalue += subseq[i] * pow((double)4,(double)w);
w--;
}
return hashvalue;
}
if(binseq[i].size() > motifwidth)
{
for(j = 0; j < binseq[i].size()-motifwidth+1; j++)
{
sub.assign(binseq[i].begin() + j, binseq[i].begin() + j+motifwidth);
hashed = hashSeq(sub);
nmercount[hashed]++;
sub.clear();
}
}
The whole C++ code has about 400 lines, including all the possible output formatting and printing. Timing with time the C++ executable takes a little bit less than 2 seconds to read, count and output different files.
For Python, we will use a different approach and gain a lot in code simplicity. As we want to count the number of times size-10 words appear in all sequences, we first need to generate all possible permutations (with replacement) of four nucleotides. This can be easily accomplished by using generator functions. Regular functions run until completion and the return a value. For instance, a function that calculates the factorial of 10 will return the last value only, after multiplying 10.9.8.7.6.5.4.3.2. A generator function runs until a value is available to return, yielding it and then suspending its operation until called again. The yielding part was emphasized because yield is the command used by Python to return the value and suspend the function until further notice. In the factorial function, a generator would return the intermediary factorial values up to 10.
To generate all 1 million plus permutations of 4 nucleotides we need a function similar to the one below (modified from here)
def permutations(items, n):
if n == 0:
yield ''
else:
for i in range(len(items)):
for base in permutations(items, n - 1):
yield str(items[i]) + str(base)
Basically, what this generator function does is to combine all four nucleotides in words of size 10. This is a recursive function, where the result of the function is dependent on the n-1 value calculated by the function until n equals 0. The first for loop over the items that we want to permutate (the nucleotides) and the second for recursively calls permutations<code> starting with the initial <code>n passed (10) until we reach 0. Debugging this function we will see that i is constant for each iteration of the second loop and only n changes from 10 to 0, while one by one nucleotides are joined to form a motif. It starts with AAAAAAAAAA, then AAAAAAAAAC, then AAAAAAAAAG, until it gets to a poly-T.
Our final code would look like the one below
import fasta
import sys
def permutations(items, n):
if n == 0:
yield ''
else:
for i in range(len(items)):
for base in permutations(items, n - 1):
yield str(items[i]) + str(base)
seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = sys.argv[2]
nucleotides = ['A', 'C', 'G', 'T']
merged_seqs = ''
for i in seqs:
merged_seqs += i.sequence
for i in permutations(nucleotides, int(length)):
print i + '\t' + merged_seqs.count(i)
where we read the input sequence(s), merge them in one long string and as we generate all possible combinations we count the number of times they appear. This code running on the same input file used on the C++ executable is 60 times slower, taking in average one full minute to count motifs in 8 500 bp DNA sequences. The slowest section is the count, as the generation of all possible combinations is straightforward. We lose some speed, but gain a lot on code simplicity and clarity. Next we will modify this code to output different counts needed for the statistical analysis.

May 9th, 2008 at 11:31 pm
Maybe using xrange instead of range for looping is more efficient. xrange uses generator and might take less memory (better caching ? faster ?) ?
May 10th, 2008 at 1:14 am
Hi,
If the line
merged_seqs += i.sequence
is concatenating strings then maybe something like
merged_seqs = ”.join(i.sequence for i in seqs)
would be faster as += is slow for Pythons immutable strings.
- Paddy.
May 10th, 2008 at 3:24 am
merged_seq.count(i) scans through the entire sequence counting the occurances of i. For a 10-element sequence the “for i in permutations(…):” causes merged_seq to be scanned over a million times.
It can be scanned once using slices of merged_seq as a key to a defaultdict as follows:
from collections import defaultdict
from random import choice
length = 10
#generate a random sequence for testing
#need to use a tuple instead of a list, because list slices cannot
# be used as keys in a dict
merged_seqs = tuple(choice(’ACGT’) for n in range(8*500))
#for a missing key, the dict entry is initialized to zero
counts = defaultdict(int)
#count the ‘length’-element subsequences in merged_seq
# takes about 5 second to scan a million item merged_seq
for n in range(len(merged_seqs)-length):
counts[merged_seqs[n:n+length]] += 1
#counts.keys() will then return the nucleotide sequences that were
#actually in merged_seqs
#print out the sequences that occur more than once
for seq,count in counts:
if count >= 2:
print seq,count
#print most common sequences
max_count = max(counts.values())
for seq,count in counts.values():
if count == max_count:
print seq
#print sequences in decending order of frequency
# counts.items returns (seq,count) tuples
# sorted says to sort the tuples,
# key= itemgetter(1) says to use element 1 of the tuple (the count)
# as the sort key
# reverse=True says to sort from largest to smallest (default is smallest first)
for seq,count in sorted(counts.items(),key=itemgetter(1),reverse=True):
print seq,count
May 10th, 2008 at 5:57 am
Thanks for all the comments, I am also learning a lot. Keep them coming.
I will include the suggested modifications, which I think make the script even better. I was going to improve the code too as this is probably a series of 10 or more posts. This one was more focused on showing the gain in code readability in Python when comparing to C++ and to introduce yield and generator function.
Cheers
May 10th, 2008 at 8:17 am
Mike,
Beautiful!
May 10th, 2008 at 12:33 pm
Paulo, I think that my khmer package does something similar:
http://darcs.idyll.org/~t/projects/khmer/doc/
I take a different approach, though, by basing everything on a hash table; then there’s no need to generate the permutations up front.
So, take a nt string and hash every N-length word into a 4**N hash table. Then iterate over the hash table and for every nonzero count, emit the reverse hash of the table index. See the example at the bottom of the above URL.
Also note that by merging the sequences, you’re creating new sequences at the boundaries of the merge… e.g. two sequences ‘AAAAAAAAA’ and ‘TTTTTTTTT’ will be merged to create ‘AAAAAAATTTTTTTT’ and now you’ll have a bunch of words that were not in the original sequences.
cheers,
–titus
May 10th, 2008 at 1:17 pm
Does the comment posting system accept html tags such as ? (so we don’t lose the indentation in the code)
I forgot to mention that the loop:
for n in range(len(merged_seqs)-length):
counts[merged_seqs[n:n+length]] += 1
also works if merged_seqs is a string, as in the original post.
May 10th, 2008 at 1:18 pm
That was “tags such as pre”, and it didn’t.
May 10th, 2008 at 1:55 pm
From the description it appears that one wouldn’t need to merge all the sequences. All that would be needed is to prefix each successive sequence with the last N-1 bases of the previous sequence, N being the word length. I am taking the liberty of posting a somewhat complete implementation (lot of overlap with Mike’s suggestions). A cursory profiling of the script shows that the bulk of the time is in generating the various permutations, followed by the counting. If the length of the message is greater than what is normally acceptable for comments, my apologies.
“”"
Count the number of occurences of N-mers in a sequence
“”"
from collections import defaultdict
class NmerCount(object):
def __init__(self, n):
self.N = n
self.count_dict = defaultdict(int)
def __call__(self, sequence):
N = self.N
d = self.count_dict
for i in xrange(0, len(sequence)-N):
d[sequence[i:i+N]] += 1
def fasta_iterator(fasta_file):
seqfile = open(fasta_file, ‘rU’)
seqid = seq = ”
first_seq_found = False
for line in seqfile:
if line[:1] == ‘>’:
if first_seq_found:
yield seqid, seq
first_seq_found = True
seqid = line[1:].split(None, 1)[0]
seq = ”
else:
seq += ”.join(line.split())
yield seqid, seq
def cross(*sets):
“”"Generate the cross product of an arbitrary number of sets.
From http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/159975“”"
wheels = [iter(s) for s in sets]
digits = [it.next() for it in wheels]
DIGIT_RANGE = range(len(digits)-1, -1, -1)
while 1:
yield digits[:]
for i in DIGIT_RANGE:
try:
digits[i] = wheels[i].next()
break
except StopIteration:
wheels[i] = iter(sets[i])
digits[i] = wheels[i].next()
else:
break
def main():
import sys
fastafile = sys.argv[1]
N = int(sys.argv[2])
NC = NmerCount(N)
lastseq = ”
# process each sequence
numseq = 0
for seqid, sequence in fasta_iterator(fastafile):
numseq += 1
# prepend the last N-1 characters of the previous sequence
seq = lastseq + sequence
# update the n-mer count
NC(seq)
#store the last N-1 characters of the sequence so that
#it can be prepended to the next sequence
if N>1:
lastseq = seq[-(N-1):]
print ‘Total Number of Sequences =’, numseq
alphabet = ‘ACGT’
combos = (alphabet,)*N
for pattern in cross(*combos):
pattern_s = ”.join(pattern)
print pattern_s, NC.count_dict[pattern_s]
if __name__ == “__main__”:
main()
May 10th, 2008 at 2:50 pm
Thanks Paddy.
Note, however, that I am using the same basic technique as in the original C++ code. Lines 16-20 in the second listing scan through the merged sequence and pull out a subsequence of ‘length’ items. The subsequence is hashed, and the hash is used to index into an array of counts. It’s because Python provides slices and dictionaries (which hash the key) that the code is so compact and simple compared to the C++ for which Paulo had to write the hash, slice, and ‘dictionary’ code.
May 10th, 2008 at 4:41 pm
Mike: thanks a lot for the suggestions, I will include them in the next post. And thanks to Paddy too. I will try to change the comments and allow the inclusion of tags.
Titus: yes you are right and right. It’s very similar to the package you pointed out, but at the same time I am trying to be faithful to the motto of the site. Regarding the error you pointed out, I was going to explain it in the next post, which will also include a modification of the motif count. Instead of getting the number of times each motif appear overall, we would only count the motif quorum. I’m doing this trying to retrace the steps of the C++ software development which was closely related to the method we decided to use in the end.