Obtaining overrepresented motifs in DNA sequences, part 4
Phase 2, motifs May 12th, 2008We found a way to make the Python script as good as or better than the C++ executable. But for the analysis we need to do, motif counts are not the value we want. We need the quorum: the number of sequences the motif is present at least once. For instance, if the desired motifs was AAACCCTTTG we will check in which sequences this word was present. Let’s say in a cluster of 10 sequences, we would find it in sequences 1, 2, 3, 4 and 5, giving us a quorum of 5 out of 10, or 50%. The quorum will be used in the future in the statistical calculation in order to determine the overrepresented motifs.
With only a couple of modifications, we can adapt the script used to get the motif counts to get the quorum.
#!/scratch/python/bin/python
from collections import defaultdict
import sys
import fasta
seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = int(sys.argv[2])
quorum = defaultdict(list)
seq_number = 0
for i in seqs:
seq_number += 1
for n in range(len(i.sequence) - int(length)):
if not seq_number in quorum[i.sequence[n : n + length]]:
quorum[i.sequence[n : n + length]].append(seq_number)
for i in quorum:
print ''.join(i).upper(), len(quorum[i])
Basically, we change the way the defaultdict is initialized, this time as a list instead of int and we also change the procedure that used to get the counts. The loop does identical work, iterating along the sequences, with a window (of the input length) sliding on them and checking each word. This time instead of incrementing the value of the defaultdict, we append to the list the sequence number, obtained from a index integer variable (incremented in each iteration of the loop), if this number is not already in the list value. In the end each value of quorum will be a list os numbers and by printing the list length we obtain the quorum. Testing the above script there is no performance loss when comparing to the previous count script.
Next we will see which statistical method to use and start to devising an script to calculate it.
May 12th, 2008 at 7:02 pm
The “enumerate” function is handy when you want to number items in a list (or tuple, generator, etc.) as you loop over them with a ‘for’ loop. You pass the iterable into enumerate and it returns two-tuples a sequence number and an item from the iterator.
The built-in “set” class is handy when you want to keep track of a group of items, but don’t want duplicates. The .add() method adds an element to the set, unless it is already in the set.
Dictionary methods items() and iteritems() return tuples of key-value pairs
(Leading “.”s represent spaces for indentation)
quorum = defaultdict(set)
for seq_number,seq in enumerate(seqs):
…for n in range(len(i.sequence) – length):
……quorum[i.sequence[n:n+length]].add(seq_number)
for motif,seq_numbers in quorum.iteritems():
…print motif.upper(), len(seq_numbers)
May 12th, 2008 at 9:08 pm
Hi Mike
I don’t know if I can thank you enough, your comments always make the topic better.
I was aware of the iteritems method but forgot to use in this case. Thanks for pointing out the enumerate. I am wondering about a possible speed increase with the solution you suggested.
Cheers
May 14th, 2008 at 7:34 pm
I coded the routine 8 different ways. The code in comment number 1 was fastest. A routine similar to the code in the original post was second fastest. The difference in run time among the 4 fastest routines was not significant (only about 2 or 3%). The slowest routine was about 20% slower than the fastest routine.
May 16th, 2008 at 7:55 am
Thanks Mike. I am away until Tuesday and I am going to test too. Would you mind sending me the code you have?
Cheers