Obtaining overrepresented motifs in DNA sequences, part 13
Phase 2, Section 3, Section 5, motifs August 20th, 2008Now that we have the best quorum determination function and the ideal function to calculate the binomial expansions it is easy to program a script to calculate the p value of motifs in DNA sequences. To the script
below in the code there are a couple of errors that wordpress don’t let me fix. The
#!/usr/bin/env python
import fasta
import sys
from collections import defaultdict
def choose(n, k):
if 0 <= k <= n:
ntok = 1
ktok = 1
for t in xrange(1, min(k, n - k) + 1):
ntok *= n
ktok *= t
n -= 1
#print ntok // ktok
return ntok // ktok
else:
return 0
def get_quorums(seqs, mlen):
"""
add seq id_no to a set
use explicit counter to create seq_no
"""
quorum = defaultdict(set)
id_no = 0
for seq in seqs:
id_no += 1
for n in range(len(seq) - mlen):
quorum[seq[n:n + mlen]].add(id_no)
return quorum
input_seqs = fasta.read_seqs(open(sys.argv[1]).readlines())
input_seqs2 = fasta.read_seqs(open(sys.argv[2]).readlines())
foreground = get_quorums(input_seqs, 10)
background = get_quorums(input_seqs2, 10)
N = len(input_seqs) + len(input_seqs2)
for i in foreground:
term1 = choose(len(background[i]), len(foreground[i]))
term2 = choose((N - len(background[i])), len(input_seqs)-1)
term3 = choose(N, len(input_seqs))
p = (float(term1) * float(term2)) / term3
if 0 < p <= 0.0001:
print i, len(foreground[i]), len(background[i]), p
We already defined choose in the last post (more information in the link from the Python’s cookbook) and earlier Mike sent us a series of quorum-determination functions and one of the best was portrayed and explained here. We also need our fasta module to read the sequences (and only the sequences) in order to use it in the quorum function.
Basically we use the foreground and background files as input, determine the quorum of the different words (width 10) and then we iterate over the results, calculating the p value for each motif found in the foreground set. The tree terms of the Hypergeometric Distribution are calculated separately and we test for a p value smaller that 0.0001 (this can be modified) and we only print the results that fall in this category.>
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=cdd03597-362b-4dcc-b588-fb3fe3fba91a)
August 24th, 2008 at 10:11 am
The code doesn’t use the id_no inside of each quorum set, only the length of each set. If you don’t need that number then you can use a counter instead of of a set, as in:
def get_quorums(seqs, mlen):
__quorum = defaultdict(int)
__for seq in seqs:
____for n in range(len(seq) – mlen):
______quorum[seq[n:n + mlen]] += 1
__return quorum
and replace the main code with
for i in foreground:
__term1 = choose(background[i], foreground[i])
__term2 = choose((N – background[i]), len(input_seqs)-1)
__term3 = choose(N, len(input_seqs))
__p = (float(term1) * float(term2)) / term3
__if 0 < p <= 0.0001:
____print i, foreground[i], background[i], p