Obtaining overrepresented motifs in DNA sequences, part 11
Phase 2 August 12th, 2008After a long hiatus we are (almost) back on track in order to get our scripts to determine overrepresented motifs in DNA sequences. Last time we checked we defined the “best” factorial function in Python
def fac_01(n):
result = 1
for i in xrange(2, n+1):
result *= i
return result
and Andrew Dalke sent a couple of links pointing out to a binomial calculation function, one of them is below
# This file contains the Python code from Program 14.10 of
# "Data Structures and Algorithms
# with Object-Oriented Design Patterns in Python"
# by Bruno R. Preiss.
#
# Copyright (c) 2003 by Bruno R. Preiss, P.Eng. All rights reserved.
#
# http://www.brpreiss.com/books/opus7/programs/pgm14_10.txt
#
def binom(n, m):
b = [0] * (n + 1)
b[0] = 1
for i in xrange(1, n + 1):
b[i] = 1
j = i - 1
while j > 0:
b[j] += b[j - 1]
j -= 1
return b[m]
There is a similar implementation in the Python Cookbook online and it is clearly more convenient using this function than actually coding to calculate an identical value using factorials. But anyway, let’s see how a function that calculates one binomial coefficient from the Hypergeometric Distribution (HD) by using the factorial function. Later we benchmark both methods.
Each binomial coefficient can be expanded as
and the HD has three of them. From the Wikipedia “In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement.” In the next post we will define each term of the HD for the motifs case.
In each binomial expansion we would have to calculate three factorial values, nine in total. With the binomial function, only three values need to be calculated. So, using the factorial function we would need to code something like this in order to calculate one of the binomials
#let's say the motif quorum in the foreground is 7 #and the total number of sequences is 3000 #we won't touch the other required values fore = 7 total = 3000 hd = fac_01(3000) / (fac_01(7) * fac_01(2993) print hd
Next, we will benchmark and see if there is an advantage to either method.
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=20cae2f2-5ae2-4e8d-b367-9a6e684efa94)
August 13th, 2008 at 4:09 am
Is there a reason you’re not using numpy, scipy or rpy for maths?
August 13th, 2008 at 6:49 am
Yes, there is. I am trying to create the scripts without many imports, using mostly Python’s included libraries.
August 14th, 2008 at 7:09 am
Um…that’s an interesting approach, although not one I would recommend in practice. Perhaps at the end, you could present a solution that uses one of the main Python scientific libraries. (This may also yield performance improvements if you feel this is important.)
August 14th, 2008 at 7:59 am
I guess in the end the scripts will evolve and I’ll include a scientific package. Also the scripts are very simple and straightforward, that sometimes big package might get in the way.
August 14th, 2008 at 10:31 am
Paulo, hold your ground. Numpy may yet prove useful, but for now it is premature optimization. Your audience will get more value from watching a small app develop and be benchmarked. Already I think the 500x performance difference between the numerically clever binomial approach vs 3 for loops is a good lesson. Keep it up.
August 15th, 2008 at 1:30 am
For those coming across this blog via a search, check out my comments in part 12 of this series. The “binom” implementation here is really slow, as Paulo’s benchmark in #12 shows. That implementation is decidedly not clever.
A faster implementation gives results in much shorter time. For the benchmark case, about 3,000 times faster.