Obtaining overrepresented motifs in DNA sequences, part 12.5
Phase 2 August 15th, 2008So let’s modify a little bit the factorial function and then benchmark both by using timeit. Ideally our factorial function would need to calculate a value similar to the binomial expansion, as we have three factorials to calculate in for each binomial in the Hypergeometric Distribution.
So we can add two extra factorial calculations to our function and perform the multiplication and division to return the equivalent to the binomial calculation. So the function would be
def fac(n, m):
value1 = 1
for i in xrange(2, n + 1):
value1 *= i
value2 = 1
for i in xrange(2, m + 1):
value2 *= i
value3 = 1
for i in xrange(2, (n - m) + 1):
value3 *= i
return value1 / (value2 * value3)
m and n are both values of the binomial and n - m is the subtraction of one by the other that forms the last factorial to be calculated. This way it makes easier to time the performance of both functions. In the end the complete script would look like
#!/usr/bin/env python
import timeit
def fac(n, m):
result1 = 1
for i in xrange(2, n + 1):
result1 *= i
result2 = 1
for i in xrange(2, m + 1):
result2 *= i
result3 = 1
for i in xrange(2, (n - m) + 1):
result3 *= i
return result1 / (result2 * result3)
def binom(n, m):
b = [0] * (n + 1)
b[0] = 1
for i in xrange(1, n + 1):
b[i] = 1
j = i - 1
while j > 0:
b[j] += b[j - 1]
j -= 1
return b[m]
def choose(n, k):
if 0 <= k <= n:
ntok = 1
ktok = 1
for t in xrange(1, min(k, n - k) + 1):
ntok *= n
ktok *= t
n -= 1
#print ntok // ktok
return ntok // ktok
else:
return 0
if __name__ == "__main__":
stmt = "fac(3000, 7)"
t = timeit.Timer(stmt = stmt, setup='from __main__ import fac')
stmt2 = "binom(3000, 7)"
t2 = timeit.Timer(stmt = stmt2, setup = 'from __main__ import binom')
stmt3 = "choose(3000, 7)"
t3 = timeit.Timer(stmt = stmt3, setup = 'from __main__ import choose')
print 'fac: %.9f' % (t.timeit(100)/100)
print 'binom: %.2f' % (t2.timeit(10)/10)
print 'choose %.9f' % (t3.timeit(100)/100)
The final result of the average for ten repetitions is as follow
fac = 0.10 s
binom = 43.24 s
choose = 0.000005 s
Clearly, the factorial function gets a huge advantage over the binomial one. So we will modify it a little bit and use it for our HD script. Clearly the choose function is the fastest one, so we will incorporate it on out HD script.
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=828aa662-2ad6-4a52-ae01-e8494daa880d)
August 15th, 2008 at 1:24 am
I sent a couple of links to binomial function implementations without reviewing the code myself. You used the second of those links, which happens to be amazingly slow. It’s actually computing all of the values of the distribution for a given size n, then returning the one requested.
Choosing one of the implementations from the first link (the “non-clever” one at http://groups.google.com/group/comp.lang.python/msg/6e7c3358b086ff9c?dmode=source):
def choose(n, k):
if 0 <= k <= n:
ntok = 1
ktok = 1
for t in xrange(1, min(k, n – k) + 1):
ntok *= n
ktok *= t
n -= 1
return ntok // ktok
else:
return 0
I get as output
fac: 0.01
choose: 0.00
Since it doesn’t actually take 0.0 time I changed the repeat count from 10 to 10,000 and removed the “/0″. This will give you an idea of the performance difference:
fac: 140.19
choose: 0.05
or roughly a factor of 3,000 times faster.
(BTW, I had a hard time writing “binomial” – my fingers keep wanting to type “bionomial”
August 15th, 2008 at 8:12 am
Thanks Andrew, I will change the code and post the results.