Obtaining overrepresented motifs in DNA sequences, part 10
Phase 2, motifs June 4th, 2008Let’s get back to the statistical module, that will calculate an Hypergeometric Distribution (HD) p value so we can define the overrepresented motifs. Last time we saw it, we just had defined the factorial function, which is immensely helpful in this case due to the number of factorial calculations needed in the HD. The factorial function was the one below
def fac(n):
value = reduce(lambda i, j : i * j, range(1, n + 1))
return value
but as mentioned in the comments by Dave and by Mike via email the method used is not the best method to calculate factorial in Python. The best approach in this case is to use operator.mul. All functions in the operator modules are in implemented in pure C and they mimic the same operators in Python. So in this module we can find mul for multiplication, sub for subtraction, add for additions, etc.
The operator.mul needs two arguments to multiply, and in our case we still need to use reduce to sum all the results from a series of multiplications. As parameters we should use a range, that can start with 2, that should go up to the number we want the factorial plus one. Finally our function would be
import operator
def fac(n):
value = reduce(operator.mul, xrange(2, n+1))
return value
The time gain, quickly measured in a non-scientific fashion in my system, is around 5 to 15%, depending on the factorial being calculated. It may seem a small gain, but when you need to calculate almost a million factorials for all possible motifs the amount of time saved is crucial. Next time we will be back with more statistics, expanding the module.

June 4th, 2008 at 3:09 pm
http://xkcd.com/386/
Enjoying the series, but I disagree strongly with the entire last paragraph.
As long as you can call the faster version with the same interface as the simpler version there is no reason to go for a more complicated version until you’ve done some actual testing. Premature optimization is the root of all evil.
This is still called ‘beginning python for bioinformatics’. I think we’re better off with a reminder to google ‘python factorial’ and to read a full discussion on 13 different implementations.
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/67668
as long as you hide it behind
def factorial(n):
I don’t care which implementation you pick until I see
1. that the program isn’t fast enough
2. that the factorial function is a bottleneck
If you can do that, I’d really like to see your testing.
http://blog.doughellmann.com/2007/09/pymotw-timeit.html
since that seems very useful for this audience
at which point I’ll probably suggest that
http://en.wikipedia.org/wiki/Memoization
is often a better solution to the larger problem.
June 4th, 2008 at 6:06 pm
Hi Cariaso
http://blindscientist.genedrift.org/2008/04/21/xkcd-got-it-wrong/
Thanks for the comment. I agree with you the premature optimization is sometimes bad, I face challenges like this mostly everyday.
The discussion you pointed out should have been linked, I thought of adding that (due to the fact that it was there that I was able to get a good grip on the implementations) but forgot in the last minute.
I will add some more scientific testing, I was writing something with Python’s time to check.
Speaking from personal experience usually the bottleneck of such applications is the factorial calculation. I implemented a similar algorithm in C++ using MAPM (http://www.tc.umn.edu/~ringx004/mapm-main.html) and in the end opted to a “dumb” memoization by pre-calculating or loading all possible factorial values for my sample size.
I will post about the stats module, then I will have something on the factorial test.
Cheers
June 4th, 2008 at 6:30 pm
To add to Mike’s comments, since you are actually computing a binomial coefficient, you can save yourself a lot of work by not computing the entire factorial. For example code,
http://groups.google.com/group/comp.lang.python/msg/6e7c3358b086ff9c?dmode=source
http://www.brpreiss.com/books/opus7/programs/pgm14_10.txt
I’ll also emphasize something Mike mentioned in passing. It’s better if you call this function “factorial”. That’s easier for anyone to understand what you mean, or Google if they don’t.
If you are working with really big numbers then it’s best to work in log space, as for example: http://aspn.activestate.com/ASPN/Mail/Message/python-list/2954844 . Note also that gmpy is yet another possible solution for you. Assuming this is indeed your bottleneck.
June 4th, 2008 at 6:52 pm
Thanks Andrew. I saw gmpy while searching for factorial implementations in Python. I will probably give it a try after finishing the module.
I could have SciPy, but then it would add an external module to be installed.
Cheers
June 5th, 2008 at 4:52 am
This is probably the faster Python version (for small factorials):
def small_factorial(n):
____result = 1
____for i in xrange(2, n+1):
________result *= i
____return result
import psyco; psyco.bind(small_factorial)
June 6th, 2008 at 11:52 am
[...] factorials in Python (the same “problem” can be found in some other languages too). Cariaso suggested to time the execution of different factorial functions, including the ones found on [...]