A quick assessment of factorial functions in Python

Phase 2 10 Comments »

A short pause on the motifs subject. As mentioned on comments there is a lot of different ways of calculating factorials in Python (the same “problem” can be found in some other languages too). Cariaso suggested to time the execution of different factorial functions, including the ones found on Python’s cookbook (which I should have included in the beginning of last post). Anyway all functions from the webpage were included, as the one mentioned on a comment and both functions seen here. Using timeit (thanks Mike!) the execution time of all of them were measured by calculating the factorial of 800 and 4000. First, the functions:

def fac_01(n):
    result = 1
    for i in xrange(2, n+1):
        result *= i
    return result

def fac_02(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

def fac_03(n):
    import operator
    value = reduce(operator.mul, xrange(2, n + 1))
    return value

def fac_04(n):
    fac = lambda n:n-1 + abs(n-1) and fac(n-1)*long(n) or 1
    return fac(n)

def fac_05(n):
    fac = lambda n:[1,0][n>0] or fac(n-1)*n
    return fac(n)

def fac_06(n):
    fac = lambda n:reduce(lambda a,b:a*(b+1),range(n),1)
    return fac(n)

def fac_07(n):
    fac=lambda n: [1, 0][n > 0] or reduce(lambda x, y: x*y, xrange(1,n + 1))
    return fac(n)

def fac_08(n):
    fac = lambda n: n <= 0 or reduce(lambda a, b: a*b, xrange(1,n + 1))
    return fac(n)

def fac_09(n):
    fac = lambda n: [[[j for j in (j * i,)][0] for i in range(2, n+1)][-1] for j in (1,)][0]
    return fac(n)

def fac_10(n):
    fac = lambda n: [j for j in [1] for i in range(2, n+1) for j in [j * i]] [-1]
    return fac(n)

fac_01 was suggested by bearophile on a comment (no psyco import here), fac_02 was the first one we have seen in the blog, fac_03 was the one “selected as the fastest” (mentioned on comments too) and functions 4 to 10 were gathered from the cookbook, all of them using lambda. Now, to the results (no fancy graphs, yet), average over 1000 repeats, time in seconds:

800!
fac_01: 0.0010
fac_02: 0.0012
fac_03: 0.0010
fac_04: 0.0022
fac_05: 0.0020
fac_06: 0.0015
fac_07: 0.0013
fac_08: 0.0013
fac_09: 0.0017
fac_10: 0.0015

4000!
fac_01: 0.0226
fac_02: 0.0242
fac_03: 0.0230
fac_04: N/A
fac_05: N/A
fac_06: 0.0244
fac_07: 0.0239
fac_08: 0.0241
fac_09: 0.0380
fac_10: 0.0362

In both cases the “best” functions were 1 and 3. For the smallest factorial (800!) 1 and 3 didn’t have a lot of advantage for 2, 6-10, while 4 and 5 were 2 times slower. For the largest factorial (4000!) 1 was 4% better than 3, and the third best (6) was also 4% slower than 3. Why do 4 and 5 have no computed time? Because they are recursive and it blows the recursion limit in Python.

So I guess it is a safe bet to use either function 1 or 3, with a slightly advantage for 3 in the case of large factorials. But at the same time a believe for our case of overrepresented motifs, it would be better to pre-calculate the factorials, store them and access the value when needed. Andrew’s idea of using binomials is also interesting, and I am planning to test it here. He also gave a couple of other suggestions that would need extra imports, scipy and gmpy.

Zemanta Pixie

Obtaining overrepresented motifs in DNA sequences, part 10

Phase 2, motifs 6 Comments »

Let’s get back to the statistical module, that will calculate an Hypergeometric Distribution (HD) p value so we can define the overrepresented motifs. Last time we saw it, we just had defined the factorial function, which is immensely helpful in this case due to the number of factorial calculations needed in the HD. The factorial function was the one below


def fac(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

but as mentioned in the comments by Dave and by Mike via email the method used is not the best method to calculate factorial in Python. The best approach in this case is to use operator.mul. All functions in the operator modules are in implemented in pure C and they mimic the same operators in Python. So in this module we can find mul for multiplication, sub for subtraction, add for additions, etc.

The operator.mul needs two arguments to multiply, and in our case we still need to use reduce to sum all the results from a series of multiplications. As parameters we should use a range, that can start with 2, that should go up to the number we want the factorial plus one. Finally our function would be


import operator

def fac(n):
    value = reduce(operator.mul, xrange(2, n+1))
    return value

The time gain, quickly measured in a non-scientific fashion in my system, is around 5 to 15%, depending on the factorial being calculated. It may seem a small gain, but when you need to calculate almost a million factorials for all possible motifs the amount of time saved is crucial. Next time we will be back with more statistics, expanding the module.

Zemanta Pixie

Obtaining overrepresented motifs in DNA sequences, part 9

Phase 2, motifs No Comments »

Back on new functions for motif quorums. We jump function 7 in order to explain “simpler” ones, 8 and 9. Both functions use generators. We’ve already seen here generators, which are functions that use the yield statement to generate iterators. The generator is very similar to a function but instead of returning a value, it yields one and waits for another call to resume. In function 8, a generator is used to return the motif sequence that is used as a key in the defaultdict. Notice the scope of the generator that is coded inside a function.

def get_quorums_08(seqs, mlen):
    """
    add seq id_no to a set
    use enumerate to create seq_no
    use an explicit generator to create the motifs
    """
    def motif_gen(seq):
        for n in range(len(seq)-mlen):
            yield seq[n:n+mlen]

    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for motif in motif_gen(seq):
            quorum[motif].add(id_no)

    return quorum

In function 9 a very similar structure is used but in this cases instead of a “pure” generator it uses generator expressions which a very similar to list comprehensions but with parentheses instead of square brackets. Generator expressions are generators that can be written in one line and have identical behaviour as generators coded in the “regular” inception. In the function below the generator expressions provide the iterator for the loop with motif as index. Very simple and elegant.

def get_quorums_09(seqs, mlen):
    """
    add seq id_no to a set
    use enumerate to create seq_no
    use a generator expression to create the motifs
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for motif in (seq[n:n+mlen] for n in range(len(seq)-mlen)):
            quorum[motif].add(id_no)

    return quorum

In the next post we will go back to the statistical module and soon we will see the remainder 5 functions sent by Mike.

Zemanta Pixie

Obtaining overrerpresented motifs in DNA sequences, part 8

Phase 2, motifs No Comments »

We keep on introducing Mike’s functions. This time there are a couple of Python methods that we haven’t seen here and need some introduction, izip and count. To use these two we also need to import new modules

from itertools import count, izip

count returns consecutive integers starting at a defined point (the method’s parameter). If empty it starts from zero. Basically, by starting a count it will give an iterable with a increasing integer values, in a fashion similar to a function with yield. Every time our loop accesses the count it will “remember” the last return value and increment it by one.

izip also returns an iterator, but from a list of iterables. It is basically used to iterate through a list of many iterables at the same time. In the function below it is used twice: one to generate a tuple (with count) with a sequence number and the sequence itself. The sequence in the tuple is then used in another izip to create the windows on the sequences to count motifs.

def get_quorums_06(seqs, mlen):
    """
    add seq id_no to a set
    use 'izip(count(),...) to create seq_no
    use 'izip(count(),range(...)) to create start/stop indices for motifs
    """
    quorum = defaultdict(set)
    for id_no, seq in izip(count(), seqs):
        for s, e in izip(count(), range(mlen, len(seq))):
            quorum[seq[s:e]].add(id_no)
    return quorum

In the next couple of posts we still be checking motif quorum functions. Stay tuned.

Obtaining overrepresented motifs in DNA sequences, part 7

Phase 2, motifs No Comments »

Continuing on Mike’s functions to obtain motif quorums. We see function 3, 4 and 5. Function get_quorums_03, uses an old friend of the blog, sets. Recall that sets are very similar to lists, but their are unordered and items are unique.

def get_quorums_03(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(set)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq)-mlen):
            quorum[seq[n:n+mlen]].add(id_no)
    return quorum

Basically, the sequence numbers (an incremented counter) are added to a defaultdict which was initialized as a set. This way you don’t need to check for the existence of the sequence number in the defaultdict list and count on the ability of set of being unique. Function 4 is very similar to function 3 with the difference of using enumerate (as in function 02) to make the sequence numbers.

def get_quorums_04(seqs, mlen):
    """
    add seq id_no to a set
    use 'enumerate' to create seq_no
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for n in range(len(seq)-mlen):
            quorum[seq[n:n+mlen]].add(id_no)
    return quorum

Function 5 adds a twist, which is to have an enumerate to set the sequence range (motif/word width) start and stop. This way the window is sliding based on the tuple created by the enumerate method and not on the slicing that were used in all other functions. Again, a defaultdict is initialized as set and the sequence numbers are generated by an enumerate.

def get_quorums_05(seqs, mlen):
    """
    add seq id_no to a set
    use 'enumerate' to create seq_no
    use enumerate(range(...)) to create start/stop indices for motif
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for s, e in enumerate(range(mlen, len(seq))):
            quorum[seq[s:e]].add(id_no)
    return quorum

Next time we will be back with some more functions that introduce new concepts to everyone that is following the blog.

Obtaining overrepresented motifs in DNA sequences, part 6

Phase 2, motifs No Comments »

We will take a break on developing the statistical module to obtain overrepresented motifs (I will introduce mul in the next stats post), and take a deeper look at the possibilities on obtaining the motif quorums. Mike DeHaemer, a regular commenter and contributor to the blog, sent me a Python script with 8 different ways distributed in 13 distinct functions for obtaining the motif quorums. I will take advantage of his contribution and post all of them, with some quick comments on each one of them (his code comments were kept in each function). After, a small benchmarking will be posted.

Most of the functions need to import a couple of module

from collections import defaultdict, deque
from itertools import count, izip, tee

and they have two parameters, a sequence list and the length of the motifs.

The first function uses again the defaultdict and it is very similar to the one used in the final version of the quorum script. The defaultdict is initialized as a list and the ids are added to a the list, keys are motifs, only if they are not already present in it. The sequence id is generated in a variable incremented each time the loop iterates.

def get_quorums_01(seqs, mlen):
    """
    append seq id_no to list after checking to see if already present
    use explicit counter to create seq_no
    """
    quorum = defaultdict(list)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq) - mlen):
            if id not in quorum[seq[n:n+mlen]]:
                quorum[seq[n:n + mlen]].append(id_no)
    return quorum

The second function is very similar to the first one, with the caveat that sequence id numbers are generated with enumerate.

def get_quorums_02(seqs, mlen):
    """
    append seq id_no to list after checking to see if already present
    use 'enumerate' to create seq_no
    """
    quorum = defaultdict(list)

    for id_no, seq in enumerate(seqs):
        for n in range(len(seq) - mlen):
            if id_no not in quorum[seq[n:n+mlen]]:
                quorum[seq[n:n+mlen]].append(id_no)
    return quorum

enumerate is a object based on another iterable object. When called enumerate always returns a tuple of an indexed series. For instance, in our case above, enumerate will return a series of tuples (0, sequence1), (1, sequence2) ... (n, sequenceN). That’s the reason the enumerate loop uses a tuple as its index

for id_no, seq in enumerate(seqs)

Next couple of posts will cover the other functions sent by Mike. Then we will go back to the statistical module.

Zemanta Pixie

Obtaining overrepresented motifs in DNA sequences, part 5

Phase 2, motifs 3 Comments »

Now that we have the script to generate the word quorums working (and working fast!) we need then to calculate the a p value for each motif based on the fore and background quorums. A p value cut-off will determine the statistically significant words, or overrepresented. These overrepresented words then can be analysed in more details (that we won’t see here) and for instance determine new or already known transcription factor binding sites.

A well established statistical method to determine such overrepresented words is the Hypergeometric Distribution (HD for short). HD measures “success” and “failures” for values that do not fit in the binomial distribution, and depend on the measurements without replacement.

Basically, HD’s equation has a a series of binomial coefficients/combinations

HD equation

where N is the population size, m is foreground cluster size, k is the motif quorum in the background gene set and x is the word quorum in the foreground set. Note that the above equation is for the cumulative HD, where a sum of probabilities is calculated.

All the combinations in the above equation have to be expanded to factorials that depending on the value to be calculated are very computer intensive and sometimes don’t fit in the memory (either a float or integer). But Python is able to handle very large numbers and the calculation of large factorials is relatively fast.

In C++, I had to use a couple of tricks to achieve a good speed in the factorial determination, and specially in the HD calculation that requires multiple factorials and multiplication, division and subtraction of large numbers. I didn’t want to use any mathematical trick such as Stirling’s approximation. 13! in C++ already blows the size of long, so I had to use the MAPM, A Portable Arbitrary Precision Math Library in C. This library is quite fast to calculate the factorial values but when one needs to calculate more than 200,000 factorials the speed is unbearable. So, I decided to pre-calculate a series of factorial values, keeping 10 decimal places as precision and saving in another column their exponential. Then using this table as an input I was able to multiply, divide and subtract the factorials and by employing the first law of exponents do the same operations with their exponential. This speeds up the process tremendously.

In Python, we don’t need any extra third-party library, we just use Python itself, without importing an extra module. A factorial function in Python can be written in one line, but for clarity is better to define it separately. We can try throwing any number at it and see the result.


def fac(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

We already saw reduce and lambda and using these two methods make the factorial function clear and simple. And why are we not using a recursive function? Because Python has a limit recursion depth (1000). Next time we will implement the code that calculates the HD p values.

Obtaining overrepresented motifs in DNA sequences, part 4

Phase 2, motifs 4 Comments »

We found a way to make the Python script as good as or better than the C++ executable. But for the analysis we need to do, motif counts are not the value we want. We need the quorum: the number of sequences the motif is present at least once. For instance, if the desired motifs was AAACCCTTTG we will check in which sequences this word was present. Let’s say in a cluster of 10 sequences, we would find it in sequences 1, 2, 3, 4 and 5, giving us a quorum of 5 out of 10, or 50%. The quorum will be used in the future in the statistical calculation in order to determine the overrepresented motifs.

With only a couple of modifications, we can adapt the script used to get the motif counts to get the quorum.

#!/scratch/python/bin/python

from collections import defaultdict
import sys
import fasta

seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = int(sys.argv[2])

quorum = defaultdict(list)

seq_number = 0
for i in seqs:
    seq_number += 1
    for n in range(len(i.sequence) - int(length)):
        if not seq_number in quorum[i.sequence[n : n + length]]:
            quorum[i.sequence[n : n + length]].append(seq_number)

for i in quorum:
    print ''.join(i).upper(), len(quorum[i])

Basically, we change the way the defaultdict is initialized, this time as a list instead of int and we also change the procedure that used to get the counts. The loop does identical work, iterating along the sequences, with a window (of the input length) sliding on them and checking each word. This time instead of incrementing the value of the defaultdict, we append to the list the sequence number, obtained from a index integer variable (incremented in each iteration of the loop), if this number is not already in the list value. In the end each value of quorum will be a list os numbers and by printing the list length we obtain the quorum. Testing the above script there is no performance loss when comparing to the previous count script.

Next we will see which statistical method to use and start to devising an script to calculate it.

Obtaining overrepresented motifs in DNA sequences, part 3

Phase 2, motifs No Comments »

We take a break of developing code and check for performance (non-scientific testing!). In the previous entry a simple file was used as input: 8 DNA sequences of 500 bases each. That’s not enough to test the performance of the Python script against the C++ compiled executable. So, we use a larger file; two larger files to be more exact. First, we use a 555 sequence file with the sequences averaging 19371 nucleotides and another with 3854 sequences averaging 20000 nucleotides in length. Those files were the largest foreground and background clusters used in analysis. Let’s see how the Pyhton and C++ fared (Linux’s time was used in the comparison, for simplicity).

Foreground cluster

Average of 10 runs

C++: 45.66 seconds
Python: 36.4 seconds

Background cluster

Average of 10 runs

C++: 5 minutes and 4 seconds
Python: 2 minutes and 44 seconds

The C++ analysis of the background cluster was done on a cluster’s node, and not on my desktop computer, due to the fact that my desktop could not handle it. C++ defense: the code was developed by me and I am not a computer scientist. Clearly there is room for improvement and performance gain.

But, definitely we can see that by using some advanced techniques in Python we are able to outperform C++ (at least C++ code developed by me) and still have a short and easy to read script. Next time we will change the Python’s script output.

Obtaining overrepresented motifs in DNA sequences, part 2

Phase 2, motifs No Comments »

We move one on our search for overrepresented motifs in DNA sequences. I was preparing this entry when the comments to part 1 arrived. Because of that we will modify our previous code to include some suggestions and then we will change t a little bit to output the actual values we want. As Titus pointed out in his comment, when we merge the sequences we generate errors, because we artificially including motifs that are not supposed to be there. But in the end we will see that we are no after the actual number of times a motif appears in the sequence and what matters is the motif quorum.

The main focus of the part one was to show the decrease in code length from C++ to Python, introduce generator functions and yield and also show the nice permutation function. Turns out by using the approach of the previous entry there is a big drawback in code execution, which is around 60 times slower than C++. So we need to find another approach, and Mike showed us how (with some small modifications)


#!/usr/bin/env python

from collections import defaultdict
import sys
import fasta

seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = int(sys.argv[2])

#for a missing key, the dict entry is initialized to zero
counts = defaultdict(int)

#count the length-element subsequences in each sequence
for i in seqs:
	for n in range(len(i.sequence) - length):
		counts[i.sequence[n : n + length]] += 1

#counts.keys() will then return the nucleotide sequences
#that were actually in merged_seqs

#print out the sequences that occur more than once
for count in counts:
        print ''.join(count), counts[count]

This time the counting is done with a defaultdict initialized with integers (all zeroes), and instead of generating all possible combinations (which was cool and fast, but in the end made our script slower) a window with the input length is slided along the each sequence and the key of the default dictionary is the motif and the value is the actual count incremented as we determine each motif.

Checking this code performance using Linux’s time, we get 0.2 seconds, an improvement of 5 fold on the C++ code and 300 fold on our previous Python code (Thanks Mike and everyone else for the suggestions).

A couple of other differences here is that the output is not ordered and only the seen motifs are printed in the end. We will see next time how to get the quorums.

Design by j david macor.com.Original WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in