Obtaining overrerpresented motifs in DNA sequences, part 8

Phase 2, motifs No Comments »

We keep on introducing Mike’s functions. This time there are a couple of Python methods that we haven’t seen here and need some introduction, izip and count. To use these two we also need to import new modules

from itertools import count, izip

count returns consecutive integers starting at a defined point (the method’s parameter). If empty it starts from zero. Basically, by starting a count it will give an iterable with a increasing integer values, in a fashion similar to a function with yield. Every time our loop accesses the count it will “remember” the last return value and increment it by one.

izip also returns an iterator, but from a list of iterables. It is basically used to iterate through a list of many iterables at the same time. In the function below it is used twice: one to generate a tuple (with count) with a sequence number and the sequence itself. The sequence in the tuple is then used in another izip to create the windows on the sequences to count motifs.

def get_quorums_06(seqs, mlen):
    """
    add seq id_no to a set
    use 'izip(count(),...) to create seq_no
    use 'izip(count(),range(...)) to create start/stop indices for motifs
    """
    quorum = defaultdict(set)
    for id_no, seq in izip(count(), seqs):
        for s, e in izip(count(), range(mlen, len(seq))):
            quorum[seq[s:e]].add(id_no)
    return quorum

In the next couple of posts we still be checking motif quorum functions. Stay tuned.

Obtaining overrepresented motifs in DNA sequences, part 7

Phase 2, motifs 1 Comment »

Continuing on Mike’s functions to obtain motif quorums. We see function 3, 4 and 5. Function get_quorums_03, uses an old friend of the blog, sets. Recall that sets are very similar to lists, but their are unordered and items are unique.

def get_quorums_03(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(set)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq)-mlen):
            quorum[seq[n:n+mlen]].add(id_no)
    return quorum

Basically, the sequence numbers (an incremented counter) are added to a defaultdict which was initialized as a set. This way you don’t need to check for the existence of the sequence number in the defaultdict list and count on the ability of set of being unique. Function 4 is very similar to function 3 with the difference of using enumerate (as in function 02) to make the sequence numbers.

def get_quorums_04(seqs, mlen):
    """
    add seq id_no to a set
    use 'enumerate' to create seq_no
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for n in range(len(seq)-mlen):
            quorum[seq[n:n+mlen]].add(id_no)
    return quorum

Function 5 adds a twist, which is to have an enumerate to set the sequence range (motif/word width) start and stop. This way the window is sliding based on the tuple created by the enumerate method and not on the slicing that were used in all other functions. Again, a defaultdict is initialized as set and the sequence numbers are generated by an enumerate.

def get_quorums_05(seqs, mlen):
    """
    add seq id_no to a set
    use 'enumerate' to create seq_no
    use enumerate(range(...)) to create start/stop indices for motif
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for s, e in enumerate(range(mlen, len(seq))):
            quorum[seq[s:e]].add(id_no)
    return quorum

Obtaining overrepresented motifs in DNA sequences, part 6

Phase 2, motifs No Comments »

We will take a break on developing the statistical module to obtain overrepresented motifs (I will introduce mul in the next stats post), and take a deeper look at the possibilities on obtaining the motif quorums. Mike DeHaemer, a regular commenter and contributor to the blog, sent me a Python script with 8 different ways distributed in 13 distinct functions for obtaining the motif quorums. I will take advantage of his contribution and post all of them, with some quick comments on each one of them (his code comments were kept in each function). After, a small benchmarking will be posted.

Most of the functions need to import a couple of module

from collections import defaultdict, deque
from itertools import count, izip, tee

and they have two parameters, a sequence list and the length of the motifs.

The first function uses again the defaultdict and it is very similar to the one used in the final version of the quorum script. The defaultdict is initialized as a list and the ids are added to a the list, keys are motifs, only if they are not already present in it. The sequence id is generated in a variable incremented each time the loop iterates.

def get_quorums_01(seqs, mlen):
    """
    append seq id_no to list after checking to see if already present
    use explicit counter to create seq_no
    """
    quorum = defaultdict(list)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq) - mlen):
            if id not in quorum[seq[n:n+mlen]]:
                quorum[seq[n:n + mlen]].append(id_no)
    return quorum

The second function is very similar to the first one, with the caveat that sequence id numbers are generated with enumerate.

def get_quorums_02(seqs, mlen):
    """
    append seq id_no to list after checking to see if already present
    use 'enumerate' to create seq_no
    """
    quorum = defaultdict(list)

    for id_no, seq in enumerate(seqs):
        for n in range(len(seq) - mlen):
            if id_no not in quorum[seq[n:n+mlen]]:
                quorum[seq[n:n+mlen]].append(id_no)
    return quorum

enumerate is a object based on another iterable object. When called enumerate always returns a tuple of an indexed series. For instance, in our case above, enumerate will return a series of tuples (0, sequence1), (1, sequence2) ... (n, sequenceN). That’s the reason the enumerate loop uses a tuple as its index

for id_no, seq in enumerate(seqs)

Next couple of posts will cover the other functions sent by Mike. Then we will go back to the statistical module.

Zemanta Pixie

Obtaining overrepresented motifs in DNA sequences, part 5

Phase 2, motifs 3 Comments »

Now that we have the script to generate the word quorums working (and working fast!) we need then to calculate the a p value for each motif based on the fore and background quorums. A p value cut-off will determine the statistically significant words, or overrepresented. These overrepresented words then can be analysed in more details (that we won’t see here) and for instance determine new or already known transcription factor binding sites.

A well established statistical method to determine such overrepresented words is the Hypergeometric Distribution (HD for short). HD measures “success” and “failures” for values that do not fit in the binomial distribution, and depend on the measurements without replacement.

Basically, HD’s equation has a a series of binomial coefficients/combinations

HD equation

where N is the population size, m is foreground cluster size, k is the motif quorum in the background gene set and x is the word quorum in the foreground set. Note that the above equation is for the cumulative HD, where a sum of probabilities is calculated.

All the combinations in the above equation have to be expanded to factorials that depending on the value to be calculated are very computer intensive and sometimes don’t fit in the memory (either a float or integer). But Python is able to handle very large numbers and the calculation of large factorials is relatively fast.

In C++, I had to use a couple of tricks to achieve a good speed in the factorial determination, and specially in the HD calculation that requires multiple factorials and multiplication, division and subtraction of large numbers. I didn’t want to use any mathematical trick such as Stirling’s approximation. 13! in C++ already blows the size of long, so I had to use the MAPM, A Portable Arbitrary Precision Math Library in C. This library is quite fast to calculate the factorial values but when one needs to calculate more than 200,000 factorials the speed is unbearable. So, I decided to pre-calculate a series of factorial values, keeping 10 decimal places as precision and saving in another column their exponential. Then using this table as an input I was able to multiply, divide and subtract the factorials and by employing the first law of exponents do the same operations with their exponential. This speeds up the process tremendously.

In Python, we don’t need any extra third-party library, we just use Python itself, without importing an extra module. A factorial function in Python can be written in one line, but for clarity is better to define it separately. We can try throwing any number at it and see the result.


def fac(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

We already saw reduce and lambda and using these two methods make the factorial function clear and simple. And why are we not using a recursive function? Because Python has a limit recursion depth (1000). Next time we will implement the code that calculates the HD p values.

Obtaining overrepresented motifs in DNA sequences, part 4

Phase 2, motifs 4 Comments »

We found a way to make the Python script as good as or better than the C++ executable. But for the analysis we need to do, motif counts are not the value we want. We need the quorum: the number of sequences the motif is present at least once. For instance, if the desired motifs was AAACCCTTTG we will check in which sequences this word was present. Let’s say in a cluster of 10 sequences, we would find it in sequences 1, 2, 3, 4 and 5, giving us a quorum of 5 out of 10, or 50%. The quorum will be used in the future in the statistical calculation in order to determine the overrepresented motifs.

With only a couple of modifications, we can adapt the script used to get the motif counts to get the quorum.

#!/scratch/python/bin/python

from collections import defaultdict
import sys
import fasta

seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = int(sys.argv[2])

quorum = defaultdict(list)

seq_number = 0
for i in seqs:
    seq_number += 1
    for n in range(len(i.sequence) - int(length)):
        if not seq_number in quorum[i.sequence[n : n + length]]:
            quorum[i.sequence[n : n + length]].append(seq_number)

for i in quorum:
    print ''.join(i).upper(), len(quorum[i])

Basically, we change the way the defaultdict is initialized, this time as a list instead of int and we also change the procedure that used to get the counts. The loop does identical work, iterating along the sequences, with a window (of the input length) sliding on them and checking each word. This time instead of incrementing the value of the defaultdict, we append to the list the sequence number, obtained from a index integer variable (incremented in each iteration of the loop), if this number is not already in the list value. In the end each value of quorum will be a list os numbers and by printing the list length we obtain the quorum. Testing the above script there is no performance loss when comparing to the previous count script.

Next we will see which statistical method to use and start to devising an script to calculate it.

Obtaining overrepresented motifs in DNA sequences, part 3

Phase 2, motifs No Comments »

We take a break of developing code and check for performance (non-scientific testing!). In the previous entry a simple file was used as input: 8 DNA sequences of 500 bases each. That’s not enough to test the performance of the Python script against the C++ compiled executable. So, we use a larger file; two larger files to be more exact. First, we use a 555 sequence file with the sequences averaging 19371 nucleotides and another with 3854 sequences averaging 20000 nucleotides in length. Those files were the largest foreground and background clusters used in analysis. Let’s see how the Pyhton and C++ fared (Linux’s time was used in the comparison, for simplicity).

Foreground cluster

Average of 10 runs

C++: 45.66 seconds
Python: 36.4 seconds

Background cluster

Average of 10 runs

C++: 5 minutes and 4 seconds
Python: 2 minutes and 44 seconds

The C++ analysis of the background cluster was done on a cluster’s node, and not on my desktop computer, due to the fact that my desktop could not handle it. C++ defense: the code was developed by me and I am not a computer scientist. Clearly there is room for improvement and performance gain.

But, definitely we can see that by using some advanced techniques in Python we are able to outperform C++ (at least C++ code developed by me) and still have a short and easy to read script. Next time we will change the Python’s script output.

Obtaining overrepresented motifs in DNA sequences, part 2

Phase 2, motifs No Comments »

We move one on our search for overrepresented motifs in DNA sequences. I was preparing this entry when the comments to part 1 arrived. Because of that we will modify our previous code to include some suggestions and then we will change t a little bit to output the actual values we want. As Titus pointed out in his comment, when we merge the sequences we generate errors, because we artificially including motifs that are not supposed to be there. But in the end we will see that we are no after the actual number of times a motif appears in the sequence and what matters is the motif quorum.

The main focus of the part one was to show the decrease in code length from C++ to Python, introduce generator functions and yield and also show the nice permutation function. Turns out by using the approach of the previous entry there is a big drawback in code execution, which is around 60 times slower than C++. So we need to find another approach, and Mike showed us how (with some small modifications)


#!/usr/bin/env python

from collections import defaultdict
import sys
import fasta

seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = int(sys.argv[2])

#for a missing key, the dict entry is initialized to zero
counts = defaultdict(int)

#count the length-element subsequences in each sequence
for i in seqs:
	for n in range(len(i.sequence) - length):
		counts[i.sequence[n : n + length]] += 1

#counts.keys() will then return the nucleotide sequences
#that were actually in merged_seqs

#print out the sequences that occur more than once
for count in counts:
        print ''.join(count), counts[count]

This time the counting is done with a defaultdict initialized with integers (all zeroes), and instead of generating all possible combinations (which was cool and fast, but in the end made our script slower) a window with the input length is slided along the each sequence and the key of the default dictionary is the motif and the value is the actual count incremented as we determine each motif.

Checking this code performance using Linux’s time, we get 0.2 seconds, an improvement of 5 fold on the C++ code and 300 fold on our previous Python code (Thanks Mike and everyone else for the suggestions).

What is different here is that the output is not ordered and only the seen motifs are printed in the end. We will see next time how to get the quorums.

Obtaining overrepresented motifs in DNA sequences, part 1

Phase 2, motifs 11 Comments »

Changing gears now, leaving behind Pfam alignments. I decided to start a new series of posts based on the conversion of some small C++ programs I developed in the past. These small programs (I call them modules because they were part of a larger application) were used to count motifs, short nucleotide words up to 10-12 base pairs, and then calculate statistical overrepresentation of these words by comparing a foreground set of DNA sequences against a background set.

We will start comparing the different approaches of the C++ and the Python codes and point out advantages and disadvantages of doing it in one language or the other. First thing we need to do is to count the motifs in all sequences from our foreground and background sets. For the project I was working on, the ideal word length was 10 nucleotides. Basically our C++ approach to increase speed was to transform the character DNA sequences into numbers and then, while sliding a window with the desired word length, hash the base-four numbers into base 10 and increment a vector position, previously initialized with 0. For four nucleotides and a word size of 10 there are 1,048,576 permutations possible, from AAAAAAAAAA to TTTTTTTTTT. Initially the C++ program would do

for(j = 0; j < seqsize; j++)
{
    seqfile.get(base);
    if(base != '\n')
    {
        switch(base)
        {
            case 'A':
                bseq.push_back(0);
                break;
            case 'C':
                bseq.push_back(1);
                break;
            case 'G':
                bseq.push_back(2);
                break;
            case 'T':
                bseq.push_back(3);
                break;
        }
    }
}

reading all sequences and pushing an figure for each nucleotide in a vector, and then sliding a window on this vector and hashing the base-four number

int hashSeq(vector<short> subseq)
{
    int w, i, hashvalue = 0, power;
    w = subseq.size() - 1;
    for(i = 0; i &lt; subseq.size(); i++)
    {
            power = 0;
            hashvalue += subseq[i] * pow((double)4,(double)w);
            w--;
    }
    return hashvalue;
}

if(binseq[i].size() &gt; motifwidth)
{
    for(j = 0; j &lt; binseq[i].size()-motifwidth+1; j++)
    {
        sub.assign(binseq[i].begin() + j, binseq[i].begin() + j+motifwidth);
        hashed = hashSeq(sub);
        nmercount[hashed]++;
        sub.clear();
    }
}

The whole C++ code has about 400 lines, including all the possible output formatting and printing. Timing with time the C++ executable takes a little bit less than 2 seconds to read, count and output different files.

For Python, we will use a different approach and gain a lot in code simplicity. As we want to count the number of times size-10 words appear in all sequences, we first need to generate all possible permutations (with replacement) of four nucleotides. This can be easily accomplished by using generator functions. Regular functions run until completion and the return a value. For instance, a function that calculates the factorial of 10 will return the last value only, after multiplying 10.9.8.7.6.5.4.3.2. A generator function runs until a value is available to return, yielding it and then suspending its operation until called again. The yielding part was emphasized because yield is the command used by Python to return the value and suspend the function until further notice. In the factorial function, a generator would return the intermediary factorial values up to 10.

To generate all 1 million plus permutations of 4 nucleotides we need a function similar to the one below (modified from here)

def permutations(items, n):
    if n == 0:
        yield ''
    else:
        for i in range(len(items)):
            for base in permutations(items, n - 1):
                yield str(items[i]) + str(base)

Basically, what this generator function does is to combine all four nucleotides in words of size 10. This is a recursive function, where the result of the function is dependent on the n-1 value calculated by the function until n equals 0. The first for loop over the items that we want to permutate (the nucleotides) and the second for recursively calls permutations starting with the initial n passed (10) until we reach 0. Debugging this function we will see that i is constant for each iteration of the second loop and only n changes from 10 to 0, while one by one nucleotides are joined to form a motif. It starts with AAAAAAAAAA, then AAAAAAAAAC, then AAAAAAAAAG, until it gets to a poly-T.

Our final code would look like the one below

import fasta
import sys

def permutations(items, n):
    if n == 0:
        yield ''
    else:
        for i in range(len(items)):
            for base in permutations(items, n - 1):
                yield str(items[i]) + str(base)

seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = sys.argv[2]

nucleotides = ['A', 'C', 'G', 'T']

merged_seqs = ''
for i in seqs:
    merged_seqs += i.sequence

for i in permutations(nucleotides, int(length)):
    print i + '\t' + merged_seqs.count(i)

where we read the input sequence(s), merge them in one long string and as we generate all possible combinations we count the number of times they appear. This code running on the same input file used on the C++ executable is 60 times slower, taking in average one full minute to count motifs in 8 500 bp DNA sequences. The slowest section is the count, as the generation of all possible combinations is straightforward. We lose some speed, but gain a lot on code simplicity and clarity. Next we will modify this code to output different counts needed for the statistical analysis.

Still on merging Pfam alignments …

Phase 2 No Comments »

One of the things I like about Python and the Python community is the search for the making code simple and clear. Tal left a comment in the last post about merging Pfam alignment sequences suggesting another approach to our problem. The code is below

def merge_seqs(data1, data2):
    from itertools import chain, groupby
    format = "%s-%s->%d\n%s%s"
    flist = []
    keyfunc = lambda it: it.name[it.name.find('|') + 1 : it.name.find('/')]
    for it, g in groupby(sorted(chain(data1, data2), key=keyfunc), keyfunc):
        values = list(g)
        if len(values) == 2:
            jname, jseq = values[0].name, values[0].sequence
            kname, kseq = values[1].name, values[1].sequence
            flist.append(format % (jname, kname, len(jseq), jseq, kseq) )

    return flist

The code also uses the itertools module, importing chains and groupby. We already saw chains in the previous post, but groupby is new to us here. groupby was introduced in the 2.4 version of Python and is a method returns keys and groups from an iterable. An Python iterable is any object that can return its elemements at given time, for instance in a for loop, while the index of this loop is the iterator. So, in our case groupby will return the sequence names based on the lambda function defined before the groupby and the chain method. Usually groupby has this syntax

groupby(iterable[, key])

The key is optional, and in our case it is the lambda function. Another method new to use that uses the same lambda function is sorted. As its name hints, sorted returns a sorted list of iterables. The key in this case is the sorting algorithm, that actually creates the comparison between items.

Basically in the code above, a lambda function extracts the desired regions from the sequence names, which are them iterated in a groupby method that returns they key values, one value when the sequence is unique, two values when there are two sequences, of a sorted iterable generated by a chain that read both input lists in one pass. After this we just need to check the number of returned values and we have our list of matching sequences.

Revisting Pfam alignments, again …

Phase 2 No Comments »

Continuing on the Pfam alignment sequence merge, Luke provided two solutions, one for case where we have duplicates in the file (that happens) and anothet one where duplicates are not tested. First for the case with no duplicates:

def merge_seqs(data1, data2):
    first, second = dict(), dict()
    for i in data1:
        first[i.name[i.name.find('|') + 1:i.name.find('/')]] = i

    for i in data2:
        second[i.name[i.name.find('|') + 1:i.name.find('/')]] = i

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        j = first[i]
        k = second[i]
        tempname = j.name + '-' + k.name + '-&gt;' + str(len(j.sequence))
        tempseq = j.sequence + k.sequence
        flist.append(tempname + '\n' + tempseq)

    return flist

Basically, his approach is to create two dictionaries, first and second to store the data from our lists of FASTA objects. The dictionaries keys are the sections fo each sequence name, and the values are the FASTA object created when we read the file. After that, a set of both dictionaries is created, and this represents everything we need in order to find shared sequences between both files. The output is trivial, basically iterating the set and getting the values from both dictionaries.

But, in some cases there are some duplicates in each Pfam alignment, and this would make the above approach to miss some sequences. Luke also provided a solution for this case

from collections import defaultdict
def merge_seqs(data1, data2):
    first, second = defaultdict(list), defaultdict(list)
    for i in data1:
        first[i.name[i.name.find('|') + 1:i.name.find('/')]].append(i)

    for i in data2:
        second[i.name[i.name.find('|') + 1:i.name.find('/')]].append(i)

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        cross = [(a,b) for a in first[i] for b in second[i]]
        for j,k in cross:
            tempname = j.name + '-' + k.name + '-&gt;'’ + str(len(j.sequence))
            tempseq = j.sequence + k.sequence
            flist.append(tempname + '\n' + tempseq)

    return flist

which is similar to what Mike suggested. In his approach defaultdict is used instead of a regular dict, initialized as a list. Again we can take advantage of the fact that we that the value of defaultdict has the same features as the object that was used to initialized it. The same set approach is used and we have a nice set of merged sequences from two Pfam alignments. Remember that defaultdict were introduced in Python 2.5, it won’t work on older versions.

Design by j david macor.com.Original WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in