Creating an interface for the motif finding script, part 1

motifs, wxPython Comments Off

And we are back. After much ado about real life, I am able to “restart” this blog and probably with a good frequency of posts. Last time we saw the final product of our motif finding series. We ended up creating a very elegant script in Python that efficiently counts words in FASTA sequences and then using a basic statistical method, calculates the significance of each word and output the overrepresented ones.

Our script used a little bit less than 50 lines, and if you include the imported fasta module, it won’t top 100. But the number of lines is not important. The efficiency, clarity and speed are key here. At the same time, running a script from the command line is not something everyone is used to do. In order to add more visibility to our simple script, why not including a GUI? With a visual interface, more people can use our script, in different systems. Sounds great.

Python has many options of GUI frameworks, some more cross-platform that others. In the end finding the right framework is more a matter of taste, or availability. My personal experience with wxWidgets lead me to start developing in wxPython, and for me this was a natural choice. But there are many other GUI frameworks for Python, each one providing more or less integration and portability (you can “choose” you own here).

So, let’s create a skeleton for our GUI. First step is to install wxPython. Packages for Windows are available from their website, RPMs for Linux and DMG for Macs (I’m quite sure OS X Leopard comes with wxPython by default, just test importing it). After installing it, start Python and check if everything is in place

import wx
wx.__version__

On my machine, I get no errors and the version is 2.8.9.1 (you don’t need the latest version to create the GUI). Everything seems to be fine. A wxPython script has the same format as any Python script, the only difference is that its output is not directed to the prompt or a file. The script’s product will be the screen, so in most cases the output and program usage will depend on the user’s interaction with objects on the screen. Like any other graphical interface. A very simple script would look like

#!/usr/bin/env python

import wx

class pymot(wx.App):

    def __init__(self, redirect=False):
        wx.App.__init__(self, redirect)

class pymotGUI(wx.Frame):

    def __init__(self, parent, id):
        wx.Frame.__init__(self, parent, id,  'Python Motif Finder', style=wx.DEFAULT_FRAME_STYLE)
        self.__do_layout()

    def __do_layout(self):
        pass

app = pymot()
frame = pymotGUI(parent=None, id = -1)
frame.Show()
app.MainLoop()

Usually a wxPython interface has three parts in its script: a class for the window/frame/dialog, a class for the application and a initialization routine. All wxPython applications, and scripts, need to derive an wx.App class and initialize it (on OnInit or on __init__ functions), i.e. create the window, begin the program, etc. Another class, derived from wx.Frame in this case, will build the window/frame/dialog per se and will also contain initialization for the window, objects, events, etc. The last part is the main script where the application is started, by calling the derived class, the window is also called and shown. The last line is the MainLoop, present in every wxPython script, and it is the main line of the script, the heart of the application. MainLoop processes all the events and manages how the objects interact by receiving and dispatching such events.

The script above could have been created differently, some lines of it omitted and there is also no need to derive an specific class for the frame. But this way it is easier to get a grasp of the script as it will need to be enlarged so accomodates the objects and maybe a couple of extra windows and dialogs. Running the above script will generate the window below

First screencap of our GUI

very simple and barebones. Next will explore the script above, include some extra elements and learn a little bit more of wxPython.

Python, overepresented motifs, the Grand Finale

Phase 2, motifs Comments Off

In this final part, let’s do some very simple refactoring and modify the output section to make the result a little bit better. There are not many options about the functions to calculate the binomial expansion. But Andrew posted some opinions on how to slight change the quorum function.

def get_quorums(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(int)
    for seq in seqs:
        for n in range(len(seq) - mlen):
            quorum[seq[n:n + mlen]] += 1
    return quorum

His modifications were small but improved the code a bit, as you remove one variable/object from the function. At the same time there is need to change a bit our output section of the code, as we don’t use a defaultdict initialized with a set, but with an integer.

for i in foreground:
    term1 = choose(background[i], foreground[i])
    term2 = choose((N - background[i]), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 < p <= 0.0001:
        print i, foreground[i], background[i], p

Notice that in the term1 line we don’t check for the set length anymore and just use the integer stored in foreground and background. Again a small change, that can make the code a little bit more clear. But we need to modify this section so the output is a little bit more clear, maybe ordered by motif sequence.

But as we are reading the sequences as they are our results are not ordered. It would be great to have a final list starting with AAAAAAAA and ending with TTTTTTTTT. There is an easy way to do that, and very inexpensive regarding code and final performance. Basically we append each one of the motifs (and their extra information) to a list and use the sort method for lists. So our output section of the code will be

res_motifs = []
for i in foreground:
    term1 = choose(background[i], foreground[i])
    term2 = choose((N - background[i]), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 < p <= 0.0001:
        res_motifs.append(i + '\t' + str(foreground[i]) + '\t' + str(background[i]) + '\t' + str(p))

res_motifs.sort()
for i in res_motifs:
    print i

Putting everything together our final motif determination script is (batteries included):

#!/usr/bin/env python

import fasta
import sys
from collections import defaultdict

def choose(n, k):
    if 0 <= k <= n:
        ntok = 1
        ktok = 1
        for t in xrange(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        return ntok // ktok
    else:
        return 0

def get_quorums(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(int)
    for seq in seqs:
        for n in range(len(seq) - mlen):
            quorum[seq[n:n + mlen]] += 1
    return quorum

input_seqs = fasta.read_seqs(open(sys.argv[1]).readlines())
input_seqs2 = fasta.read_seqs(open(sys.argv[2]).readlines())

foreground = get_quorums(input_seqs, 10)
background = get_quorums(input_seqs2, 10)

N = len(input_seqs) + len(input_seqs2)

res_motifs = []
for i in foreground:
    term1 = choose(background[i], len(foreground[i])
    term2 = choose((N - background[i]), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 < p <= 0.0001:
        res_motifs.append(i + '\t' + str(foreground[i]) + '\t' + str(background[i]) + '\t' + str(p))

res_motifs.sort()
for i in res_motifs:
    print i

Next we will see some basic Python methods. And maybe start a new series and phase.

Reblog this post [with Zemanta]

Obtaining overrepresented motifs in DNA sequences, part 13

Phase 2, Section 3, Section 5, motifs 1 Comment »

Now that we have the best quorum determination function and the ideal function to calculate the binomial expansions it is easy to program a script to calculate the p value of motifs in DNA sequences. To the script

below in the code there are a couple of errors that wordpress don’t let me fix. The > and < are replaced by their literal html enconding. I am working on it, sorry

#!/usr/bin/env python

import fasta
import sys
from collections import defaultdict

def choose(n, k):
    if 0 <= k <= n:
        ntok = 1
        ktok = 1
        for t in xrange(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        #print ntok // ktok
        return ntok // ktok
    else:
        return 0

def get_quorums(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(set)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq) - mlen):
            quorum[seq[n:n + mlen]].add(id_no)
    return quorum

input_seqs = fasta.read_seqs(open(sys.argv[1]).readlines())
input_seqs2 = fasta.read_seqs(open(sys.argv[2]).readlines())

foreground = get_quorums(input_seqs, 10)
background = get_quorums(input_seqs2, 10)

N = len(input_seqs) + len(input_seqs2)

for i in foreground:
    term1 = choose(len(background[i]), len(foreground[i]))
    term2 = choose((N - len(background[i])), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 < p <= 0.0001:
        print i, len(foreground[i]), len(background[i]), p

We already defined choose in the last post (more information in the link from the Python’s cookbook) and earlier Mike sent us a series of quorum-determination functions and one of the best was portrayed and explained here. We also need our fasta module to read the sequences (and only the sequences) in order to use it in the quorum function.

Basically we use the foreground and background files as input, determine the quorum of the different words (width 10) and then we iterate over the results, calculating the p value for each motif found in the foreground set. The tree terms of the Hypergeometric Distribution are calculated separately and we test for a p value smaller that 0.0001 (this can be modified) and we only print the results that fall in this category.>

Reblog this post [with Zemanta]

Obtaining overrepresented motifs in DNA sequences, part 10

Phase 2, motifs 6 Comments »

Let’s get back to the statistical module, that will calculate an Hypergeometric Distribution (HD) p value so we can define the overrepresented motifs. Last time we saw it, we just had defined the factorial function, which is immensely helpful in this case due to the number of factorial calculations needed in the HD. The factorial function was the one below


def fac(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

but as mentioned in the comments by Dave and by Mike via email the method used is not the best method to calculate factorial in Python. The best approach in this case is to use operator.mul. All functions in the operator modules are in implemented in pure C and they mimic the same operators in Python. So in this module we can find mul for multiplication, sub for subtraction, add for additions, etc.

The operator.mul needs two arguments to multiply, and in our case we still need to use reduce to sum all the results from a series of multiplications. As parameters we should use a range, that can start with 2, that should go up to the number we want the factorial plus one. Finally our function would be


import operator

def fac(n):
    value = reduce(operator.mul, xrange(2, n+1))
    return value

The time gain, quickly measured in a non-scientific fashion in my system, is around 5 to 15%, depending on the factorial being calculated. It may seem a small gain, but when you need to calculate almost a million factorials for all possible motifs the amount of time saved is crucial. Next time we will be back with more statistics, expanding the module.

Zemanta Pixie

Obtaining overrepresented motifs in DNA sequences, part 9

Phase 2, motifs Comments Off

Back on new functions for motif quorums. We jump function 7 in order to explain “simpler” ones, 8 and 9. Both functions use generators. We’ve already seen here generators, which are functions that use the yield statement to generate iterators. The generator is very similar to a function but instead of returning a value, it yields one and waits for another call to resume. In function 8, a generator is used to return the motif sequence that is used as a key in the defaultdict. Notice the scope of the generator that is coded inside a function.

def get_quorums_08(seqs, mlen):
    """
    add seq id_no to a set
    use enumerate to create seq_no
    use an explicit generator to create the motifs
    """
    def motif_gen(seq):
        for n in range(len(seq)-mlen):
            yield seq[n:n+mlen]

    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for motif in motif_gen(seq):
            quorum[motif].add(id_no)

    return quorum

In function 9 a very similar structure is used but in this cases instead of a “pure” generator it uses generator expressions which a very similar to list comprehensions but with parentheses instead of square brackets. Generator expressions are generators that can be written in one line and have identical behaviour as generators coded in the “regular” inception. In the function below the generator expressions provide the iterator for the loop with motif as index. Very simple and elegant.

def get_quorums_09(seqs, mlen):
    """
    add seq id_no to a set
    use enumerate to create seq_no
    use a generator expression to create the motifs
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for motif in (seq[n:n+mlen] for n in range(len(seq)-mlen)):
            quorum[motif].add(id_no)

    return quorum

In the next post we will go back to the statistical module and soon we will see the remainder 5 functions sent by Mike.

Zemanta Pixie

Obtaining overrerpresented motifs in DNA sequences, part 8

Phase 2, motifs Comments Off

We keep on introducing Mike’s functions. This time there are a couple of Python methods that we haven’t seen here and need some introduction, izip and count. To use these two we also need to import new modules

from itertools import count, izip

count returns consecutive integers starting at a defined point (the method’s parameter). If empty it starts from zero. Basically, by starting a count it will give an iterable with a increasing integer values, in a fashion similar to a function with yield. Every time our loop accesses the count it will “remember” the last return value and increment it by one.

izip also returns an iterator, but from a list of iterables. It is basically used to iterate through a list of many iterables at the same time. In the function below it is used twice: one to generate a tuple (with count) with a sequence number and the sequence itself. The sequence in the tuple is then used in another izip to create the windows on the sequences to count motifs.

def get_quorums_06(seqs, mlen):
    """
    add seq id_no to a set
    use 'izip(count(),...) to create seq_no
    use 'izip(count(),range(...)) to create start/stop indices for motifs
    """
    quorum = defaultdict(set)
    for id_no, seq in izip(count(), seqs):
        for s, e in izip(count(), range(mlen, len(seq))):
            quorum[seq[s:e]].add(id_no)
    return quorum

In the next couple of posts we still be checking motif quorum functions. Stay tuned.

Obtaining overrepresented motifs in DNA sequences, part 7

Phase 2, motifs 1 Comment »

Continuing on Mike’s functions to obtain motif quorums. We see function 3, 4 and 5. Function get_quorums_03, uses an old friend of the blog, sets. Recall that sets are very similar to lists, but their are unordered and items are unique.

def get_quorums_03(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(set)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq)-mlen):
            quorum[seq[n:n+mlen]].add(id_no)
    return quorum

Basically, the sequence numbers (an incremented counter) are added to a defaultdict which was initialized as a set. This way you don’t need to check for the existence of the sequence number in the defaultdict list and count on the ability of set of being unique. Function 4 is very similar to function 3 with the difference of using enumerate (as in function 02) to make the sequence numbers.

def get_quorums_04(seqs, mlen):
    """
    add seq id_no to a set
    use 'enumerate' to create seq_no
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for n in range(len(seq)-mlen):
            quorum[seq[n:n+mlen]].add(id_no)
    return quorum

Function 5 adds a twist, which is to have an enumerate to set the sequence range (motif/word width) start and stop. This way the window is sliding based on the tuple created by the enumerate method and not on the slicing that were used in all other functions. Again, a defaultdict is initialized as set and the sequence numbers are generated by an enumerate.

def get_quorums_05(seqs, mlen):
    """
    add seq id_no to a set
    use 'enumerate' to create seq_no
    use enumerate(range(...)) to create start/stop indices for motif
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for s, e in enumerate(range(mlen, len(seq))):
            quorum[seq[s:e]].add(id_no)
    return quorum

Obtaining overrepresented motifs in DNA sequences, part 6

Phase 2, motifs Comments Off

We will take a break on developing the statistical module to obtain overrepresented motifs (I will introduce mul in the next stats post), and take a deeper look at the possibilities on obtaining the motif quorums. Mike DeHaemer, a regular commenter and contributor to the blog, sent me a Python script with 8 different ways distributed in 13 distinct functions for obtaining the motif quorums. I will take advantage of his contribution and post all of them, with some quick comments on each one of them (his code comments were kept in each function). After, a small benchmarking will be posted.

Most of the functions need to import a couple of module

from collections import defaultdict, deque
from itertools import count, izip, tee

and they have two parameters, a sequence list and the length of the motifs.

The first function uses again the defaultdict and it is very similar to the one used in the final version of the quorum script. The defaultdict is initialized as a list and the ids are added to a the list, keys are motifs, only if they are not already present in it. The sequence id is generated in a variable incremented each time the loop iterates.

def get_quorums_01(seqs, mlen):
    """
    append seq id_no to list after checking to see if already present
    use explicit counter to create seq_no
    """
    quorum = defaultdict(list)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq) - mlen):
            if id not in quorum[seq[n:n+mlen]]:
                quorum[seq[n:n + mlen]].append(id_no)
    return quorum

The second function is very similar to the first one, with the caveat that sequence id numbers are generated with enumerate.

def get_quorums_02(seqs, mlen):
    """
    append seq id_no to list after checking to see if already present
    use 'enumerate' to create seq_no
    """
    quorum = defaultdict(list)

    for id_no, seq in enumerate(seqs):
        for n in range(len(seq) - mlen):
            if id_no not in quorum[seq[n:n+mlen]]:
                quorum[seq[n:n+mlen]].append(id_no)
    return quorum

enumerate is a object based on another iterable object. When called enumerate always returns a tuple of an indexed series. For instance, in our case above, enumerate will return a series of tuples (0, sequence1), (1, sequence2) ... (n, sequenceN). That’s the reason the enumerate loop uses a tuple as its index

for id_no, seq in enumerate(seqs)

Next couple of posts will cover the other functions sent by Mike. Then we will go back to the statistical module.

Zemanta Pixie

Obtaining overrepresented motifs in DNA sequences, part 5

Phase 2, motifs 3 Comments »

Now that we have the script to generate the word quorums working (and working fast!) we need then to calculate the a p value for each motif based on the fore and background quorums. A p value cut-off will determine the statistically significant words, or overrepresented. These overrepresented words then can be analysed in more details (that we won’t see here) and for instance determine new or already known transcription factor binding sites.

A well established statistical method to determine such overrepresented words is the Hypergeometric Distribution (HD for short). HD measures “success” and “failures” for values that do not fit in the binomial distribution, and depend on the measurements without replacement.

Basically, HD’s equation has a a series of binomial coefficients/combinations

HD equation

where N is the population size, m is foreground cluster size, k is the motif quorum in the background gene set and x is the word quorum in the foreground set. Note that the above equation is for the cumulative HD, where a sum of probabilities is calculated.

All the combinations in the above equation have to be expanded to factorials that depending on the value to be calculated are very computer intensive and sometimes don’t fit in the memory (either a float or integer). But Python is able to handle very large numbers and the calculation of large factorials is relatively fast.

In C++, I had to use a couple of tricks to achieve a good speed in the factorial determination, and specially in the HD calculation that requires multiple factorials and multiplication, division and subtraction of large numbers. I didn’t want to use any mathematical trick such as Stirling’s approximation. 13! in C++ already blows the size of long, so I had to use the MAPM, A Portable Arbitrary Precision Math Library in C. This library is quite fast to calculate the factorial values but when one needs to calculate more than 200,000 factorials the speed is unbearable. So, I decided to pre-calculate a series of factorial values, keeping 10 decimal places as precision and saving in another column their exponential. Then using this table as an input I was able to multiply, divide and subtract the factorials and by employing the first law of exponents do the same operations with their exponential. This speeds up the process tremendously.

In Python, we don’t need any extra third-party library, we just use Python itself, without importing an extra module. A factorial function in Python can be written in one line, but for clarity is better to define it separately. We can try throwing any number at it and see the result.


def fac(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

We already saw reduce and lambda and using these two methods make the factorial function clear and simple. And why are we not using a recursive function? Because Python has a limit recursion depth (1000). Next time we will implement the code that calculates the HD p values.

Obtaining overrepresented motifs in DNA sequences, part 4

Phase 2, motifs 4 Comments »

We found a way to make the Python script as good as or better than the C++ executable. But for the analysis we need to do, motif counts are not the value we want. We need the quorum: the number of sequences the motif is present at least once. For instance, if the desired motifs was AAACCCTTTG we will check in which sequences this word was present. Let’s say in a cluster of 10 sequences, we would find it in sequences 1, 2, 3, 4 and 5, giving us a quorum of 5 out of 10, or 50%. The quorum will be used in the future in the statistical calculation in order to determine the overrepresented motifs.

With only a couple of modifications, we can adapt the script used to get the motif counts to get the quorum.

#!/scratch/python/bin/python

from collections import defaultdict
import sys
import fasta

seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = int(sys.argv[2])

quorum = defaultdict(list)

seq_number = 0
for i in seqs:
    seq_number += 1
    for n in range(len(i.sequence) - int(length)):
        if not seq_number in quorum[i.sequence[n : n + length]]:
            quorum[i.sequence[n : n + length]].append(seq_number)

for i in quorum:
    print ''.join(i).upper(), len(quorum[i])

Basically, we change the way the defaultdict is initialized, this time as a list instead of int and we also change the procedure that used to get the counts. The loop does identical work, iterating along the sequences, with a window (of the input length) sliding on them and checking each word. This time instead of incrementing the value of the defaultdict, we append to the list the sequence number, obtained from a index integer variable (incremented in each iteration of the loop), if this number is not already in the list value. In the end each value of quorum will be a list os numbers and by printing the list length we obtain the quorum. Testing the above script there is no performance loss when comparing to the previous count script.

Next we will see which statistical method to use and start to devising an script to calculate it.

Design by j david macor.com.Original WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in