May 08

As mentioned on the last post I am moving the current repository, an html page, to an actual Git repository on github.com. The link to the repository is

http://github.com/nuin/beginning-python-for-bioinformatics/tree/master

and it can be accessed by anyone. There are only a handful of scripts there but I am slowly adding more comments to the scripts and moving them to github. The web interface at github is pretty nice and the code can be viewed on the website with a nice code highlighting, for example. Also there is an RSS feed to receive updates, commits, etc.

How to create a local copy

Git is very easy to use and it is very simple to create a local copy of the repository on your local machine. Git is available on most systems as a command-line utility (there is a gui but I haven’t used yet) and to have an updated copy of the Beginning Python for Bioinformatics, two commands are needed.

- first you have to clone the repository

$ git clone git://github.com/nuin/beginning-python-for-bioinformatics.git

that will create a beginning-python-for-bioinformatics directory wherever you run git (this only needs to be done once)

and to keep the clone updated

$ git pull

from inside the clone directory.

Any questions please let me know.

May 06

I am transferring the current repository, which is “hosted” in a static page on th blog, to a Git repository on github.com. I am posting soon a quick guide on how to access it. I am also making all relevant entries from the blog into stand alone html pages for off-line browsing.

Anyone with Git experience the link to the repository is here (not many commits yet, but it is coming).

May 05

One of the things I like about Python and the Python community is the search for the making code simple and clear. Tal left a comment in the last post about merging Pfam alignment sequences suggesting another approach to our problem. The code is below

def merge_seqs(data1, data2):
    from itertools import chain, groupby
    format = "%s-%s->%d\n%s%s"
    flist = []
    keyfunc = lambda it: it.name[it.name.find('|') + 1 : it.name.find('/')]
    for it, g in groupby(sorted(chain(data1, data2), key=keyfunc), keyfunc):
        values = list(g)
        if len(values) == 2:
            jname, jseq = values[0].name, values[0].sequence
            kname, kseq = values[1].name, values[1].sequence
            flist.append(format % (jname, kname, len(jseq), jseq, kseq) )

    return flist

The code also uses the itertools module, importing chains and groupby. We already saw chains in the previous post, but groupby is new to us here. groupby was introduced in the 2.4 version of Python and is a method returns keys and groups from an iterable. An Python iterable is any object that can return its elemements at given time, for instance in a for loop, while the index of this loop is the iterator. So, in our case groupby will return the sequence names based on the lambda function defined before the groupby and the chain method. Usually groupby has this syntax

groupby(iterable[, key])

The key is optional, and in our case it is the lambda function. Another method new to use that uses the same lambda function is sorted. As its name hints, sorted returns a sorted list of iterables. The key in this case is the sorting algorithm, that actually creates the comparison between items.

Basically in the code above, a lambda function extracts the desired regions from the sequence names, which are them iterated in a groupby method that returns they key values, one value when the sequence is unique, two values when there are two sequences, of a sorted iterable generated by a chain that read both input lists in one pass. After this we just need to check the number of returned values and we have our list of matching sequences.

May 03

Continuing on the Pfam alignment sequence merge, Luke provided two solutions, one for case where we have duplicates in the file (that happens) and anothet one where duplicates are not tested. First for the case with no duplicates:

def merge_seqs(data1, data2):
    first, second = dict(), dict()
    for i in data1:
        first[i.name[i.name.find('|') + 1:i.name.find('/')]] = i

    for i in data2:
        second[i.name[i.name.find('|') + 1:i.name.find('/')]] = i

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        j = first[i]
        k = second[i]
        tempname = j.name + '-' + k.name + '->' + str(len(j.sequence))
        tempseq = j.sequence + k.sequence
        flist.append(tempname + '\n' + tempseq)

    return flist

Basically, his approach is to create two dictionaries, first and second to store the data from our lists of FASTA objects. The dictionaries keys are the sections fo each sequence name, and the values are the FASTA object created when we read the file. After that, a set of both dictionaries is created, and this represents everything we need in order to find shared sequences between both files. The output is trivial, basically iterating the set and getting the values from both dictionaries.

But, in some cases there are some duplicates in each Pfam alignment, and this would make the above approach to miss some sequences. Luke also provided a solution for this case

from collections import defaultdict
def merge_seqs(data1, data2):
    first, second = defaultdict(list), defaultdict(list)
    for i in data1:
        first[i.name[i.name.find('|') + 1:i.name.find('/')]].append(i)

    for i in data2:
        second[i.name[i.name.find('|') + 1:i.name.find('/')]].append(i)

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        cross = [(a,b) for a in first[i] for b in second[i]]
        for j,k in cross:
            tempname = j.name + '-' + k.name + '->'’ + str(len(j.sequence))
            tempseq = j.sequence + k.sequence
            flist.append(tempname + '\n' + tempseq)

    return flist

which is similar to what Mike suggested. In his approach defaultdict is used instead of a regular dict, initialized as a list. Again we can take advantage of the fact that we that the value of defaultdict has the same features as the object that was used to initialized it. The same set approach is used and we have a nice set of merged sequences from two Pfam alignments. Remember that defaultdict were introduced in Python 2.5, it won’t work on older versions.

May 02

I haven’t posted in a while, so let’s get back to the last topic covered here, merging sequences from Pfam alignments. Two comments to my last post suggested some changes to the original code, and both comments made a considerable improvement to script. But following our line of thought here, there were many things in both posts that we haven’t covered in the series. In order to make it clear to anyone that is cronologically following the entries, we will see what new things were suggested in the comments.

We will start with Mike’s comment. The code is below.

def merge_seqs(data1, data2):
    from collections import defaultdict
    from itertools import chain

    data = defaultdict(list)

    for item in chain(data1, data2):
        ident = item.name[item.name.find('|') + 1 : item.name.find('/')]
        data[ident].append((item.name, item.sequence))

    format = "%s-%s->%d\n%s%s"
    flist = []
    for key, value in data.iteritems():
        if len(value) == 2:
            jname, jseq = value[0]
            kname, kseq = value[1]
            flist.append(format % (jname, kname, len(jseq), jseq, kseq) )

    return flist

Mike’s code would only run on Python 2.5, due to the import of collections. He imports only defaultdict from collections, which is similar to a regular dictionary except for the fact that is takes an extra argument, a factory function (that we will examine soon), and this factory function is called everytime a dict key is found for the the first time, initializing the object. The line

data = defaultdict(list)

creates and intializes a defaultdict from an empty list. This defaultdict is used to store data from a chain formed by both input files being compared. A chain makes an iterator that “returns elements from the first iterable until it is exhausted, the proceeds to the next iterable” (from Python docs). Basically a chain in our case will make the loop iterate first through data1 until its last item then will start the iteration through data2.

In the loop it will use the region in the FASTA sequence name as a defaultdict key, and store the fasta name and sequence as the value. Notice that the defaultdict, as it has been initialized as a list, allows appending more than one value to each key, so every time that it finds a sequence with a name identical to some key already present in the defaultdict, it will append the information to the value of that key in a list. So, all the matching is done with this short loop, we only need then to iterate through the keys and print the result.

For the results, Mike set up a format string (we will see in a future post) that receives the merged sequences and their names in a formatted way. The iteration along the defaultdict checks for both keys and values and every time it finds a value with a length of 2 (meaning two sequences shared identical names) it puts the items of the defaultdict’s value list into two tuples. These tuples, containing sequence and name, are then formatted for output and appended to a list of merged sequences.

There are a lot of new concepts, objects and methods but Mike’s suggestion shows how powerful Python is. For a set of 25 alignments, there is little performance improvement, but our coding ability improves exponentially with clever suggestions.

I will comment and explain Luke’s comments later.

Apr 29

I am preparing a couple of posts based on the latest entries and some other subject and after that we going to check again functional programming in Python geared to Bioinformatics. I am also planning on covering topics on Mastering Perl for Bioinformatics and convert them to Python.

Apart from that I already have some suggestions from friends and colleagues, but I am still looking for some subjects that the readers would like to see covered here. Leave me a message or a comment below with some Bioinformatics subject that would be interesting to write about.

On another note, the blog now is syndicated on Planet Python and Unofficial Planet Python. Welcome.

Apr 14

I had problems installing Google’s AppEngine in Vista. I had Python 2.5.1 installed in my machine but every time I tried to install the msi package it failed, claiming that Python was not present, even though C:\Python25 was in the path. AppEngine issues site did not help much either, the “solution” listed there was to make sure Python was in the path.

So, I decided to start over. I removed Python (and ActiveState Python, which I installed before to see if AppEngine would work) and re-installed it, or tried to. Strangely, Python’s msi package was installing it in the C drive root, not under Python25. For half an hour I tested all possible combinations, versions and tricks to have it installed in the proper directory/folder. Then I remembered msiexec, a command line tool that runs msi packages. Running msiexec with no parameters shows a window with options, but basically /i is the the only required. /i tells msiexec the input package, while ‘code>/qb make the installation quiet and with a basic interface. msiexec worked flawlessly and then Python was in its right place.

msiexec /i python-2.5.2.msi TARGETDIR="C:\Python25" /qb

Then the same “trick” was used with AppEngine, but without a TARGETDIR. Bingo, it installed perfectly (I am assuming that).

Now, I just have to wait for my AppEngine account.

Mar 30

Mike and Luke left comments and suggestions on how to attack the problem from the previous entry. I copied their comments here in order to maintain code formatting a little bit better (comments do not have code highlighting and formatting).

Mike:
Why not build a dictionary of the fasta data keyed by protein ID. The value of each dictionary entry would be a list of (name,sequence) tuples. Then iterate over the dictionary items and build the output list from dictionary entries of length == 2.

Obviously, I don’t have any data sets to test it on, but something like this might work:

def merge_seqs(data1, data2):
    from collections import defaultdict
    from itertools import chain

    data = defaultdict(list)

    for item in chain(data1, data2):
        ident = i.name[i.name.find(’|')+1:i.name.find(’/')]
        data[ident].append( (i.name, i.sequence) )

    format = “%s-%s->%d\n%s%s”
    flist = [ ]
    for key,value in data.iteritems():
        if len(value) == 2:
            jname,jseq = value[0]
            kname,kseq = value[1]
            flist.append(fmt % (jname, kname, len(jseq), jseq, kseq) )

    return flist

Luke:

Are there duplicate IDs in a single file? If not, no reason to perform the intersection twice (once by set, once by nested loop). Sets and dictionaries are both hashes, so just remember the instances with the id:

def merge_seqs(data1, data2):
    first, second = dict(), dict()
    for i in data1:
        first[i.name[i.name.find(’|')+1:i.name.find(’/')]] = i

    for i in data2:
        second[i.name[i.name.find(’|')+1:i.name.find(’/')]] = i

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        j = first[i]
        k = second[i]
        tempname = j.name + ‘-’ + k.name + ‘->’ + str(len(j.sequence))
        tempseq = j.sequence + k.sequence
        flist.append(tempname + ‘\n’ + tempseq)

    return flist

If there are duplicate IDs in a file, and you want the behavior of your original script of giving every combination _between_ files, how about a dictionary of lists and the cross-product of j & k? (I’m going to use python2.5’s collections.defaultdict, if you’re on an earlier version use .setdefault . The cross product brings back a nested loop, but it’s only over values we want rather than the whole datasets.)

from collections import defaultdict
def merge_seqs(data1, data2):
    first, second = defaultdict(list), defaultdict(list)
    for i in data1:
        first[i.name[i.name.find(’|')+1:i.name.find(’/')]].append(i)

    for i in data2:
        second[i.name[i.name.find(’|')+1:i.name.find(’/')]].append(i)

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        cross = [(a,b) for a in first[i] for b in second[i]]
        for j,k in cross:
            tempname = j.name + ‘-’ + k.name + ‘->’ + str(len(j.sequence))
            tempseq = j.sequence + k.sequence
            flist.append(tempname + ‘\n’ + tempseq)

    return flist

And if you want combinations from within a file as well, just use one dictionary of lists and cross each list value with itself, excluding identical elements:

cross = [(a,b) for a in data[i] for b in data[i] if a.name != b.name]

To your question of more directly creating the sets of names, can’t be much more direct, but perhaps more concise using a generator (or list, with the extra cost) comprehension:

myset1 = set(i.name… for i in data1)

Could do equivalently for the dict-based first example above, noting that the dict() constructor can take a sequence of (key, value) tuples. This would not work for the defaultdict or setdefault based version.

—–
Thanks a lot for the great comments. I will make some changes in my original code, with the nice things I learned from you.

Mar 28

I added a more evident link to the RSS feed on the sidebar. If you haven’t already subscribed, please do so. The feed is now syndicated through the Unofficial Planet Python.

Mar 28

A colleague came with a “problem”: what would be the most efficient way to merge Pfam alignments? He had FASTA files containing sequences and he wanted to find identical IDs in two files and merge the related sequences from different domains of the same protein. His C++ approach was taking too long to run so I jumped in to help with some Python tricks.

Fasta headers of Pfam alignments look like this

>P00526.2|SRC_RSVP/147-229

where the first section, before the pipe, is the protein family, the section between the pipe and the slash is the protein ID and the after the slash are the start and end positions. Basically we want to match the protein ID, between | and /, which is the only section that should not change from one alignment from the other, if there are similar sequences in both files.

His dataset size was around 8000 FASTA files and he wanted to compare all-versus-all. The first idea that comes to mind is to read the two files and basically run nested loops checking for matches between two sequence names. Simple and easy. But it would be good to find something else to do in the meantime, instead of checking for its progress.

Another idea, to simplify the process, is to read both files, extract the protein IDs and check for the intersection of the two sets of IDs and then just extract the sequences from a list of matches.

def merge_seqs(data1, data2):

    myset1, myset2 = Set([]), Set([])

    for i in data1:
        myset1.add(i.name[i.name.find('|')+1:i.name.find('/')])

    for i in data2:
        myset2.add(i.name[i.name.find('|')+1:i.name.find('/')])

    mylist = Set.intersection(myset1, myset2)

    flist = []
    for i in mylist:
        for j in data1:
            if j.name[j.name.find('|')+1:j.name.find('/')] == i:
                for k in data2:
                    if k.name[k.name.find('|')+1:k.name.find('/')] == j.name[j.name.find('|')+1:j.name.find('/')]:
                        tempname = j.name + '-' + k.name + '->' + str(len(j.sequence))
                        tempseq = j.sequence + k.sequence
                        flist.append(tempname + '\n' + tempseq)

    return flist

This is the function that matches the sequence IDs. As we are reading the FASTA file with the functions/classes in the fasta module there is no* direct way to transform the list of sequence names into a set, that would make easier to extract the intersection between two sets of IDs from distinct files. In the above function, data1 and data2 are instances of the fasta class, and we read these lists and add the fasta.name to a couple of sets and get the intersection of them. The intersection method returns a list that we use in the nested loops to find the matching sequences, as we need to merge them, we add both names and residues. Simple and easy. All merged sequences, and their respective merged names, are appended to a list that is returned at the end of the function.

*if there is a more direct way, just let me know