I haven’t posted in a while, so let’s get back to the last topic covered here, merging sequences from Pfam alignments. Two comments to my last post suggested some changes to the original code, and both comments made a considerable improvement to script. But following our line of thought here, there were many things in both posts that we haven’t covered in the series. In order to make it clear to anyone that is cronologically following the entries, we will see what new things were suggested in the comments.
We will start with Mike’s comment. The code is below.
def merge_seqs(data1, data2):
from collections import defaultdict
from itertools import chain
data = defaultdict(list)
for item in chain(data1, data2):
ident = item.name[item.name.find('|') + 1 : item.name.find('/')]
data[ident].append((item.name, item.sequence))
format = "%s-%s->%d\n%s%s"
flist = []
for key, value in data.iteritems():
if len(value) == 2:
jname, jseq = value[0]
kname, kseq = value[1]
flist.append(format % (jname, kname, len(jseq), jseq, kseq) )
return flist
Mike’s code would only run on Python 2.5, due to the import of collections. He imports only defaultdict from collections, which is similar to a regular dictionary except for the fact that is takes an extra argument, a factory function (that we will examine soon), and this factory function is called everytime a dict key is found for the the first time, initializing the object. The line
data = defaultdict(list)
creates and intializes a defaultdict from an empty list. This defaultdict is used to store data from a chain formed by both input files being compared. A chain makes an iterator that “returns elements from the first iterable until it is exhausted, the proceeds to the next iterable” (from Python docs). Basically a chain in our case will make the loop iterate first through data1 until its last item then will start the iteration through data2.
In the loop it will use the region in the FASTA sequence name as a defaultdict key, and store the fasta name and sequence as the value. Notice that the defaultdict, as it has been initialized as a list, allows appending more than one value to each key, so every time that it finds a sequence with a name identical to some key already present in the defaultdict, it will append the information to the value of that key in a list. So, all the matching is done with this short loop, we only need then to iterate through the keys and print the result.
For the results, Mike set up a format string (we will see in a future post) that receives the merged sequences and their names in a formatted way. The iteration along the defaultdict checks for both keys and values and every time it finds a value with a length of 2 (meaning two sequences shared identical names) it puts the items of the defaultdict’s value list into two tuples. These tuples, containing sequence and name, are then formatted for output and appended to a list of merged sequences.
There are a lot of new concepts, objects and methods but Mike’s suggestion shows how powerful Python is. For a set of 25 alignments, there is little performance improvement, but our coding ability improves exponentially with clever suggestions.
I will comment and explain Luke’s comments later.
May 3rd, 2008 at 8:48 am
Two suggestions for simplifying the code:
defaultdict is not really required here, the same could be done by initializing data to a normal empty dict (i.e. data={}) and using: data.setdefault(ident, []).append(…)
Generally, while this is nice for practicing your programming skills, itertools.groupby can do most of this work for you. Here’s how (not tested) - notice that this version is shorter and clearer:
def merge_seqs(data1, data2):
from itertools import chain, groupby
format = “%s-%s->%d\n%s%s”
flist = []
keyfunc = lambda it: it.name[it.name.find(’|') + 1 : it.name.find(’/')]
for item, g in groupby(sorted(chain(data1, data2), key=keyfunc), keyfunc):
values = list(g)
if len(values) == 2:
jname, jseq = values[0].name, values[0].sequence
kname, kseq = values[1].name, values[1].sequence
flist.append(format % (jname, kname, len(jseq), jseq, kseq) )
return flist
The only “ugliness” here is that groupby requires that the items in the iterator it works on be sorted in advance by the same key function. For this reason I usually just use a dict with setdefault, as I mentioned above, but in this case using groupby tidies up the code considerably.
IMO the best solution here is to use this simple groupby replacement:
def groupby_unsorted(iterator, key=lambda x:x):
“”"Like itertools.groupby, but doesn’t require that the given iterator be
sorted in advance, and doesn’t work with infinite iterators.
“”"
groups={}
for item in iterator:
groups.setdefault(key(item), []).append(item)
return groups.iteritems()
This way you can use the the second, more concise, version of the function above, without needed to first sort the data, by replacing the ugly:
groupby(sorted(chain(data1, data2), key=keyfunc), keyfunc)
with:
groupby_unsorted(chain(data1, data2), keyfunc)
Finally, I just want to say that this was a nice post, and I hope to see more like it
May 3rd, 2008 at 12:36 pm
Hi Tal
I will post your comment in a regular post in order to format it. Thanks a lot for your comment.
Cheers
May 6th, 2008 at 10:07 am
Just a personal opinion of style (it does not change the algorithms in any way): I prefer defaultdict over setdefault and def over named lambdas, they seem easier to read to me.
I prefer
groups=defaultdict(list)
groups[key(item)].append(item)
over
groups={}
groups.setdefault(key(item), []).append(item)
and
def keyfunc(it): return it.name[it.name.find(’|’) + 1 : it.name.find(’/’)]
over
keyfunc = lambda it: it.name[it.name.find(’|’) + 1 : it.name.find(’/’)]