One of the things I like about Python and the Python community is the search for the making code simple and clear. Tal left a comment in the last post about merging Pfam alignment sequences suggesting another approach to our problem. The code is below
def merge_seqs(data1, data2):
from itertools import chain, groupby
format = "%s-%s->%d\n%s%s"
flist = []
keyfunc = lambda it: it.name[it.name.find('|') + 1 : it.name.find('/')]
for it, g in groupby(sorted(chain(data1, data2), key=keyfunc), keyfunc):
values = list(g)
if len(values) == 2:
jname, jseq = values[0].name, values[0].sequence
kname, kseq = values[1].name, values[1].sequence
flist.append(format % (jname, kname, len(jseq), jseq, kseq) )
return flist
The code also uses the itertools module, importing chains and groupby. We already saw chains in the previous post, but groupby is new to us here. groupby was introduced in the 2.4 version of Python and is a method returns keys and groups from an iterable. An Python iterable is any object that can return its elemements at given time, for instance in a for loop, while the index of this loop is the iterator. So, in our case groupby will return the sequence names based on the lambda function defined before the groupby and the chain method. Usually groupby has this syntax
groupby(iterable[, key])
The key is optional, and in our case it is the lambda function. Another method new to use that uses the same lambda function is sorted. As its name hints, sorted returns a sorted list of iterables. The key in this case is the sorting algorithm, that actually creates the comparison between items.
Basically in the code above, a lambda function extracts the desired regions from the sequence names, which are them iterated in a groupby method that returns they key values, one value when the sequence is unique, two values when there are two sequences, of a sorted iterable generated by a chain that read both input lists in one pass. After this we just need to check the number of returned values and we have our list of matching sequences.