Python sets, intersections, reduce and more RSS
Mar 28

A colleague came with a “problem”: what would be the most efficient way to merge Pfam alignments? He had FASTA files containing sequences and he wanted to find identical IDs in two files and merge the related sequences from different domains of the same protein. His C++ approach was taking too long to run so I jumped in to help with some Python tricks.

Fasta headers of Pfam alignments look like this

>P00526.2|SRC_RSVP/147-229

where the first section, before the pipe, is the protein family, the section between the pipe and the slash is the protein ID and the after the slash are the start and end positions. Basically we want to match the protein ID, between | and /, which is the only section that should not change from one alignment from the other, if there are similar sequences in both files.

His dataset size was around 8000 FASTA files and he wanted to compare all-versus-all. The first idea that comes to mind is to read the two files and basically run nested loops checking for matches between two sequence names. Simple and easy. But it would be good to find something else to do in the meantime, instead of checking for its progress.

Another idea, to simplify the process, is to read both files, extract the protein IDs and check for the intersection of the two sets of IDs and then just extract the sequences from a list of matches.

def merge_seqs(data1, data2):

    myset1, myset2 = Set([]), Set([])

    for i in data1:
        myset1.add(i.name[i.name.find('|')+1:i.name.find('/')])

    for i in data2:
        myset2.add(i.name[i.name.find('|')+1:i.name.find('/')])

    mylist = Set.intersection(myset1, myset2)

    flist = []
    for i in mylist:
        for j in data1:
            if j.name[j.name.find('|')+1:j.name.find('/')] == i:
                for k in data2:
                    if k.name[k.name.find('|')+1:k.name.find('/')] == j.name[j.name.find('|')+1:j.name.find('/')]:
                        tempname = j.name + '-' + k.name + '->' + str(len(j.sequence))
                        tempseq = j.sequence + k.sequence
                        flist.append(tempname + '\n' + tempseq)

    return flist

This is the function that matches the sequence IDs. As we are reading the FASTA file with the functions/classes in the fasta module there is no* direct way to transform the list of sequence names into a set, that would make easier to extract the intersection between two sets of IDs from distinct files. In the above function, data1 and data2 are instances of the fasta class, and we read these lists and add the fasta.name to a couple of sets and get the intersection of them. The intersection method returns a list that we use in the nested loops to find the matching sequences, as we need to merge them, we add both names and residues. Simple and easy. All merged sequences, and their respective merged names, are appended to a list that is returned at the end of the function.

*if there is a more direct way, just let me know

3 Responses to “Merging sequences from a Pfam alignment: using sets”

  1. Luke Says:

    Are there duplicate IDs in a single file? If not, no reason to perform the intersection twice (once by set, once by nested loop). Sets and dictionaries are both hashes, so just remember the instances with the id:

    def merge_seqs(data1, data2):
    first, second = dict(), dict()
    for i in data1:
    first[i.name[i.name.find(’|')+1:i.name.find(’/')]] = i

    for i in data2:
    second[i.name[i.name.find(’|')+1:i.name.find(’/')]] = i

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
    j = first[i]
    k = second[i]
    tempname = j.name + ‘-’ + k.name + ‘->’ + str(len(j.sequence))
    tempseq = j.sequence + k.sequence
    flist.append(tempname + ‘\n’ + tempseq)
    return flist

    If there are duplicate IDs in a file, and you want the behavior of your original script of giving every combination _between_ files, how about a dictionary of lists and the cross-product of j & k? (I’m going to use python2.5’s collections.defaultdict, if you’re on an earlier version use .setdefault . The cross product brings back a nested loop, but it’s only over values we want rather than the whole datasets.)

    from collections import defaultdict
    def merge_seqs(data1, data2):
    first, second = defaultdict(list), defaultdict(list)
    for i in data1:
    first[i.name[i.name.find(’|')+1:i.name.find(’/')]].append(i)

    for i in data2:
    second[i.name[i.name.find(’|')+1:i.name.find(’/')]].append(i)

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
    cross = [(a,b) for a in first[i] for b in second[i]]
    for j,k in cross:
    tempname = j.name + ‘-’ + k.name + ‘->’ + str(len(j.sequence))
    tempseq = j.sequence + k.sequence
    flist.append(tempname + ‘\n’ + tempseq)
    return flist

    And if you want combinations from within a file as well, just use one dictionary of lists and cross each list value with itself, excluding identical elements:

    cross = [(a,b) for a in data[i] for b in data[i] if a.name != b.name]

    To your question of more directly creating the sets of names, can’t be much more direct, but perhaps more concise using a generator (or list, with the extra cost) comprehension:

    myset1 = set(i.name… for i in data1)

    Could do equivalently for the dict-based first example above, noting that the dict() constructor can take a sequence of (key, value) tuples. This would not work for the defaultdict or setdefault based version.

    - Luke

  2. Paulo Nuin Says:

    I will include your comment on the entry because of the formatting loss, and I will update the entry with some other comments.

    Thanks a lot for the input.

  3. Mike Says:

    Why not build a dictionary of the fasta data keyed by protein ID. The value of each dictionary entry would be a list of (name,sequence) tuples. Then iterate over the dictionary items and build the output list from dictionary entries of length == 2.

    Obviously, I don’t have any data sets to test it on, but something like this might work:

    def merge_seqs(data1, data2):
    from collections import defaultdict
    from itertools import chain

    data = defaultdict(list)

    for item in chain(data1, data2):
    ident = i.name[i.name.find(’|')+1:i.name.find(’/')]

    data[ident].append( (i.name, i.sequence) )

    format = “%s-%s->%d\n%s%s”
    flist = [ ]
    for key,value in data.iteritems():
    if len(value) == 2:
    jname,jseq = value[0]
    kname,kseq = value[1]
    flist.append(fmt % (jname, kname, len(jseq), jseq, kseq) )

    return flist

Leave a Reply