Revisting Pfam alignments, again …

Phase 2 No Comments »

Continuing on the Pfam alignment sequence merge, Luke provided two solutions, one for case where we have duplicates in the file (that happens) and anothet one where duplicates are not tested. First for the case with no duplicates:

def merge_seqs(data1, data2):
    first, second = dict(), dict()
    for i in data1:
        first[i.name[i.name.find('|') + 1:i.name.find('/')]] = i

    for i in data2:
        second[i.name[i.name.find('|') + 1:i.name.find('/')]] = i

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        j = first[i]
        k = second[i]
        tempname = j.name + '-' + k.name + '->' + str(len(j.sequence))
        tempseq = j.sequence + k.sequence
        flist.append(tempname + '\n' + tempseq)

    return flist

Basically, his approach is to create two dictionaries, first and second to store the data from our lists of FASTA objects. The dictionaries keys are the sections fo each sequence name, and the values are the FASTA object created when we read the file. After that, a set of both dictionaries is created, and this represents everything we need in order to find shared sequences between both files. The output is trivial, basically iterating the set and getting the values from both dictionaries.

But, in some cases there are some duplicates in each Pfam alignment, and this would make the above approach to miss some sequences. Luke also provided a solution for this case

from collections import defaultdict
def merge_seqs(data1, data2):
    first, second = defaultdict(list), defaultdict(list)
    for i in data1:
        first[i.name[i.name.find('|') + 1:i.name.find('/')]].append(i)

    for i in data2:
        second[i.name[i.name.find('|') + 1:i.name.find('/')]].append(i)

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        cross = [(a,b) for a in first[i] for b in second[i]]
        for j,k in cross:
            tempname = j.name + '-' + k.name + '->'’ + str(len(j.sequence))
            tempseq = j.sequence + k.sequence
            flist.append(tempname + '\n' + tempseq)

    return flist

which is similar to what Mike suggested. In his approach defaultdict is used instead of a regular dict, initialized as a list. Again we can take advantage of the fact that we that the value of defaultdict has the same features as the object that was used to initialized it. The same set approach is used and we have a nice set of merged sequences from two Pfam alignments. Remember that defaultdict were introduced in Python 2.5, it won’t work on older versions.

Revisiting Pfam alignments: using defaultdicts, chains …

Phase 2 3 Comments »

I haven’t posted in a while, so let’s get back to the last topic covered here, merging sequences from Pfam alignments. Two comments to my last post suggested some changes to the original code, and both comments made a considerable improvement to script. But following our line of thought here, there were many things in both posts that we haven’t covered in the series. In order to make it clear to anyone that is cronologically following the entries, we will see what new things were suggested in the comments.

We will start with Mike’s comment. The code is below.

def merge_seqs(data1, data2):
    from collections import defaultdict
    from itertools import chain

    data = defaultdict(list)

    for item in chain(data1, data2):
        ident = item.name[item.name.find('|') + 1 : item.name.find('/')]
        data[ident].append((item.name, item.sequence))

    format = "%s-%s->%d\n%s%s"
    flist = []
    for key, value in data.iteritems():
        if len(value) == 2:
            jname, jseq = value[0]
            kname, kseq = value[1]
            flist.append(format % (jname, kname, len(jseq), jseq, kseq) )

    return flist

Mike’s code would only run on Python 2.5, due to the import of collections. He imports only defaultdict from collections, which is similar to a regular dictionary except for the fact that is takes an extra argument, a factory function (that we will examine soon), and this factory function is called everytime a dict key is found for the the first time, initializing the object. The line

data = defaultdict(list)

creates and intializes a defaultdict from an empty list. This defaultdict is used to store data from a chain formed by both input files being compared. A chain makes an iterator that “returns elements from the first iterable until it is exhausted, the proceeds to the next iterable” (from Python docs). Basically a chain in our case will make the loop iterate first through data1 until its last item then will start the iteration through data2.

In the loop it will use the region in the FASTA sequence name as a defaultdict key, and store the fasta name and sequence as the value. Notice that the defaultdict, as it has been initialized as a list, allows appending more than one value to each key, so every time that it finds a sequence with a name identical to some key already present in the defaultdict, it will append the information to the value of that key in a list. So, all the matching is done with this short loop, we only need then to iterate through the keys and print the result.

For the results, Mike set up a format string (we will see in a future post) that receives the merged sequences and their names in a formatted way. The iteration along the defaultdict checks for both keys and values and every time it finds a value with a length of 2 (meaning two sequences shared identical names) it puts the items of the defaultdict’s value list into two tuples. These tuples, containing sequence and name, are then formatted for output and appended to a list of merged sequences.

There are a lot of new concepts, objects and methods but Mike’s suggestion shows how powerful Python is. For a set of 25 alignments, there is little performance improvement, but our coding ability improves exponentially with clever suggestions.

I will comment and explain Luke’s comments later.

Merging sequences from a Pfam alignment: using sets, part II

Phase 2 2 Comments »

Mike and Luke left comments and suggestions on how to attack the problem from the previous entry. I copied their comments here in order to maintain code formatting a little bit better (comments do not have code highlighting and formatting).

Mike:
Why not build a dictionary of the fasta data keyed by protein ID. The value of each dictionary entry would be a list of (name,sequence) tuples. Then iterate over the dictionary items and build the output list from dictionary entries of length == 2.

Obviously, I don’t have any data sets to test it on, but something like this might work:

def merge_seqs(data1, data2):
    from collections import defaultdict
    from itertools import chain

    data = defaultdict(list)

    for item in chain(data1, data2):
        ident = i.name[i.name.find(’|')+1:i.name.find(’/')]
        data[ident].append( (i.name, i.sequence) )

    format = “%s-%s->%d\n%s%s”
    flist = [ ]
    for key,value in data.iteritems():
        if len(value) == 2:
            jname,jseq = value[0]
            kname,kseq = value[1]
            flist.append(fmt % (jname, kname, len(jseq), jseq, kseq) )

    return flist

Luke:

Are there duplicate IDs in a single file? If not, no reason to perform the intersection twice (once by set, once by nested loop). Sets and dictionaries are both hashes, so just remember the instances with the id:

def merge_seqs(data1, data2):
    first, second = dict(), dict()
    for i in data1:
        first[i.name[i.name.find(’|')+1:i.name.find(’/')]] = i

    for i in data2:
        second[i.name[i.name.find(’|')+1:i.name.find(’/')]] = i

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        j = first[i]
        k = second[i]
        tempname = j.name + ‘-’ + k.name + ‘->’ + str(len(j.sequence))
        tempseq = j.sequence + k.sequence
        flist.append(tempname + ‘\n’ + tempseq)

    return flist

If there are duplicate IDs in a file, and you want the behavior of your original script of giving every combination _between_ files, how about a dictionary of lists and the cross-product of j & k? (I’m going to use python2.5’s collections.defaultdict, if you’re on an earlier version use .setdefault . The cross product brings back a nested loop, but it’s only over values we want rather than the whole datasets.)

from collections import defaultdict
def merge_seqs(data1, data2):
    first, second = defaultdict(list), defaultdict(list)
    for i in data1:
        first[i.name[i.name.find(’|')+1:i.name.find(’/')]].append(i)

    for i in data2:
        second[i.name[i.name.find(’|')+1:i.name.find(’/')]].append(i)

    shared_ids = set(first).intersection(set(second))

    flist = []
    for i in shared_ids:
        cross = [(a,b) for a in first[i] for b in second[i]]
        for j,k in cross:
            tempname = j.name + ‘-’ + k.name + ‘->’ + str(len(j.sequence))
            tempseq = j.sequence + k.sequence
            flist.append(tempname + ‘\n’ + tempseq)

    return flist

And if you want combinations from within a file as well, just use one dictionary of lists and cross each list value with itself, excluding identical elements:

cross = [(a,b) for a in data[i] for b in data[i] if a.name != b.name]

To your question of more directly creating the sets of names, can’t be much more direct, but perhaps more concise using a generator (or list, with the extra cost) comprehension:

myset1 = set(i.name… for i in data1)

Could do equivalently for the dict-based first example above, noting that the dict() constructor can take a sequence of (key, value) tuples. This would not work for the defaultdict or setdefault based version.

—–
Thanks a lot for the great comments. I will make some changes in my original code, with the nice things I learned from you.

Merging sequences from a Pfam alignment: using sets

Phase 2 3 Comments »

A colleague came with a “problem”: what would be the most efficient way to merge Pfam alignments? He had FASTA files containing sequences and he wanted to find identical IDs in two files and merge the related sequences from different domains of the same protein. His C++ approach was taking too long to run so I jumped in to help with some Python tricks.

Fasta headers of Pfam alignments look like this

>P00526.2|SRC_RSVP/147-229

where the first section, before the pipe, is the protein family, the section between the pipe and the slash is the protein ID and the after the slash are the start and end positions. Basically we want to match the protein ID, between | and /, which is the only section that should not change from one alignment from the other, if there are similar sequences in both files.

His dataset size was around 8000 FASTA files and he wanted to compare all-versus-all. The first idea that comes to mind is to read the two files and basically run nested loops checking for matches between two sequence names. Simple and easy. But it would be good to find something else to do in the meantime, instead of checking for its progress.

Another idea, to simplify the process, is to read both files, extract the protein IDs and check for the intersection of the two sets of IDs and then just extract the sequences from a list of matches.

def merge_seqs(data1, data2):

    myset1, myset2 = Set([]), Set([])

    for i in data1:
        myset1.add(i.name[i.name.find('|')+1:i.name.find('/')])

    for i in data2:
        myset2.add(i.name[i.name.find('|')+1:i.name.find('/')])

    mylist = Set.intersection(myset1, myset2)

    flist = []
    for i in mylist:
        for j in data1:
            if j.name[j.name.find('|')+1:j.name.find('/')] == i:
                for k in data2:
                    if k.name[k.name.find('|')+1:k.name.find('/')] == j.name[j.name.find('|')+1:j.name.find('/')]:
                        tempname = j.name + '-' + k.name + '->' + str(len(j.sequence))
                        tempseq = j.sequence + k.sequence
                        flist.append(tempname + '\n' + tempseq)

    return flist

This is the function that matches the sequence IDs. As we are reading the FASTA file with the functions/classes in the fasta module there is no* direct way to transform the list of sequence names into a set, that would make easier to extract the intersection between two sets of IDs from distinct files. In the above function, data1 and data2 are instances of the fasta class, and we read these lists and add the fasta.name to a couple of sets and get the intersection of them. The intersection method returns a list that we use in the nested loops to find the matching sequences, as we need to merge them, we add both names and residues. Simple and easy. All merged sequences, and their respective merged names, are appended to a list that is returned at the end of the function.

*if there is a more direct way, just let me know

Python sets, intersections, reduce and more

Phase 2 No Comments »

Last time we saw how to use a set intersection to check for clusters of DNA/protein sequences and their genome intersections. We going to use the same example but this time we will see how we can change it and create a function that would calculate an arbitrary number of intersections at a time and also be able to check the intersection of more than two sets.

Our previous code to calculate intersections was

from sets import Set #for Python 2.3 and below
genA, genB, genC = Set([]), Set([]), Set([])

#populate the sets ...

print len(Set.intersection(genA, genB))
print len(Set.intersection(genA, genC))
print len(Set.intersection(genB, genC))

and that’s does not gives us the most important piece of information: the intersection of A, B and C. Let’s see how we can do it. Python has a function that can help us this time: reduce. This function allows to reduce a series of values to a single one, by employing a function defined in its parameters. From the file tutorial

reduce(function, iterable[, initializer])

Basically reduce will make the function modify the values of a iterable (i.e. list). In our case reduce helps us calculate the intersection of many sets at once (our sets are genA, genB and genC)

reduce(Set.intersection, [genA, genB, genC])

where reduce is applying the Set method intersection to a list of sets passed to the function. Python’s reduce might not be available in 3.0, but it is worth knowing.

Now that we have a Pythonic way of checking the intersections we need to put reduce in a function, to make our code more elegant.

def get_intersection(my_set1, my_set2, my_set3):
    return reduce(Set.intersection, [my_set1, my_set2, my_set3])

Done. In this function reduce will operate by first checking the intersection between my_set1 and my_set2 and storing it internally, then checking for the intersection between the previous result and my_set3.

But wait … get_intersection receives three arguments, so it will be able to return the intersection only when three sets are sent, thus we would have to create another function that can receive two items in order to calculate the intersection between two items. And if we have more than three genomes to compare …

There is a trick. If we define our function receiving only one argument and put a star/asterisk on it, this means that an arbitrary number of arguments will be provided to this function in a tuple (and a tuple is iterable, so it can be used in a reduce). Changing our function accordingly, we will have

def get_intersection(*my_sets):
    return reduce(Set.intersection, my_sets)

so we can send 2, 3, 4, 5 or n number of sets to this function and it will return the intersection among all of them. Our simple script would be (including the above function)

from sets import Set #for Python 2.3 and below

def get_intersection(*my_sets):
    return reduce(Set.intersection, my_sets)

genA, genB, genC = Set([]), Set([]), Set([])

#populate the sets ...

print len(get_intersection(genA, genB))
print len(get_intersection(genA, genC))
print len(get_intersection(genB, genC))
print len(get_intersection(genA, genB, genC))

Python sets and intersections

Phase 2 No Comments »

Sometime ago we saw how to use sets and uniquify lists. This time we will see anothe example of the use of sets.

Note: in the previous example we saw sets imported as an extra module. This has to be done for Python versions 2.3 and under. There is a difference between both: when sets are imported the object is noted Set with capital S while on Python 2.4 and above, set objects are noted with a small s, set. In order to make the post as global as possible for people that does not have the latest version of Python I will use the 2.3 notation and import

This time we will see a different example os set usage that includes some methods available for this type of objects. The initial problem was how to determine which genomes are represented in protein/DNA clusters obtained with CD-HIT. Basically CD-HIT uses a multiple fasta file to generate clusters of proteins/DNA using their sequence identity. It clusters the sequences, keeping thei fasta title and assigning an ID for each cluster obtained. We won’t see how is CD-HIT’s output, not how to parse it. For this example, let’s say three genomes (A, B and C) were analysed in CD-HIT and we want to determine which clusterss contain sequences from the combinations AB, AC and CB (we won’t touch unique items and clusters with sequences from all genomes this time).

After reading the clusters and sequence IDs in each one of them we basically need to create unique lists of cluster IDs for each one of the genomes. That’s where sets are used

from sets import Set

genA, genB, genC = Set([]), Set([]), Set([])

We declare three emtpy sets, that will store cluster IDs for each one of the genomes. The important part here is that the sequences’ fasta titles should have a unique identifier for each one of the genomes, just to make it easier to read each cluster contents. In each of our genomes the sequences were tagged with their letter in the first character

>genA_sequence1
ACGT
>genB_sequence1
ACGTT
>genC_sequence1
ACGTT
...

We run CD-HIT and parse the results, maybe creating a class to store information about each cluster and its sequences. Then we analyse this list

for i in clusters:
    for id in i.sequence_id:
        if id.startswith('>genA'):
            genA.add(i.cluster_number)
        elif id.startswith('genB'):
            genB.add(i.cluster_number)
        elif id.startswith('genC'):
            o.add(i.cluster_number)

Remembering that sets are unordered unique lists, we don’t expect to see repeated cluster IDs in each of the three sets. Now to the fun part. Our first thought to determine how many clusters contain sequences of A and B would be to create two loops and iterate checking for equalities in cluster IDs. But with sets, our task is easier. What we need is the intersection from genA and genB, and that’s the method available to do that.

print len(Set.intersection(genA, genB))
print len(Set.intersection(genA, genC))
print len(Set.intersection(genB, genC))

That’s it. Three lines of code, one method. The output will be the number of clusters that contain sequences of A and B, A and C and B and C, respectively.

PS: If anyone is interested in the CD-HIT output parsing class/function just let me know.

GenBank parsing: both scripts

Phase 2 No Comments »

Both scripts to parse GenBank scripts are below and included in the repository.

First the one that extracts amino acid sequences, and then the latest to extract the DNA.

#! /usr/bin/env python

'''
input is a GenBank file. The script searches for gene annotations, extract all lines
from the file and then parses these lines in order to extract protein sequences
Ribosomal genes and other non-coding genes are not extracted - plan to do it later
output is a fasta formatted file
'''

import sys
import fasta

class Protein:
    '''
    class that stores protein information
    '''
    def __init__(self, gi, id, sequence):
        self.gi = gi
        self.id = id
        self.sequence = sequence

def parse_entry(gene_data):
    '''
    parses the entry received from the main function
    in order to extract information as protein id
    gi, etc
    '''
    prot_id = ''
    sequence = ''
    gi_id = ''
    gene_data = gene_data.splitlines()
    for line in gene_data:
        if line.find('/product') >=0:
            prot_id = line[line.find('=') + 2:-1]
        elif line.find('protein_id') >= 0:
            prot_id += '\t' + line[line.find('=') + 2: -1]
        elif line.find('GI:') >= 0:
            gi_id = 'gi' + line[line.find('GI:')+3:-1]
        elif line.find('/translation') >= 0:
            sequence = line[line.find('=') + 2:]
            temp = gene_data.index(line)
            for i in range(temp+1, len(gene_data)):
                if gene_data[i].find('sig_peptide') >= 0:
                    break
                else:
                    sequence += gene_data[i].strip()

    return Protein(gi_id, prot_id, sequence)

#only input is a genbank file
gbfile = open(sys.argv[1])

proteins = []
index = 0
entry = ''
for line in gbfile:
    if line.find('  gene ') >= 0:
        if index >= 1:
            #parses the CDS and appends to a list
            proteins.append(parse_entry(entry))
            entry = ''
        index += 1
        entry += line
    elif line.find('ORIGIN') >= 0:
        #found the DNA sequence, we can stop now
        break
    else:
        entry += line

#parses the last entry after leaving the loop
proteins.append(parse_entry(entry))

#output
for i in proteins:
    if len(i.gi) > 2:
        print i.gi, i.id
        output = open(i.gi + '.fasta', 'w')
        output.write('>' + i.gi + '\t' + i.id + '\n')
        i.sequence = i.sequence.replace('\"', '')
        output.write(fasta.format_output(i.sequence, 80))
        print i.id
#! /usr/bin/env python

'''
script that extracts sequences from a GenBank file. Script reads the gene CDS from the file and
builds a list of start and end positions, and if gene is complement outputs the 5'3' sequence and
its reverse complement
only input is a GenBank file
outputs a fasta file with the GI ID as its name.
'''

import sys
import fasta

class CDSinfo:
    '''
    CDSinf class to store all the information from CDS
    '''
    def __init__(self, gi_id, id, start, end, complement):
        self.gi_id = gi_id
        self.id = id
        self.start = start
        self.end = end
        self.complement = complement

def parse_entry(gene_data):
    '''
    each CDS entry is obtained in the main function and a string of lines with
    information is passed to parse_entry to be parsed and have information extracted
    '''

    gene_data = gene_data.splitlines() #changes a string to list, splitting at line ends
    start, end = 0, 0
    gi_id = ''
    id = ''
    complement = False
    for line in gene_data: #searches for regions annotated as CDS
        if line.find('  CDS  ') >=0:
            temp = line.split()
            #checks for complement sequence, if true remove extra characters
            if temp[1].find('complement') >= 0:
                complement = True
                temp[1] = temp[1].replace('complement(', '')
                temp[1] = temp[1].replace(')', '')
            temp2 = temp[1].split('..')
            start = temp2[0]
            end = temp2[1]
        #checks for GI IDs
        elif line.find('GI:') >= 0:
            gi_id = 'gi' + line[line.find('GI:')+3:-1]
        #get the gene name/function
        elif line.find('/product') >=0:
            id = line[line.find('=') + 2:-1]
        #and adds the protein id
        elif line.find('protein_id') >= 0:
            id += '\t' + line[line.find('=') + 2: -1]

    return CDSinfo(gi_id, id, start, end, complement)

#only input is the genbank file with annotation and sequence
gbfile = open(sys.argv[1])

index = 0
entry = ''
sequence = []
is_seq = False

genes = []
for line in gbfile:
    #reads the genbank file and whenever finds a gene annotation
    #concatenate the lines up to the next gene
    if line.find('  gene ') >= 0:
        #if an entry is complete, send it to parse
        if index >= 1:
            #appends to a list of CDSinfo objects
            genes.append(parse_entry(entry))
            entry = ''
        index += 1
        entry += line
    elif line.find('ORIGIN') >= 0:
        #found sequence start, set the flag on and parses the last entry
        is_seq = True
        genes.append(parse_entry(entry))
    elif is_seq == True:
        #if flag is true keep going, usually sequences are store at the end of the file
        line = line.split()
        sequence.append(line)
    else:
        #this is an entry so append line
        entry += line

str_seq = ''
#make the sequence a string
for i in sequence:
    str_seq += ''.join(i[1:]).upper()

for i in genes:
    if len(i.gi_id) > 2:
        print i.id, i.start, i.end
        output = open(i.gi_id + '.DNA.fasta', 'w')
        output.write('>' + i.gi_id + '\t' + i.id + '\n')
        # if this is a complement, print both 5'-3' and reverse complement sequences
        if i.complement == True:
            output.write(fasta.format_output(fasta.invert(str_seq[int(i.start)-1:int(i.end)]), 80) + '\n')
        else:
            if not i.start.find('join') >= 0:
                output.write(fasta.format_output(str_seq[int(i.start)-1:int(i.end)], 80))

Checking the DNA extract script carefully, we notice that we have an extra function in the fasta module, invert.

Update: A small change to the script is that split('\n') can be replaced by splitlines() which is a built-in method for strings.

This hasn’t been added yet to the repository version of the file and an explanation on its use will be added shortly.

Fasta module: transcribing DNA

Phase 2 No Comments »

A long time ago when the blog was still based on the Perl book we have seen how to transcribe DNA to RNA. This entry serves only to remember the method and add a function to the fasta module in the repositoty.

It is really simple to transcribe in Python, by employing the replace method on strings. Our function looks like

def transcribe(seq):
    RNA = seq.replace('T', 'U')
    return RNA

Some other functions will be added to the module in the next couple of weeks. After that we will divert the focus on optimizing our GenBank parsing script.

Fasta module: generating reverse complement of DNA sequences

Phase 2 5 Comments »

As shown in the GenBank DNA parser script, it is really useful to have the ability to get the reverse complement of some DNA sequences. The reverse complement of a 5′-3′ DNA sequence is on its complementary strand. Using our fasta module it is easy to implement a function to generate the antiparallel sequence

Basically we can do this in two step, one - obtaining the complement and two - reversing it.

def complement(seq):
    abuffer = []
    for i in seq:
        if i == 'A':
            abuffer.append('T')
        elif i == 'C':
            abuffer.append('G')
        elif i == 'T':
            abuffer.append('A')
        elif i == 'G':
            abuffer.append('C')
    return abuffer

That’s not a Pythonic approach, but it works reasonably well for short sequences. What would be a Pythonic approach to it? Using dictionaries and list comprehension. From the Python online documentation:

List comprehensions provide a concise way to create lists without resorting to use of map(), filter() and/or lambda. The resulting list definition tends often to be clearer than lists built using those constructs. Each list comprehension consists of an expression followed by a for clause, then zero or more for or if clauses. The result will be a list resulting from evaluating the expression in the context of the for and if clauses which follow it.

It is basically a specific way to modify lists using a loop and if clauses when needed. In a couple of lines we can do what we accomplished in 10 lines with the above excerpt. First we need a dictionary to set how the nucleotides are related

complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

and then a list comprehension to modify each nucleotide

complseq = [complement[base] for base in seq]

The list comprehension menas: for each base in the sequence get the dictionary value for each key (a nucleotide in the initial sequence). The complete function would be

def complement(seq):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    complseq = [complement[base] for base in seq]
    return complseq

One step solved. We are already able to get the complement, now we need to reverse it. Simple. A Python list method automatically reverts list elements. Basically what we need then is below

def reverse_complement(seq):
    seq = list(seq)
    seq.reverse()
    return ''.join(complement(seq))

The call from any script to this function would be simply be

fasta.reverse_complement('AACCGGTTTTGGCCAA')

Batteries included, indeed.

more can be found here. that served as an inpiration for the funtion

I am updating the fasta.py file in the repository soon

GenBank parsing: part II

Phase 2 No Comments »

Last thing we saw in the previous entry was the function to parse a GenBank file in order to get the DNA sequences of an annotated gene. This time we will dissect the function.

def parse_entry(gene_data):
    #changes a string to list, splitting at line ends
    gene_data = gene_data.split('\n')
    start, end = 0, 0
    gi_id = ''
    id = ''
    complement = False
    for line in gene_data:
        if line.find('  CDS  ') >=0:
            temp = line.split()
            if temp[1].find('complement') >= 0:
                complement = True
                temp[1] = temp[1].replace('complement(', '')
                temp[1] = temp[1].replace(')', '')
            temp2 = temp[1].split('..')
            start = temp2[0]
            end = temp2[1]
        elif line.find('GI:') >= 0:
            gi_id = 'gi' + line[line.find('GI:')+3:-1]
        elif line.find('/product') >=0:
            id = line[line.find('=') + 2:-1]
        elif line.find('protein_id') >= 0:
            id += '\t' + line[line.find('=') + 2: -1]

    return CDSinfo(gi_id, id, start, end, complement)

First things first. As in the CDS/protein parsing example, the function receives a string and it is easier for us to operate with lists, so we split the string at the carriage returns. This time we also need to check for the start and end positions of the gene in the long string that is the DNA sequence of a GenBank file, so we just declare a couple of integers start and end. We also have a couple of strings to receive the gi ID and the gene product (id). And finally a boolean, that will tell us if the sequence in question is on the DNA complement strand, meaning it is a reverse complement of the DNA that is stored in the file.

So we start our loop checking for the elements of each line. If we find a ' CDS ' we know that we can extract the start and end positions of the gene

if line.find('  CDS  ') >=0:
    temp = line.split()
    if temp[1].find('complement') >= 0:
        complement = True
        temp[1] = temp[1].replace('complement(', '')
        temp[1] = temp[1].replace(')', '')
    temp2 = temp[1].split('..')
    start = temp2[0]
    end = temp2[1]

We use temporary list and split the line at the spaces and then move to get the information from it. If the line is a reverse complement we remove the word from the entry and the parentheses. Whenever the gene annotation relates to the complement strand the entry is as such complement(10728..11330). We then use another temp object and split the second element in the split line and separate it at the double dots. From this latter split, we end up with two elements, the first is the start position and the second the end position of the gene.

To extract the other information everything works as the previous script, basically checking for the presence of certain words in the each line. In the end we return an instance of the class CDSinfo which will be used to extract the DNA sequence.

Design by j david macor.com.Original WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in