Translating DNA into proteins: second approach, now using FASTA files
Section 5 July 11th, 2007We have seen before how to translate DNA sequences into amino acids sequences. We have even created a module that contains the dictionary for the genetic code. Now we are going to combine both (very simple) modules we created in one nice script for day-to-day use.
So, we have the dnatranslate.py and the fasta.py that we are going to import into our script. And that’s basically it: calling function already created, stored in modules that can be reused anytime. In the end our script that translates DNA sequences to proteins takes a little bit more than a handful of lines.
#!/usr/bin/env python
import dnatranslate
import sys
import fasta
dna = fasta.read_fasta(open(sys.argv[1], 'r').readlines())
for item in dna:
protein = dnatranslate.translate_dna(item.sequence)
print item.name
print protein
That’s it. A good example of reusable code, that once created fits everywhere and handles most type of data. We read the FASTA file in the first line, and iterate over the items created translating them as we go. As an extra exercise, we can include the output formatting function. First we need to update the fasta.py module (already on the repository) and slightly change the formatting function, that ends up looking like this
def format_output(sequence, length):
temp = []
for j in range(0,len(sequence),length):
temp.append(sequence[j:j+length])
return '\n'.join(temp)
For this case the ideal formatting function would go through the “longer” route mentioned before, because the final printing should be done by the main script and not by the imported module. This gives us more control on what we want to do with the resulting string. The format_output function receives two arguments: the first is the actual DNA/protein sequence to be formatted and the length we want to output it. We had to remove the loop too, so only one sequence can be sent to the function and, as pointed, the function returns a string with the formatted sequence. In the end our post’s initial sequence has one modification only
#!/usr/bin/env python
import dnatranslate
import sys
import fasta
dna = fasta.read_fasta(open(sys.argv[1], 'r').readlines())
for item in dna:
protein = dnatranslate.translate_dna(item.sequence)
print item.name
print fasta.format_output(protein, 60)
the last line, that instead of printing directly the result of the translation, sends the sequence to the formatting function before.
June 10th, 2008 at 8:45 pm
Here is some alternative implementations of your translate function.
def translate_dna_0(sequence):proteinseq = ''
for n in range(0,len(sequence),3):
# "obj == True" is ugly, use the keyword 'is', as "obj is True"
# for testing singleton like True, False and None, or even
# better omit it completely.
if gencode.has_key(sequence[n:n+3]) == True:
# String concatenation is quadratic Python. Since strings are
# immutable, a new string object needs to be created every time.
# However, the main Python implementation does concatenation quite
# efficiently, so the performance benefits are not extraordinaire.
proteinseq += gencode[sequence[n:n+3]]
return proteinseq
def translate_dna_1(sequence):
# Use a list to avoid the quadratic behavior of string
# concatenation.
proteinseq = []
for n in range(0, len(sequence), 3):
# Use the idiomatic 'in' keyword to test whether a sequence
# is in the gencode dictionary.
if sequence[n:n+3] in gencode:
proteinseq.append(gencode[sequence[n:n+3]])
return ''.join(proteinseq)
def translate_dna_2(sequence):
# Alternative implementation: use re.findall to split the sequence
# into nucleotides.
proteinseq = []
for nucleo in re.findall("...", sequence):
if nucleo in gencode:
proteinseq.append(gencode[nucleo])
return ''.join(proteinseq)
def translate_dna_3(sequence):
# Alternative implementation: use re.sub with a callable to do the whole
# translation. Short, but slow due to the function call overhead.
def replace(match):
return gencode[match.group(0)]
return re.sub("...", replace, sequence)